Hello,


I  explain my scalable RWLock algorithm:

You will read this inside the RLock() method:

//==============================================================================
procedure TRWLOCK.RLock(var myid:integer);


begin

myid:=GetCurrentProcessorNumber;

repeat

while (FCount3^.fcount3 = 1)
do if mysleep=0 then sleep(0)
else sleep(1);

LockedExchangeAdd(FCount1^[myid].fcount1,1);

if (FCount3^.fcount3 = 0)
then break
else
 begin
   LockedExchangeAdd(FCount1^[myid].fcount1,-1);
 end;

until false;

end;
//==============================================================================

procedure TRWLOCK.RUnlock(myid:integer);

begin

LockedExchangeAdd(FCount1^[myid].fcount1,-1);

end;

//==============================================================================

First take a look at the source code inside the construtor you will notice that  the fcount1^ array is 64 byte aligned, and each element on the fcount1^ array resides in a cache line , so that it makes my RWLock scalable, and there is as much elements as the number of cores inside fcount1^ array .

As you have noticed i am testing with an: if (FCount3^.fcount3 = 0), to see if there is no writers threads that entered the WLock() method and incremented FCount3^.fcount3, if the thread that entered WLock() has not yet incremented FCount3^.fcount3 the threads on the RLock() side will proceed and the thread that entered WLock() will wait until all the FCount1^[i].fcount1 will equal to 0  so as you have noticed with my algorithm the writers will not be starved. And as you have noticed there is an:  if (FCount3^.fcount3 =  1) just after LockedExchangeAdd(FCount1^[myid].fcount1,-1) to test if the writer has entered and incremented FCount3^.fcount3 by 1, so that the readers will not spin by incrementing and decrementing FCount1^[myid].fcount1 by 1 , but they will spin on a sleep(0) until FCount3^.fcount3 will equal to 0, so my new algorithm is efficient.

Notice carefully that my RWLock is scalable cause each element of the FCount1^ array resides in a  seperate cache line , hence when i am incrementing  with  LockedExchangeAdd(FCount1^[myid].fcount1,1) it's scaling, notice also that i am using the following: myid:=GetCurrentProcessorNumber
so i am puting the processor number inside myid variable.

On the WLock() and WUnlock() side you will read this:

//==============================================================================

procedure TRWLOCK.WLock;

var i:integer;

begin

while not CAS(FCount2^.FCount2,0,1)
do
 begin
  if mysleep=0 then sleep(0)
  else sleep(1);
 end;

FCount3^.fcount3:=1;

for i:=0 to GetSystemThreadCount-1 do
 begin
   while (FCount1^[i].fcount1<>0)
    do
     begin
      //if mysleep=0 then sleep(0)
      //else sleep(1);
     end;
 end;

end;

//==============================================================================

procedure TRWLOCK.WUnlock;
var i:integer;
begin

FCount3^.fcount3:=0;
//switchtothread;
sleep(0);
FCount2^.FCount2:=0;
end;
end.

//==============================================================================


As you have noticed i am using a CAS() as a critical section cause only one thread must enter the WLock(), after that i am assigning 1 to FCount3^.fcount3 to block the new readers threads entering the RLock() method, after that i am waiting for all FCount1^[i].fcount1 to equal 0 that means i am waiting for all the threads that entered the RLock() to exit with a RUnlock(), and in WUnlock() i am assigning 0 to FCount3^.fcount3  to unblock the readers threads and i am using also a sleep(0) so that to give a chance to the readers to run ,  so i think my algorithm is efficient and correct.

So if you have noticed in the "writers" side , each writer verifies if all the  FCount1^[i].fcount1 equal to zero , so you will say that my algorithm have a weakness in the writer side since every writer must transfer many cache lines so it will be slow, but be smart please and  imagine that we have 100 cores, and you have to start 100 threads , so when a writer will enter the RWlock() method it will set all the  FCount1^[i].fcount1 to zero  in its corresponding core,  and since every FCount1^[i].fcount1 resides in a core , so i think as soon a writer will finish to verify the  first or the second  FCount1^[i].fcount1 many other FCount1^[i].fcount1 may have been decremented to zero , so this will make the writer side faster , so i think my algorithm will still be fast on both the reader and the writer  and it will be scalable. Other than that if you have 100 cores you can start fewer than 100  threads and make also my algorithm faster in the "writers" side , so this will also make my algorithm faster on both the readers and the writers and my algorithm will be scalable.


Click here to return to my homepage: Homepage



Thank you,
Amine Moulay Ramdane.