Lock-free Data Structures. The Inside. RCU

C++

RCU

Today I will continue to introduce techniques that help to write lock-free containers. At the same time, I will advertise (hopefully not too obtrusive), my libcds library.

We will talk about one more technique of safe memory reclamation for lock-free containers – RCU. This technique differs significantly from the previously discussed algorithms, such as Hazard Pointer.

Read – Copy Update (RCU) is a synchronization technique designed for «almost read-only», meaning rarely changed, data structures. Typical examples of such structure are map and set, in which the majority of operations are search ones, i.e. reading data. It is believed that more than 90% of operations for a typical map are searches by key.

Therefore, it is important that the search operation is the fastest; search synchronization is not really necessary– when there are no writers, readers can work in parallel. RCU provides the lowest overhead exactly for read-operations.

Where did the Read – Copy Update name come from? At first, the idea was really simple: there is some rarely changed data structure. To modify it, we make a copy of it, and make changes – adding or deleting data – in the copy. At that, readers work with the original, not modified structure. At some safe moment in time, when there are no readers, we can replace the data structure with the modified copy. As a result, all subsequent readers will see changes made by the writer.

Paul McKenney is the creator and an active promoter of the RCU technique. He leads the school of “RCU fans” that has taught many well-known scientists in the field of lock-free and non-traditional synchronization schemes. Also, he is the Linux-kernel RCU maintainer and the author of some works on RCU.

RCU Grow

RCU has been implemented in the Linux kernel in 2002. Since then, it has been growing into the kernel code more and more, look at the graph on the right. For a long time, it has been positioned as a synchronization technique for the operating system kernel specifically. Since the kernel has full control over all threads, – both user and system – it’s quite easy to determine the safe moment in time for replacing data with the modified copy. But we are interested in the RCU application. Is it possible? Before answering this question, let’s take a closer look at the RCU theory and the terminology used in it.

RCU Overview

The provided above description of the RCU idea is very simplistic. As we know, having atomic operations, we don’t have to make a copy of the data, but can change the data structure “on the fly” along with reading it. Then the «reader» becomes the thread that executes any operation, except removing the element from the data structure. The thread, which removes anything from the structure will be called «writer». Removal should be performed at the time when no one has touched the data being removed. Otherwise, we will get a bunch of hard to detect problems — starting from the ABA problem and finishing with the memory corruption. RCU solves all of these problems, and a method is different from the previously considered Hazard Pointers scheme.

Readers in the RCU technique are performed in the read-side critical section. Entering such critical section, a reader invokes the rcu_read_lock() function, and rcu_read_unlock() when exiting it. These are really lightweight functions that have virtually no effect on performance; they weigh nothing in the Linux kernel (zero-overhead).

If the thread is not in the read-side critical section, it is in the quiescent state. Any period of time, during which each thread was in the quiescent state at least once, called the grace period. Each read-side critical section that has started before the grace period, should finish before the grace period completes. Every grace period is known finite, as any read-side critical section is finite (it is assumed that the number of threads of finite, and also that we are good programmers and avoid infinite loops, as well as thread crashes).

A writer thread which removes an element from the data structure, eliminates the element from the structure, and then waits for the grace period to complete. The grace period completion means that no reader has access to the element being removed (look at the chart, “reads” bars are the read-side critical sections).

Therefore, the writer thread can safely physically remove the element. Removal is carried out in two stages: the first stage is “removal” – atomically removes an element from the data structure, but does not perform physical memory deallocation. Instead, the writer defines the beginning of grace-period by calling the special synchronize_rcu() primitive and waits for its completion. The removed element can be available for reading only to readers that have declared their read-side critical section in parallel with the writer (such section are highlighted in grey on the picture above). By definition, all such readers will finish their work before the grace period completion. When the grace period completes, that is, when all read-side critical sections initiated or active during the grace period will complete, the second stage of removal will start – “reclamation” – that is, physical memory reclamation for the element.

As you can see, the RCU synchronization technique is quite simple. The question remains: how to determine the completion of the grace-period in the user code? The original RCU is designed for the Linux kernel, in which it is much easier to determine, as we have full control over all of the threads. Approaches of the original RCU are not applicable to the user space code.

User-space RCU

In 2009, the decision was made by M. Desnoyers, the representative of P. McKenney’s school, in his dissertation, chapter 6, titled “User-Level Implementations of RCU”.

M.Desnoyers offers the following three solutions for the user-space RCU (URCU):

  • Quiescent-State-Based Reclamation RCU is a really simple scheme for readers, but it requires that threads outside of the read-side critical section to periodically announced “I am in the quiescent state”. Such solution is not suitable for a general purpose library, which libcds is, so I will not consider it.
  • User-space RCU (General-Purpose URCU) is the algorithm suitable for general implementation. I will describe it below.
  • User-space RCU via Signal Handling is also an interesting algorithm based on signals (suitable for *nix systems, inapplicable for Windows). It is implemented in the libcds library. It shows a slightly lower performance, than the general-purpose RCU. I will not consider it in this article. If interested, refer to the dissertation by M. Desnoyers, as well as to the source code of libcds.

General-Purpose URCU

M. Desnoyers sufficiently details and carefully examines the URCU algorithm, so I can just follow him, changing only names of some variables and functions, so that they would correspond the ones in libcds.

Two variables are defined in the URCU scheme:

std::atomic     g_nGlobalCtl(1) ;
struct thread_record {
   std::atomic  nThreadCtl;
   thread_record *        pNext;
   thread_record(): nThreadCtl(0), pNext(nullptr) {}
};

The thread_record structure contains local data for the thread and binds all such objects in the list of RCU threads.

The least significant 31 bits of nThreadCtl contain the counter of depth of nested calls of URCU (URCU does allow almost unlimited nesting of read-side critical sections), the most significant bit defines the grace period identifier at the moment of a thread’s entering a read-side critical section. In the described scheme, two identifiers are enough for the grace period. The most significant bit of the g_nGlobalCtl variable contains an identifier of the current grace period, least significant bits serve for the per-thread initialization of nThreadCtl variables and do not change.

To enter/exit a read-side critical section, use access_lock and access_unlock functions respectively:

static uint32_t const c_nControlBit = 0x80000000;
static uint32_t const c_nNestMask =  c_nControlBit — 1;
void access_lock()
{
   thread_record * pRec = get_thread_record();
   assert( pRec != nullptr );
   uint32_t tmp = pRec->nThreadCtl.load( std::memory_order_relaxed );
   if ( (tmp & c_nNestMask) == 0 ) {
       pRec->nThreadCtl.store(g_nGlobalCtl.load( std::memory_order_relaxed ),
            std::memory_order_relaxed );
       std::thread_fence( std::memory_order_acquire );
   }
   else
       pRec->nThreadCtl.fetch_add( 1, std::memory_order_relaxed );
}
void access_unlock()
{
   thread_record * pRec = get_thread_record();
   assert( pRec != nullptr );
   pRec->nThreadCtl.fetch_sub( 1, std::memory_order_release );
}

Entering the URCU critical section, we check whether this call is nested. If it is (i.e. the counter in the least significant 31 bits is not zero), the nesting counter is incremented. But if the call is not nested, the nThreadCtl variable of the current thread is assigned with the value of the g_nGlobalCtl global variable. Thus, we mark that the critical section entering has been carried out during a certain grace period (the most significant bit g_nGlobalCtl), while 1 in least significant bits is initiated by the nesting counter of the current thread. During the first, the most outer entering the read-side critical section, the acquire memory barrier is applied. It ensures that the following code will not be moved (“optimized”) outside the barrier either by the processor, or the compiler. This provides the visibility of the current grace period of a thread to all processors. If we break this order, the URCU algorithm will fall apart. When entering a nested critical section, the barrier is not required, as the current grace period (the most significant bit) does not change.

When we exit a critical section (access_unlock), the nesting counter in nThreadCtl of the current thread is decremented. The release-semantics of an atomic operation is applied here. In fact, we need the release-barrier here only when exiting the highest critical section (when the nesting counter goes from 1 to 0). When exiting a nested critical section, the relaxed-semantics is enough. The release-barrier is required during the counter reset to zero, as when the nesting counter goes from 1 to 0, it’s like declaring that “the thread no longer uses RCU”. Actually, this means that we exit the grace period, which is critical for the URCU algorithm, as the breach of order either by the compiler or the processor will lead to a malfunction of the algorithm. “0 – not 0” situations discernment in the code will require conditional statement, which is unlikely to add performance to the access_unlock function. Besides, the basic pattern of using URCU critical sections is without nesting, so the release-semantics is always used here.

As can be seen, the read-side code is quite lightweight. The atomic read-write and thread-local data is used. It’s definitely not the zero-overhead, but still much better than a mutex or CAS.

Before physically removing an element, the writer thread should make sure that the grace period has completed. There are two conditions of the grace period completion:

  • The nThreadCtl least significant bits (the nesting counter) of each thread are equal to zero, which means that a thread is not in the URCU critical section
  • The nThreadCtl most significant bit does not coincide with the g_nGlobalCtl most significant bit, which means that the reader entered the critical section after the grace period had started.

The following function checks these conditions:

bool check_grace_period( thread_record * pRec )
{
   uint32_t const v = pRec->nThreadCtl.load( std::memory_order_relaxed );
   return (v & general_purpose_rcu::c_nNestMask)
      && ((( v ^ g_nGlobalCtl.load( std::memory_order_relaxed )) & ~c_nNestedMask ));       }

Before physically removing the element, the writer invokes the synchronize function that waits for the current grace period to complete.

std::mutex  g_Mutex ;
void synchronize()
{
   std::atomic_thread_fence( std::memory_order_acquire );
   {
      cds::lock::scoped_lock<:mutex> sl( g_Mutex );
      flip_and_wait();
      flip_and_wait();
   }
   std::atomic_thread_fence( std::memory_order_release );
}

g_Mutex here is a global for the URCU algorithm (URCU is a synchronization technique, so the mutex should definitely be here). Thus, just one writer thread can go into synchronize. Do not forget that RCU is designed for the “almost read-only” data, so there should not be much crowding on this mutex.

The writer waits for the grace period completion by calling the flip_and_wait function:

void flip_and_wait()
{
   g_nGlobalCtl.fetch_xor( c_nControlBit, std::memory_order_seq_cst );
   for (thread_record* pRec = g_ThreadList.head(std::memory_order_acquire);
         pRec!= nullptr; 
         pRec = pRec->m_pNext ) 
   {
     while ( check_grace_period( pRec )) 
     {
        sleep( 10 ); // wait for 10 milliseconds
        CDS_COMPILER_RW_BARRIER ;
     }
   }
}

With the help of the atomic fetch_xor, this function changes the identifier of the grace period, which means the beginning of a new grace period, and waits (by calling check_grace_period) till all reader threads complete the new grace period. In the pseudocode, the waiting is performed by a simple sleep for 10 milliseconds. In the real code, libcds uses a template parameter defining the back-off-strategy.

Why does the writer invoke flip_and_wait twice? To illustrate, let’s consider the following sequence of actions with A and B threads. Assume that there’s just one call for flip_and_wait in synchronize:

  • The A thread invokes access_lock. It is defined in the function body, that the call is not nested, the global g_nGlobalCtl is read, but not assigned to the nThreadCtl variable of the thread yet (everything is performed in parallel, such situation is quite acceptable).
  • The B thread invokes synchronize. flip_and_wait is called first. It changes the identifier bit of the grace period in g_nGlobalCtl. The current identifier of the grace period becomes 1.
  • Since there is nothing in the URCU critical section (remember that the A thread has not assigned a value to its nThreadCtl variable). The B thread completes synchronize.
  • The A thread performs assignment of its nThreadCtl variable. Remember that the thread has read the grace period old value, that is equal to 0.
  • The A thread completes access_lock and continues performing in the critical section.
  • The B thread calls synchronize one more time (apparently, it has something to remove). And again, the current grace period is converted to g_nGlobalCtl, so its identifier is 0 now.

But the A thread is in the critical section that had started earlier than B changed the grace period! It’s the violation of the URCU semantics that will lead to a lot of problems (ABA problem, memory corruption, etc). Remember that synchronize is called by the writer before it physically frees memory for the element.

Calling flip_and_wait twice, that is, waiting twice for the grace period to complete, we solve the above problem, the cause of which is concurrent execution of threads.

Another Solution

We can definitely solve this problem in another way. For example, if we use some counter instead of the identifier bit of the grace period. That’s when we face the problem considered in the article about the tagged pointer algorithm – the counter is subject to overflow. For reliability, the counter should be 32-bit. In this case, we will not be afraid of overflow. But such counter leads to the need for a 64-bit atomic type on 32-bit platforms. Either there is no such type, or it is rather inefficient. Or we will have to abandon nesting the URCU critical sections, which is not very convenient.

Therefore, we will focus on the general solution with a bit as the identifier of the grace period and call two flip_and_wait.

URCU Implementation In libcds

The mentioned above URCU algorithm is all good, except for the fact that it is required to call quite heavyweight synchronize before each removal. Is there any way to improve it?

Yes, there is, using the same method as in the Hazard Pointer algorithm. We should apply deferred reclamation. Instead of removing elements, we will place them in a buffer. We’ll invoke the synchronize function only when the buffer is full. Unlike in Hazard Pointers, a buffer in URCU will be shared by all threads (by the way, nothing prevents us from creating per-thread buffers).

Moreover, not to slow down the writer, that has to clear the buffer in case of overflow, we can ask a separate thread to clear the buffer, meaning actual removal.

The libcds library has five URCU implementations. All of them live in the cds::urcu namespace:

  • general_instant is the implementation that follows the described URCU algorithm: each removal invokes synchronize, no buffering. If removal is a fairly frequent operation, i.e. the structure is not really “almost read-only”, this implementation is rather slow.
  • general_buffered is the implementation with a general lock-free buffer of a predetermined size. Dmitry Vyukov’s queue will be used as the lock-free buffer: cds::container::VyukovMPMCCycleQueue. Such implementation performance is comparable to Hazard Pointer.
  • general_threaded is similar to general_buffered; a dedicated thread performs buffer clearing. Such implementation is inferior to general_buffered due to additional synchronization with a dedicated thread. At the same time, it does not slow down writers.
  • signal_buffered is the analogy of general_buffered, but it is based on the signal-handled URCU; not for Windows systems.
  • signal_threaded is the analogy of general_threaded for the signal-handled URCU; not for Windows systems either.

A big number of URCU implementations raise the problem of writing container specializations for URCU. The thing is that the implementation of containers for the URCU scheme differs significantly from the implementation for the Hazard Pointer. Therefore, a separate specialization for URCU is required. I would like to have one specialization, rather than five.

To simplify the writing of specialization for URCU, the cds::urcu::gc wrapper class has been introduced:

template class gc; in which RCUimpl is one of URCU implementations: general_instant, general_buffered etc.

Using this wrapper, it will be no problem to write only one URCU specialization:

template 
class SplitListMap, Key, Value, Traits > ...

It should be noted that the main function of the URCU algorithm in libcds during removal is not synchronize, but retire_ptr. This function places the removable element in the URCU buffer and calls synchronize at the right time (for example, when the buffer is full). So it is not required to call synchronize explicitly, although it is acceptable. In addition, this solution unifies the URCU interface and the Hazard Pointer.

All of these URCU algorithms are implemented in a typical libcds manner: there is a global singleton object for each variable. Its initialization is performed by calling the constructor of the cds::urcu::gccds::urcu::general_buffered > wrapper object at the beginning of main(), after calling cds::Initialize():

#include   //cds::Initialize и cds::Terminate
#include  // general_buffered URCU
int main(int argc, char** argv)
{
    // Initialize libcds
    cds::Initialize() ;
   {
       // Initialize the general_buffered URCU singleton
       cds::urcu::gc<:urcu::general_buffered> > gbRCU ;
       // If the main thread uses lock-free containers,
       // it should be attached  
       // to the libcds infrastructure 
       cds::threading::Manager::attachThread() ;
      // That’s it, libcds is ready for use
      // Next is your code
      ...
   }
   // Terminate libcds
   cds::Terminate() ;
}

Just like for the Hazard Pointer scheme, each thread using URCU-containers should be initialized in a special way:

// cds::threading::Manager
#include 
int myThreadEntryPoint(void *)
{
    // attaché the thread to the infrastructure of libcds
    cds::threading::Manager::attachThread() ;
    // Now we can use lock-free containers of libcds  
    // in this thread
    ...
   // Detach the thread from libcds
   cds::threading::Manager::detachThread() ;
   return 0;
}

Using URCU containers of the libcds library is simple: just declare a container object with the URCU gc, — and that’s it. All the specifics of working with URCU is hidden inside the URCU specialization of the container. No external synchronization when accessing such container is required.

UPD: Oops! “_No external synchronization when accessing such container is required_” — I was wrong. Actually, some methods of some URCU containers do require prior entering a read-side critical section. As a rule, these are methods of removal (extraction) of a container element. URCU can provide the capability of returning a pointer to the element found by key. This capability is a rare exception in the world of lock-free, in which returning a pointer is deathlike, as the element can be removed any time by a concurrent thread. To work safely with the returned pointer to the element, we should be in the read-side critical section. So in this case, we should explicitly invoke access_lock, before calling the container method, and then access_unlock after finishing our work with the pointer. The exception-safe method is the scoped-lock use in a separate block of code.

The description of each method of the URCU container of the libcds library specifies the way to call this method – in a critical section or not.

If you decide to create your own container class based on the URCU implementation from libcds, you’d better know well the internal structure of the library’s URCU containers. In principle, there’s nothing too special about it: call gc::access_lock() when entering the method, and gc::access_unlock() when exiting it (gc is one of URCU implementations. For safe exceptions, you’d better use the scoped-lock techniques, instead of invoking functions). The only subtle point is element removal. The method of removal should also be in the read-side critical section, but physical removal, performed by calling gc::retire_ptr, should be carried out outside the read-side critical section. Otherwise, a deadlock is possible: inside, the gc::retire_ptr method can invoke synchronize.

Libcds defines URCU specializations for all set and map classes. URCU specializations for containers, such as «queue» and «stack» are not defined – they are not «almost read-only» containers, so URCU is not for them.

Lock-free Data Structures

Comments

  1. Interesting article! Here are a few comments:

    The following two comments apply to your introduction. Having no context on what you mean by «At first», I will assume that you are explaining the RCU technique as today’s state of the art. Else we specific reference would be needed.

    «search synchronization is not really necessary– when there are no writers, readers can work in parallel»

    -> this is slightly misleading. In RCU, readers can progress in parallel with updates.

    «At some safe moment in time, when there are no readers, we can replace the data structure with the modified copy. As a result, all subsequent readers will see changes made by the writer.»

    -> This is inaccurate. In RCU, publcation of the modified copy is performed while there are readers actively accessing the old one. We then wait until no reader can still possibly see the old copy before reclaiming its memory.

    «it’s quite easy to determine the safe moment in time for replacing data with the modified copy.»

    -> again, there appears to be confusion between publication and grace period guarantees.

    «As we know, having atomic operations, we don’t have to make a copy of the data, but can change the data structure “on the fly” along with reading it.»

    -> In RCU, a copy of the old data is done into a new memory area. Publication of that memory area is performed by storing to a word-aligned pointer. The store is an atomic (indivisible) operation, but it’s not what we would generally call an atomic operation in the sense of a LOCK prefix on Intel.

    «Then the «reader» becomes the thread that executes any operation, except removing the element from the data structure.»

    -> untrue. The writer thread(s) are responsible for both publication of new data, and for awaiting grace periods before reclaim.

    «The nThreadCtl most significant bit does not coincide with the g_nGlobalCtl most significant bit, which means that the reader entered the critical section after the grace period had started.»

    -> as you point out later in the article, the algorithm does 2 flip and wait. This matter of fact does not fit with the explanation given in the sentence above.

    «Calling flip_and_wait twice, that is, waiting twice for the grace period to complete»

    -> inaccurate. By doing flip_and_wait twice, we’re actually waiting for a single grace period (by definition).

    Also, I would appreciate if you could please credit the sources of the graph and diagram you use in your article.

    Thanks,

    Mathieu Desnoyers

3,751

Ropes — Fast Strings

Most of us work with strings one way or another. There’s no way to avoid them — when writing code, you’re doomed to concatinate strings every day, split them into parts and access certain characters by index. We are used to the fact that strings are fixed-length arrays of characters, which leads to certain limitations when working with them. For instance, we cannot quickly concatenate two strings. To do this, we will at first need to allocate the required amount of memory, and then copy there the data from the concatenated strings.