Lock-Free Data Structures. The Evolution of a Stack

In the previous articles I described the basis, on which lock-free data structures and basic algorithms of the lifetime management of elements of lock-free data structures are built. Actually, it was a prelude to the description of lock-free containers. But then I faced a problem of how to build the story. Describing the known algorithms would be quite boring, as there would be a lot of [pseudo-]code, plenty of details that are important but quite specific. After all, you can always find them in the references I provide in articles. What I wanted was to tell you an interesting story about exciting things. I wanted to show the development of approaches to designing concurrent containers.

Then the method of presentation should be the following: take some container type – a queue, map, hash map – and make a review of the currently known original algorithms for this container type. Where to start? That’s when the simplest data structure came to my mind. The stack.

What can you tell us about it? — You might say. This data structure is so primitive that there’s hardly anything to tell about.

Indeed, there’re not so many works about the implementation of the concurrent stack. But those available are dedicated more to approaches, than to the stack itself. But approaches are exactly what I am interested in.

Lock-Free Stack

Stack is probably the first data structure a lock-free algorithm was created for. It is believed that R. Kent Treiber was the first one to publish it his article in 1986. Although, there’s evidence that this algorithm was first described in the system documentation for IBM/360.

Historical Digression In general, Treiber’s article is a sort of Old Testament. Perhaps, it’s the first article about lock-free data structures. At least, I don’t know any earlier ones. It is still provided as a reference in many modern papers. Maybe, it’s a sort of a tribute to the founder of the lock-free approach.

The algorithm is so simple, that I will provide its adapted code from libcds (in case you’re interested, it’s an intrusive stack: cds::intrusive::TreiberStack):

// m_Top – the top of the stack
bool push( value_type& val )
{
    back_off bkoff;
    value_type * t = m_Top.load(std::memory_order_relaxed);
    while ( true ) {
        val.m_pNext.store( t, std::memory_order_relaxed );
        if (m_Top.compare_exchange_weak(t, &val, 
                  std::memory_order_release, std::memory_order_relaxed))       
           return true;
        bkoff();
    }
}
value_type * pop()
{
   back_off bkoff;
   typename gc::Guard guard; // Hazard pointer guard
   while ( true ) {
      value_type * t = guard.protect( m_Top );
      if ( t == nullptr )
         return nullptr ;  // stack is empty
      value_type * pNext = t->m_pNext.load(std::memory_order_relaxed);
      if ( m_Top.compare_exchange_weak( t, pNext, 
                  std::memory_order_acquire, std::memory_order_relaxed ))
           return t;
      bkoff();
   }
}

This algorithm has been examined in details many times (for example, here), so I won’t review it here. A brief description of the algorithm leads to using CAS atomic primitive for m_Top, until we get the desired result. Simple and quite primitive.

There are two interesting details:

Safe memory reclamation (SMR) is necessary in the pop method only, as it’s the only place we read m_Top fields in. We don’t read any m_Top fields in push (there’s no call by m_Top reference), so we don’t need to protect anything with Hazard Pointer. It’s interesting, as the SMR is usually required in all methods of the lock-free container class.
The mysterious bkoff object and its bkoff() call, in case if CAS is unsuccessful.

I’d like to dwell on bkoff.

Back-Off Strategies

Why is CAS unsuccessful? Obviously because some thread managed to change the value of m_Top, while reading the current m_Top value and trying to apply CAS. Thus, it is a typical example of concurrency. In case of high load, when N threads execute push/pop, just one of them will win; while other N – 1 will be wasting the processor time and interfere with each other on CAS (remember the MESI cache protocol).

How to offload the processor when such situation is detected? We can back off from the main task execution and do something useful, or just wait. That’s what back-off strategies are meant for.

We definitely won’t be able to “do something useful” in the general case, as we have no idea of the specific problem, so we can only wait. How to wait? We will reject the option with sleep(), as few operating systems can provide such small expectation timeouts. Besides, overhead of the context switching is too big, more than the execution of CAS.

In the academic world, the strategy of the exponential back-off is popular. The idea is really simple:

class exp_backoff {
    int const nInitial;
    int const nStep;
    int const nThreshold;
    int nCurrent;
public:
    exp_backoff( int init=10, int step=2, int threshold=8000 )
         : nInitial(init), nStep(step), nThreshold(threshold), nCurrent(init)
    {}
    void operator()()
    {
          for ( int k = 0; k < nCurrent; ++k )
              nop();
          nCurrent *= nStep;
          if ( nCurrent > nThreshold )
               nCurrent = nThreshold;
    }
    void reset() { nCurrent = nInitial; }
};

That is, we execute nop() in the loop, each time increasing the loop length. Instead of nop(), we can call something more useful, for example, a hint-instruction (if any) that tells the processor that it has time to execute its internal tasks (again, remember the MESI, the processor can have plenty of such tasks).

The problem with the exponential back-off is simple – it is hard to find good values for nInitial, nStep, nThreshold. These constants depend on the architecture and the task. In the above code, values for them are pulled out of a hat by default.

Therefore, in practice yield() back-off – the switch to another thread – will be quite a good choice for desktop processors and entry-level servers. Thus, we give time to another, more successful threads, in the hope that they will do what they should and will not interfere with us (and we with them).

Is there any use in applying back-off strategies? Experiments show that there is: applied in the necessary narrow place, a correct back-off strategy can give a huge performance gain of a lock-free container.

The considered back-off strategies help to solve the problem with the unsuccessful CAS, but they do not assist in the main task execution – the operation with the stack. Is there any way to combine push/pop and back-off, so that the back-off strategy would actively help to perform the operation?

Elimination back-off

Let’s take a look at the problem of the unsuccessful CAS in push/pop from another perspective. Why is CAS unsuccessful? Because another thread has changed m_Top. What does this another thread do? It executes push() or pop(). Now, note that push/pop are complementary operations for the stack: if one thread executes push(), and another one executes pop() at the same time, there’s absolutely no point in calling the top of the m_Top stack, as a push-thread could simply pass its data to a pop-thread, without violating LIFO — the main feature of a stack. We should just figure out how to bring these two threads together and bypass the top of the stack.

In 2004, Hendler, Shavit and Yerushalmi proposed a modification of Treiber’s algorithm, in which the task of transferring data between push- and pop-threads without the top of the stack involved is solved by using a special back-off strategy that they called the elimination back-off strategy.

There is an elimination array of size N (N is a small number). This array is a member of the stack class. When CAS fails, going to the back-off, the thread creates a descriptor of its operation (push or pop) and randomly selects an arbitrary cell of the array. If the cell is empty, it writes a pointer to its descriptor and executes a regular back-off, for example, exponential. In this case, the thread can be called passive.

If the selected cell already contains a pointer to the descriptor of the P operation of some other (passive) thread, the thread (let’s call it active) checks what kind of operation it is. If push and pop operations are complementary, they are mutually eliminated.

If the active thread performs push, it writes its argument to the P descriptor, thus passing it to the pop operation of the active thread;
If the active thread performs pop, it reads the argument of the complementary push operation from the P descriptor.

Then the active thread marks the record in the elimination array cell as processed, so that the passive thread leaving the back-off could see that someone has magically fulfilled its job. As a result, the active and the passive threads will perform their operations without accessing the top of the stack.

If the selected elimination array cell has the same operation (push-push or pop-pop situation), we’re out of luck. In this case, the active thread executes a regular back-off and then tries to execute its push/pop in a usual manner, — CAS tops of the stack.

As for the passive thread, after finishing with the back-off, it checks its descriptor in the elimination array. If the descriptor has the mark that the operation is completed, that is, there’s another thread with a complementary operation, the passive thread successfully completes its push/pop.

All the above actions are performed in the lock-free manner, without any locks. Therefore, the actual algorithm is more complex than the described one, but the meaning is the same.

Descriptor contains the operation code – push or pop, – and the operation argument. In case of push it’s the pointer to the element to be added. In case of pop the pointer field remains empty (nullptr). In case of successful elimination, a pointer to the element of the eliminating push operation is written into it.

Elimination back-off can significantly unload the stack on high load. In case if the load is low, when the CAS of the top of the stack is almost always successful, such a scheme does not introduce any overhead at all. The algorithm requires fine tuning, which is to select the optimal size of the elimination array, depending on the task and the actual load. We can also suggest an adaptive version of the algorithm, when the size of the elimination array changes within a small range in the course of operation, adjusting to the load.

In the case of imbalance, when there are plenty of push and pop operations – a lot of push without pop, then plenty of pop without push – the algorithm will not give any noticeable gain. There should not be much loss in performance either, in contrast to the classical Treiber’s algorithm.

Flat combining

So far, we have been talking about the lock-free stack. Let’s take a look at a regular lock-based stack:

template  class LockStack {
    std::stack m_Stack;
    std::mutex      m_Mutex; 
public:
    void push( T& v ) {
        m_Mutex.lock();
        m_Stack.push( &v );
        m_Mutex.unlock();
    }
    T * pop() {
        m_Mutex.lock();
        T * pv = m_Stack.top();
        m_Stack.pop()
        m_Mutex.unlock();
        return pv;
    }
};

Obviously, its performance will be quite low during concurrent execution: all calls to the stack are serialized on the mutex, everything is executed strictly sequentially. Is there any way to improve performance?

If N threads access the stack at the same time, just one of them will acquire the mutex, the rest will wait for its release. But instead of passively waiting on the mutex, the waiting threads could announce their operations, like in the elimination back-off. And the winner-thread (the mutex owner) could perform tasks from its fellows, in addition to its own job. This idea was the bases of the flat combining approach described in 2010 and developing to this day.

Flat Combining

We can describe the idea of flat combining the following way. A mutex and a publication list of the size proportional to the number of threads that work with the stack are linked with each data structure, the stack in our case. At the first access to the stack, each thread adds its record, located in the thread local storage (TLS), to the list of publications. When performing the operation, the thread publishes a request in its record — a push or pop operation and its arguments – and tries to acquire the mutex. If the mutex is acquired, the thread becomes a combiner. It scans the publication list, collects all requests from it, executes them (in case with the stack, we can apply the considered above elimination), then writes the result into the elements of the publication list and finally releases the mutex. If the attempt to acquire the mutex failed, the thread spins on its publication for the combiner to execute its request and place the result into the publication record.

The publication list is built in such a way as to reduce the overhead of managing them. The key point is that the publication list rarely changes, otherwise we’ll get a situation where the lock-free publication list is attached to the stack, which will hardly speed up the stack. Requests to the operation are added into the already existing record of the publication list. Reminding you that the record is the property of threads and is located in the TLS. Some records of the list can have the “empty” status, meaning that the corresponding thread is not currently executing any actions with the data structure (the stack). From time to time the combiner decimates the publication list and excludes long inactive records (therefore, the list records should have some timestamp), thereby reducing the time of scanning the list.

Flat combining is a very general approach that has been successfully extended to complex lock-free data structures and allows to combine different algorithms: for example, add the elimination back-off to the implementation of the flat-combining stack. The implementation of flat combining in C++ in a shared library is also quite an interesting task. The thing is that the publication list in research papers is generally an attribute of each container object, which can be too costly in real life, as each container should have its own record in the TLS. It would be nice to have a more general implementation with a single publication list for all objects of flat combining, but it is important not to lose speed.

History Develops in a Spiral

Interestingly, the idea of publishing the operation dates back to the beginning of the study of lock-free algorithms. In early 1990s, attempts were made to create common methods of converting traditional lock-based data structures to lock-free ones. From the standpoint of theory, these attempts are successful, but in practice we get low and heavy lock-free algorithms. The idea of these general approaches was to publish the operation before the execution, so that concurrent threads could see it and help to perform it. In practice, such “help” was more of a difficulty, as threads performed the same operation concurrently, jostling and interfering with each other.

It was worth rearranging accents – from the active help to the passive delegation of work to another, more successful thread – and we got a quick flat combining method.

Summary

It’s amazing that such a simple data structure as stack, seeming to be nothing special to write about, has allowed to demonstrate so many interesting approaches to the development of concurrent data structures!

Back-off strategies are applicable everywhere in the design of lock-free algorithms. As a rule, each operation is enclosed in an endless loop on the principle of “keep doing it till we succeed”. At the end of the loop body, that is precisely when the iteration fails, it is a good practice to place a back-off that can reduce pressure on critical data of a container under heavy load.

Elimination back-off is a less general approach that is applicable to a stack, and, to a lesser extent, to a queue.

Developed in recent years, flat combining is a special trend in building concurrent containers, both lock-free and fine grained lock-based.

Hope we’ll meet these techniques in the future.

### Lock-free Data Structures

Comments

John Leidegren 1:55AM on Wednesday, Feb 25, 2015

devnull In this case, lock-free implies that we never block/suspend the execution of the thread/process doing the work. The basic premise is that lock acquisition is expensive. Let’s assume you’re doing work on a modern OS. That OS will use preemptive multithreading. In that situation a thread may enter a wait state. Effectively going into hibernation. This is not free but generally and probably not your average performance bottleneck. Lock-free/wait-free attempts to minimize this overhead.