It was long ago since the last time we talked about lock free containers. I was planning to quickly write another article about queues, but there was one problem. I knew what I wanted to write about, but I had no implementation of those methods in C++. It was wrong to write about something I did not try out myself. That’s how I tried to implement new queue algorithms in libcds. So now I can continue the series. This article is going to be the last one in the series about queues.
Last time we spoke about them, we considered some interesting algorithms of lock free queues and provided their operation results in some synthetic tests. To sum up, it was really bad. Our hopes that the lock free approach on the magic compare-and-swap (CAS) would give us at least any, if not linear, performance gain with the increase of the number of threads, failed. Queues are not scalable. Why?
The answer to this question lies in the characteristics of the architecture of modern processors. The CAS primitive is quite a heavy structure that loads hash and the internal synchronization protocol. During an active CAS usage over the same cache line, processor is mostly busy supporting hash coherency. You can read more about this in the Memory Barriers: a Hardware View for Software Hackers article by Paul McKenney.
Stacks and queues are extremely unfriendly for the lock-free approach of data structures, as they have a small number of entry points. There’s just one such point for a stack and it’s the top of the stack. Meanwhile, there are two such points for a queue – the head and the tail. CAS competes for the access to these points. At that, performance drops when the processor is 100% loaded. All threads are occupied with something, but just one of them will win the access to the tail/head. Does it ring a bell? It’s a classic spin-lock! Thus, using the lock-free approach on CASes, we got rid of the external synchronization by mutex, and got the internal one on the level of the processor instruction. So, we did not win much.
Bounded Queues
We should note that the above concerns unbounded MPMC (multiple producer/multiple consumer) queues. As for queues, bounded in the number of elements, they are usually built on the basis of an array, and the given problem can be not so pronounced thanks to the thorough distribution of queue elements in an array. We can also build faster algorithms for queues with one producer or/and one consumer.
Researches are still interested in the problem of efficient implementation of concurrent queues. In recent years, we made sure that various tricks used when implementing a lock-free queue in a brute-force manner, do not give the expected result and we have to think out other approaches. Let’s take a look at some of them.
Flat combining
I have already described this method in the article about the stack, but flat combining is a universal method, so we can apply it to queues as well.
Array of Queues
The low-hanging-fruit method is quite difficult to implement and has a lot of hidden dangers. It was introduced in the article titled "Replacing Competition with Cooperation to Achieve Scalable Lock-Free FIFO Queues". I guess it’s the best one for the current subject.
The idea is really simple. Build an array of K size instead of one queue. Each array element is a lock-free queue represented by a singly-linked list. A new element is added to the following (by K module) array slot. Thereby, we level the crush at the queue tail: there are K tails instead of one. So, we can expect that we will get linear scalability up to K parallel threads. We should definitely have some common atomic monotonically increasing counter of push-operations, so that we could multiplex each insertion in its slot in the array. Of course, the given counter will be the only “sticking point” for all push-threads. But the authors say (and they have reasons for that, I must say) that xadd instruction of atomic add (increment in our case) an x86 architecture is a bit faster that CAS. Thus, we can expect to get a prize here. The atomic fetch_add is emulated by CAS on other architectures, so the prize will be not so outstanding.
The code of element removal from a queue is similar. There is an atomic counter of removed elements (pop-counter), on the basis of which the array slot is chosen by the K module. To avoid the violation of the main queue feature – FIFO – each element contains additional load (ticket), which is the push counter value at the time of adding the element. Actually, it is the element’s serial number. When removal is executed in the slot, we search the ticket element that is equal to the pop-counter value. The found element is the result of the pop() operation.
The method solving the problem of element removal from an empty queue is quite interesting. The given algorithm starts with the atomic increment of the counter of removed elements (pop-counter), and then proceeds with the search in the appropriate slot. The slot can be empty. This means that the entire queue is empty as well. But the pop-counter is already incremented and its further use will lead to the violation of FIFO and even loss of elements. We cannot make a rollback, as other threads can be adding or removing elements at that very moment (“cannot-rollback” is the integral property of the lock-free approach). That's why when the pop() situation pops up from an empty queue, the array is declared invalid, which leads to creation of a new array with a new push- and pop-counters when another element insertion takes place.
Unfortunately, the authors did not care about (as they say, due to the lack of space) the problem of memory deallocation and gave it just few sentences with shallow description of the Hazard Pointer scheme application to its algorithm. I have not succeeded in interpreting their hints, so there’s no implementation of this algorithm in the libcds library. Besides, the algorithm is open to the unlimited accumulation of removed elements in case if the queue is never empty, that is if the array invalidation does not take place, as element removal from the slot-list of an array is not provided. pop() just searches the element with the current ticket in the corresponding slot, but the physical removal of the element does not take place till the entire array is invalid.
To sum up, we can say that the algorithm is interesting, but memory management is not described explicitly enough. Additional study is necessary.
Segmented Queues
Another way to increase the queue scalability is to violate its basic property: first-in – first-out (FIFO). Is it that scary? Depends on the task. FIFO compliance is mandatory for some of them and quite acceptable within some limits for others.
We definitely don’t want to violate the basic FIFO property in a big way. In this case we will get a data structure called pool. It does not keep any order at all: the pop() operation returns any element. It’s no queue. As for the queue with FIFO violation, we’d like to have some guarantees of this violation. For example, the pop() operation will return one of the first K elements. This means for K=2 that pop() will return A or B for a queue with A, B, C, D,… elements. Such queues are called segmented or K-segmented, to emphasize the value of K factor – limitations to FIFO violation. It is obvious that K=1 for the fair FIFO queue.
As far as I know, the segmented queue was first considered in 2010 in the work dedicated to allowable relaxations in requirements imposed on lock-free data structures. The internal structure of a queue is quite simple (see picture 5 from the above article, K=5):
A queue is a singly-linked list of segments. Each segment is an array of K elements size. Head and Tail point to the first and the last segment in the list. The push() operation inserts a new element into a random empty slot of the tail-segment. The pop() operation extracts the element from a random occupied slot of the head-segment. It is obvious that the less K segment size is, the less FIFO property is violated. When K is equal to 1, we’ll get a strict queue. Thus, with the help of varying K, we can manage the level of FIFO violation.
Each segment slot in the above algorithm can be of three states: free, occupied (contains an element) and “garbage” (the element was read from the slot). Note that the pop() operation converts the slot to the “garbage” state that is not the equivalent of the “empty” state. The “garbage” state is the slot finite state. We cannot log the value into such slot. It’s quite a drawback of the algorithm, as we cannot reuse the slot under the “garbage” state. This will lead to the distribution of new segments even in such typical sequence of operations as alternating push() / pop(). This drawback was fixed in another work at the cost of significant code complication.
Let’s Look Into It
Thus, libcds introduced implementations of two new algorithms for queues – FCQueue and SegmentedQueue. Let’s take a look at their performance and decide whether they’re worth to deal with. Unfortunately, the server I ran tests on crashed. So I had to run the tests on another server, less high end. The 2 x 12 core AMD Opteron 1.5 hHz with 64G of memory managed by Linux. It was almost empty — idle at 95%.
I’ve changed the visualization of results. Instead of the time of test execution, I indicate the number of megaoperatons per second (Mop/s) along Y axis. Reminding you that the test is of classical producer/consumer type. There are 20 millions of operations, 10 M of push operations and 10 M of pop operations without any playload emulation. All tests of lock-free queues use Hazard Pointer for the safe memory reclamation.
Using new measuring units (MOp/s), the chart from the previous article looks like this:
But what are we aiming at? What do we want to get when talking about scalability?
The perfect scalability is linear increase of Mop/s when the number of threads grows, provided that the hardware can take this number of threads. Some Mop/s increase will be a good result. If performance falls when the number of threads grows, the algorithm is either non-scalable, or scalable to some extent.
Results for Intrusive queues (reminding you that an intrusive container is characterized by containing a pointer to the data, not the its copy, like in STL. Thus, we don’t have to distribute memory for copies of elements, which is considered to be a bad form in the world of lock-free).
We can see that Flat Combining was worth being implemented. This technique shows a very good result in contrast to other algorithms. It does not provide performance growth, but performance does not fall either. Moreover, it does not load the processor, as just one kernel works. Other algorithms show the 100% load of the central processor when there are plenty of threads. The segmented queue is an outsider here. It may be connected with the specific character of the implemented algorithm: memory allocation under segments (the segment size is equal to 16 in the given test), infeasibility of reusing segment slots, which leads to constant allocations; the implementation of the list of segments based on boost::intrusive::slist under the lock (I tried two types of locks – spin-lock and std::mutex, and their results are almost the same).
I was hoping that false sharing was the main obstacle here. In the implementation of a segmented queue, a segment was an array of pointers to elements. When the segment size is equal to 16, the segment occupies 128 bytes, meaning two unnecessary cash lines. When there are plenty of threads on the first and the last segments, false sharing can show itself. That’s why I’ve added an additional option to the algorithm – the required padding. When the padding is equal to the size of the cash line (64 bytes), the segment size increases up to 16 * 64 = 1024 bytes; but we exclude false sharing this way. Unfortunately, it turned out that padding merely affects the SegmentedQueue performance. I guess the reason for this is that the algorithm of the segment cell search is probabilistic, which leads to many failed tries to read occupied cells. Thus, this will lead to false sharing. Or, false sharing is not the major obstacle for this algorithm and we should find the actual reason.
Despite the loss, there’s another interesting thing: SegmentedQueue does not show performance fall when the number of threads increases. This fact inspires hope that the given class algorithms do have future. We should implement them in a different way, more efficiently.
The similar situation is with STL-like queues and the element’s copy creation:
Just as a matter of interest, take a look at the bounded queue result – the algorithm by Dmitry Vyukov on the basis of an array, intrusive implantation:
When the number of threads is equal to 2 (one consumer and one producer), this queue shows MOp/s, which is too big for the chart. When the number of threads increases, performance degradation is observed. As an excuse, we can say that false sharing can also be an obstacle for the Vyukov queue, but libcds provides no option containing padding on the cashe-line.
For Fans of the Strange
I came across an unusual Linux server on IBM Power8 with two processors, 3.42 hHz, 10 kernels each. Each kernel can concurrently execute up to 8 instructions, 160 logic processors in all. Take a look at results of the same tests run on it:
Intrusive queues:
STL-like queues:
As you can see, there are no significant differences. Except one. The results are for segmented queues, the size of a segment K=256 — this exact K value was the best for this server.
I’d also like to say that there’s no payload in the above tests for consumers and producers. What we want is just to put into a queue and read from it. There’s always some load in real tasks, as we do something with data read from a queue. We prepare data before placing it into a queue. It seems that payload should lead to performance degradation, but practice shows that it’s not always the case. I have often observed that payload led to the significant growth of MOp/s. To my mind, the reason of it is the unloading of the internal protocol of cash synchronization, refer to P.McKenney’s article. That's why we should test it on real tasks.
Today we’ve mentioned a small part of works dedicated to queues, unbounded only, meaning without limitations of the number of elements. As I have said, queue is one of the favorite structures for researchers. Maybe because it is non-scalable. But there are plenty of other works about bounded queues, work-stealing queues applied in task planners, single consumer or single producer queues (algorithms of such queues are customized for one producer or one consumer, and that’s why they’re simpler and/or much more efficient), and so on and so forth.
News from the world of libcds
1.6.0 version of the library has been released since the last article. The version introduced Flat Combining and SegmentedQueue techniques, fixed some bugs, and revised SkipList and EllenBinTree. The repository for the coming 2.0 version has moved to github. The 2.0 version is dedicated to the transition to C++11, code cleaning, unification of interfaces that will lead to violations of the backward compatibility (which means that some entities will be renamed), and also removing workarounds of old compilers support.
Despite the fact that queues and stacks are not the friendliest structures for the lock-free approach and modern processors, I am pleased to bring the articles about them to a logical conclusion.
We are going to talk about such interesting structures associative containers (set/map), hash map namely, with beautiful algorithms. They will be more scalable, I hope.
Talk to you soon…