Lock-free Data Structures. Memory Model. Part 3.

In the previous article we looked inside the processor, even if it’s hypothetic. We have clarified that for the proper parallel code execution we should prompt the processor to what limits he is allowed to execute its internal read/write optimizations. These prompts are memory barriers. They allow regulating memory access to some extent (or rather the cache as the processor interacts with the outer world through cache only). The “weight” of such regulation can be different. Every architecture can provide the entire set of barriers “for choice”. Using these or those barriers we can build different memory models. It’s the set of warranties which will be executed for our programs.

In the article we’ll talk about C++11 memory model.

Historical Mini Review

At first developers didn’t publish an open specification of processor memory model. That is a set of rules according to which a weakly-ordered processor works with the memory. I think they were hoping to gain some space for a maneuver in future (why would they limit themselves in architecture development undertaking some obligations?). But then they let the genie out of the bottle and gigacycles hit the ceiling. So developers started to introduce multi cores everywhere. As a result the actual multithreading has been let out to the multitudes.

The first ones to sound the alarm were operational system developers, as they have to maintain multi cores in the core, and there are no specifications of the weakly ordered architecture. Then the other standards committee joined. The programs are becoming more and more paralleled, it’s time to standardize the language memory model, provide some warranties for concurrent multithread execution, but there are now specifications for processors memory models. As a result, almost all modern processor architectures have open memory model specifications.

Since olden times C++ is famous for its capability to express low-level things in a high level manner. During C++11 memory model development the task was not to break this quality, meaning to give the programmers the maximum flexibility. Having analyzed existing (at that moment) memory models of other languages (mainly Java) and typical examples of the synchronization primitives’ internal structure and lock-free algorithms, developers have introduced three memory models: - sequential consistency - acquire/release semantics - relaxed memory ordering

All these memory models are defined in C++ by one list – std::memory_order. It has the following 6 constants:

memory_order_seq_cst – points to the sequential consistent model
memory_order_acquire, memory_order_release, memory_order_acq_rel, memory_order_consume – refer to the acquire/release semantics based model
memory_order_relaxed – points to the relaxed model

Before we review all of these models, we should determine the way the memory model is indicated in a program. We should refer to atomic operations one more time. The operations I’ve introduced in the article about atomicity have almost nothing in common with operations defined in C++11. That’s because of the standards — needful memory_order is indicated as an argument of an atomic operation. There are two reasons:

Semantic: as a matter of fact, ordering (memory barrier) refers to an atomic operation we execute. Barriers placement in the read/write-approach is kind of a magic as the barrier itself is not connected with the code it exists in in any way. It’s just an instruction among equivalents. Besides that, read/write barriers placement depends on the architecture a lot.
Practical: Intel Itanium has a special, different from other architectures, method of indicating memory ordering for read/write instructions and RMW operations. The order type in Itanium is indicated as an optional flag of the instruction itself: acquire, release or relaxed. But there are no individual instructions (memory barriers) provided to indicate acquire/release semantics within the architecture. There is only an instruction of a heavy full memory barrier.

Here is how atomicoperations actually look like. Each specialization of std::atomic class should have the following methods at the least:

void store(T, memory_order = memory_order_seq_cst);
T load(memory_order = memory_order_seq_cst) const;
T exchange(T, memory_order = memory_order_seq_cst);
bool compare_exchange_weak(T&, T, memory_order = memory_order_seq_cst);
bool compare_exchange_strong(T&, T, memory_order = memory_order_seq_cst);

Separate Memory Barriers

Of course, С++11 has also free separate functions for the memory barrier:

void atomic_thread_fence(memory_order);
void atomic_signal_fence(memory_order);

atomic_thread_fence allows to actually use the approach of separate read/write barriers, which is acknowledged out-of-date. Though memory_order order methods do not provide a way to set either read-barrier (Load/Load) or write-barrier (Store/Store). atomic_signal_fence is intended to be used in the signal handler. As a rule, this function doesn’t generate any code, but is a compiler barrier.

As you can see, the memory model in C++11 is sequential consistency by default. That’s exactly what we will talk about. But at first I would like to say a few words about compiler barriers.

Compiler Barriers

Who can reorder the code, written by us? We have found out that the processor can do it. But there is one more source of reordering. It’s a compiler. Many heuristics and optimization methods have been developed (and are still being developed) in the assumption (not obvious, maybe) of a one-thread execution. It’s quite difficult (even theoretically impossible) for the compiler to understand that your code is multithreaded. So it needs hints – barriers. Such barrier says to the compiler: “Do NOT relocate (mix) the code before the barrier with the code after the barrier, and vice versa”. Compiler barrier doesn’t generate any code. Compiler barrier for MS Visual С++ is a pseudo function _ReadWriteBarrier() (I was always at a loss by this name: a complete association with read/write memory barrier – the heaviest memory barrier). For GCC and Clang it’s a smart __asm__ __volatile__ ( “” ::: «memory» ) construction. It’s worth noting that assembly __asm__ __volatile__ ( … ) insertions are also a sort of barrier for GCC/Clang. The compiler doesn’t have the right to either discard or relocate them up/down the code (as you can see by __volatile__ modifier). memory_order constants impact the C++ supporting compiler to the same extent it impacts the processor — they are the barrier of the compiler, limiting its reordering (i.e. optimization) facilities of our code. So there’s no need to set special compiler barriers. Of course, if compiler completely supports a new standard.

Sequential consistency

Suppose, we have implemented a lock-free stack (it’s the most primitive of lock-free data structures), compiled it and now testing. We get a core file. What’s wrong? We’ll be looking for an error, running in our mind (none of debuggers will help us here) the line-by-line implementation of our lock-free stack, trying to emulate multithreading and answer the question: “Which fatal combination of executing K line of 1 thread and at the same time N line of 2 thread has led to the failure?” Perhaps, we’ll find and fix some errors, but our lock-free stack will keep falling. Why? It turns out that the things we do trying to find an error and comparing in our mind the lines of the program for concurrently executed threads, is called a sequential consistency. It’s a strict memory model which assumes that the processor executes everything in the exact order which is stated in the program. For example, for such code:

// Thread 1
atomic a, b ;
a.store( 5 );
int vb = b.load();

// Thread 2
atomic x,y ;
int vx = x.load() ;
y.store( 42 ) ;

any execution scenarios are acceptable for sequential consistency, except the ones that swap a.store / b.load and x.load / y.store. Note, that in this code I don’t set memory_order-argument directly in the load/store. I rely on the default argument value. The same warranty is expanded on the compiler: it’s forbidden to reorder our code so that operations after memory_order_seq_cst would be relocated above this barrier. And vice versa, the operations before seq_cst-barrier can’t be shifted under the barrier. sequential consistency model is instinctively close to human beings, but it has quite a significant drawback. It’s too strict for modern processors. It leads to the heaviest memory barriers, not allowing the processor to apply speculative execution to the full extent. That’s why a new C++ standard has accepted the following compromise:

Accept sequential consistency as a default model for atomic operations, due to its strictness and understandability
Introduce weaker memory barriers to C++, which allow to implement the potential of modern architectures

A model, based on acquire/release semantics has been offered as an addition to sequential consistency.

Acquire/release Semantics

As you could notice from the title, these semantics is somehow connected with resource acquiring/releasing. It’s true indeed. Resource acquire is reading it from memory to the register, release is writing from the register to memory.

load memory, register ;
membar #LoadLoad | #LoadStore ; // acquire-барьер

// Operation within acquire/release-sections
...

membar #LoadStore | #StoreStore ; // release-barrier
store regiser, memory ;

As you can see, we went without #StoreLoad heavy barrier application. Acquire-barrier, as well as release-barrier are half-barriers. Acquire doesn’t order the previous store-operations with the following load/store. Release doesn’t order the previous load with the following ones, as well as the previous store with the following load. All of this applies to both compiler and processor. Acquire/release are barriers for the entire code, which is between acquire/release. Some operations before the acquire-barrier can leak (can be reordered by processor of compiler) into acquire/release-sections. As well as some operations after the release-barrier can be shifted upwards (once again, by processor or compiler), inwards the acquire/release-section. But the operations enclosed inside acquire-release won’t go beyond its limits.

I guess spin-lock is the easiest example of acquire/release-semantics application.

Lock-free and spin-lock

It may seem strange that I’m providing an example of lock algorithm in the article of lock-free algorithms series. This should be explained. I am not a fan of pure lock-free, no way. Yes, pure lock-free (especially wait-free) algorithm makes me glad. It gladdens me even more when I manage to implement it (that doesn’t always happen). I am the follower of the pragmatic approach: everything efficient is good. So I am fine with applying the locks where they can benefit. Spin-lock can give a considerable benefit comparing with general mutex if it “guards” a very small piece of code – few assembly instructions. Spin-lock is also an inexhaustible source of different interesting optimizations.

That’s how the easiest spin-lock on acquire/release looks like (C++ experts will state that a special atomic_flag should have been used for spin-lock implementation. But I will build the spin-lock on an atomic variable (not even boolean). It’s more clear from the point of view of this article):

class spin_lock 
{
    atomic m_spin ;
public:
    spin_lock(): m_spin(0) {}
    ~spin_lock() { assert( m_spin.load(memory_order_relaxed) == 0);}

    void lock()
    {
        unsigned int nCur;
        do { nCur = 0; }
        while ( !m_spin.compare_exchange_weak( nCur, 1, memory_order_acquire ));
    }
    void unlock()
    {
        m_spin.store( 0, memory_order_release );
    }
};

Note that the thing which disturbs me a lot in this code is that compare_exchange accepts its first argument by reference and modifies it if CAS is unsuccessful. So I have to use do-while with the nonempty body. I use acquire-semantics in the lock method and release-semantics in the unlock method (by the way, acquire/release semantics stems from synchronization primitives. Developers of the standard carefully analyzed the implementation of different synchronization primitives and derived an acquire/release pattern). As I have already mentioned, stated in this case barriers do not allow the code between lock() and unlock() to leak outside. And that’s exactly what we need! m_spin variable atomicity guarantees that no one will manage to acquire the lock, while m_spin=1. And again, that’s what we need! We can see that I use compare_exchange_weak in the algorithm. But what is it?

Weak and Strong CAS

As you remember, processors architecture can refer to one of the two classes. Either the processor implements atomic CAS (compare-and-swap) primitive, or LL/SC (load-linked/store-conditional) pair. LL/SC pair allows to implement atomic CAS, but it’s not atomic itself due to many reasons. One of them is that the code, being executed within LL/SC, can be interrupted by the operational system. For example, at this moment exactly OS decides to push the current thread out. Thereafter, store-conditional won’t response later, after renewal. Our CAS will return false, though the real reason of this false can be not because of the data, but because of the outer event – a thread interruption. This exact consideration pushed developers of the standard to add two compare_exchange primitives – weak and strong. These primitives are named accordingly — compare_exchange_weak and compare_exchange_strong. The weak version can fail, i.e. return false, even if the current variable value is equal to the expected one. So any weak CAS can break CAS semantics and return false, while true should actually be returned (but not vice versa!) Strong CAS can’t do that, it just strictly follows CAS semantics. Of course, it can be worth something. So when the weak and when the strong CAS should be applied? I derived the following rule: if CAS is used in the loop (it’s the basic pattern of CAS usage) and there’re no tons of operations within the loop (i.e. the loop body is light and simple), then I use compare_exchange_weak. Otherwise – the strong one compare_exchange_strong.

Memory Order for Acquire/Release Semantics

As I’ve mentioned above, the following memory_order definitions are determined for acquire/release semantics:

memory_order_acquire
memory_order_consume
memory_order_release
memory_order_acq_rel

For reading(load) memory_order_acquire and memory_order_consume are possible For writing(store) – memory_order_release only. Memory_order_acq_rel is acceptable for RMW operations only – compare_exchange, exchange, fetch_xxx. To be honest, atomic RMW primitive can have memory_order_acquire acquire semantics, memory_order_release release semantics or both memory_order_acq_rel. These constants determine semantics for RMW operations, as read-modify-write primitive concurrently executes atomic read and write. Semantically RMW operation can be considered as either acquire-load or release-store, or as both. It is possible to define RMW operation semantics in the algorithm only, where it’s applied. The parts, somehow similar to spin-lock, can be distinguished in lock-free algorithms. At first we acquire some resource, do something (usually calculate new value) and finally release new resource value. If resource acquisition is executed by a RMV operation (usually CAS), such operation quite likely has an acquire-semantics. If a new value is executed by a RMW primitive, it quite likely has release-semantics. “Quite likely” is used not without the purpose. A detailed algorithm analysis is required before we can understand which semantics fits the RMW-operation. But if RMW-primitive is executed separately (it’s impossible to allocate acquire/release pattern), three variants for semantics are possible: - memory_order_seq_cst – RMW-operation is the key algorithm link. The code reordering, load and store shifting upwards/downwards can lead to an error. - memory_order_acq_rel – is somehow similar to memory_order_seq_cst, but RMW-operation is located inside the acquire/release-section - memory_order_relaxed – RMW-operation shifting (its load and store parts) upwards/downwards the code (for example, within the acquire/release section, if the operation is located inside such section) doesn’t lead to errors

All these pieces of advice should be apprehended as just a try to lay down some basic principles using this or that semantics on a RMW primitive. A detailed analysis should be performed for each algorithm.

Consume-Semantics

There is a separate, weaker sort of acquire-semantics. It’s a consume-semantics of reading. This semantics has been introduced as a “tribute to the memory” of DEC Alpha processor. Alpha architecture has a significant difference from other modern architectures. It could break data dependency. In the following code example:

struct foo {
    int x;
    int y;
} ;
atomic pFoo ;

foo * p = pFoo.load( memory_order_relaxed );
int x = p->x;

It could reorder p->x reading and p acquiring (don’t ask me how it’s possible. It’s one of Alpha features. I haven’t worked with Alpha, so I can neither confirm, nor disprove it). In order to prevent such reordering consume-semantics was introduced. It’s applicable to atomic reading of the pointer to the structure, and then follows the structure fields reading. In the given example pFoo pointer should be read as:

foo * p = pFoo.load( memory_order_consume );
int x = p->x;

Consume semantics is somewhere between the relaxed and acquire semantics of reading. In most modern architectures it’s reflected on relaxed-read.

CAS Again

I provided atomicinterface with two CAS – weak and strong. But there are actually two more variants of CAS. With an additional memory_order argument:

bool compare_exchange_weak(T&, T, memory_order successOrder, memory_order failedOrder );
bool compare_exchange_strong(T&, T, memory_order successOrder, memory_order failedOrder );

But what kind of argument is failedOrder? Let’s remember that CAS is read-modify-write primitive. It executes atomic reading even in case of failure. failedOrder argument determines the semantics of this reading in case of CAS failure. The same values as for common read are supported. “Semantics in case of failure” is rarely required in practice. It depends on the algorithm, of course!

Relaxed Semantics

Finally, the third type of atomic operations model. Relaxed semantics, which is applicable to all atomic primitives – load, store, all RMW, — and which applies almost no restrictions. Therefore it allows the processor to reorder to the almost full extent, demonstrating its great strength. Why almost? First of all, standard requirement is to keep atomicity of relaxed operations. Meaning that even a relaxed operation should be atomic, with no partial effects. Secondly, speculative writing is denied for atomic relaxed writing. These requirements can apply restrictions on the implementation of atomic relaxed operations in some weakly ordered architectures. For example, relaxed load of atomic variable is implemented as load.acq on Intel Itanium (acquire-read, don’t mix up with Itanium acquire with C++ acquire).

Requiem on Itanium

I’ve often mentioned Intel Itanium in my articles. It may seem that I am the fan of Intel architecture, which seems to be slowly dying. I am not its fan, but… Itanium VLIW architecture somehow differs from other by the principle of command system build. Memory ordering is indicated as a suffix of load/store/RMW instructions. You won’t find it in modern architectures. Even used acquire and release terms put an idea into my mind that C++11 might be copied from Itanium. Remembering the history, Itanium is the architecture (or its child) which all of us would use, unless AMD had introduced AMD64 — a 64-bit extension of x86. At that time Intel was unhurriedly developing a new architecture for 64-bit calculations. It hazily hinted at this new architecture. From those hints, we could understand that the desktop Itanium was waiting for us. By the way, the Microsoft Windows port and Visual C++ compiler for Itanium indirectly proof that fact (has anyone ever seen an operating Windows on Itanium?). But AMD destroyed Intel’s plans and the latter had to catch up, integrating 64 bit into x86. Itanium stayed in the server segment, where it was slowly dying away not getting proper resources for development. Meanwhile, Itanium with its set of instructions in a “very long word” (VLIW – very long instruction word) is still an interesting and a breakthrough process. The things modern processors execute by themselves (load executive blocks, reorder operations) were entrusted to the compiler in Itanium. But compilers didn’t cope with the task and generated (still generating) not enough optimized code. As a result, Itanium performance fell several times. So Itanium is our unimplemented future. Who knows, maybe it’s too early to write a requiem?

Happens-Before, Synchronized-With and other Relations

The one familiar with C++11 standard will ask: ”Where are the relations determining semantics of atomic operations: happened before, synchronized with and other?” I’ll reply: ”In the standard”. A good review is provided in a book by Anthony Williams, C++ Concurrency in Action, chapter five. You’ll find plenty of detailed examples there. Developers of the standard have carried out a very important task. They derived the rules for C++ memory model. These rules described not the placement of memory barriers, but warranties of threads interaction. As a result a compact axiomatic specification of C++ memory model appeared. Unfortunately, it’s extremely difficult to apply these relations in practice due to the great number of variants required to be considered in order to prove memory_order indication correctness in more or less complex lock-free algorithm. That’s why sequential consistency is the default model. It doesn’t require to set any special memory_order arguments for atomic operations. As I’ve already mentioned, this model is the most braking. Applying weaker models – acquire/release or relaxed – requires algorithm verification.

UPD: I’ve read somewhere that the last statement isn’t accurate. Indeed, sequential consistency itself doesn’t guarantee anything. You can even write crap with its help. So I’m clarifying that lock-free algorithm verification is always required for any memory model. But in case of weak models – it’s especially required.

Lock-Free Algorithms Verification

I knew just one verification method until recently — relacy library by Dmitriy Vyukov. Unfortunately, this method requires building a special model. In fact, a lite lock-free model should be built in the terms of relacy library as a first step (Why lite? Because when building a model you usually throw away all the useful things that don’t relate to the algorithm under consideration). This model should be debugged. After that only you can write a production version of the algorithm. This approach perfectly fits for developers of lock-free algorithms and data structures. That’s actually it for developers. But implementers do not usually like such a two-staged approach (due to natural laziness) as they need it right here right now, asap. I guess the author of relacy was aware of the drawback (no irony – is indeed a breakthrough project in its niche). He had an idea of building a verification method into the part of a standard library. That was, you wouldn’t have to make any additional model. It seems somehow similar to safe iterators concept in STL. A new ThreadSanitizer instrument has been recently introduced by Dmitriy and his colleagues from Google. This instrument allows to check data race existence in the program. So it’s actually a proper use of reordering in atomic operations. Moreover, this instrument is built not into STL, but even deeper – into compiler (Clang 3.2, and then in GCC 4.8). ThreadSanitizer use is quite simple. You just have to compile a program with specific keys, perform a test run and then enjoy extensive logs analysis. I am going to apply this instrument in my libcds library in order to make sure that libcds is okay.

“I don’t Get It” — Criticizing the Standard

Yes, I do dare to criticize C++11 standard! I just don’t understand why the standards want the semantics to be set as an argument of an atomic operation. It would be much more logical to use templates and do something like:

template 
class atomic {
    template 
    T load() const ;

    template 
    void store( T val ) ;

    template 
    bool compare_exchange_weak( T& expected, T desired ) ;

   // and so forth, and so on
};

I’ll explain why I think that this is more correct. As I’ve mentioned more than once, atomic operations semantics impacts not the processor only, but compiler as well. Semantics is an optimization [half]barriers for the compiler. Besides that, the compiler should monitor an atomic operation to be assigned with proper semantics (for example, release-semantics was applicable to reading). So semantics should be determined at the compiling stage. I can’t imagine what compiler would do in the following code:

extern std::memory_order currentOrder ;
std::Atomic atomicInt ;
atomicInt.store( 42, currentOrder ) ;

Formally, this code doesn’t contradicts C++11 standard. Nevertheless, the only things compiler can do are the following: - either give an error, — but then why to allow the atomic operations interface leading to an error? - or apply sequential consistent semantics – for reasons of “not to make it worse”. But then currentOrder variable is ignored and the program gets issues we wanted to avoid. - or generate a switch/case for all possible currentOrder values. In this case we’ll get an inefficient code instead of one or two assembly instructions. The problem of proper semantics isn’t solved as well. You can call release-read or acquire-write.

While template approach has no such drawbacks. In the template functions we should define the compilation time constant exactly from memory_order list. Yes, atomic operations call can be a bit awkward:

std::Atomic atomicInt ;
atomicInt.store( 42 ) ;
// or even like that:
atomicInt.template store( 42 ) ;

But such awkwardness is compensated by advantages of the template approach, — unambiguity of semantics operation indication during compilation. The only explanation of C++11 approach coming to my mind is compatibility with C. Besides std::atomic class, C++11 standard introduces free C atomic functions as atomic_load, atomic_store, etc. At the times when C++11 was just planned, I implemented atomic primitives in template terms. Then I decided that I should follow the standards and build another libcds for C++11 atomic operations interface.

The End of Basics

This article is the end of Basics.

[original source]

Comments

RussianBydlo 4:53AM on Sunday, May 18, 2014

What’s wrong with you, guys? Where is a link to the original? habrahabr.ru/company/ifree/blog/197520/

Kukuruku Hub 2:42PM on Sunday, May 18, 2014

We’ve added the link. Thanks for pointing that out.

MikeDRush 10:39AM on Sunday, May 18, 2014

What is a coincidence? Original authors just want to publish their article here