C++ Memory Model: Migrating From X86 to ARM
Over the past years, we have seen a shift in processors from the previously dominant x86 architecture to the more energy-efficient (and often cheaper) ARM architecture. This trend is true for both consumer hardware, e.g., Apple’s M1 SoC, and also server hardware, e.g. AWS’s Gravitron processor. Given that the ARM architecture has a more relaxed memory model, this might have some subtle impact on C++ programs.
So far, ArangoDB was only built for x86-based CPUs, but due to the rise of ARM processors, we are faced with an ever-increasing number of requests from users to support ARM. Since performance is at the heart of ArangoDB, there are a lot of places where we use atomic operations. In order to ensure that our code behaves correctly on ARM, we have to revisit all those places and make sure that the operations use the correct memory order. But what does that mean?
With the introduction of multi-threaded execution in C++11, the standard also introduced a memory model. For many developers, this memory model is one of the least well-understood parts of the C++ standard and a common source of confusion. While an in-depth explanation of the C++ memory model is beyond this article’s scope, we will cover some basics, as well as take a brief look at two of the most common hardware memory models and how the C++ memory model maps to these hardware models.
But first, let’s take a step back and consider why we need a memory model in the first place.
A memory model defines what happens whenever we access the memory. Given a potentially concurrent program, the memory model defines the possible values a read might return when a write by one thread becomes visible to other threads, as well as the final values of each location in memory.
The memory model can be refined to differentiate between the programming language memory model and the hardware memory model.
- Language memory model: Defines the optimizations, memory access rewrites, and reorderings a compiler is allowed to perform when transforming a program into code.
- Hardware memory model: Defines the optimizations and memory access reorderings a specific hardware architecture is allowed to perform while executing the machine code.
These optimizations can cause memory accesses to be executed or perceived in orders that differ from what is defined in the source code, leading to distinguishing between the following four orderings:
- Source code order: The order in which the memory operations are specified in the source code
- Program order: The order in which the memory operations are specified in the machine code. Based on the language memory model, the program order can differ from the source code order because compilers can reorder instructions as part of the optimization process.
- Execution order: The order in which the individual memory-reference instructions are executed on a given CPU. The execution order can differ from the program order due to optimizations based on the specific CPU implementation’s hardware memory model (e.g., out-of-order execution).
- Perceived order: The order in which a CPU perceives its and other CPUs’ memory operations. The perceived order can differ from the execution order due to caching, interconnect, and memory-system optimizations defined by the hardware memory model. On some architectures, different CPUs can perceive the same set of memory operations as occurring in different orders.
The sequentially consistent memory model is the simplest memory model and describes how most people intuitively expect memory to behave. A natural view of a multi-threaded program’s execution is as follows: we repeatedly choose a random thread and execute the next step in that thread’s execution until the program terminates. This is effectively equivalent to taking all steps of all threads in order and interleaving them in some way, resulting in a single total order of all steps. Therefore, whenever an object is accessed, the last value stored to the object in this order is retrieved. An execution that can be understood as such, interleaving is referred to as sequentially consistent. This more formal definition of the sequentially consistent model was first proposed by Leslie Lamport [Lamport79].
Due to the strong guarantees sequential consistency provides, it is much easier to reason about the correctness of some code. That is why most publications on concurrent algorithms or data structure assume a sequentially consistent memory model. And for the same reason, all atomic operations in C++ default to sequential consistency unless specified otherwise (we will discuss this in more detail later).
Unfortunately, ensuring sequential consistency is quite expensive, and none of today’s processor architectures provide a fully sequentially consistent memory model. While they allow enforcing sequential consistency at certain points, normal execution depends highly on the specific architecture’s implementation.
The Intel x86 memory model is one of the strongest models amongst today’s modern CPU implementations. For a long time, the information provided by Intel and AMD on their respective x86 architecture implementations was mostly informal, missing concrete examples, and sometimes even inconsistent with the actual implementation.
Sewell et al. [Sewell2010] formally described a new memory model called x86-TSO (Total Store Order), which is consistent with the concrete examples in Intel and AMD’s latest documentation available at that time. This model is illustrated here:
x86-TSO block diagram from [Sewell2010].
As can be seen, the hardware threads interact with a storage subsystem represented by the dotted box. The storage subsystem comprises a shared memory that maps addresses to values, one write buffer per hardware thread (also referred to as ‘store buffer’), and a global lock to indicate when a particular hardware thread has exclusive access to memory. The complete formal definition can be found in [Sewell2010], but the main points are:
- The store buffers are FIFO, and a reading thread must read its own most recent buffered write, if there is one, to that address. Otherwise, reads are satisfied from shared memory. Since writes are buffered, the new value is not visible to other threads until it has propagated to the shared memory.
mfenceinstruction flushes the store buffer of that thread, ensuring that the writes become globally visible.
- To execute a
lock, a thread must first acquire the global lock. At the end of the instruction, it flushes its store buffer and releases the lock. While the lock is held by one thread, no other thread can read. This essentially means that
lockenforces sequential consistency.
lock‘d instructions are read-modify-write instructions with a
lockprefix for atomicity, like
lock xadd(atomic fetch-and-add) or
lock cmpxchg(atomic compare-and-swap).
- A buffered write from a thread can propagate to the shared memory at any time except when some other thread holds the lock.
x86-TSO does not permit local reordering except for reads after writes to different addresses.
It is important to note that x86-TSO is a highly simplified model that does not fully describe actual CPU implementations by Intel or AMD. However, this simplified model is consistent with the concrete examples in Intel and AMD’s latest documentation when the model was defined. So, it should be possible to model any effect that is observable in x86 CPUs with x86-TSO.
ARM and POWER
POWER is an architecture developed by IBM. Even though it is not as widely known as ARM, POWER CPUs are not only used in servers and supercomputers but also in many consumer devices like satellite receivers or game consoles (e.g., Nintendo Wii, Xbox 360, PlayStation 3). Although ARM and POWER are completely different architectures, their memory models are quite similar. In particular, both have considerably more relaxed memory models, allowing a wider range of hardware optimizations. Maranget et al. [Maranget2012] provide a detailed and extensive description of both architectures and their observable behaviors.
While memory order relaxations can improve performance, power efficiency, and hardware complexity, it makes the life of a programmer, who is implementing concurrent data structures, significantly harder. In contrast to TSO models, the following behaviors are possible on these architectures:
- Hardware threads can perform reads and writes out-of-order or even speculatively, i.e., before preceding conditional branches have been resolved. Any local reordering is allowed unless otherwise specified.
- The memory system does not guarantee that a write becomes visible to all other hardware threads simultaneously (this behavior is called non-multicopy-atomicity).
It can be helpful to think of each hardware thread as effectively having its own copy of memory as illustrated:
ARM/POWER storage subsystem from [Maranget2012].
This collection of memories and their interconnect (i.e., everything except the threads) is usually referred to as the storage subsystem. A write by one thread may propagate to other threads in any order, and the propagations of writes to different addresses can be interleaved arbitrarily unless they are constrained by barriers or cache coherence. One can also think of barriers as propagating from the hardware thread that executed them to each of the other threads.
Since a particular ordering of instructions is crucial for the simplest non-blocking data structures, these architectures provide various memory barriers and dependency guarantees that the programmer must use correctly to enforce a desired appropriate ordering of memory operations.
dbm and POWER
sync barrier (fence) instructions can enforce the following orderings between two instructions:
- Read/Read: The barrier ensures that they are satisfied and committed in program order.
- Read/Write: The barrier ensures that the read is satisfied and committed before the write can be committed (and thus propagated and become visible to others).
- Write/Write: The barrier ensures that the first write is committed and has propagated to all other threads before the second write is committed.
- Write/Read: The barrier ensures that the write is committed and has propagated to all other threads before the read is satisfied.
In addition to barriers, these architectures provide the following (implicit) dependencies to enforce orderings:
- Address Dependency: There is an address dependency from a read to a program-order-later read or write if the value read by the first instruction is used to compute the address of the second instruction.
- Control Dependency: There is a control dependency from a read to a program-order-later read/write if the value read by the first instruction is used to compute the condition of a conditional branch that is program-order-before the second instruction.
- Data Dependency: There is a data dependency from a read to a program-order-later write if the value read by the first instruction is used to compute the value written by the second instruction.
The ARMv8 architecture has been revised and now has a multicopy-atomic model. It has also been simplified in other respects, including more straightforward notions of dependency, and the architecture now includes a formal concurrency model. More details are available in [Pulte2017].
The C++ Memory Model
The C++ standard provides a formal definition of an abstract machine with its execution model. Before C++11, this execution model was purely sequential. C++11 redefined this execution model to support multi-threaded executions and introduced the memory model as a common ground between the programmer, the runtime library, the compiler, and the hardware.
Together with the memory model, C++11 also introduced the concept of a data race. A data race occurs when we have two conflicting actions, at least one of which is not atomic, and neither ‘happens-before’ the other. Two actions conflict if both access the same memory location and at least one of them is a write. Any such data race results in undefined behavior. We will discuss in a minute what this ‘happens before’ part means exactly.
The C++ standard contains a paragraph commonly referred to as the ‘as-if’ rule. It states that ‘conforming implementations are required to emulate (only) the observable behavior of the abstract machine.’ Basically, this means implementation is free to disregard any requirement of the standard as long as the result is as if the requirement had been obeyed, as far as can be determined from the observable behavior of the program. This rule is the basis for virtually all compiler optimizations because it effectively allows any and all code transformations that do not change the program’s observable behavior.
The observable behavior of a multi-threaded program is defined by means of the happens-before relationship, which can be roughly described as follows:
Let A and B represent operations performed by a multi-threaded process. If A happens-before B, then the memory effects of A effectively become visible to the thread performing B before B
For C++, the standard defines ‘happens-before’ as follows:
An evaluation A happens before an evaluation B if:
- A is sequenced before B, or;
- A inter-thread happens before B.
Sequenced before is an anti-symmetric, transitive, pair-wise relation between instructions executed by a single thread, which induces a partial order among those evaluations [C++17, 4.6.15, p. 14]. Given any two evaluations A and B, if A is sequenced before B, then the execution of A shall precede the execution of B. So effectively sequenced before matches the source code order as previously defined.
Inter-thread-happens-before describes a happens-before order between two operations from different threads.
An evaluation A inter-thread happens before an evaluation B if:
A synchronizes with B, or;
A is dependency-ordered before B, or;
We have some arbitrary concatenation of sequenced before, synchronizes with and dependency-ordered before relationships.
There are two minor exceptions to the last case of arbitrary concatenations, but that is beyond our scope here. For more details see [C++17, 18.104.22.168, pp. 15-19].
An inter-thread happens-before relation can be established via high-level synchronization objects like
std::mutex or via atomic synchronization operations. Since we are focusing on the memory model, we only look at the latter. Each atomic operation takes a parameter of type
std::memory_order, which is an enumeration type with the following values:
memory_order_seq_cst is the strongest order and used for sequential consistency. This is the default memory order for all operations, so unless specified otherwise, all atomic operations are sequentially consistent. As previously explained, there is a single total order S of all sequentially consistent operations. An operation B that performs a load on an object M will observe the result of the last modification A of M that precedes B in S [C++17, 32.4.3, p. 1347].
memory_order_acquire can only be used for operations that perform a read,
memory_order_release can only be used for operations that perform a write, and
memory_order_acq_rel can only be used for operations that perform a read-modify-write operation. Some implementations check these constraints at runtime (e.g., MSVC in debug builds).
memory_order_consume is used in conjunction with dependency-ordered before. However, the C++17 standard contains the following note:
memory_order_acquire, which provides stronger guarantees than
memory_order_consume. Implementations have found it infeasible to provide performance better than that of
memory_order_acquire. Specification revisions are under consideration.
For that reason, we will only consider the remaining memory orders and the synchronize-with relation.
A synchronize-with relation is established by using acquire/release operations. An atomic operation A that performs a store-release operation on an atomic object M synchronizes with an atomic operation B that performs a load-acquire operation on M and observes the value written by A, or some value from any side effect in the release sequence headed by A. This synchronize-with order is compatible with the inter-thread-happens-before order, i.e., if A synchronizes with B, then A inter-thread-happens-before B. Release sequences are beyond our scope here, so we will only consider the simple case where B observes the value written by A. For more details on release sequences see [C++17, 22.214.171.124, p. 16].
memory_order_relaxed is the weakest order and does not provide any ordering guarantees. In particular, relaxed atomic operations are not synchronization operations, as they do not affect any assignment’s visibility to other threads. However, of course, relaxed atomic operations are still fully atomic, so a relaxed
fetch_add is still guaranteed to increase monotonically without any lost or duplicate updates.
There are also synchronization operations without an associated memory location, which are called fences, but these are beyond this post’s scope.
As previously explained, virtually all compiler optimizations are based on the so-called ‘as-if’ rule. In particular, the compiler is free to reorder instructions as long as the observable behavior is not affected.
One might assume that the definition of sequenced-before relation effectively prevents the compiler from performing any instruction reordering. After all, it states that ‘if A is sequenced before B, then the execution of A shall precede the execution of B.’ Nevertheless, the sequenced-before relation is only defined between instructions executed by a single thread, so as long as these changes are not observable by some other thread (without a data race!), the compiler may freely reorder these instructions. For example, the compiler can freely reorder two relaxed atomic write operations. Remember that the perceived order can differ from the execution order? Since the two operations are relaxed, there is no guarantee in which order they become visible, so they can just as well be reordered.
Essentially, it boils down to this: the compiler is free to reorder operations as long as the observable behavior is retained. Moreover, since for multi-threaded execution the observable behavior is defined by means of the (inter-thread) happens-before relation, the compiler can freely reorder operations that are not ordered by such an (inter-thread) happens-before relation.
As a rule of thumb, an operation cannot be reordered after a release-store or before an acquire-load, because the acquire/release operations can potentially establish a happens-before relation, and such reorderings would spoil the transitivity. However, any correctness argument about concurrent code using relaxed atomics should be based on happens-before relations and not on possible reorderings, because if or when operations can be reordered is merely a result of applying the happens-before rules.
It can be quite difficult to correctly apply for these more relaxed memory orders. And to make matters worse, incorrect usage often does not cause any obvious issues. Consider this example:
This code contains a data race because the initialization of
x in Thread A and the read operation on
x in Thread B is not ordered by a happens-before relation. Nevertheless, this code will probably work on x86. In particular, if the initialization of
x is executed before the
node.store, then the x86 model guarantees that
n->x in Thread B will observe the value written in the initialization, so the assertion is guaranteed to hold. This is because all load/store operations on x86 have acquire/release semantic, even if the C++ code uses
memory_order_relaxed. There is no such guarantee on ARM, so the value returned by
n->x in Thread B is undefined.
It will only probably work on x86 because the memory model not only relates to the hardware memory model but also interacts with the compiler.
The pointer to the newly created node is written with
memory_order_relaxed, which does not provide any ordering guarantees. In particular, it can never establish an inter-thread-happens before relation (ignoring the possibility of combining it with an explicit
atomic_thread_fence, which is beyond this article’s scope). For that reason, the compiler is well in its right to assume that no thread may access that node yet, because doing so would send us to the land of undefined behavior. So there is no need to guarantee that the node is already fully initialized at the time we store the pointer, and the compiler may reorder the code as if it had been written as follows:
Obviously, such optimization may also cause the assertion to fail on x86. But, because such optimization for this example is rather unlikely, chances are that the incorrect code runs flawlessly on your x86 machine, resulting in a false sense of security.
The correct way to fix this is by using acquire/release operations:
Once Thread B observes the value written by Thread A, the acquire-load synchronizes with the release-store, thereby establishing an inter-thread-happens-before relation. This means that all operations preceding the release-store must become visible to the thread performing the acquire-operation. In particular, this means that:
- The compiler must ensure that the initialization is executed before storing the pointer, i.e., the initialization code must not be reordered past the store.
- For weak hardware models, the compiler must generate the necessary instructions to enforce that ordering.
The best way to test whether code contains any data races is using ThreadSanitizer, or short TSan. ThreadSanitizer is yet another sanitizer available in GCC and Clang. If enabled during compilation, it adds instrumentation code to the binary. Typical slowdown introduced by ThreadSanitizer is about 5x-15x, and typical memory overhead is about 5x-10x. TSan can report a number of typical issues in concurrent code (like lock inversion), but more importantly, it also understands the C++ memory model, which allows it to detect data races. Essentially, TSan checks every single memory access operation for conflicting actions (as defined before). If such conflicting actions are not ordered by a happens-before relation, TSan reports a data race. Since these checks are performed at runtime, the fact whether a data race can be detected or not depends on the order in which the program’s code is executed — which TSan has no control over. So it is important to realize that TSan does not guarantee that it can find all data races. Even if the code runs hundreds of times without any warnings from TSan, it does not mean that the code is free of data races, just that TSan could not find any. An alternative to TSan are model checkers like CDSChecker, which exhaustively explore the behavior of code under the C/C++ memory model. But since these checkers essentially explore the complete state space, they are only feasible for small, self-contained examples. For large applications, TSan is still the best tool available to check code for data races.
However, there are some limitations. At the moment, TSan does not support explicit memory fences, nor the failure-order for
compare_exchange operations. Relying on either will almost certainly result in false positives. These are long-standing issues, but there is no roadmap for a fix for either of them to the best of my knowledge. However, dynamic data race analysis is still an active research topic (e.g., Dynamic Race Detection for C++11), but it is unclear if or when such improvements will find their way into the official TSan implementation.
Writing concurrent code that is correct and efficient can already be quite challenging, but throwing relaxed atomics into the mix adds yet another level of complexity. Intuitively, most people expect memory operations to behave sequentially consistent, and it can be mind-boggling to realize that real-world architectures don’t behave that way. Wrapping your head around this fact and coming to grips with how to deal with it can take quite a while.
It takes a lot of practice to use relaxed atomic operations correctly. When writing code with relaxed operations, try to reason about its correctness. To that end, think about the requirements:
- Which operations need to be ordered?
- Where do we need happens-before relations?
- How can these be established?
Even then, code should be run with ThreadSanitizer to ensure (as far as possible) that the assumptions hold and you did not miss anything.
‘It works on my machine’ is never a good argument, even less so for concurrent code using relaxed atomics!
This post is largely based on the white paper Memory Models for C/C++ Programmers, which provides more in-depth details, including those topics that were outside this article’s scope.
Credit: Source link