C++11 introduced a standardized memory model. What does it mean? And how is it going to affect C++...












1647














C++11 introduced a standardized memory model, but what exactly does that mean? And how is it going to affect C++ programming?



This article (by Gavin Clarke who quotes Herb Sutter) says that,




The memory model means that C++ code
now has a standardized library to call
regardless of who made the compiler
and on what platform it's running.
There's a standard way to control how
different threads talk to the
processor's memory.



"When you are talking about splitting
[code] across different cores that's
in the standard, we are talking about
the memory model. We are going to
optimize it without breaking the
following assumptions people are going
to make in the code," Sutter said.




Well, I can memorize this and similar paragraphs available online (as I've had my own memory model since birth :P) and can even post as an answer to questions asked by others, but to be honest, I don't exactly understand this.



C++ programmers used to develop multi-threaded applications even before, so how does it matter if it's POSIX threads, or Windows threads, or C++11 threads? What are the benefits? I want to understand the low-level details.



I also get this feeling that the C++11 memory model is somehow related to C++11 multi-threading support, as I often see these two together. If it is, how exactly? Why should they be related?



As I don't know how the internals of multi-threading work, and what memory model means in general, please help me understand these concepts. :-)










share|improve this question





























    1647














    C++11 introduced a standardized memory model, but what exactly does that mean? And how is it going to affect C++ programming?



    This article (by Gavin Clarke who quotes Herb Sutter) says that,




    The memory model means that C++ code
    now has a standardized library to call
    regardless of who made the compiler
    and on what platform it's running.
    There's a standard way to control how
    different threads talk to the
    processor's memory.



    "When you are talking about splitting
    [code] across different cores that's
    in the standard, we are talking about
    the memory model. We are going to
    optimize it without breaking the
    following assumptions people are going
    to make in the code," Sutter said.




    Well, I can memorize this and similar paragraphs available online (as I've had my own memory model since birth :P) and can even post as an answer to questions asked by others, but to be honest, I don't exactly understand this.



    C++ programmers used to develop multi-threaded applications even before, so how does it matter if it's POSIX threads, or Windows threads, or C++11 threads? What are the benefits? I want to understand the low-level details.



    I also get this feeling that the C++11 memory model is somehow related to C++11 multi-threading support, as I often see these two together. If it is, how exactly? Why should they be related?



    As I don't know how the internals of multi-threading work, and what memory model means in general, please help me understand these concepts. :-)










    share|improve this question



























      1647












      1647








      1647


      965





      C++11 introduced a standardized memory model, but what exactly does that mean? And how is it going to affect C++ programming?



      This article (by Gavin Clarke who quotes Herb Sutter) says that,




      The memory model means that C++ code
      now has a standardized library to call
      regardless of who made the compiler
      and on what platform it's running.
      There's a standard way to control how
      different threads talk to the
      processor's memory.



      "When you are talking about splitting
      [code] across different cores that's
      in the standard, we are talking about
      the memory model. We are going to
      optimize it without breaking the
      following assumptions people are going
      to make in the code," Sutter said.




      Well, I can memorize this and similar paragraphs available online (as I've had my own memory model since birth :P) and can even post as an answer to questions asked by others, but to be honest, I don't exactly understand this.



      C++ programmers used to develop multi-threaded applications even before, so how does it matter if it's POSIX threads, or Windows threads, or C++11 threads? What are the benefits? I want to understand the low-level details.



      I also get this feeling that the C++11 memory model is somehow related to C++11 multi-threading support, as I often see these two together. If it is, how exactly? Why should they be related?



      As I don't know how the internals of multi-threading work, and what memory model means in general, please help me understand these concepts. :-)










      share|improve this question















      C++11 introduced a standardized memory model, but what exactly does that mean? And how is it going to affect C++ programming?



      This article (by Gavin Clarke who quotes Herb Sutter) says that,




      The memory model means that C++ code
      now has a standardized library to call
      regardless of who made the compiler
      and on what platform it's running.
      There's a standard way to control how
      different threads talk to the
      processor's memory.



      "When you are talking about splitting
      [code] across different cores that's
      in the standard, we are talking about
      the memory model. We are going to
      optimize it without breaking the
      following assumptions people are going
      to make in the code," Sutter said.




      Well, I can memorize this and similar paragraphs available online (as I've had my own memory model since birth :P) and can even post as an answer to questions asked by others, but to be honest, I don't exactly understand this.



      C++ programmers used to develop multi-threaded applications even before, so how does it matter if it's POSIX threads, or Windows threads, or C++11 threads? What are the benefits? I want to understand the low-level details.



      I also get this feeling that the C++11 memory model is somehow related to C++11 multi-threading support, as I often see these two together. If it is, how exactly? Why should they be related?



      As I don't know how the internals of multi-threading work, and what memory model means in general, please help me understand these concepts. :-)







      c++ multithreading c++11 language-lawyer memory-model






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 23 '18 at 16:39









      Peter Mortensen

      13.5k1983111




      13.5k1983111










      asked Jun 11 '11 at 23:30









      Nawaz

      251k87556751




      251k87556751
























          6 Answers
          6






          active

          oldest

          votes


















          1901





          +250









          First, you have to learn to think like a Language Lawyer.



          The C++ specification does not make reference to any particular compiler, operating system, or CPU. It makes reference to an abstract machine that is a generalization of actual systems. In the Language Lawyer world, the job of the programmer is to write code for the abstract machine; the job of the compiler is to actualize that code on a concrete machine. By coding rigidly to the spec, you can be certain that your code will compile and run without modification on any system with a compliant C++ compiler, whether today or 50 years from now.



          The abstract machine in the C++98/C++03 specification is fundamentally single-threaded. So it is not possible to write multi-threaded C++ code that is "fully portable" with respect to the spec. The spec does not even say anything about the atomicity of memory loads and stores or the order in which loads and stores might happen, never mind things like mutexes.



          Of course, you can write multi-threaded code in practice for particular concrete systems – like pthreads or Windows. But there is no standard way to write multi-threaded code for C++98/C++03.



          The abstract machine in C++11 is multi-threaded by design. It also has a well-defined memory model; that is, it says what the compiler may and may not do when it comes to accessing memory.



          Consider the following example, where a pair of global variables are accessed concurrently by two threads:



                     Global
          int x, y;

          Thread 1 Thread 2
          x = 17; cout << y << " ";
          y = 37; cout << x << endl;


          What might Thread 2 output?



          Under C++98/C++03, this is not even Undefined Behavior; the question itself is meaningless because the standard does not contemplate anything called a "thread".



          Under C++11, the result is Undefined Behavior, because loads and stores need not be atomic in general. Which may not seem like much of an improvement... And by itself, it's not.



          But with C++11, you can write this:



                     Global
          atomic<int> x, y;

          Thread 1 Thread 2
          x.store(17); cout << y.load() << " ";
          y.store(37); cout << x.load() << endl;


          Now things get much more interesting. First of all, the behavior here is defined. Thread 2 could now print 0 0 (if it runs before Thread 1), 37 17 (if it runs after Thread 1), or 0 17 (if it runs after Thread 1 assigns to x but before it assigns to y).



          What it cannot print is 37 0, because the default mode for atomic loads/stores in C++11 is to enforce sequential consistency. This just means all loads and stores must be "as if" they happened in the order you wrote them within each thread, while operations among threads can be interleaved however the system likes. So the default behavior of atomics provides both atomicity and ordering for loads and stores.



          Now, on a modern CPU, ensuring sequential consistency can be expensive. In particular, the compiler is likely to emit full-blown memory barriers between every access here. But if your algorithm can tolerate out-of-order loads and stores; i.e., if it requires atomicity but not ordering; i.e., if it can tolerate 37 0 as output from this program, then you can write this:



                     Global
          atomic<int> x, y;

          Thread 1 Thread 2
          x.store(17,memory_order_relaxed); cout << y.load(memory_order_relaxed) << " ";
          y.store(37,memory_order_relaxed); cout << x.load(memory_order_relaxed) << endl;


          The more modern the CPU, the more likely this is to be faster than the previous example.



          Finally, if you just need to keep particular loads and stores in order, you can write:



                     Global
          atomic<int> x, y;

          Thread 1 Thread 2
          x.store(17,memory_order_release); cout << y.load(memory_order_acquire) << " ";
          y.store(37,memory_order_release); cout << x.load(memory_order_acquire) << endl;


          This takes us back to the ordered loads and stores – so 37 0 is no longer a possible output – but it does so with minimal overhead. (In this trivial example, the result is the same as full-blown sequential consistency; in a larger program, it would not be.)



          Of course, if the only outputs you want to see are 0 0 or 37 17, you can just wrap a mutex around the original code. But if you have read this far, I bet you already know how that works, and this answer is already longer than I intended :-).



          So, bottom line. Mutexes are great, and C++11 standardizes them. But sometimes for performance reasons you want lower-level primitives (e.g., the classic double-checked locking pattern). The new standard provides high-level gadgets like mutexes and condition variables, and it also provides low-level gadgets like atomic types and the various flavors of memory barrier. So now you can write sophisticated, high-performance concurrent routines entirely within the language specified by the standard, and you can be certain your code will compile and run unchanged on both today's systems and tomorrow's.



          Although to be frank, unless you are an expert and working on some serious low-level code, you should probably stick to mutexes and condition variables. That's what I intend to do.



          For more on this stuff, see this blog post.






          share|improve this answer



















          • 31




            Nice answer, but this is really begging for some actual examples of the new primitives. Also, I think the memory ordering without primitives is the same as pre-C++0x: there are no guarantees.
            – John Ripley
            Jun 12 '11 at 0:37






          • 4




            @John: I know, but I am still learning the primitives myself :-). Also I think they guarantee byte accesses are atomic (although not ordered) which is why I went with "char" for my example... But I am not even 100% sure about that... If you want to suggest any good "tutorial" references I will add them to my answer
            – Nemo
            Jun 12 '11 at 0:39








          • 41




            @Nawaz: Yes! Memory accesses can get reordered by the compiler or CPU. Think about (e.g.) caches and speculative loads. The order in which system memory gets hit can be nothing like what you coded. The compiler and CPU will ensure such reorderings do not break single-threaded code. For multi-threaded code, the "memory model" characterizes the possible re-orderings, and what happens if two threads read/write the same location at the same time, and how you excert control over both. For single-threaded code, the memory model is irrelevant.
            – Nemo
            Jun 12 '11 at 17:08






          • 23




            @Nawaz, @Nemo - A minor detail: the new memory model is relevant in single-threaded code insofar as it specifies the undefinedness of certain expressions, such as i = i++. The old concept of sequence points has been discarded; the new standard specifies the same thing using a sequenced-before relation which is just a special case of the more general inter-thread happens-before concept.
            – JohannesD
            Jun 13 '11 at 13:14








          • 15




            @AJG85: Section 3.6.2 of the draft C++0x spec says, "Variables with static storage duration (3.7.1) or thread storage duration (3.7.2) shall be zero-initialized (8.5) before any other initialization takes place." Since x,y are global in this example, they have static storage duration and therefore will zero-initialized, I believe.
            – Nemo
            Jun 13 '11 at 20:16



















          299





          +50









          I will just give the analogy with which I understand memory consistency models (or memory models, for short). It is inspired by Leslie Lamport's seminal paper "Time, Clocks, and the Ordering of Events in a Distributed System".
          The analogy is apt and has fundamental significance, but may be overkill for many people. However, I hope it provides a mental image (a pictorial representation) that facilitates reasoning about memory consistency models.



          Let’s view the histories of all memory locations in a space-time diagram in which the horizontal axis represents the address space (i.e., each memory location is represented by a point on that axis) and the vertical axis represents time (we will see that, in general, there is not a universal notion of time). The history of values held by each memory location is, therefore, represented by a vertical column at that memory address. Each value change is due to one of the threads writing a new value to that location. By a memory image, we will mean the aggregate/combination of values of all memory locations observable at a particular time by a particular thread.



          Quoting from "A Primer on Memory Consistency and Cache Coherence"




          The intuitive (and most restrictive) memory model is sequential consistency (SC) in which a multithreaded execution should look like an interleaving of the sequential executions of each constituent thread, as if the threads were time-multiplexed on a single-core processor.




          That global memory order can vary from one run of the program to another and may not be known beforehand. The characteristic feature of SC is the set of horizontal slices in the address-space-time diagram representing planes of simultaneity (i.e., memory images). On a given plane, all of its events (or memory values) are simultaneous. There is a notion of Absolute Time, in which all threads agree on which memory values are simultaneous. In SC, at every time instant, there is only one memory image shared by all threads. That's, at every instant of time, all processors agree on the memory image (i.e., the aggregate content of memory). Not only does this imply that all threads view the same sequence of values for all memory locations, but also that all processors observe the same combinations of values of all variables. This is the same as saying all memory operations (on all memory locations) are observed in the same total order by all threads.



          In relaxed memory models, each thread will slice up address-space-time in its own way, the only restriction being that slices of each thread shall not cross each other because all threads must agree on the history of every individual memory location (of course, slices of different threads may, and will, cross each other). There is no universal way to slice it up (no privileged foliation of address-space-time). Slices do not have to be planar (or linear). They can be curved and this is what can make a thread read values written by another thread out of the order they were written in. Histories of different memory locations may slide (or get stretched) arbitrarily relative to each other when viewed by any particular thread. Each thread will have a different sense of which events (or, equivalently, memory values) are simultaneous. The set of events (or memory values) that are simultaneous to one thread are not simultaneous to another. Thus, in a relaxed memory model, all threads still observe the same history (i.e., sequence of values) for each memory location. But they may observe different memory images (i.e., combinations of values of all memory locations). Even if two different memory locations are written by the same thread in sequence, the two newly written values may be observed in different order by other threads.



          [Picture from Wikipedia]
          Picture from Wikipedia



          Readers familiar with Einstein’s Special Theory of Relativity will notice what I am alluding to. Translating Minkowski’s words into the memory models realm: address space and time are shadows of address-space-time. In this case, each observer (i.e., thread) will project shadows of events (i.e., memory stores/loads) onto his own world-line (i.e., his time axis) and his own plane of simultaneity (his address-space axis). Threads in the C++11 memory model correspond to observers that are moving relative to each other in special relativity. Sequential consistency corresponds to the Galilean space-time (i.e., all observers agree on one absolute order of events and a global sense of simultaneity).



          The resemblance between memory models and special relativity stems from the fact that both define a partially-ordered set of events, often called a causal set. Some events (i.e., memory stores) can affect (but not be affected by) other events. A C++11 thread (or observer in physics) is no more than a chain (i.e., a totally ordered set) of events (e.g., memory loads and stores to possibly different addresses).



          In relativity, some order is restored to the seemingly chaotic picture of partially ordered events, since the only temporal ordering that all observers agree on is the ordering among “timelike” events (i.e., those events that are in principle connectible by any particle going slower than the speed of light in a vacuum). Only the timelike related events are invariantly ordered.
          Time in Physics, Craig Callender.



          In C++11 memory model, a similar mechanism (the acquire-release consistency model) is used to establish these local causality relations.



          To provide a definition of memory consistency and a motivation for abandoning SC, I will quote from "A Primer on Memory Consistency and Cache Coherence"




          For a shared memory machine, the memory consistency model defines the architecturally visible behavior of its memory system. The correctness criterion for a single processor core partitions behavior between “one correct result” and “many incorrect alternatives”. This is because the processor’s architecture mandates that the execution of a thread transforms a given input state into a single well-defined output state, even on an out-of-order core. Shared memory consistency models, however, concern the loads and stores of multiple threads and usually allow many correct executions while disallowing many (more) incorrect ones. The possibility of multiple correct executions is due to the ISA allowing multiple threads to execute concurrently, often with many possible legal interleavings of instructions from different threads.



          Relaxed or weak memory consistency models are motivated by the fact that most memory orderings in strong models are unnecessary. If a thread updates ten data items and then a synchronization flag, programmers usually do not care if the data items are updated in order with respect to each other but only that all data items are updated before the flag is updated (usually implemented using FENCE instructions). Relaxed models seek to capture this increased ordering flexibility and preserve only the orders that programmers “require” to get both higher performance and correctness of SC. For example, in certain architectures, FIFO write buffers are used by each core to hold the results of committed (retired) stores before writing the results to the caches. This optimization enhances performance but violates SC. The write buffer hides the latency of servicing a store miss. Because stores are common, being able to avoid stalling on most of them is an important benefit. For a single-core processor, a write buffer can be made architecturally invisible by ensuring that a load to address A returns the value of the most recent store to A even if one or more stores to A are in the write buffer. This is typically done by either bypassing the value of the most recent store to A to the load from A, where “most recent” is determined by program order, or by stalling a load of A if a store to A is in the write buffer. When multiple cores are used, each will have its own bypassing write buffer. Without write buffers, the hardware is SC, but with write buffers, it is not, making write buffers architecturally visible in a multicore processor.



          Store-store reordering may happen if a core has a non-FIFO write buffer that lets stores depart in a different order than the order in which they entered. This might occur if the first store misses in the cache while the second hits or if the second store can coalesce with an earlier store (i.e., before the first store). Load-load reordering may also happen on dynamically-scheduled cores that execute instructions out of program order. That can behave the same as reordering stores on another core (Can you come up with an example interleaving between two threads?). Reordering an earlier load with a later store (a load-store reordering) can cause many incorrect behaviors, such as loading a value after releasing the lock that protects it (if the store is the unlock operation). Note that store-load reorderings may also arise due to local bypassing in the commonly implemented FIFO write buffer, even with a core that executes all instructions in program order.




          Because cache coherence and memory consistency are sometimes confused, it is instructive to also have this quote:




          Unlike consistency, cache coherence is neither visible to software nor required. Coherence seeks to make the caches of a shared-memory system as functionally invisible as the caches in a single-core system. Correct coherence ensures that a programmer cannot determine whether and where a system has caches by analyzing the results of loads and stores. This is because correct coherence ensures that the caches never enable new or different functional behavior (programmers may still be able to infer likely cache structure using timing information). The main purpose of cache coherence protocols is maintaining the single-writer-multiple-readers (SWMR) invariant for every memory location.
          An important distinction between coherence and consistency is that coherence is specified on a per-memory location basis, whereas consistency is specified with respect to all memory locations.




          Continuing with our mental picture, the SWMR invariant corresponds to the physical requirement that there be at most one particle located at any one location but there can be an unlimited number of observers of any location.






          share|improve this answer



















          • 47




            +1 for the analogy with special relativity, I've been trying to make the same analogy myself. Too often I see programmers investigating threaded code trying to interpret the behavior as operations in different threads occurring interleaved with one another in a specific order, and I have to tell them, nope, with multi-processor systems the notion of simultaneity between different <s>frames of reference</s> threads is now meaningless. Comparing with special relativity is a good way to make them respect the complexity of the problem.
            – Pierre Lebeaupin
            Jun 26 '14 at 19:42






          • 2




            @Ahmed Nassar: the link you shared from stanford is dead.
            – Joze
            Apr 21 '15 at 12:02






          • 2




            @Joze: Thanks. I made it refer to the ACM library instead. It is still freely available elsewhere on the Web.
            – Ahmed Nassar
            Apr 21 '15 at 18:55






          • 52




            So should you conclude that the Universe is multicore?
            – Peter K
            Apr 28 '15 at 11:36






          • 5




            @PeterK: Exactly :) And here is a very nice visualization of this picture of time by physicist Brian Greene: youtube.com/watch?v=4BjGWLJNPcA&t=22m12s This is "The Illusion of Time [Full Documentary]" at minute 22 and 12 seconds.
            – Ahmed Nassar
            Jul 19 '15 at 2:17





















          89














          This is now a multiple-year old question, but being very popular, it's worth mentioning a fantastic resource for learning about the C++11 memory model. I see no point in summing up his talk in order to make this yet another full answer, but given this is the guy who actually wrote the standard, I think it's well worth watching the talk.



          Herb Sutter has a three hour long talk about the C++11 memory model titled "atomic<> Weapons", available on the Channel9 site - part 1 and part 2. The talk is pretty technical, and covers the following topics:




          1. Optimizations, Races, and the Memory Model

          2. Ordering – What: Acquire and Release

          3. Ordering – How: Mutexes, Atomics, and/or Fences

          4. Other Restrictions on Compilers and Hardware

          5. Code Gen & Performance: x86/x64, IA64, POWER, ARM

          6. Relaxed Atomics


          The talk doesn't elaborate on the API, but rather on the reasoning, background, under the hood and behind the scenes (did you know relaxed semantics were added to the standard only because POWER and ARM do not support synchronized load efficiently?).






          share|improve this answer



















          • 6




            That talk is indeed fantastic, totally worth the 3 hours you'll spend watching it.
            – ZunTzu
            Aug 31 '15 at 12:50








          • 4




            @ZunTzu: on most video players you can set the speed to 1.25, 1.5 or even 2 times the original.
            – Christian Severin
            Dec 15 '15 at 17:48






          • 3




            @eran do you guys happen to have the slides? links on the channel 9 talk pages do not work.
            – athos
            Aug 30 '16 at 2:33






          • 2




            @athos I don't have them, sorry. Try contacting channel 9, I don't think the removal was intentional (my guess is that they got the link from Herb Sutter, posted as is, and he later removed the files; but that's just a speculation...).
            – eran
            Aug 30 '16 at 6:06



















          70














          It means that the standard now defines multi-threading, and it defines what happens in the context of multiple threads. Of course, people used varying implementations, but that's like asking why we should have a std::string when we could all be using a home-rolled string class.



          When you're talking about POSIX threads or Windows threads, then this is a bit of an illusion as actually you're talking about x86 threads, as it's a hardware function to run concurrently. The C++0x memory model makes guarantees, whether you're on x86, or ARM, or MIPS, or anything else you can come up with.






          share|improve this answer



















          • 24




            Posix threads are not restricted to x86. Indeed, the first systems they were implemented on were probably not x86 systems. Posix threads are system-independent, and are valid on all Posix platforms. It's also not really true that it's a hardware property because Posix threads can also be implemented through cooperative multitasking. But of course most threading issues only surface on hardware threading implementations (and some even only on multiprocessor/multicore systems).
            – celtschk
            Aug 18 '13 at 19:56



















          51














          For languages not specifying a memory model, you are writing code for the language and the memory model specified by the processor architecture. The processor may choose to re-order memory accesses for performance. So, if your program has data races (a data race is when it's possible for multiple cores / hyper-threads to access the same memory concurrently) then your program is not cross platform because of its dependence on the processor memory model. You may refer to the Intel or AMD software manuals to find out how the processors may re-order memory accesses.



          Very importantly, locks (and concurrency semantics with locking) are typically implemented in a cross platform way... So if you are using standard locks in a multithreaded program with no data races then you don't have to worry about cross platform memory models.



          Interestingly, Microsoft compilers for C++ have acquire / release semantics for volatile which is a C++ extension to deal with the lack of a memory model in C++ http://msdn.microsoft.com/en-us/library/12a04hfd(v=vs.80).aspx. However, given that Windows runs on x86 / x64 only, that's not saying much (Intel and AMD memory models make it easy and efficient to implement acquire / release semantics in a language).






          share|improve this answer



















          • 1




            It is true that, when the answer was written, Windows run on x86/x64 only, but Windows run, at some point in time, on IA64, MIPS, Alpha AXP64, PowerPC and ARM. Today it runs on various versions of ARM, which is quite different memory wise from x86, and nowhere nearly as forgiving.
            – Lorenzo Dematté
            Dec 6 '16 at 10:12










          • That link is somewhat broken (says "Visual Studio 2005 Retired documentation"). Care to update it?
            – Peter Mortensen
            Nov 5 '17 at 23:09








          • 3




            It was not true even when the answer was written.
            – Ben
            Dec 2 '17 at 10:14










          • "to access the same memory concurrently" to access in a conflicting way
            – curiousguy
            Jun 13 '18 at 23:22



















          24














          If you use mutexes to protect all your data, you really shouldn't need to worry. Mutexes have always provided sufficient ordering and visibility guarantees.



          Now, if you used atomics, or lock-free algorithms, you need to think about the memory model. The memory model describes precisely when atomics provide ordering and visibility guarantees, and provides portable fences for hand-coded guarantees.



          Previously, atomics would be done using compiler intrinsics, or some higher level library. Fences would have been done using CPU-specific instructions (memory barriers).






          share|improve this answer

















          • 18




            The problem before was that there was not such thing as a mutex (in terms of the C++ standard). So the only guarantees you were provided were by the mutex manufacturer, which was fine as long as you did not port the code (as minor changes to guarantees are hard to spot). Now we are get guarantees provided by the standard which should be portable between platforms.
            – Martin York
            Jun 12 '11 at 0:09






          • 4




            @Martin: in any case, one thing is the memory model, and another are the atomics and threading primitives that run on top of that memory model.
            – ninjalj
            Jun 12 '11 at 0:18






          • 4




            Also, my point was mostly that previously there was mostly no memory model at the language level, it happened to be the memory model of the underlying CPU. Now there is a memory model which is part of the core language; OTOH, mutexes and the like could always be done as a library.
            – ninjalj
            Jun 12 '11 at 0:36






          • 3




            It could also be a real problem for the people trying to write the mutex library. When the CPU, the memory controller, the kernel, the compiler, and the "C library" are all implemented by different teams, and some of them are in violent disagreement as to how this stuff is supposed to work, well, sometimes the stuff we systems programmers have to do to present a pretty facade to the applications level is not pleasant at all.
            – zwol
            Jun 12 '11 at 2:02






          • 10




            Unfortunately it is not enough to guard your data structures with simple mutexes if there is not a consistent memory model in your language. There are various compiler optimizations which make sense in a single threaded context but when multiple threads and cpu cores come into play, reordering of memory accesses and other optimizations may yield undefined behavior. For more information see "Threads cannot be implemented as a library" by Hans Boehm: citeseer.ist.psu.edu/viewdoc/…
            – exDM69
            Jun 13 '11 at 12:45












          protected by Nawaz Oct 3 '17 at 17:06



          Thank you for your interest in this question.
          Because it has attracted low-quality or spam answers that had to be removed, posting an answer now requires 10 reputation on this site (the association bonus does not count).



          Would you like to answer one of these unanswered questions instead?














          6 Answers
          6






          active

          oldest

          votes








          6 Answers
          6






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          1901





          +250









          First, you have to learn to think like a Language Lawyer.



          The C++ specification does not make reference to any particular compiler, operating system, or CPU. It makes reference to an abstract machine that is a generalization of actual systems. In the Language Lawyer world, the job of the programmer is to write code for the abstract machine; the job of the compiler is to actualize that code on a concrete machine. By coding rigidly to the spec, you can be certain that your code will compile and run without modification on any system with a compliant C++ compiler, whether today or 50 years from now.



          The abstract machine in the C++98/C++03 specification is fundamentally single-threaded. So it is not possible to write multi-threaded C++ code that is "fully portable" with respect to the spec. The spec does not even say anything about the atomicity of memory loads and stores or the order in which loads and stores might happen, never mind things like mutexes.



          Of course, you can write multi-threaded code in practice for particular concrete systems – like pthreads or Windows. But there is no standard way to write multi-threaded code for C++98/C++03.



          The abstract machine in C++11 is multi-threaded by design. It also has a well-defined memory model; that is, it says what the compiler may and may not do when it comes to accessing memory.



          Consider the following example, where a pair of global variables are accessed concurrently by two threads:



                     Global
          int x, y;

          Thread 1 Thread 2
          x = 17; cout << y << " ";
          y = 37; cout << x << endl;


          What might Thread 2 output?



          Under C++98/C++03, this is not even Undefined Behavior; the question itself is meaningless because the standard does not contemplate anything called a "thread".



          Under C++11, the result is Undefined Behavior, because loads and stores need not be atomic in general. Which may not seem like much of an improvement... And by itself, it's not.



          But with C++11, you can write this:



                     Global
          atomic<int> x, y;

          Thread 1 Thread 2
          x.store(17); cout << y.load() << " ";
          y.store(37); cout << x.load() << endl;


          Now things get much more interesting. First of all, the behavior here is defined. Thread 2 could now print 0 0 (if it runs before Thread 1), 37 17 (if it runs after Thread 1), or 0 17 (if it runs after Thread 1 assigns to x but before it assigns to y).



          What it cannot print is 37 0, because the default mode for atomic loads/stores in C++11 is to enforce sequential consistency. This just means all loads and stores must be "as if" they happened in the order you wrote them within each thread, while operations among threads can be interleaved however the system likes. So the default behavior of atomics provides both atomicity and ordering for loads and stores.



          Now, on a modern CPU, ensuring sequential consistency can be expensive. In particular, the compiler is likely to emit full-blown memory barriers between every access here. But if your algorithm can tolerate out-of-order loads and stores; i.e., if it requires atomicity but not ordering; i.e., if it can tolerate 37 0 as output from this program, then you can write this:



                     Global
          atomic<int> x, y;

          Thread 1 Thread 2
          x.store(17,memory_order_relaxed); cout << y.load(memory_order_relaxed) << " ";
          y.store(37,memory_order_relaxed); cout << x.load(memory_order_relaxed) << endl;


          The more modern the CPU, the more likely this is to be faster than the previous example.



          Finally, if you just need to keep particular loads and stores in order, you can write:



                     Global
          atomic<int> x, y;

          Thread 1 Thread 2
          x.store(17,memory_order_release); cout << y.load(memory_order_acquire) << " ";
          y.store(37,memory_order_release); cout << x.load(memory_order_acquire) << endl;


          This takes us back to the ordered loads and stores – so 37 0 is no longer a possible output – but it does so with minimal overhead. (In this trivial example, the result is the same as full-blown sequential consistency; in a larger program, it would not be.)



          Of course, if the only outputs you want to see are 0 0 or 37 17, you can just wrap a mutex around the original code. But if you have read this far, I bet you already know how that works, and this answer is already longer than I intended :-).



          So, bottom line. Mutexes are great, and C++11 standardizes them. But sometimes for performance reasons you want lower-level primitives (e.g., the classic double-checked locking pattern). The new standard provides high-level gadgets like mutexes and condition variables, and it also provides low-level gadgets like atomic types and the various flavors of memory barrier. So now you can write sophisticated, high-performance concurrent routines entirely within the language specified by the standard, and you can be certain your code will compile and run unchanged on both today's systems and tomorrow's.



          Although to be frank, unless you are an expert and working on some serious low-level code, you should probably stick to mutexes and condition variables. That's what I intend to do.



          For more on this stuff, see this blog post.






          share|improve this answer



















          • 31




            Nice answer, but this is really begging for some actual examples of the new primitives. Also, I think the memory ordering without primitives is the same as pre-C++0x: there are no guarantees.
            – John Ripley
            Jun 12 '11 at 0:37






          • 4




            @John: I know, but I am still learning the primitives myself :-). Also I think they guarantee byte accesses are atomic (although not ordered) which is why I went with "char" for my example... But I am not even 100% sure about that... If you want to suggest any good "tutorial" references I will add them to my answer
            – Nemo
            Jun 12 '11 at 0:39








          • 41




            @Nawaz: Yes! Memory accesses can get reordered by the compiler or CPU. Think about (e.g.) caches and speculative loads. The order in which system memory gets hit can be nothing like what you coded. The compiler and CPU will ensure such reorderings do not break single-threaded code. For multi-threaded code, the "memory model" characterizes the possible re-orderings, and what happens if two threads read/write the same location at the same time, and how you excert control over both. For single-threaded code, the memory model is irrelevant.
            – Nemo
            Jun 12 '11 at 17:08






          • 23




            @Nawaz, @Nemo - A minor detail: the new memory model is relevant in single-threaded code insofar as it specifies the undefinedness of certain expressions, such as i = i++. The old concept of sequence points has been discarded; the new standard specifies the same thing using a sequenced-before relation which is just a special case of the more general inter-thread happens-before concept.
            – JohannesD
            Jun 13 '11 at 13:14








          • 15




            @AJG85: Section 3.6.2 of the draft C++0x spec says, "Variables with static storage duration (3.7.1) or thread storage duration (3.7.2) shall be zero-initialized (8.5) before any other initialization takes place." Since x,y are global in this example, they have static storage duration and therefore will zero-initialized, I believe.
            – Nemo
            Jun 13 '11 at 20:16
















          1901





          +250









          First, you have to learn to think like a Language Lawyer.



          The C++ specification does not make reference to any particular compiler, operating system, or CPU. It makes reference to an abstract machine that is a generalization of actual systems. In the Language Lawyer world, the job of the programmer is to write code for the abstract machine; the job of the compiler is to actualize that code on a concrete machine. By coding rigidly to the spec, you can be certain that your code will compile and run without modification on any system with a compliant C++ compiler, whether today or 50 years from now.



          The abstract machine in the C++98/C++03 specification is fundamentally single-threaded. So it is not possible to write multi-threaded C++ code that is "fully portable" with respect to the spec. The spec does not even say anything about the atomicity of memory loads and stores or the order in which loads and stores might happen, never mind things like mutexes.



          Of course, you can write multi-threaded code in practice for particular concrete systems – like pthreads or Windows. But there is no standard way to write multi-threaded code for C++98/C++03.



          The abstract machine in C++11 is multi-threaded by design. It also has a well-defined memory model; that is, it says what the compiler may and may not do when it comes to accessing memory.



          Consider the following example, where a pair of global variables are accessed concurrently by two threads:



                     Global
          int x, y;

          Thread 1 Thread 2
          x = 17; cout << y << " ";
          y = 37; cout << x << endl;


          What might Thread 2 output?



          Under C++98/C++03, this is not even Undefined Behavior; the question itself is meaningless because the standard does not contemplate anything called a "thread".



          Under C++11, the result is Undefined Behavior, because loads and stores need not be atomic in general. Which may not seem like much of an improvement... And by itself, it's not.



          But with C++11, you can write this:



                     Global
          atomic<int> x, y;

          Thread 1 Thread 2
          x.store(17); cout << y.load() << " ";
          y.store(37); cout << x.load() << endl;


          Now things get much more interesting. First of all, the behavior here is defined. Thread 2 could now print 0 0 (if it runs before Thread 1), 37 17 (if it runs after Thread 1), or 0 17 (if it runs after Thread 1 assigns to x but before it assigns to y).



          What it cannot print is 37 0, because the default mode for atomic loads/stores in C++11 is to enforce sequential consistency. This just means all loads and stores must be "as if" they happened in the order you wrote them within each thread, while operations among threads can be interleaved however the system likes. So the default behavior of atomics provides both atomicity and ordering for loads and stores.



          Now, on a modern CPU, ensuring sequential consistency can be expensive. In particular, the compiler is likely to emit full-blown memory barriers between every access here. But if your algorithm can tolerate out-of-order loads and stores; i.e., if it requires atomicity but not ordering; i.e., if it can tolerate 37 0 as output from this program, then you can write this:



                     Global
          atomic<int> x, y;

          Thread 1 Thread 2
          x.store(17,memory_order_relaxed); cout << y.load(memory_order_relaxed) << " ";
          y.store(37,memory_order_relaxed); cout << x.load(memory_order_relaxed) << endl;


          The more modern the CPU, the more likely this is to be faster than the previous example.



          Finally, if you just need to keep particular loads and stores in order, you can write:



                     Global
          atomic<int> x, y;

          Thread 1 Thread 2
          x.store(17,memory_order_release); cout << y.load(memory_order_acquire) << " ";
          y.store(37,memory_order_release); cout << x.load(memory_order_acquire) << endl;


          This takes us back to the ordered loads and stores – so 37 0 is no longer a possible output – but it does so with minimal overhead. (In this trivial example, the result is the same as full-blown sequential consistency; in a larger program, it would not be.)



          Of course, if the only outputs you want to see are 0 0 or 37 17, you can just wrap a mutex around the original code. But if you have read this far, I bet you already know how that works, and this answer is already longer than I intended :-).



          So, bottom line. Mutexes are great, and C++11 standardizes them. But sometimes for performance reasons you want lower-level primitives (e.g., the classic double-checked locking pattern). The new standard provides high-level gadgets like mutexes and condition variables, and it also provides low-level gadgets like atomic types and the various flavors of memory barrier. So now you can write sophisticated, high-performance concurrent routines entirely within the language specified by the standard, and you can be certain your code will compile and run unchanged on both today's systems and tomorrow's.



          Although to be frank, unless you are an expert and working on some serious low-level code, you should probably stick to mutexes and condition variables. That's what I intend to do.



          For more on this stuff, see this blog post.






          share|improve this answer



















          • 31




            Nice answer, but this is really begging for some actual examples of the new primitives. Also, I think the memory ordering without primitives is the same as pre-C++0x: there are no guarantees.
            – John Ripley
            Jun 12 '11 at 0:37






          • 4




            @John: I know, but I am still learning the primitives myself :-). Also I think they guarantee byte accesses are atomic (although not ordered) which is why I went with "char" for my example... But I am not even 100% sure about that... If you want to suggest any good "tutorial" references I will add them to my answer
            – Nemo
            Jun 12 '11 at 0:39








          • 41




            @Nawaz: Yes! Memory accesses can get reordered by the compiler or CPU. Think about (e.g.) caches and speculative loads. The order in which system memory gets hit can be nothing like what you coded. The compiler and CPU will ensure such reorderings do not break single-threaded code. For multi-threaded code, the "memory model" characterizes the possible re-orderings, and what happens if two threads read/write the same location at the same time, and how you excert control over both. For single-threaded code, the memory model is irrelevant.
            – Nemo
            Jun 12 '11 at 17:08






          • 23




            @Nawaz, @Nemo - A minor detail: the new memory model is relevant in single-threaded code insofar as it specifies the undefinedness of certain expressions, such as i = i++. The old concept of sequence points has been discarded; the new standard specifies the same thing using a sequenced-before relation which is just a special case of the more general inter-thread happens-before concept.
            – JohannesD
            Jun 13 '11 at 13:14








          • 15




            @AJG85: Section 3.6.2 of the draft C++0x spec says, "Variables with static storage duration (3.7.1) or thread storage duration (3.7.2) shall be zero-initialized (8.5) before any other initialization takes place." Since x,y are global in this example, they have static storage duration and therefore will zero-initialized, I believe.
            – Nemo
            Jun 13 '11 at 20:16














          1901





          +250







          1901





          +250



          1901




          +250




          First, you have to learn to think like a Language Lawyer.



          The C++ specification does not make reference to any particular compiler, operating system, or CPU. It makes reference to an abstract machine that is a generalization of actual systems. In the Language Lawyer world, the job of the programmer is to write code for the abstract machine; the job of the compiler is to actualize that code on a concrete machine. By coding rigidly to the spec, you can be certain that your code will compile and run without modification on any system with a compliant C++ compiler, whether today or 50 years from now.



          The abstract machine in the C++98/C++03 specification is fundamentally single-threaded. So it is not possible to write multi-threaded C++ code that is "fully portable" with respect to the spec. The spec does not even say anything about the atomicity of memory loads and stores or the order in which loads and stores might happen, never mind things like mutexes.



          Of course, you can write multi-threaded code in practice for particular concrete systems – like pthreads or Windows. But there is no standard way to write multi-threaded code for C++98/C++03.



          The abstract machine in C++11 is multi-threaded by design. It also has a well-defined memory model; that is, it says what the compiler may and may not do when it comes to accessing memory.



          Consider the following example, where a pair of global variables are accessed concurrently by two threads:



                     Global
          int x, y;

          Thread 1 Thread 2
          x = 17; cout << y << " ";
          y = 37; cout << x << endl;


          What might Thread 2 output?



          Under C++98/C++03, this is not even Undefined Behavior; the question itself is meaningless because the standard does not contemplate anything called a "thread".



          Under C++11, the result is Undefined Behavior, because loads and stores need not be atomic in general. Which may not seem like much of an improvement... And by itself, it's not.



          But with C++11, you can write this:



                     Global
          atomic<int> x, y;

          Thread 1 Thread 2
          x.store(17); cout << y.load() << " ";
          y.store(37); cout << x.load() << endl;


          Now things get much more interesting. First of all, the behavior here is defined. Thread 2 could now print 0 0 (if it runs before Thread 1), 37 17 (if it runs after Thread 1), or 0 17 (if it runs after Thread 1 assigns to x but before it assigns to y).



          What it cannot print is 37 0, because the default mode for atomic loads/stores in C++11 is to enforce sequential consistency. This just means all loads and stores must be "as if" they happened in the order you wrote them within each thread, while operations among threads can be interleaved however the system likes. So the default behavior of atomics provides both atomicity and ordering for loads and stores.



          Now, on a modern CPU, ensuring sequential consistency can be expensive. In particular, the compiler is likely to emit full-blown memory barriers between every access here. But if your algorithm can tolerate out-of-order loads and stores; i.e., if it requires atomicity but not ordering; i.e., if it can tolerate 37 0 as output from this program, then you can write this:



                     Global
          atomic<int> x, y;

          Thread 1 Thread 2
          x.store(17,memory_order_relaxed); cout << y.load(memory_order_relaxed) << " ";
          y.store(37,memory_order_relaxed); cout << x.load(memory_order_relaxed) << endl;


          The more modern the CPU, the more likely this is to be faster than the previous example.



          Finally, if you just need to keep particular loads and stores in order, you can write:



                     Global
          atomic<int> x, y;

          Thread 1 Thread 2
          x.store(17,memory_order_release); cout << y.load(memory_order_acquire) << " ";
          y.store(37,memory_order_release); cout << x.load(memory_order_acquire) << endl;


          This takes us back to the ordered loads and stores – so 37 0 is no longer a possible output – but it does so with minimal overhead. (In this trivial example, the result is the same as full-blown sequential consistency; in a larger program, it would not be.)



          Of course, if the only outputs you want to see are 0 0 or 37 17, you can just wrap a mutex around the original code. But if you have read this far, I bet you already know how that works, and this answer is already longer than I intended :-).



          So, bottom line. Mutexes are great, and C++11 standardizes them. But sometimes for performance reasons you want lower-level primitives (e.g., the classic double-checked locking pattern). The new standard provides high-level gadgets like mutexes and condition variables, and it also provides low-level gadgets like atomic types and the various flavors of memory barrier. So now you can write sophisticated, high-performance concurrent routines entirely within the language specified by the standard, and you can be certain your code will compile and run unchanged on both today's systems and tomorrow's.



          Although to be frank, unless you are an expert and working on some serious low-level code, you should probably stick to mutexes and condition variables. That's what I intend to do.



          For more on this stuff, see this blog post.






          share|improve this answer














          First, you have to learn to think like a Language Lawyer.



          The C++ specification does not make reference to any particular compiler, operating system, or CPU. It makes reference to an abstract machine that is a generalization of actual systems. In the Language Lawyer world, the job of the programmer is to write code for the abstract machine; the job of the compiler is to actualize that code on a concrete machine. By coding rigidly to the spec, you can be certain that your code will compile and run without modification on any system with a compliant C++ compiler, whether today or 50 years from now.



          The abstract machine in the C++98/C++03 specification is fundamentally single-threaded. So it is not possible to write multi-threaded C++ code that is "fully portable" with respect to the spec. The spec does not even say anything about the atomicity of memory loads and stores or the order in which loads and stores might happen, never mind things like mutexes.



          Of course, you can write multi-threaded code in practice for particular concrete systems – like pthreads or Windows. But there is no standard way to write multi-threaded code for C++98/C++03.



          The abstract machine in C++11 is multi-threaded by design. It also has a well-defined memory model; that is, it says what the compiler may and may not do when it comes to accessing memory.



          Consider the following example, where a pair of global variables are accessed concurrently by two threads:



                     Global
          int x, y;

          Thread 1 Thread 2
          x = 17; cout << y << " ";
          y = 37; cout << x << endl;


          What might Thread 2 output?



          Under C++98/C++03, this is not even Undefined Behavior; the question itself is meaningless because the standard does not contemplate anything called a "thread".



          Under C++11, the result is Undefined Behavior, because loads and stores need not be atomic in general. Which may not seem like much of an improvement... And by itself, it's not.



          But with C++11, you can write this:



                     Global
          atomic<int> x, y;

          Thread 1 Thread 2
          x.store(17); cout << y.load() << " ";
          y.store(37); cout << x.load() << endl;


          Now things get much more interesting. First of all, the behavior here is defined. Thread 2 could now print 0 0 (if it runs before Thread 1), 37 17 (if it runs after Thread 1), or 0 17 (if it runs after Thread 1 assigns to x but before it assigns to y).



          What it cannot print is 37 0, because the default mode for atomic loads/stores in C++11 is to enforce sequential consistency. This just means all loads and stores must be "as if" they happened in the order you wrote them within each thread, while operations among threads can be interleaved however the system likes. So the default behavior of atomics provides both atomicity and ordering for loads and stores.



          Now, on a modern CPU, ensuring sequential consistency can be expensive. In particular, the compiler is likely to emit full-blown memory barriers between every access here. But if your algorithm can tolerate out-of-order loads and stores; i.e., if it requires atomicity but not ordering; i.e., if it can tolerate 37 0 as output from this program, then you can write this:



                     Global
          atomic<int> x, y;

          Thread 1 Thread 2
          x.store(17,memory_order_relaxed); cout << y.load(memory_order_relaxed) << " ";
          y.store(37,memory_order_relaxed); cout << x.load(memory_order_relaxed) << endl;


          The more modern the CPU, the more likely this is to be faster than the previous example.



          Finally, if you just need to keep particular loads and stores in order, you can write:



                     Global
          atomic<int> x, y;

          Thread 1 Thread 2
          x.store(17,memory_order_release); cout << y.load(memory_order_acquire) << " ";
          y.store(37,memory_order_release); cout << x.load(memory_order_acquire) << endl;


          This takes us back to the ordered loads and stores – so 37 0 is no longer a possible output – but it does so with minimal overhead. (In this trivial example, the result is the same as full-blown sequential consistency; in a larger program, it would not be.)



          Of course, if the only outputs you want to see are 0 0 or 37 17, you can just wrap a mutex around the original code. But if you have read this far, I bet you already know how that works, and this answer is already longer than I intended :-).



          So, bottom line. Mutexes are great, and C++11 standardizes them. But sometimes for performance reasons you want lower-level primitives (e.g., the classic double-checked locking pattern). The new standard provides high-level gadgets like mutexes and condition variables, and it also provides low-level gadgets like atomic types and the various flavors of memory barrier. So now you can write sophisticated, high-performance concurrent routines entirely within the language specified by the standard, and you can be certain your code will compile and run unchanged on both today's systems and tomorrow's.



          Although to be frank, unless you are an expert and working on some serious low-level code, you should probably stick to mutexes and condition variables. That's what I intend to do.



          For more on this stuff, see this blog post.







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Dec 1 '18 at 11:30









          Baum mit Augen

          40.2k12115147




          40.2k12115147










          answered Jun 12 '11 at 0:23









          Nemo

          56k992131




          56k992131








          • 31




            Nice answer, but this is really begging for some actual examples of the new primitives. Also, I think the memory ordering without primitives is the same as pre-C++0x: there are no guarantees.
            – John Ripley
            Jun 12 '11 at 0:37






          • 4




            @John: I know, but I am still learning the primitives myself :-). Also I think they guarantee byte accesses are atomic (although not ordered) which is why I went with "char" for my example... But I am not even 100% sure about that... If you want to suggest any good "tutorial" references I will add them to my answer
            – Nemo
            Jun 12 '11 at 0:39








          • 41




            @Nawaz: Yes! Memory accesses can get reordered by the compiler or CPU. Think about (e.g.) caches and speculative loads. The order in which system memory gets hit can be nothing like what you coded. The compiler and CPU will ensure such reorderings do not break single-threaded code. For multi-threaded code, the "memory model" characterizes the possible re-orderings, and what happens if two threads read/write the same location at the same time, and how you excert control over both. For single-threaded code, the memory model is irrelevant.
            – Nemo
            Jun 12 '11 at 17:08






          • 23




            @Nawaz, @Nemo - A minor detail: the new memory model is relevant in single-threaded code insofar as it specifies the undefinedness of certain expressions, such as i = i++. The old concept of sequence points has been discarded; the new standard specifies the same thing using a sequenced-before relation which is just a special case of the more general inter-thread happens-before concept.
            – JohannesD
            Jun 13 '11 at 13:14








          • 15




            @AJG85: Section 3.6.2 of the draft C++0x spec says, "Variables with static storage duration (3.7.1) or thread storage duration (3.7.2) shall be zero-initialized (8.5) before any other initialization takes place." Since x,y are global in this example, they have static storage duration and therefore will zero-initialized, I believe.
            – Nemo
            Jun 13 '11 at 20:16














          • 31




            Nice answer, but this is really begging for some actual examples of the new primitives. Also, I think the memory ordering without primitives is the same as pre-C++0x: there are no guarantees.
            – John Ripley
            Jun 12 '11 at 0:37






          • 4




            @John: I know, but I am still learning the primitives myself :-). Also I think they guarantee byte accesses are atomic (although not ordered) which is why I went with "char" for my example... But I am not even 100% sure about that... If you want to suggest any good "tutorial" references I will add them to my answer
            – Nemo
            Jun 12 '11 at 0:39








          • 41




            @Nawaz: Yes! Memory accesses can get reordered by the compiler or CPU. Think about (e.g.) caches and speculative loads. The order in which system memory gets hit can be nothing like what you coded. The compiler and CPU will ensure such reorderings do not break single-threaded code. For multi-threaded code, the "memory model" characterizes the possible re-orderings, and what happens if two threads read/write the same location at the same time, and how you excert control over both. For single-threaded code, the memory model is irrelevant.
            – Nemo
            Jun 12 '11 at 17:08






          • 23




            @Nawaz, @Nemo - A minor detail: the new memory model is relevant in single-threaded code insofar as it specifies the undefinedness of certain expressions, such as i = i++. The old concept of sequence points has been discarded; the new standard specifies the same thing using a sequenced-before relation which is just a special case of the more general inter-thread happens-before concept.
            – JohannesD
            Jun 13 '11 at 13:14








          • 15




            @AJG85: Section 3.6.2 of the draft C++0x spec says, "Variables with static storage duration (3.7.1) or thread storage duration (3.7.2) shall be zero-initialized (8.5) before any other initialization takes place." Since x,y are global in this example, they have static storage duration and therefore will zero-initialized, I believe.
            – Nemo
            Jun 13 '11 at 20:16








          31




          31




          Nice answer, but this is really begging for some actual examples of the new primitives. Also, I think the memory ordering without primitives is the same as pre-C++0x: there are no guarantees.
          – John Ripley
          Jun 12 '11 at 0:37




          Nice answer, but this is really begging for some actual examples of the new primitives. Also, I think the memory ordering without primitives is the same as pre-C++0x: there are no guarantees.
          – John Ripley
          Jun 12 '11 at 0:37




          4




          4




          @John: I know, but I am still learning the primitives myself :-). Also I think they guarantee byte accesses are atomic (although not ordered) which is why I went with "char" for my example... But I am not even 100% sure about that... If you want to suggest any good "tutorial" references I will add them to my answer
          – Nemo
          Jun 12 '11 at 0:39






          @John: I know, but I am still learning the primitives myself :-). Also I think they guarantee byte accesses are atomic (although not ordered) which is why I went with "char" for my example... But I am not even 100% sure about that... If you want to suggest any good "tutorial" references I will add them to my answer
          – Nemo
          Jun 12 '11 at 0:39






          41




          41




          @Nawaz: Yes! Memory accesses can get reordered by the compiler or CPU. Think about (e.g.) caches and speculative loads. The order in which system memory gets hit can be nothing like what you coded. The compiler and CPU will ensure such reorderings do not break single-threaded code. For multi-threaded code, the "memory model" characterizes the possible re-orderings, and what happens if two threads read/write the same location at the same time, and how you excert control over both. For single-threaded code, the memory model is irrelevant.
          – Nemo
          Jun 12 '11 at 17:08




          @Nawaz: Yes! Memory accesses can get reordered by the compiler or CPU. Think about (e.g.) caches and speculative loads. The order in which system memory gets hit can be nothing like what you coded. The compiler and CPU will ensure such reorderings do not break single-threaded code. For multi-threaded code, the "memory model" characterizes the possible re-orderings, and what happens if two threads read/write the same location at the same time, and how you excert control over both. For single-threaded code, the memory model is irrelevant.
          – Nemo
          Jun 12 '11 at 17:08




          23




          23




          @Nawaz, @Nemo - A minor detail: the new memory model is relevant in single-threaded code insofar as it specifies the undefinedness of certain expressions, such as i = i++. The old concept of sequence points has been discarded; the new standard specifies the same thing using a sequenced-before relation which is just a special case of the more general inter-thread happens-before concept.
          – JohannesD
          Jun 13 '11 at 13:14






          @Nawaz, @Nemo - A minor detail: the new memory model is relevant in single-threaded code insofar as it specifies the undefinedness of certain expressions, such as i = i++. The old concept of sequence points has been discarded; the new standard specifies the same thing using a sequenced-before relation which is just a special case of the more general inter-thread happens-before concept.
          – JohannesD
          Jun 13 '11 at 13:14






          15




          15




          @AJG85: Section 3.6.2 of the draft C++0x spec says, "Variables with static storage duration (3.7.1) or thread storage duration (3.7.2) shall be zero-initialized (8.5) before any other initialization takes place." Since x,y are global in this example, they have static storage duration and therefore will zero-initialized, I believe.
          – Nemo
          Jun 13 '11 at 20:16




          @AJG85: Section 3.6.2 of the draft C++0x spec says, "Variables with static storage duration (3.7.1) or thread storage duration (3.7.2) shall be zero-initialized (8.5) before any other initialization takes place." Since x,y are global in this example, they have static storage duration and therefore will zero-initialized, I believe.
          – Nemo
          Jun 13 '11 at 20:16













          299





          +50









          I will just give the analogy with which I understand memory consistency models (or memory models, for short). It is inspired by Leslie Lamport's seminal paper "Time, Clocks, and the Ordering of Events in a Distributed System".
          The analogy is apt and has fundamental significance, but may be overkill for many people. However, I hope it provides a mental image (a pictorial representation) that facilitates reasoning about memory consistency models.



          Let’s view the histories of all memory locations in a space-time diagram in which the horizontal axis represents the address space (i.e., each memory location is represented by a point on that axis) and the vertical axis represents time (we will see that, in general, there is not a universal notion of time). The history of values held by each memory location is, therefore, represented by a vertical column at that memory address. Each value change is due to one of the threads writing a new value to that location. By a memory image, we will mean the aggregate/combination of values of all memory locations observable at a particular time by a particular thread.



          Quoting from "A Primer on Memory Consistency and Cache Coherence"




          The intuitive (and most restrictive) memory model is sequential consistency (SC) in which a multithreaded execution should look like an interleaving of the sequential executions of each constituent thread, as if the threads were time-multiplexed on a single-core processor.




          That global memory order can vary from one run of the program to another and may not be known beforehand. The characteristic feature of SC is the set of horizontal slices in the address-space-time diagram representing planes of simultaneity (i.e., memory images). On a given plane, all of its events (or memory values) are simultaneous. There is a notion of Absolute Time, in which all threads agree on which memory values are simultaneous. In SC, at every time instant, there is only one memory image shared by all threads. That's, at every instant of time, all processors agree on the memory image (i.e., the aggregate content of memory). Not only does this imply that all threads view the same sequence of values for all memory locations, but also that all processors observe the same combinations of values of all variables. This is the same as saying all memory operations (on all memory locations) are observed in the same total order by all threads.



          In relaxed memory models, each thread will slice up address-space-time in its own way, the only restriction being that slices of each thread shall not cross each other because all threads must agree on the history of every individual memory location (of course, slices of different threads may, and will, cross each other). There is no universal way to slice it up (no privileged foliation of address-space-time). Slices do not have to be planar (or linear). They can be curved and this is what can make a thread read values written by another thread out of the order they were written in. Histories of different memory locations may slide (or get stretched) arbitrarily relative to each other when viewed by any particular thread. Each thread will have a different sense of which events (or, equivalently, memory values) are simultaneous. The set of events (or memory values) that are simultaneous to one thread are not simultaneous to another. Thus, in a relaxed memory model, all threads still observe the same history (i.e., sequence of values) for each memory location. But they may observe different memory images (i.e., combinations of values of all memory locations). Even if two different memory locations are written by the same thread in sequence, the two newly written values may be observed in different order by other threads.



          [Picture from Wikipedia]
          Picture from Wikipedia



          Readers familiar with Einstein’s Special Theory of Relativity will notice what I am alluding to. Translating Minkowski’s words into the memory models realm: address space and time are shadows of address-space-time. In this case, each observer (i.e., thread) will project shadows of events (i.e., memory stores/loads) onto his own world-line (i.e., his time axis) and his own plane of simultaneity (his address-space axis). Threads in the C++11 memory model correspond to observers that are moving relative to each other in special relativity. Sequential consistency corresponds to the Galilean space-time (i.e., all observers agree on one absolute order of events and a global sense of simultaneity).



          The resemblance between memory models and special relativity stems from the fact that both define a partially-ordered set of events, often called a causal set. Some events (i.e., memory stores) can affect (but not be affected by) other events. A C++11 thread (or observer in physics) is no more than a chain (i.e., a totally ordered set) of events (e.g., memory loads and stores to possibly different addresses).



          In relativity, some order is restored to the seemingly chaotic picture of partially ordered events, since the only temporal ordering that all observers agree on is the ordering among “timelike” events (i.e., those events that are in principle connectible by any particle going slower than the speed of light in a vacuum). Only the timelike related events are invariantly ordered.
          Time in Physics, Craig Callender.



          In C++11 memory model, a similar mechanism (the acquire-release consistency model) is used to establish these local causality relations.



          To provide a definition of memory consistency and a motivation for abandoning SC, I will quote from "A Primer on Memory Consistency and Cache Coherence"




          For a shared memory machine, the memory consistency model defines the architecturally visible behavior of its memory system. The correctness criterion for a single processor core partitions behavior between “one correct result” and “many incorrect alternatives”. This is because the processor’s architecture mandates that the execution of a thread transforms a given input state into a single well-defined output state, even on an out-of-order core. Shared memory consistency models, however, concern the loads and stores of multiple threads and usually allow many correct executions while disallowing many (more) incorrect ones. The possibility of multiple correct executions is due to the ISA allowing multiple threads to execute concurrently, often with many possible legal interleavings of instructions from different threads.



          Relaxed or weak memory consistency models are motivated by the fact that most memory orderings in strong models are unnecessary. If a thread updates ten data items and then a synchronization flag, programmers usually do not care if the data items are updated in order with respect to each other but only that all data items are updated before the flag is updated (usually implemented using FENCE instructions). Relaxed models seek to capture this increased ordering flexibility and preserve only the orders that programmers “require” to get both higher performance and correctness of SC. For example, in certain architectures, FIFO write buffers are used by each core to hold the results of committed (retired) stores before writing the results to the caches. This optimization enhances performance but violates SC. The write buffer hides the latency of servicing a store miss. Because stores are common, being able to avoid stalling on most of them is an important benefit. For a single-core processor, a write buffer can be made architecturally invisible by ensuring that a load to address A returns the value of the most recent store to A even if one or more stores to A are in the write buffer. This is typically done by either bypassing the value of the most recent store to A to the load from A, where “most recent” is determined by program order, or by stalling a load of A if a store to A is in the write buffer. When multiple cores are used, each will have its own bypassing write buffer. Without write buffers, the hardware is SC, but with write buffers, it is not, making write buffers architecturally visible in a multicore processor.



          Store-store reordering may happen if a core has a non-FIFO write buffer that lets stores depart in a different order than the order in which they entered. This might occur if the first store misses in the cache while the second hits or if the second store can coalesce with an earlier store (i.e., before the first store). Load-load reordering may also happen on dynamically-scheduled cores that execute instructions out of program order. That can behave the same as reordering stores on another core (Can you come up with an example interleaving between two threads?). Reordering an earlier load with a later store (a load-store reordering) can cause many incorrect behaviors, such as loading a value after releasing the lock that protects it (if the store is the unlock operation). Note that store-load reorderings may also arise due to local bypassing in the commonly implemented FIFO write buffer, even with a core that executes all instructions in program order.




          Because cache coherence and memory consistency are sometimes confused, it is instructive to also have this quote:




          Unlike consistency, cache coherence is neither visible to software nor required. Coherence seeks to make the caches of a shared-memory system as functionally invisible as the caches in a single-core system. Correct coherence ensures that a programmer cannot determine whether and where a system has caches by analyzing the results of loads and stores. This is because correct coherence ensures that the caches never enable new or different functional behavior (programmers may still be able to infer likely cache structure using timing information). The main purpose of cache coherence protocols is maintaining the single-writer-multiple-readers (SWMR) invariant for every memory location.
          An important distinction between coherence and consistency is that coherence is specified on a per-memory location basis, whereas consistency is specified with respect to all memory locations.




          Continuing with our mental picture, the SWMR invariant corresponds to the physical requirement that there be at most one particle located at any one location but there can be an unlimited number of observers of any location.






          share|improve this answer



















          • 47




            +1 for the analogy with special relativity, I've been trying to make the same analogy myself. Too often I see programmers investigating threaded code trying to interpret the behavior as operations in different threads occurring interleaved with one another in a specific order, and I have to tell them, nope, with multi-processor systems the notion of simultaneity between different <s>frames of reference</s> threads is now meaningless. Comparing with special relativity is a good way to make them respect the complexity of the problem.
            – Pierre Lebeaupin
            Jun 26 '14 at 19:42






          • 2




            @Ahmed Nassar: the link you shared from stanford is dead.
            – Joze
            Apr 21 '15 at 12:02






          • 2




            @Joze: Thanks. I made it refer to the ACM library instead. It is still freely available elsewhere on the Web.
            – Ahmed Nassar
            Apr 21 '15 at 18:55






          • 52




            So should you conclude that the Universe is multicore?
            – Peter K
            Apr 28 '15 at 11:36






          • 5




            @PeterK: Exactly :) And here is a very nice visualization of this picture of time by physicist Brian Greene: youtube.com/watch?v=4BjGWLJNPcA&t=22m12s This is "The Illusion of Time [Full Documentary]" at minute 22 and 12 seconds.
            – Ahmed Nassar
            Jul 19 '15 at 2:17


















          299





          +50









          I will just give the analogy with which I understand memory consistency models (or memory models, for short). It is inspired by Leslie Lamport's seminal paper "Time, Clocks, and the Ordering of Events in a Distributed System".
          The analogy is apt and has fundamental significance, but may be overkill for many people. However, I hope it provides a mental image (a pictorial representation) that facilitates reasoning about memory consistency models.



          Let’s view the histories of all memory locations in a space-time diagram in which the horizontal axis represents the address space (i.e., each memory location is represented by a point on that axis) and the vertical axis represents time (we will see that, in general, there is not a universal notion of time). The history of values held by each memory location is, therefore, represented by a vertical column at that memory address. Each value change is due to one of the threads writing a new value to that location. By a memory image, we will mean the aggregate/combination of values of all memory locations observable at a particular time by a particular thread.



          Quoting from "A Primer on Memory Consistency and Cache Coherence"




          The intuitive (and most restrictive) memory model is sequential consistency (SC) in which a multithreaded execution should look like an interleaving of the sequential executions of each constituent thread, as if the threads were time-multiplexed on a single-core processor.




          That global memory order can vary from one run of the program to another and may not be known beforehand. The characteristic feature of SC is the set of horizontal slices in the address-space-time diagram representing planes of simultaneity (i.e., memory images). On a given plane, all of its events (or memory values) are simultaneous. There is a notion of Absolute Time, in which all threads agree on which memory values are simultaneous. In SC, at every time instant, there is only one memory image shared by all threads. That's, at every instant of time, all processors agree on the memory image (i.e., the aggregate content of memory). Not only does this imply that all threads view the same sequence of values for all memory locations, but also that all processors observe the same combinations of values of all variables. This is the same as saying all memory operations (on all memory locations) are observed in the same total order by all threads.



          In relaxed memory models, each thread will slice up address-space-time in its own way, the only restriction being that slices of each thread shall not cross each other because all threads must agree on the history of every individual memory location (of course, slices of different threads may, and will, cross each other). There is no universal way to slice it up (no privileged foliation of address-space-time). Slices do not have to be planar (or linear). They can be curved and this is what can make a thread read values written by another thread out of the order they were written in. Histories of different memory locations may slide (or get stretched) arbitrarily relative to each other when viewed by any particular thread. Each thread will have a different sense of which events (or, equivalently, memory values) are simultaneous. The set of events (or memory values) that are simultaneous to one thread are not simultaneous to another. Thus, in a relaxed memory model, all threads still observe the same history (i.e., sequence of values) for each memory location. But they may observe different memory images (i.e., combinations of values of all memory locations). Even if two different memory locations are written by the same thread in sequence, the two newly written values may be observed in different order by other threads.



          [Picture from Wikipedia]
          Picture from Wikipedia



          Readers familiar with Einstein’s Special Theory of Relativity will notice what I am alluding to. Translating Minkowski’s words into the memory models realm: address space and time are shadows of address-space-time. In this case, each observer (i.e., thread) will project shadows of events (i.e., memory stores/loads) onto his own world-line (i.e., his time axis) and his own plane of simultaneity (his address-space axis). Threads in the C++11 memory model correspond to observers that are moving relative to each other in special relativity. Sequential consistency corresponds to the Galilean space-time (i.e., all observers agree on one absolute order of events and a global sense of simultaneity).



          The resemblance between memory models and special relativity stems from the fact that both define a partially-ordered set of events, often called a causal set. Some events (i.e., memory stores) can affect (but not be affected by) other events. A C++11 thread (or observer in physics) is no more than a chain (i.e., a totally ordered set) of events (e.g., memory loads and stores to possibly different addresses).



          In relativity, some order is restored to the seemingly chaotic picture of partially ordered events, since the only temporal ordering that all observers agree on is the ordering among “timelike” events (i.e., those events that are in principle connectible by any particle going slower than the speed of light in a vacuum). Only the timelike related events are invariantly ordered.
          Time in Physics, Craig Callender.



          In C++11 memory model, a similar mechanism (the acquire-release consistency model) is used to establish these local causality relations.



          To provide a definition of memory consistency and a motivation for abandoning SC, I will quote from "A Primer on Memory Consistency and Cache Coherence"




          For a shared memory machine, the memory consistency model defines the architecturally visible behavior of its memory system. The correctness criterion for a single processor core partitions behavior between “one correct result” and “many incorrect alternatives”. This is because the processor’s architecture mandates that the execution of a thread transforms a given input state into a single well-defined output state, even on an out-of-order core. Shared memory consistency models, however, concern the loads and stores of multiple threads and usually allow many correct executions while disallowing many (more) incorrect ones. The possibility of multiple correct executions is due to the ISA allowing multiple threads to execute concurrently, often with many possible legal interleavings of instructions from different threads.



          Relaxed or weak memory consistency models are motivated by the fact that most memory orderings in strong models are unnecessary. If a thread updates ten data items and then a synchronization flag, programmers usually do not care if the data items are updated in order with respect to each other but only that all data items are updated before the flag is updated (usually implemented using FENCE instructions). Relaxed models seek to capture this increased ordering flexibility and preserve only the orders that programmers “require” to get both higher performance and correctness of SC. For example, in certain architectures, FIFO write buffers are used by each core to hold the results of committed (retired) stores before writing the results to the caches. This optimization enhances performance but violates SC. The write buffer hides the latency of servicing a store miss. Because stores are common, being able to avoid stalling on most of them is an important benefit. For a single-core processor, a write buffer can be made architecturally invisible by ensuring that a load to address A returns the value of the most recent store to A even if one or more stores to A are in the write buffer. This is typically done by either bypassing the value of the most recent store to A to the load from A, where “most recent” is determined by program order, or by stalling a load of A if a store to A is in the write buffer. When multiple cores are used, each will have its own bypassing write buffer. Without write buffers, the hardware is SC, but with write buffers, it is not, making write buffers architecturally visible in a multicore processor.



          Store-store reordering may happen if a core has a non-FIFO write buffer that lets stores depart in a different order than the order in which they entered. This might occur if the first store misses in the cache while the second hits or if the second store can coalesce with an earlier store (i.e., before the first store). Load-load reordering may also happen on dynamically-scheduled cores that execute instructions out of program order. That can behave the same as reordering stores on another core (Can you come up with an example interleaving between two threads?). Reordering an earlier load with a later store (a load-store reordering) can cause many incorrect behaviors, such as loading a value after releasing the lock that protects it (if the store is the unlock operation). Note that store-load reorderings may also arise due to local bypassing in the commonly implemented FIFO write buffer, even with a core that executes all instructions in program order.




          Because cache coherence and memory consistency are sometimes confused, it is instructive to also have this quote:




          Unlike consistency, cache coherence is neither visible to software nor required. Coherence seeks to make the caches of a shared-memory system as functionally invisible as the caches in a single-core system. Correct coherence ensures that a programmer cannot determine whether and where a system has caches by analyzing the results of loads and stores. This is because correct coherence ensures that the caches never enable new or different functional behavior (programmers may still be able to infer likely cache structure using timing information). The main purpose of cache coherence protocols is maintaining the single-writer-multiple-readers (SWMR) invariant for every memory location.
          An important distinction between coherence and consistency is that coherence is specified on a per-memory location basis, whereas consistency is specified with respect to all memory locations.




          Continuing with our mental picture, the SWMR invariant corresponds to the physical requirement that there be at most one particle located at any one location but there can be an unlimited number of observers of any location.






          share|improve this answer



















          • 47




            +1 for the analogy with special relativity, I've been trying to make the same analogy myself. Too often I see programmers investigating threaded code trying to interpret the behavior as operations in different threads occurring interleaved with one another in a specific order, and I have to tell them, nope, with multi-processor systems the notion of simultaneity between different <s>frames of reference</s> threads is now meaningless. Comparing with special relativity is a good way to make them respect the complexity of the problem.
            – Pierre Lebeaupin
            Jun 26 '14 at 19:42






          • 2




            @Ahmed Nassar: the link you shared from stanford is dead.
            – Joze
            Apr 21 '15 at 12:02






          • 2




            @Joze: Thanks. I made it refer to the ACM library instead. It is still freely available elsewhere on the Web.
            – Ahmed Nassar
            Apr 21 '15 at 18:55






          • 52




            So should you conclude that the Universe is multicore?
            – Peter K
            Apr 28 '15 at 11:36






          • 5




            @PeterK: Exactly :) And here is a very nice visualization of this picture of time by physicist Brian Greene: youtube.com/watch?v=4BjGWLJNPcA&t=22m12s This is "The Illusion of Time [Full Documentary]" at minute 22 and 12 seconds.
            – Ahmed Nassar
            Jul 19 '15 at 2:17
















          299





          +50







          299





          +50



          299




          +50




          I will just give the analogy with which I understand memory consistency models (or memory models, for short). It is inspired by Leslie Lamport's seminal paper "Time, Clocks, and the Ordering of Events in a Distributed System".
          The analogy is apt and has fundamental significance, but may be overkill for many people. However, I hope it provides a mental image (a pictorial representation) that facilitates reasoning about memory consistency models.



          Let’s view the histories of all memory locations in a space-time diagram in which the horizontal axis represents the address space (i.e., each memory location is represented by a point on that axis) and the vertical axis represents time (we will see that, in general, there is not a universal notion of time). The history of values held by each memory location is, therefore, represented by a vertical column at that memory address. Each value change is due to one of the threads writing a new value to that location. By a memory image, we will mean the aggregate/combination of values of all memory locations observable at a particular time by a particular thread.



          Quoting from "A Primer on Memory Consistency and Cache Coherence"




          The intuitive (and most restrictive) memory model is sequential consistency (SC) in which a multithreaded execution should look like an interleaving of the sequential executions of each constituent thread, as if the threads were time-multiplexed on a single-core processor.




          That global memory order can vary from one run of the program to another and may not be known beforehand. The characteristic feature of SC is the set of horizontal slices in the address-space-time diagram representing planes of simultaneity (i.e., memory images). On a given plane, all of its events (or memory values) are simultaneous. There is a notion of Absolute Time, in which all threads agree on which memory values are simultaneous. In SC, at every time instant, there is only one memory image shared by all threads. That's, at every instant of time, all processors agree on the memory image (i.e., the aggregate content of memory). Not only does this imply that all threads view the same sequence of values for all memory locations, but also that all processors observe the same combinations of values of all variables. This is the same as saying all memory operations (on all memory locations) are observed in the same total order by all threads.



          In relaxed memory models, each thread will slice up address-space-time in its own way, the only restriction being that slices of each thread shall not cross each other because all threads must agree on the history of every individual memory location (of course, slices of different threads may, and will, cross each other). There is no universal way to slice it up (no privileged foliation of address-space-time). Slices do not have to be planar (or linear). They can be curved and this is what can make a thread read values written by another thread out of the order they were written in. Histories of different memory locations may slide (or get stretched) arbitrarily relative to each other when viewed by any particular thread. Each thread will have a different sense of which events (or, equivalently, memory values) are simultaneous. The set of events (or memory values) that are simultaneous to one thread are not simultaneous to another. Thus, in a relaxed memory model, all threads still observe the same history (i.e., sequence of values) for each memory location. But they may observe different memory images (i.e., combinations of values of all memory locations). Even if two different memory locations are written by the same thread in sequence, the two newly written values may be observed in different order by other threads.



          [Picture from Wikipedia]
          Picture from Wikipedia



          Readers familiar with Einstein’s Special Theory of Relativity will notice what I am alluding to. Translating Minkowski’s words into the memory models realm: address space and time are shadows of address-space-time. In this case, each observer (i.e., thread) will project shadows of events (i.e., memory stores/loads) onto his own world-line (i.e., his time axis) and his own plane of simultaneity (his address-space axis). Threads in the C++11 memory model correspond to observers that are moving relative to each other in special relativity. Sequential consistency corresponds to the Galilean space-time (i.e., all observers agree on one absolute order of events and a global sense of simultaneity).



          The resemblance between memory models and special relativity stems from the fact that both define a partially-ordered set of events, often called a causal set. Some events (i.e., memory stores) can affect (but not be affected by) other events. A C++11 thread (or observer in physics) is no more than a chain (i.e., a totally ordered set) of events (e.g., memory loads and stores to possibly different addresses).



          In relativity, some order is restored to the seemingly chaotic picture of partially ordered events, since the only temporal ordering that all observers agree on is the ordering among “timelike” events (i.e., those events that are in principle connectible by any particle going slower than the speed of light in a vacuum). Only the timelike related events are invariantly ordered.
          Time in Physics, Craig Callender.



          In C++11 memory model, a similar mechanism (the acquire-release consistency model) is used to establish these local causality relations.



          To provide a definition of memory consistency and a motivation for abandoning SC, I will quote from "A Primer on Memory Consistency and Cache Coherence"




          For a shared memory machine, the memory consistency model defines the architecturally visible behavior of its memory system. The correctness criterion for a single processor core partitions behavior between “one correct result” and “many incorrect alternatives”. This is because the processor’s architecture mandates that the execution of a thread transforms a given input state into a single well-defined output state, even on an out-of-order core. Shared memory consistency models, however, concern the loads and stores of multiple threads and usually allow many correct executions while disallowing many (more) incorrect ones. The possibility of multiple correct executions is due to the ISA allowing multiple threads to execute concurrently, often with many possible legal interleavings of instructions from different threads.



          Relaxed or weak memory consistency models are motivated by the fact that most memory orderings in strong models are unnecessary. If a thread updates ten data items and then a synchronization flag, programmers usually do not care if the data items are updated in order with respect to each other but only that all data items are updated before the flag is updated (usually implemented using FENCE instructions). Relaxed models seek to capture this increased ordering flexibility and preserve only the orders that programmers “require” to get both higher performance and correctness of SC. For example, in certain architectures, FIFO write buffers are used by each core to hold the results of committed (retired) stores before writing the results to the caches. This optimization enhances performance but violates SC. The write buffer hides the latency of servicing a store miss. Because stores are common, being able to avoid stalling on most of them is an important benefit. For a single-core processor, a write buffer can be made architecturally invisible by ensuring that a load to address A returns the value of the most recent store to A even if one or more stores to A are in the write buffer. This is typically done by either bypassing the value of the most recent store to A to the load from A, where “most recent” is determined by program order, or by stalling a load of A if a store to A is in the write buffer. When multiple cores are used, each will have its own bypassing write buffer. Without write buffers, the hardware is SC, but with write buffers, it is not, making write buffers architecturally visible in a multicore processor.



          Store-store reordering may happen if a core has a non-FIFO write buffer that lets stores depart in a different order than the order in which they entered. This might occur if the first store misses in the cache while the second hits or if the second store can coalesce with an earlier store (i.e., before the first store). Load-load reordering may also happen on dynamically-scheduled cores that execute instructions out of program order. That can behave the same as reordering stores on another core (Can you come up with an example interleaving between two threads?). Reordering an earlier load with a later store (a load-store reordering) can cause many incorrect behaviors, such as loading a value after releasing the lock that protects it (if the store is the unlock operation). Note that store-load reorderings may also arise due to local bypassing in the commonly implemented FIFO write buffer, even with a core that executes all instructions in program order.




          Because cache coherence and memory consistency are sometimes confused, it is instructive to also have this quote:




          Unlike consistency, cache coherence is neither visible to software nor required. Coherence seeks to make the caches of a shared-memory system as functionally invisible as the caches in a single-core system. Correct coherence ensures that a programmer cannot determine whether and where a system has caches by analyzing the results of loads and stores. This is because correct coherence ensures that the caches never enable new or different functional behavior (programmers may still be able to infer likely cache structure using timing information). The main purpose of cache coherence protocols is maintaining the single-writer-multiple-readers (SWMR) invariant for every memory location.
          An important distinction between coherence and consistency is that coherence is specified on a per-memory location basis, whereas consistency is specified with respect to all memory locations.




          Continuing with our mental picture, the SWMR invariant corresponds to the physical requirement that there be at most one particle located at any one location but there can be an unlimited number of observers of any location.






          share|improve this answer














          I will just give the analogy with which I understand memory consistency models (or memory models, for short). It is inspired by Leslie Lamport's seminal paper "Time, Clocks, and the Ordering of Events in a Distributed System".
          The analogy is apt and has fundamental significance, but may be overkill for many people. However, I hope it provides a mental image (a pictorial representation) that facilitates reasoning about memory consistency models.



          Let’s view the histories of all memory locations in a space-time diagram in which the horizontal axis represents the address space (i.e., each memory location is represented by a point on that axis) and the vertical axis represents time (we will see that, in general, there is not a universal notion of time). The history of values held by each memory location is, therefore, represented by a vertical column at that memory address. Each value change is due to one of the threads writing a new value to that location. By a memory image, we will mean the aggregate/combination of values of all memory locations observable at a particular time by a particular thread.



          Quoting from "A Primer on Memory Consistency and Cache Coherence"




          The intuitive (and most restrictive) memory model is sequential consistency (SC) in which a multithreaded execution should look like an interleaving of the sequential executions of each constituent thread, as if the threads were time-multiplexed on a single-core processor.




          That global memory order can vary from one run of the program to another and may not be known beforehand. The characteristic feature of SC is the set of horizontal slices in the address-space-time diagram representing planes of simultaneity (i.e., memory images). On a given plane, all of its events (or memory values) are simultaneous. There is a notion of Absolute Time, in which all threads agree on which memory values are simultaneous. In SC, at every time instant, there is only one memory image shared by all threads. That's, at every instant of time, all processors agree on the memory image (i.e., the aggregate content of memory). Not only does this imply that all threads view the same sequence of values for all memory locations, but also that all processors observe the same combinations of values of all variables. This is the same as saying all memory operations (on all memory locations) are observed in the same total order by all threads.



          In relaxed memory models, each thread will slice up address-space-time in its own way, the only restriction being that slices of each thread shall not cross each other because all threads must agree on the history of every individual memory location (of course, slices of different threads may, and will, cross each other). There is no universal way to slice it up (no privileged foliation of address-space-time). Slices do not have to be planar (or linear). They can be curved and this is what can make a thread read values written by another thread out of the order they were written in. Histories of different memory locations may slide (or get stretched) arbitrarily relative to each other when viewed by any particular thread. Each thread will have a different sense of which events (or, equivalently, memory values) are simultaneous. The set of events (or memory values) that are simultaneous to one thread are not simultaneous to another. Thus, in a relaxed memory model, all threads still observe the same history (i.e., sequence of values) for each memory location. But they may observe different memory images (i.e., combinations of values of all memory locations). Even if two different memory locations are written by the same thread in sequence, the two newly written values may be observed in different order by other threads.



          [Picture from Wikipedia]
          Picture from Wikipedia



          Readers familiar with Einstein’s Special Theory of Relativity will notice what I am alluding to. Translating Minkowski’s words into the memory models realm: address space and time are shadows of address-space-time. In this case, each observer (i.e., thread) will project shadows of events (i.e., memory stores/loads) onto his own world-line (i.e., his time axis) and his own plane of simultaneity (his address-space axis). Threads in the C++11 memory model correspond to observers that are moving relative to each other in special relativity. Sequential consistency corresponds to the Galilean space-time (i.e., all observers agree on one absolute order of events and a global sense of simultaneity).



          The resemblance between memory models and special relativity stems from the fact that both define a partially-ordered set of events, often called a causal set. Some events (i.e., memory stores) can affect (but not be affected by) other events. A C++11 thread (or observer in physics) is no more than a chain (i.e., a totally ordered set) of events (e.g., memory loads and stores to possibly different addresses).



          In relativity, some order is restored to the seemingly chaotic picture of partially ordered events, since the only temporal ordering that all observers agree on is the ordering among “timelike” events (i.e., those events that are in principle connectible by any particle going slower than the speed of light in a vacuum). Only the timelike related events are invariantly ordered.
          Time in Physics, Craig Callender.



          In C++11 memory model, a similar mechanism (the acquire-release consistency model) is used to establish these local causality relations.



          To provide a definition of memory consistency and a motivation for abandoning SC, I will quote from "A Primer on Memory Consistency and Cache Coherence"




          For a shared memory machine, the memory consistency model defines the architecturally visible behavior of its memory system. The correctness criterion for a single processor core partitions behavior between “one correct result” and “many incorrect alternatives”. This is because the processor’s architecture mandates that the execution of a thread transforms a given input state into a single well-defined output state, even on an out-of-order core. Shared memory consistency models, however, concern the loads and stores of multiple threads and usually allow many correct executions while disallowing many (more) incorrect ones. The possibility of multiple correct executions is due to the ISA allowing multiple threads to execute concurrently, often with many possible legal interleavings of instructions from different threads.



          Relaxed or weak memory consistency models are motivated by the fact that most memory orderings in strong models are unnecessary. If a thread updates ten data items and then a synchronization flag, programmers usually do not care if the data items are updated in order with respect to each other but only that all data items are updated before the flag is updated (usually implemented using FENCE instructions). Relaxed models seek to capture this increased ordering flexibility and preserve only the orders that programmers “require” to get both higher performance and correctness of SC. For example, in certain architectures, FIFO write buffers are used by each core to hold the results of committed (retired) stores before writing the results to the caches. This optimization enhances performance but violates SC. The write buffer hides the latency of servicing a store miss. Because stores are common, being able to avoid stalling on most of them is an important benefit. For a single-core processor, a write buffer can be made architecturally invisible by ensuring that a load to address A returns the value of the most recent store to A even if one or more stores to A are in the write buffer. This is typically done by either bypassing the value of the most recent store to A to the load from A, where “most recent” is determined by program order, or by stalling a load of A if a store to A is in the write buffer. When multiple cores are used, each will have its own bypassing write buffer. Without write buffers, the hardware is SC, but with write buffers, it is not, making write buffers architecturally visible in a multicore processor.



          Store-store reordering may happen if a core has a non-FIFO write buffer that lets stores depart in a different order than the order in which they entered. This might occur if the first store misses in the cache while the second hits or if the second store can coalesce with an earlier store (i.e., before the first store). Load-load reordering may also happen on dynamically-scheduled cores that execute instructions out of program order. That can behave the same as reordering stores on another core (Can you come up with an example interleaving between two threads?). Reordering an earlier load with a later store (a load-store reordering) can cause many incorrect behaviors, such as loading a value after releasing the lock that protects it (if the store is the unlock operation). Note that store-load reorderings may also arise due to local bypassing in the commonly implemented FIFO write buffer, even with a core that executes all instructions in program order.




          Because cache coherence and memory consistency are sometimes confused, it is instructive to also have this quote:




          Unlike consistency, cache coherence is neither visible to software nor required. Coherence seeks to make the caches of a shared-memory system as functionally invisible as the caches in a single-core system. Correct coherence ensures that a programmer cannot determine whether and where a system has caches by analyzing the results of loads and stores. This is because correct coherence ensures that the caches never enable new or different functional behavior (programmers may still be able to infer likely cache structure using timing information). The main purpose of cache coherence protocols is maintaining the single-writer-multiple-readers (SWMR) invariant for every memory location.
          An important distinction between coherence and consistency is that coherence is specified on a per-memory location basis, whereas consistency is specified with respect to all memory locations.




          Continuing with our mental picture, the SWMR invariant corresponds to the physical requirement that there be at most one particle located at any one location but there can be an unlimited number of observers of any location.







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Feb 8 '17 at 14:44









          Community

          11




          11










          answered Aug 29 '13 at 20:42









          Ahmed Nassar

          3,50321423




          3,50321423








          • 47




            +1 for the analogy with special relativity, I've been trying to make the same analogy myself. Too often I see programmers investigating threaded code trying to interpret the behavior as operations in different threads occurring interleaved with one another in a specific order, and I have to tell them, nope, with multi-processor systems the notion of simultaneity between different <s>frames of reference</s> threads is now meaningless. Comparing with special relativity is a good way to make them respect the complexity of the problem.
            – Pierre Lebeaupin
            Jun 26 '14 at 19:42






          • 2




            @Ahmed Nassar: the link you shared from stanford is dead.
            – Joze
            Apr 21 '15 at 12:02






          • 2




            @Joze: Thanks. I made it refer to the ACM library instead. It is still freely available elsewhere on the Web.
            – Ahmed Nassar
            Apr 21 '15 at 18:55






          • 52




            So should you conclude that the Universe is multicore?
            – Peter K
            Apr 28 '15 at 11:36






          • 5




            @PeterK: Exactly :) And here is a very nice visualization of this picture of time by physicist Brian Greene: youtube.com/watch?v=4BjGWLJNPcA&t=22m12s This is "The Illusion of Time [Full Documentary]" at minute 22 and 12 seconds.
            – Ahmed Nassar
            Jul 19 '15 at 2:17
















          • 47




            +1 for the analogy with special relativity, I've been trying to make the same analogy myself. Too often I see programmers investigating threaded code trying to interpret the behavior as operations in different threads occurring interleaved with one another in a specific order, and I have to tell them, nope, with multi-processor systems the notion of simultaneity between different <s>frames of reference</s> threads is now meaningless. Comparing with special relativity is a good way to make them respect the complexity of the problem.
            – Pierre Lebeaupin
            Jun 26 '14 at 19:42






          • 2




            @Ahmed Nassar: the link you shared from stanford is dead.
            – Joze
            Apr 21 '15 at 12:02






          • 2




            @Joze: Thanks. I made it refer to the ACM library instead. It is still freely available elsewhere on the Web.
            – Ahmed Nassar
            Apr 21 '15 at 18:55






          • 52




            So should you conclude that the Universe is multicore?
            – Peter K
            Apr 28 '15 at 11:36






          • 5




            @PeterK: Exactly :) And here is a very nice visualization of this picture of time by physicist Brian Greene: youtube.com/watch?v=4BjGWLJNPcA&t=22m12s This is "The Illusion of Time [Full Documentary]" at minute 22 and 12 seconds.
            – Ahmed Nassar
            Jul 19 '15 at 2:17










          47




          47




          +1 for the analogy with special relativity, I've been trying to make the same analogy myself. Too often I see programmers investigating threaded code trying to interpret the behavior as operations in different threads occurring interleaved with one another in a specific order, and I have to tell them, nope, with multi-processor systems the notion of simultaneity between different <s>frames of reference</s> threads is now meaningless. Comparing with special relativity is a good way to make them respect the complexity of the problem.
          – Pierre Lebeaupin
          Jun 26 '14 at 19:42




          +1 for the analogy with special relativity, I've been trying to make the same analogy myself. Too often I see programmers investigating threaded code trying to interpret the behavior as operations in different threads occurring interleaved with one another in a specific order, and I have to tell them, nope, with multi-processor systems the notion of simultaneity between different <s>frames of reference</s> threads is now meaningless. Comparing with special relativity is a good way to make them respect the complexity of the problem.
          – Pierre Lebeaupin
          Jun 26 '14 at 19:42




          2




          2




          @Ahmed Nassar: the link you shared from stanford is dead.
          – Joze
          Apr 21 '15 at 12:02




          @Ahmed Nassar: the link you shared from stanford is dead.
          – Joze
          Apr 21 '15 at 12:02




          2




          2




          @Joze: Thanks. I made it refer to the ACM library instead. It is still freely available elsewhere on the Web.
          – Ahmed Nassar
          Apr 21 '15 at 18:55




          @Joze: Thanks. I made it refer to the ACM library instead. It is still freely available elsewhere on the Web.
          – Ahmed Nassar
          Apr 21 '15 at 18:55




          52




          52




          So should you conclude that the Universe is multicore?
          – Peter K
          Apr 28 '15 at 11:36




          So should you conclude that the Universe is multicore?
          – Peter K
          Apr 28 '15 at 11:36




          5




          5




          @PeterK: Exactly :) And here is a very nice visualization of this picture of time by physicist Brian Greene: youtube.com/watch?v=4BjGWLJNPcA&t=22m12s This is "The Illusion of Time [Full Documentary]" at minute 22 and 12 seconds.
          – Ahmed Nassar
          Jul 19 '15 at 2:17






          @PeterK: Exactly :) And here is a very nice visualization of this picture of time by physicist Brian Greene: youtube.com/watch?v=4BjGWLJNPcA&t=22m12s This is "The Illusion of Time [Full Documentary]" at minute 22 and 12 seconds.
          – Ahmed Nassar
          Jul 19 '15 at 2:17













          89














          This is now a multiple-year old question, but being very popular, it's worth mentioning a fantastic resource for learning about the C++11 memory model. I see no point in summing up his talk in order to make this yet another full answer, but given this is the guy who actually wrote the standard, I think it's well worth watching the talk.



          Herb Sutter has a three hour long talk about the C++11 memory model titled "atomic<> Weapons", available on the Channel9 site - part 1 and part 2. The talk is pretty technical, and covers the following topics:




          1. Optimizations, Races, and the Memory Model

          2. Ordering – What: Acquire and Release

          3. Ordering – How: Mutexes, Atomics, and/or Fences

          4. Other Restrictions on Compilers and Hardware

          5. Code Gen & Performance: x86/x64, IA64, POWER, ARM

          6. Relaxed Atomics


          The talk doesn't elaborate on the API, but rather on the reasoning, background, under the hood and behind the scenes (did you know relaxed semantics were added to the standard only because POWER and ARM do not support synchronized load efficiently?).






          share|improve this answer



















          • 6




            That talk is indeed fantastic, totally worth the 3 hours you'll spend watching it.
            – ZunTzu
            Aug 31 '15 at 12:50








          • 4




            @ZunTzu: on most video players you can set the speed to 1.25, 1.5 or even 2 times the original.
            – Christian Severin
            Dec 15 '15 at 17:48






          • 3




            @eran do you guys happen to have the slides? links on the channel 9 talk pages do not work.
            – athos
            Aug 30 '16 at 2:33






          • 2




            @athos I don't have them, sorry. Try contacting channel 9, I don't think the removal was intentional (my guess is that they got the link from Herb Sutter, posted as is, and he later removed the files; but that's just a speculation...).
            – eran
            Aug 30 '16 at 6:06
















          89














          This is now a multiple-year old question, but being very popular, it's worth mentioning a fantastic resource for learning about the C++11 memory model. I see no point in summing up his talk in order to make this yet another full answer, but given this is the guy who actually wrote the standard, I think it's well worth watching the talk.



          Herb Sutter has a three hour long talk about the C++11 memory model titled "atomic<> Weapons", available on the Channel9 site - part 1 and part 2. The talk is pretty technical, and covers the following topics:




          1. Optimizations, Races, and the Memory Model

          2. Ordering – What: Acquire and Release

          3. Ordering – How: Mutexes, Atomics, and/or Fences

          4. Other Restrictions on Compilers and Hardware

          5. Code Gen & Performance: x86/x64, IA64, POWER, ARM

          6. Relaxed Atomics


          The talk doesn't elaborate on the API, but rather on the reasoning, background, under the hood and behind the scenes (did you know relaxed semantics were added to the standard only because POWER and ARM do not support synchronized load efficiently?).






          share|improve this answer



















          • 6




            That talk is indeed fantastic, totally worth the 3 hours you'll spend watching it.
            – ZunTzu
            Aug 31 '15 at 12:50








          • 4




            @ZunTzu: on most video players you can set the speed to 1.25, 1.5 or even 2 times the original.
            – Christian Severin
            Dec 15 '15 at 17:48






          • 3




            @eran do you guys happen to have the slides? links on the channel 9 talk pages do not work.
            – athos
            Aug 30 '16 at 2:33






          • 2




            @athos I don't have them, sorry. Try contacting channel 9, I don't think the removal was intentional (my guess is that they got the link from Herb Sutter, posted as is, and he later removed the files; but that's just a speculation...).
            – eran
            Aug 30 '16 at 6:06














          89












          89








          89






          This is now a multiple-year old question, but being very popular, it's worth mentioning a fantastic resource for learning about the C++11 memory model. I see no point in summing up his talk in order to make this yet another full answer, but given this is the guy who actually wrote the standard, I think it's well worth watching the talk.



          Herb Sutter has a three hour long talk about the C++11 memory model titled "atomic<> Weapons", available on the Channel9 site - part 1 and part 2. The talk is pretty technical, and covers the following topics:




          1. Optimizations, Races, and the Memory Model

          2. Ordering – What: Acquire and Release

          3. Ordering – How: Mutexes, Atomics, and/or Fences

          4. Other Restrictions on Compilers and Hardware

          5. Code Gen & Performance: x86/x64, IA64, POWER, ARM

          6. Relaxed Atomics


          The talk doesn't elaborate on the API, but rather on the reasoning, background, under the hood and behind the scenes (did you know relaxed semantics were added to the standard only because POWER and ARM do not support synchronized load efficiently?).






          share|improve this answer














          This is now a multiple-year old question, but being very popular, it's worth mentioning a fantastic resource for learning about the C++11 memory model. I see no point in summing up his talk in order to make this yet another full answer, but given this is the guy who actually wrote the standard, I think it's well worth watching the talk.



          Herb Sutter has a three hour long talk about the C++11 memory model titled "atomic<> Weapons", available on the Channel9 site - part 1 and part 2. The talk is pretty technical, and covers the following topics:




          1. Optimizations, Races, and the Memory Model

          2. Ordering – What: Acquire and Release

          3. Ordering – How: Mutexes, Atomics, and/or Fences

          4. Other Restrictions on Compilers and Hardware

          5. Code Gen & Performance: x86/x64, IA64, POWER, ARM

          6. Relaxed Atomics


          The talk doesn't elaborate on the API, but rather on the reasoning, background, under the hood and behind the scenes (did you know relaxed semantics were added to the standard only because POWER and ARM do not support synchronized load efficiently?).







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Nov 5 '17 at 23:17









          Peter Mortensen

          13.5k1983111




          13.5k1983111










          answered Dec 20 '13 at 13:22









          eran

          17.5k24281




          17.5k24281








          • 6




            That talk is indeed fantastic, totally worth the 3 hours you'll spend watching it.
            – ZunTzu
            Aug 31 '15 at 12:50








          • 4




            @ZunTzu: on most video players you can set the speed to 1.25, 1.5 or even 2 times the original.
            – Christian Severin
            Dec 15 '15 at 17:48






          • 3




            @eran do you guys happen to have the slides? links on the channel 9 talk pages do not work.
            – athos
            Aug 30 '16 at 2:33






          • 2




            @athos I don't have them, sorry. Try contacting channel 9, I don't think the removal was intentional (my guess is that they got the link from Herb Sutter, posted as is, and he later removed the files; but that's just a speculation...).
            – eran
            Aug 30 '16 at 6:06














          • 6




            That talk is indeed fantastic, totally worth the 3 hours you'll spend watching it.
            – ZunTzu
            Aug 31 '15 at 12:50








          • 4




            @ZunTzu: on most video players you can set the speed to 1.25, 1.5 or even 2 times the original.
            – Christian Severin
            Dec 15 '15 at 17:48






          • 3




            @eran do you guys happen to have the slides? links on the channel 9 talk pages do not work.
            – athos
            Aug 30 '16 at 2:33






          • 2




            @athos I don't have them, sorry. Try contacting channel 9, I don't think the removal was intentional (my guess is that they got the link from Herb Sutter, posted as is, and he later removed the files; but that's just a speculation...).
            – eran
            Aug 30 '16 at 6:06








          6




          6




          That talk is indeed fantastic, totally worth the 3 hours you'll spend watching it.
          – ZunTzu
          Aug 31 '15 at 12:50






          That talk is indeed fantastic, totally worth the 3 hours you'll spend watching it.
          – ZunTzu
          Aug 31 '15 at 12:50






          4




          4




          @ZunTzu: on most video players you can set the speed to 1.25, 1.5 or even 2 times the original.
          – Christian Severin
          Dec 15 '15 at 17:48




          @ZunTzu: on most video players you can set the speed to 1.25, 1.5 or even 2 times the original.
          – Christian Severin
          Dec 15 '15 at 17:48




          3




          3




          @eran do you guys happen to have the slides? links on the channel 9 talk pages do not work.
          – athos
          Aug 30 '16 at 2:33




          @eran do you guys happen to have the slides? links on the channel 9 talk pages do not work.
          – athos
          Aug 30 '16 at 2:33




          2




          2




          @athos I don't have them, sorry. Try contacting channel 9, I don't think the removal was intentional (my guess is that they got the link from Herb Sutter, posted as is, and he later removed the files; but that's just a speculation...).
          – eran
          Aug 30 '16 at 6:06




          @athos I don't have them, sorry. Try contacting channel 9, I don't think the removal was intentional (my guess is that they got the link from Herb Sutter, posted as is, and he later removed the files; but that's just a speculation...).
          – eran
          Aug 30 '16 at 6:06











          70














          It means that the standard now defines multi-threading, and it defines what happens in the context of multiple threads. Of course, people used varying implementations, but that's like asking why we should have a std::string when we could all be using a home-rolled string class.



          When you're talking about POSIX threads or Windows threads, then this is a bit of an illusion as actually you're talking about x86 threads, as it's a hardware function to run concurrently. The C++0x memory model makes guarantees, whether you're on x86, or ARM, or MIPS, or anything else you can come up with.






          share|improve this answer



















          • 24




            Posix threads are not restricted to x86. Indeed, the first systems they were implemented on were probably not x86 systems. Posix threads are system-independent, and are valid on all Posix platforms. It's also not really true that it's a hardware property because Posix threads can also be implemented through cooperative multitasking. But of course most threading issues only surface on hardware threading implementations (and some even only on multiprocessor/multicore systems).
            – celtschk
            Aug 18 '13 at 19:56
















          70














          It means that the standard now defines multi-threading, and it defines what happens in the context of multiple threads. Of course, people used varying implementations, but that's like asking why we should have a std::string when we could all be using a home-rolled string class.



          When you're talking about POSIX threads or Windows threads, then this is a bit of an illusion as actually you're talking about x86 threads, as it's a hardware function to run concurrently. The C++0x memory model makes guarantees, whether you're on x86, or ARM, or MIPS, or anything else you can come up with.






          share|improve this answer



















          • 24




            Posix threads are not restricted to x86. Indeed, the first systems they were implemented on were probably not x86 systems. Posix threads are system-independent, and are valid on all Posix platforms. It's also not really true that it's a hardware property because Posix threads can also be implemented through cooperative multitasking. But of course most threading issues only surface on hardware threading implementations (and some even only on multiprocessor/multicore systems).
            – celtschk
            Aug 18 '13 at 19:56














          70












          70








          70






          It means that the standard now defines multi-threading, and it defines what happens in the context of multiple threads. Of course, people used varying implementations, but that's like asking why we should have a std::string when we could all be using a home-rolled string class.



          When you're talking about POSIX threads or Windows threads, then this is a bit of an illusion as actually you're talking about x86 threads, as it's a hardware function to run concurrently. The C++0x memory model makes guarantees, whether you're on x86, or ARM, or MIPS, or anything else you can come up with.






          share|improve this answer














          It means that the standard now defines multi-threading, and it defines what happens in the context of multiple threads. Of course, people used varying implementations, but that's like asking why we should have a std::string when we could all be using a home-rolled string class.



          When you're talking about POSIX threads or Windows threads, then this is a bit of an illusion as actually you're talking about x86 threads, as it's a hardware function to run concurrently. The C++0x memory model makes guarantees, whether you're on x86, or ARM, or MIPS, or anything else you can come up with.







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Nov 5 '17 at 23:06









          Peter Mortensen

          13.5k1983111




          13.5k1983111










          answered Jun 11 '11 at 23:42









          Puppy

          123k25194406




          123k25194406








          • 24




            Posix threads are not restricted to x86. Indeed, the first systems they were implemented on were probably not x86 systems. Posix threads are system-independent, and are valid on all Posix platforms. It's also not really true that it's a hardware property because Posix threads can also be implemented through cooperative multitasking. But of course most threading issues only surface on hardware threading implementations (and some even only on multiprocessor/multicore systems).
            – celtschk
            Aug 18 '13 at 19:56














          • 24




            Posix threads are not restricted to x86. Indeed, the first systems they were implemented on were probably not x86 systems. Posix threads are system-independent, and are valid on all Posix platforms. It's also not really true that it's a hardware property because Posix threads can also be implemented through cooperative multitasking. But of course most threading issues only surface on hardware threading implementations (and some even only on multiprocessor/multicore systems).
            – celtschk
            Aug 18 '13 at 19:56








          24




          24




          Posix threads are not restricted to x86. Indeed, the first systems they were implemented on were probably not x86 systems. Posix threads are system-independent, and are valid on all Posix platforms. It's also not really true that it's a hardware property because Posix threads can also be implemented through cooperative multitasking. But of course most threading issues only surface on hardware threading implementations (and some even only on multiprocessor/multicore systems).
          – celtschk
          Aug 18 '13 at 19:56




          Posix threads are not restricted to x86. Indeed, the first systems they were implemented on were probably not x86 systems. Posix threads are system-independent, and are valid on all Posix platforms. It's also not really true that it's a hardware property because Posix threads can also be implemented through cooperative multitasking. But of course most threading issues only surface on hardware threading implementations (and some even only on multiprocessor/multicore systems).
          – celtschk
          Aug 18 '13 at 19:56











          51














          For languages not specifying a memory model, you are writing code for the language and the memory model specified by the processor architecture. The processor may choose to re-order memory accesses for performance. So, if your program has data races (a data race is when it's possible for multiple cores / hyper-threads to access the same memory concurrently) then your program is not cross platform because of its dependence on the processor memory model. You may refer to the Intel or AMD software manuals to find out how the processors may re-order memory accesses.



          Very importantly, locks (and concurrency semantics with locking) are typically implemented in a cross platform way... So if you are using standard locks in a multithreaded program with no data races then you don't have to worry about cross platform memory models.



          Interestingly, Microsoft compilers for C++ have acquire / release semantics for volatile which is a C++ extension to deal with the lack of a memory model in C++ http://msdn.microsoft.com/en-us/library/12a04hfd(v=vs.80).aspx. However, given that Windows runs on x86 / x64 only, that's not saying much (Intel and AMD memory models make it easy and efficient to implement acquire / release semantics in a language).






          share|improve this answer



















          • 1




            It is true that, when the answer was written, Windows run on x86/x64 only, but Windows run, at some point in time, on IA64, MIPS, Alpha AXP64, PowerPC and ARM. Today it runs on various versions of ARM, which is quite different memory wise from x86, and nowhere nearly as forgiving.
            – Lorenzo Dematté
            Dec 6 '16 at 10:12










          • That link is somewhat broken (says "Visual Studio 2005 Retired documentation"). Care to update it?
            – Peter Mortensen
            Nov 5 '17 at 23:09








          • 3




            It was not true even when the answer was written.
            – Ben
            Dec 2 '17 at 10:14










          • "to access the same memory concurrently" to access in a conflicting way
            – curiousguy
            Jun 13 '18 at 23:22
















          51














          For languages not specifying a memory model, you are writing code for the language and the memory model specified by the processor architecture. The processor may choose to re-order memory accesses for performance. So, if your program has data races (a data race is when it's possible for multiple cores / hyper-threads to access the same memory concurrently) then your program is not cross platform because of its dependence on the processor memory model. You may refer to the Intel or AMD software manuals to find out how the processors may re-order memory accesses.



          Very importantly, locks (and concurrency semantics with locking) are typically implemented in a cross platform way... So if you are using standard locks in a multithreaded program with no data races then you don't have to worry about cross platform memory models.



          Interestingly, Microsoft compilers for C++ have acquire / release semantics for volatile which is a C++ extension to deal with the lack of a memory model in C++ http://msdn.microsoft.com/en-us/library/12a04hfd(v=vs.80).aspx. However, given that Windows runs on x86 / x64 only, that's not saying much (Intel and AMD memory models make it easy and efficient to implement acquire / release semantics in a language).






          share|improve this answer



















          • 1




            It is true that, when the answer was written, Windows run on x86/x64 only, but Windows run, at some point in time, on IA64, MIPS, Alpha AXP64, PowerPC and ARM. Today it runs on various versions of ARM, which is quite different memory wise from x86, and nowhere nearly as forgiving.
            – Lorenzo Dematté
            Dec 6 '16 at 10:12










          • That link is somewhat broken (says "Visual Studio 2005 Retired documentation"). Care to update it?
            – Peter Mortensen
            Nov 5 '17 at 23:09








          • 3




            It was not true even when the answer was written.
            – Ben
            Dec 2 '17 at 10:14










          • "to access the same memory concurrently" to access in a conflicting way
            – curiousguy
            Jun 13 '18 at 23:22














          51












          51








          51






          For languages not specifying a memory model, you are writing code for the language and the memory model specified by the processor architecture. The processor may choose to re-order memory accesses for performance. So, if your program has data races (a data race is when it's possible for multiple cores / hyper-threads to access the same memory concurrently) then your program is not cross platform because of its dependence on the processor memory model. You may refer to the Intel or AMD software manuals to find out how the processors may re-order memory accesses.



          Very importantly, locks (and concurrency semantics with locking) are typically implemented in a cross platform way... So if you are using standard locks in a multithreaded program with no data races then you don't have to worry about cross platform memory models.



          Interestingly, Microsoft compilers for C++ have acquire / release semantics for volatile which is a C++ extension to deal with the lack of a memory model in C++ http://msdn.microsoft.com/en-us/library/12a04hfd(v=vs.80).aspx. However, given that Windows runs on x86 / x64 only, that's not saying much (Intel and AMD memory models make it easy and efficient to implement acquire / release semantics in a language).






          share|improve this answer














          For languages not specifying a memory model, you are writing code for the language and the memory model specified by the processor architecture. The processor may choose to re-order memory accesses for performance. So, if your program has data races (a data race is when it's possible for multiple cores / hyper-threads to access the same memory concurrently) then your program is not cross platform because of its dependence on the processor memory model. You may refer to the Intel or AMD software manuals to find out how the processors may re-order memory accesses.



          Very importantly, locks (and concurrency semantics with locking) are typically implemented in a cross platform way... So if you are using standard locks in a multithreaded program with no data races then you don't have to worry about cross platform memory models.



          Interestingly, Microsoft compilers for C++ have acquire / release semantics for volatile which is a C++ extension to deal with the lack of a memory model in C++ http://msdn.microsoft.com/en-us/library/12a04hfd(v=vs.80).aspx. However, given that Windows runs on x86 / x64 only, that's not saying much (Intel and AMD memory models make it easy and efficient to implement acquire / release semantics in a language).







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Nov 5 '17 at 23:09









          Peter Mortensen

          13.5k1983111




          13.5k1983111










          answered Jul 26 '11 at 4:27









          ritesh

          86255




          86255








          • 1




            It is true that, when the answer was written, Windows run on x86/x64 only, but Windows run, at some point in time, on IA64, MIPS, Alpha AXP64, PowerPC and ARM. Today it runs on various versions of ARM, which is quite different memory wise from x86, and nowhere nearly as forgiving.
            – Lorenzo Dematté
            Dec 6 '16 at 10:12










          • That link is somewhat broken (says "Visual Studio 2005 Retired documentation"). Care to update it?
            – Peter Mortensen
            Nov 5 '17 at 23:09








          • 3




            It was not true even when the answer was written.
            – Ben
            Dec 2 '17 at 10:14










          • "to access the same memory concurrently" to access in a conflicting way
            – curiousguy
            Jun 13 '18 at 23:22














          • 1




            It is true that, when the answer was written, Windows run on x86/x64 only, but Windows run, at some point in time, on IA64, MIPS, Alpha AXP64, PowerPC and ARM. Today it runs on various versions of ARM, which is quite different memory wise from x86, and nowhere nearly as forgiving.
            – Lorenzo Dematté
            Dec 6 '16 at 10:12










          • That link is somewhat broken (says "Visual Studio 2005 Retired documentation"). Care to update it?
            – Peter Mortensen
            Nov 5 '17 at 23:09








          • 3




            It was not true even when the answer was written.
            – Ben
            Dec 2 '17 at 10:14










          • "to access the same memory concurrently" to access in a conflicting way
            – curiousguy
            Jun 13 '18 at 23:22








          1




          1




          It is true that, when the answer was written, Windows run on x86/x64 only, but Windows run, at some point in time, on IA64, MIPS, Alpha AXP64, PowerPC and ARM. Today it runs on various versions of ARM, which is quite different memory wise from x86, and nowhere nearly as forgiving.
          – Lorenzo Dematté
          Dec 6 '16 at 10:12




          It is true that, when the answer was written, Windows run on x86/x64 only, but Windows run, at some point in time, on IA64, MIPS, Alpha AXP64, PowerPC and ARM. Today it runs on various versions of ARM, which is quite different memory wise from x86, and nowhere nearly as forgiving.
          – Lorenzo Dematté
          Dec 6 '16 at 10:12












          That link is somewhat broken (says "Visual Studio 2005 Retired documentation"). Care to update it?
          – Peter Mortensen
          Nov 5 '17 at 23:09






          That link is somewhat broken (says "Visual Studio 2005 Retired documentation"). Care to update it?
          – Peter Mortensen
          Nov 5 '17 at 23:09






          3




          3




          It was not true even when the answer was written.
          – Ben
          Dec 2 '17 at 10:14




          It was not true even when the answer was written.
          – Ben
          Dec 2 '17 at 10:14












          "to access the same memory concurrently" to access in a conflicting way
          – curiousguy
          Jun 13 '18 at 23:22




          "to access the same memory concurrently" to access in a conflicting way
          – curiousguy
          Jun 13 '18 at 23:22











          24














          If you use mutexes to protect all your data, you really shouldn't need to worry. Mutexes have always provided sufficient ordering and visibility guarantees.



          Now, if you used atomics, or lock-free algorithms, you need to think about the memory model. The memory model describes precisely when atomics provide ordering and visibility guarantees, and provides portable fences for hand-coded guarantees.



          Previously, atomics would be done using compiler intrinsics, or some higher level library. Fences would have been done using CPU-specific instructions (memory barriers).






          share|improve this answer

















          • 18




            The problem before was that there was not such thing as a mutex (in terms of the C++ standard). So the only guarantees you were provided were by the mutex manufacturer, which was fine as long as you did not port the code (as minor changes to guarantees are hard to spot). Now we are get guarantees provided by the standard which should be portable between platforms.
            – Martin York
            Jun 12 '11 at 0:09






          • 4




            @Martin: in any case, one thing is the memory model, and another are the atomics and threading primitives that run on top of that memory model.
            – ninjalj
            Jun 12 '11 at 0:18






          • 4




            Also, my point was mostly that previously there was mostly no memory model at the language level, it happened to be the memory model of the underlying CPU. Now there is a memory model which is part of the core language; OTOH, mutexes and the like could always be done as a library.
            – ninjalj
            Jun 12 '11 at 0:36






          • 3




            It could also be a real problem for the people trying to write the mutex library. When the CPU, the memory controller, the kernel, the compiler, and the "C library" are all implemented by different teams, and some of them are in violent disagreement as to how this stuff is supposed to work, well, sometimes the stuff we systems programmers have to do to present a pretty facade to the applications level is not pleasant at all.
            – zwol
            Jun 12 '11 at 2:02






          • 10




            Unfortunately it is not enough to guard your data structures with simple mutexes if there is not a consistent memory model in your language. There are various compiler optimizations which make sense in a single threaded context but when multiple threads and cpu cores come into play, reordering of memory accesses and other optimizations may yield undefined behavior. For more information see "Threads cannot be implemented as a library" by Hans Boehm: citeseer.ist.psu.edu/viewdoc/…
            – exDM69
            Jun 13 '11 at 12:45


















          24














          If you use mutexes to protect all your data, you really shouldn't need to worry. Mutexes have always provided sufficient ordering and visibility guarantees.



          Now, if you used atomics, or lock-free algorithms, you need to think about the memory model. The memory model describes precisely when atomics provide ordering and visibility guarantees, and provides portable fences for hand-coded guarantees.



          Previously, atomics would be done using compiler intrinsics, or some higher level library. Fences would have been done using CPU-specific instructions (memory barriers).






          share|improve this answer

















          • 18




            The problem before was that there was not such thing as a mutex (in terms of the C++ standard). So the only guarantees you were provided were by the mutex manufacturer, which was fine as long as you did not port the code (as minor changes to guarantees are hard to spot). Now we are get guarantees provided by the standard which should be portable between platforms.
            – Martin York
            Jun 12 '11 at 0:09






          • 4




            @Martin: in any case, one thing is the memory model, and another are the atomics and threading primitives that run on top of that memory model.
            – ninjalj
            Jun 12 '11 at 0:18






          • 4




            Also, my point was mostly that previously there was mostly no memory model at the language level, it happened to be the memory model of the underlying CPU. Now there is a memory model which is part of the core language; OTOH, mutexes and the like could always be done as a library.
            – ninjalj
            Jun 12 '11 at 0:36






          • 3




            It could also be a real problem for the people trying to write the mutex library. When the CPU, the memory controller, the kernel, the compiler, and the "C library" are all implemented by different teams, and some of them are in violent disagreement as to how this stuff is supposed to work, well, sometimes the stuff we systems programmers have to do to present a pretty facade to the applications level is not pleasant at all.
            – zwol
            Jun 12 '11 at 2:02






          • 10




            Unfortunately it is not enough to guard your data structures with simple mutexes if there is not a consistent memory model in your language. There are various compiler optimizations which make sense in a single threaded context but when multiple threads and cpu cores come into play, reordering of memory accesses and other optimizations may yield undefined behavior. For more information see "Threads cannot be implemented as a library" by Hans Boehm: citeseer.ist.psu.edu/viewdoc/…
            – exDM69
            Jun 13 '11 at 12:45
















          24












          24








          24






          If you use mutexes to protect all your data, you really shouldn't need to worry. Mutexes have always provided sufficient ordering and visibility guarantees.



          Now, if you used atomics, or lock-free algorithms, you need to think about the memory model. The memory model describes precisely when atomics provide ordering and visibility guarantees, and provides portable fences for hand-coded guarantees.



          Previously, atomics would be done using compiler intrinsics, or some higher level library. Fences would have been done using CPU-specific instructions (memory barriers).






          share|improve this answer












          If you use mutexes to protect all your data, you really shouldn't need to worry. Mutexes have always provided sufficient ordering and visibility guarantees.



          Now, if you used atomics, or lock-free algorithms, you need to think about the memory model. The memory model describes precisely when atomics provide ordering and visibility guarantees, and provides portable fences for hand-coded guarantees.



          Previously, atomics would be done using compiler intrinsics, or some higher level library. Fences would have been done using CPU-specific instructions (memory barriers).







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Jun 11 '11 at 23:49









          ninjalj

          34.9k681123




          34.9k681123








          • 18




            The problem before was that there was not such thing as a mutex (in terms of the C++ standard). So the only guarantees you were provided were by the mutex manufacturer, which was fine as long as you did not port the code (as minor changes to guarantees are hard to spot). Now we are get guarantees provided by the standard which should be portable between platforms.
            – Martin York
            Jun 12 '11 at 0:09






          • 4




            @Martin: in any case, one thing is the memory model, and another are the atomics and threading primitives that run on top of that memory model.
            – ninjalj
            Jun 12 '11 at 0:18






          • 4




            Also, my point was mostly that previously there was mostly no memory model at the language level, it happened to be the memory model of the underlying CPU. Now there is a memory model which is part of the core language; OTOH, mutexes and the like could always be done as a library.
            – ninjalj
            Jun 12 '11 at 0:36






          • 3




            It could also be a real problem for the people trying to write the mutex library. When the CPU, the memory controller, the kernel, the compiler, and the "C library" are all implemented by different teams, and some of them are in violent disagreement as to how this stuff is supposed to work, well, sometimes the stuff we systems programmers have to do to present a pretty facade to the applications level is not pleasant at all.
            – zwol
            Jun 12 '11 at 2:02






          • 10




            Unfortunately it is not enough to guard your data structures with simple mutexes if there is not a consistent memory model in your language. There are various compiler optimizations which make sense in a single threaded context but when multiple threads and cpu cores come into play, reordering of memory accesses and other optimizations may yield undefined behavior. For more information see "Threads cannot be implemented as a library" by Hans Boehm: citeseer.ist.psu.edu/viewdoc/…
            – exDM69
            Jun 13 '11 at 12:45
















          • 18




            The problem before was that there was not such thing as a mutex (in terms of the C++ standard). So the only guarantees you were provided were by the mutex manufacturer, which was fine as long as you did not port the code (as minor changes to guarantees are hard to spot). Now we are get guarantees provided by the standard which should be portable between platforms.
            – Martin York
            Jun 12 '11 at 0:09






          • 4




            @Martin: in any case, one thing is the memory model, and another are the atomics and threading primitives that run on top of that memory model.
            – ninjalj
            Jun 12 '11 at 0:18






          • 4




            Also, my point was mostly that previously there was mostly no memory model at the language level, it happened to be the memory model of the underlying CPU. Now there is a memory model which is part of the core language; OTOH, mutexes and the like could always be done as a library.
            – ninjalj
            Jun 12 '11 at 0:36






          • 3




            It could also be a real problem for the people trying to write the mutex library. When the CPU, the memory controller, the kernel, the compiler, and the "C library" are all implemented by different teams, and some of them are in violent disagreement as to how this stuff is supposed to work, well, sometimes the stuff we systems programmers have to do to present a pretty facade to the applications level is not pleasant at all.
            – zwol
            Jun 12 '11 at 2:02






          • 10




            Unfortunately it is not enough to guard your data structures with simple mutexes if there is not a consistent memory model in your language. There are various compiler optimizations which make sense in a single threaded context but when multiple threads and cpu cores come into play, reordering of memory accesses and other optimizations may yield undefined behavior. For more information see "Threads cannot be implemented as a library" by Hans Boehm: citeseer.ist.psu.edu/viewdoc/…
            – exDM69
            Jun 13 '11 at 12:45










          18




          18




          The problem before was that there was not such thing as a mutex (in terms of the C++ standard). So the only guarantees you were provided were by the mutex manufacturer, which was fine as long as you did not port the code (as minor changes to guarantees are hard to spot). Now we are get guarantees provided by the standard which should be portable between platforms.
          – Martin York
          Jun 12 '11 at 0:09




          The problem before was that there was not such thing as a mutex (in terms of the C++ standard). So the only guarantees you were provided were by the mutex manufacturer, which was fine as long as you did not port the code (as minor changes to guarantees are hard to spot). Now we are get guarantees provided by the standard which should be portable between platforms.
          – Martin York
          Jun 12 '11 at 0:09




          4




          4




          @Martin: in any case, one thing is the memory model, and another are the atomics and threading primitives that run on top of that memory model.
          – ninjalj
          Jun 12 '11 at 0:18




          @Martin: in any case, one thing is the memory model, and another are the atomics and threading primitives that run on top of that memory model.
          – ninjalj
          Jun 12 '11 at 0:18




          4




          4




          Also, my point was mostly that previously there was mostly no memory model at the language level, it happened to be the memory model of the underlying CPU. Now there is a memory model which is part of the core language; OTOH, mutexes and the like could always be done as a library.
          – ninjalj
          Jun 12 '11 at 0:36




          Also, my point was mostly that previously there was mostly no memory model at the language level, it happened to be the memory model of the underlying CPU. Now there is a memory model which is part of the core language; OTOH, mutexes and the like could always be done as a library.
          – ninjalj
          Jun 12 '11 at 0:36




          3




          3




          It could also be a real problem for the people trying to write the mutex library. When the CPU, the memory controller, the kernel, the compiler, and the "C library" are all implemented by different teams, and some of them are in violent disagreement as to how this stuff is supposed to work, well, sometimes the stuff we systems programmers have to do to present a pretty facade to the applications level is not pleasant at all.
          – zwol
          Jun 12 '11 at 2:02




          It could also be a real problem for the people trying to write the mutex library. When the CPU, the memory controller, the kernel, the compiler, and the "C library" are all implemented by different teams, and some of them are in violent disagreement as to how this stuff is supposed to work, well, sometimes the stuff we systems programmers have to do to present a pretty facade to the applications level is not pleasant at all.
          – zwol
          Jun 12 '11 at 2:02




          10




          10




          Unfortunately it is not enough to guard your data structures with simple mutexes if there is not a consistent memory model in your language. There are various compiler optimizations which make sense in a single threaded context but when multiple threads and cpu cores come into play, reordering of memory accesses and other optimizations may yield undefined behavior. For more information see "Threads cannot be implemented as a library" by Hans Boehm: citeseer.ist.psu.edu/viewdoc/…
          – exDM69
          Jun 13 '11 at 12:45






          Unfortunately it is not enough to guard your data structures with simple mutexes if there is not a consistent memory model in your language. There are various compiler optimizations which make sense in a single threaded context but when multiple threads and cpu cores come into play, reordering of memory accesses and other optimizations may yield undefined behavior. For more information see "Threads cannot be implemented as a library" by Hans Boehm: citeseer.ist.psu.edu/viewdoc/…
          – exDM69
          Jun 13 '11 at 12:45







          protected by Nawaz Oct 3 '17 at 17:06



          Thank you for your interest in this question.
          Because it has attracted low-quality or spam answers that had to be removed, posting an answer now requires 10 reputation on this site (the association bonus does not count).



          Would you like to answer one of these unanswered questions instead?



          Popular posts from this blog

          Contact image not getting when fetch all contact list from iPhone by CNContact

          count number of partitions of a set with n elements into k subsets

          A CLEAN and SIMPLE way to add appendices to Table of Contents and bookmarks