One of the reasons few security products work well in busy Linux environments is that they amplify performance risk. You’re popular and your backend’s load is skyrocketing? Well, the typical product is just going to collect more data and do more analysis, which amplifies the degradation.

In the real world, one of the key ways everyone deals with being overloaded is by dropping less essential things.

We can do the same thing with ring buffers, which are fixed-size queues that typically drop old data once they fill up. Yet, they rarely get used outside of single-reader, single-writer scenarios, because it’s hard to build something correct that scales to 1-to-many scenarios, never mind many-to-many scenarios.

But, what if we told you, you can have a scalable ring buffer that doesn’t need any locking, and works with multiple readers and multiple writers at the same time? You might say, “there’s no such thing”, except that now there is.

Wait, that rings a bell 🔔

Ring buffers are fixed-size first-in-first-out (FIFO) queues. Fixed size queues that separately track the front and back are a category of algorithm called the circular buffer.

Some people treat the term circular buffer and ring buffer the same. For me, a ring buffer is a type of circular buffer, but one that explicitly drops data when the queue fills up.

That’s not the only option for a circular buffer. For instance:

You can just block until space becomes available.
You can grow the backing store.
You can just do nothing, except signal an insertion error and let the programmer figure it out.

Ring buffers are more resiliant, since they proactively drop when needed. Typically, it’s the oldest data gets dropped. However, there are ring buffers out there that give the option to drop new data instead.

Sure, for some situations, dropping data is the wrong call. But for many situations, especially ones requiring performance, lossy is still much better than bringing the system to a halt due to too much data.

For instance, in the Linux kernel, ring buffers are used in many places. A well known example is for relaying events from the kernel to the userland handler, when ebpf probes are attached.

When workloads are having performance issues, probes often end up with more work to do, and if there’s not some form of backpressure, things will end badly. And since ebpf probes are intended for observability, and since observability is generally less important than availability, dropping data as a first line of defense is a good idea.

And because the kernel’s ring buffer allows you to choose whether to drop off the front or the back, ebpf users get that choose too. Still, dropping older data is generally more common.

When operating on any kind of queue, lock-contention can slow things down significantly, because there are two bottlenecks– the head pointer (for enqueuers) or the tail pointer (for dequeuers). But rings have life even worse, because the head pointer can circle around the ring, and meet up with the tail pointer.

Our requirements for a ring buffer:

An ordered set S, with a maximum size n, with items of a fixed length l.
An operation, enqueue(item), where item is an arbitrary item of length l.
An operation, dequeue(), which returns the next item of length l, or indicates that the buffer is empty.
No thread should be able to detect an inconsistent ordering of operations on the ring, under any circumstances (We’ll cover memory ordering after we build our ring buffer).

Trying to minimize the impact of bottlenecks is hard, and so people tend to make compromises somewhere. For instance, ring buffers will try to avoid locks, but will accept some constraints, for instance:

single producer, single consumer (SPSC)
single producer, multiple consumer (SPMC)
multiple-producer, single consumer (MPSC)

Here, “producer” means “enqueuer”, and “consumer” means “dequeuer”. At some point, I wanted a true, lock-free multiple-producer, multiple-consumer (MPMC) ring buffer, but there was nothing out there that would scale, so I came up with an algorithm.

⛓️‍💥

When we say lock free, we mean that, for any given thread performing an operation, no other thread can cause a thread to suspend. With respect to an algorithm only, this is often referred to as a system-wide progress guarantee.

That doesn't mean any given thread will make 'progress' in a comfortable amount of time; lock-free algorithms usually perform operations that fail, and need to keep trying them until successful. They could conceptually lose their race till the end of eternity.

To get per-thread progress, we need wait freedom, which is often achievable with exponential backoff. But, when OS scheduling tends to be fair, lock free algorithms essentially can expect not to contend forever, and the extra overhead of wait freedom is often not worth it.

At some point after I’d come up with my algorithm, I did stumble across a vein of literature, calling it a “ring buffer” where old data couldn’t be dropped. The user was left to resubit. To my mind, that’s a fixed-size circular FIFO, but not a ring buffer.

Today, we’ll build a true MPMC ring buffer. We’ll focus on dropping old data, but this is one place where extending it yourself would be really pretty simple.

The full code is available at codeberg.

Ordering that ring 👉☎️

The last of our above requirements for a ring is often referred to as linearization. All operations map to a point on a conceptual timeline, where no thread can ‘see’ operations in an order that would be different from that timeline.

For instance, if thread A enqueued I1 then I2, and thread B is the one to dequeue both, it must always dequeue in the expected order– I1 first, then I2.

Every insertion thus needs to have a well-defined ordering, as does every removal. But it doesn’t have to map to wall time.

For instance, let’s say threads B and C are both dequeuers. B shows up first and starts dequeueing I1, but the scheduler suspends it in the middle of the operation. Meanwhile, C comes in and quickly pulls out I2 before B wakes, and returns its value.

While B might return after C in wall-clock time, that shouldn’t worry us, as long as there’s a well-defined ordering. That well defined ordering requires a well-defined linearization point. By that, we mean an atomic operation that is considered the point where the overall operation ‘occurs’ or ‘is committed’.

So if B hasn’t returned, but is suspended after the linearization point, it’s no big deal. The algorithm already considers the item dequeued.

Similarly, if B starts its operation before C does, but is suspended BEFORE the linearization point, and C is never suspended, C can absolutely claim I1; when B wakes up, it cannot get I1, and nobody has an inconsistent view of the world.

That works because all threads get equal treatment– it doesn’t matter when the functions they call start or end; the ordering is based on when the operation at the linearization point occurs.

We’ll make sure to clearly identify our linearization points for each operation, which will always be on atomic operations.

👰‍♂️ Our word is our vow 🤵‍♀️

We’re going to start with modest expectations and build a ring that operates on values that are exactly one word long. We’re going to go ahead and assume you’re on modern hardware, with a 64-bit word.

When designing our algorithm, in order to ensure correctness, we will want to think through all the places where there’s contention, meaning, we need to cover all cases where threads might try to operate on the same memory at the same time.

The following scenarios are all very realistic things for us to consider:

Multiple enqueuers can be performing enqueue operations in parallel, and thus compete with each other to get onto the list ‘first’.
If one enqueuer gets suspended, other enqueuers may wrap around the queue before its operation completes.
Multiple dequeuers can also run in parallel, and fight over which item they get.
Dequeuers can drain a list to the point that they’ve caught up to writers, and might be reading their state before an operation completes.
Or, dequeuers could lag way behind, trying to dequeue from a slot in the ring that a writer is now trying to use for a much more recent value.

To be able to arbitrate disputes around this kind of contention, we’re going to be explicit about our linearization timeline. Items in the array will be associated with an epoch, which is simply a point of time on our timeline. But every enqueue operation will be tied to an epoch.

We will make sure that each enqueue operation is associated with a unique epoch (though it is perfectly fine if we find ourselves needing to skip epochs).

When our dequeuers go to dequeue, they will also be looking to own the dequeue of an item that’s associated with a particular epoch.

When a thread examines a slot in the ring, it will need to know what epoch is associated with a cell, and whether there’s an enqueued item in that cell or not. We’ll want to make sure all that information is definitely tied to a specific value, and that the whole kit-and-kaboodle needs to always be read from (and written to) atomically.

When a thread wants to update the same state from a cell in the ring, we need to atomically read the existing state, create a copy that has the state we want, and then try to replace the state inside the cell atomically.

However, if, when we go to update the state, it’s changed from what we expected (based on the copy we made), then we need the operation to FAIL, and we should start the update process over again, based on those changes.

Let’s Swap

Thankfully, there’s a universally available atomic operation that does all of those things, often referred to as a compare-and-swap operation (CAS). The C language standard provides an API call to do this, although they use the word exchange instead of swap (also a common name for this operation).

Any CAS operation we perform on a cell will be a linearization point for us.

Most hardware platforms can do a compare-and-swap atomically, but not for any arbitrary size. Modern hardware usually limits us to 128 bits that we can atomically operate on. Other atomic operations may limit us to just 64-bit operands.

💻	The x86 family has long supported a 128-bit compare-and-swap, but until recently, it required instruction-level locking to use, because it did not support atomically loading 128 bits into a register otherwise. So on old hardware, you're technically using a lock with a 128-bit CAS, but 🤷‍♂️.

The CAS operation conceptually takes three operands (it can differ, as we’ll see later):

A pointer to the bytes we want to swap (i.e., the object to swap)
A pointer to the value we’re expecting to be in that memory before the operation begins (i.e., the expected value)
The actual value we want to leave behind (i.e., the desired value).

The processor will check that the value in the 2nd field is right; if it is, the memory address pointed to in parameter 1 gets the value you passed into parameter 3, and the OLD value gets written into the memory address pointed to by parameter 2, overwriting the expected field.

In this case, the operation returns true.

If the operation fails because the memory has changed since your last load:

Your desired value does not get installed.
The memory holding the expected value is updated to contain what the actual value was that differed from the expected value.
The function returns false.

On some platforms, there can be two variations of this operation, the strong CAS and the weak CAS. The weak CAS actually is allowed to return false and skip the swap, even if the expected value does match the object to swap. Generally, that operation will often get it right, but you might occasionally end up re-trying where you shouldn’t have needed to retry.

Why the heck would anyone want that behavior? It turns out, some platforms can make this weaker version faster. Even with the potential for failures, if you are using a CAS in a loop, until it succeeds, this weaker version is what people will recommend.

However, if you have a use for a CAS operation that doesn’t require testing for success after, using the weaker version would require testing. If you have to add a loop where there wouldn’t have been one otherwise, then you definitely want to use the strong variant.

But, while that’s the guidance you’ll find all over the internet, I don’t actually know which CPUs this would affect. Maybe it’s old news, I dunno. But it does still seem to make a shade of difference in real-world tests, so 🤷.

Picking your venue ⛪️

How do threads decide where to operate inside the ring buffer, with a minimum of fighting?

Imagine our ring is a large butcher shop with multiple counters. We take a number, and then go to the counter associated with that number.

Our ring will give out tickets to epochs, giving out each number no more than two times– once to an enqueuer, and once to a dequeuer. To do that, we’ll need two counters that we can atomically update.

If we need to implement giving out tickets like this, we can use an atomic CAS operation to do it, sitting in a loop until our new value is swapped in. Each time we lose, we’ll have to recompute the next value, but that’ll certainly work.

However, we can avoid the loop altogether if we do it through an atomic fetch-and-add (FAA) operation. When implemented in hardware, FAA is guaranteed to be completed. You pass in two operands:

A pointer to the object containing the current value. The result will be stored here at the end of the operation.
The value you’d like to add.

At the end of the operation, the old value gets returned to the caller.

So, to implement a ticketing system, we can have a variable that holds the value of the next ticket to give out. Each thread gets a ticket by calling FAA, adding the number 1 to the ticket value, but returning the number that was there at the start, which becomes our ticket.

No two enqueuers will get the same ticket. We can then map each ticket to a cell in the ring (easiest done by limiting the number of cells in the ring to powers of two). We take the ticket, divide by the number of cells, and the remainder is the cell index associated with that ticket (i.e., we compute the modulus).

When we have a number of cells that’s a power of two, we can use a cheap binary AND operation to get the modulus.

That trick only works for powers of two, because of some specific properties:

Every power of two is represented with a single bit set to 1; all other bits are zero.
If you subtract 1 from a power of two, all bits to the RIGHT of that one bit set for the power of two will be set, and nothing else.

So we get a cheap modulo via a binary AND operation. To be fair, it could be the case that modern processors can compute the modulus just as quickly as a bitwise AND, even though it’s a much more complicated operation. But, that’s why many data structures are backed by arrays of size to powers of two. It’s such a common paradigm, we’ll just roll with it.

Now, we also need a second ticket for dequeuers; those tickets should be associated with values that have already been enqueued.

However, if we keep those two tickets separate, then it will be really hard for us to build any confidence that we’re detecting some of the contention scenarios we talked about above.

For example, if we read each counter separately, how do we know if the queue is empty? Or, how can we be certain that readers are way behind (we don’t want to waste a lot of time with readers grabbing tickets for values we already dropped).

The answer for us is to operate on those tickets (epochs) in one atomic operation.

We can still do that with an FAA. For example, if we define our epoch-of-record datatype like this:

typedef struct {
    uint32_t epoch_q; // The next epoch assoc. w/ an enqueue op.
    uint32_t epoch_d; // The next epoch assoc. w/ an enqueue op.
} h4x0r_word_ring_epochs_t;

The compiler is smart enough that it will let you perform operations on structs as if they were integers, as long as the sizes are acceptable, and as long as you are explicit enough with your use of types to convince the compiler you know what you’re doing.

One way to do this is to use a union, which we might declare like this:

typedef union {
    h4x0r_word_ring_epochs_t epochs;
    uint64_t                 integer;
} h4x0r_converter_t;

The fields in unions may be individually addressed, but they share the same memory. The bits won’t change, but we can get the compiler to recognize the change-of-type based on the field we end up accessing.

Let’s say we want to add 1 to the queue epoch. We don’t need to understand endianness or how the two fields in that struct are ordered. Here’s one way we could do that:

    h4c0r_word_ring_epochs_t to_add_struct = {
                                               .epoch_q = 1,
                                               .epoch_d = 0,
                                             };
    union h4x0r_converter_t conv           = {
                                               .epochs = my_epochs,
                                             };
    uint64_t                to_add_num     = conv.integer;

We can then convert it back to a struct. And really, those conversions we expect to be free; it’s just a mechanism for expressing our intent to the compiler.

While 128-bit CAS operations are commonly supported in hardware, FAA (and similar operations) are more likely to have a 64-bit cap on their operands.

That’s why the two epochs in our data structure are kept to 32 bits.

However, 32 bits isn’t all that large a number, and it wouldn’t take too long for a busy system to overflow a counter. Dealing with that kind of overflow wouldn’t be fun.

If you are confident you’ve got a situation where you won’t use a ring enough to overflow, then by all means, use 32-bit epochs. But we recommend that, by default, you use 64-bit epochs. You can emulate a 128-bit FAA pretty easily with a CAS operation, as we described above.

While a 64-bit FAA should be faster, the CAS probably will be fine (on my machine, it’s a tiny smidge better, but not enough to crow about).

Our full implementation allows you to toggle between the two epoch sizes at compile-time, so you can compare the results if you like.

Anyway, atomically updating the epoch state gets you a ticket, and shows you what the next ticket is for the opposite operation, at the time we were handing your ticket.

A Simple ring 🍩

As we said above, the core state of the cells inside a ring must be no more than 128 bits if we want to operate on it without requiring a lock. So we need to be careful about what we try to jam in a cell.

When we load a cell, we want to know if we’re too slow, whatever our operation. At a bare minimum, we’re going to need:

Whether an item is enqueued in the slot or has been dequeued.
Room for that value, which probably needs to be a whole word so we can fit a pointer in (usually 64 bits).
The epoch associated with the cell, which is how we’ll know if a writer has been lapped (if the epoch is higher than the epoch for our operation, we got lapped). It’s also how readers will figure out the writer is slow and hasn’t finished.

As we said above, 32 bits is not enough space for the epoch if we are building something general-purpose. But, we clearly don’t have room for 64-bit epochs, so we’ll need to compromise.

A boolean generally will take up at least 8 bits, and a 56-bit epoch probably is good enough not to worry about. Though C doesn’t have 56-bit ints.

However, we can instead use C bit slicing, which would even let us get the flag down to a single bit, leaving 63 bits for our epoch:

typedef struct {
    _Atomic(void *)item;
    uint64_t enqueued : 1;
    uint64_t epoch    : 63;
} h4x0r_word_ring_item_t;

static_assert(sizeof(h4x0r_word_ring_item_t) == 16,
              "Bitfield implementation doesn't pack as expected");

typedef _Atomic(h4x0r_word_ring_item_t)   h4x0r_word_ring_cell_t;
typedef _Atomic(h4x0r_word_ring_epochs_t) h4x0r_atomic_epochs_t;

The C standard doesn’t really mandate layout for bit slices, so we check at compile type with the static_assert.

For values we need to be shared between threads, we need to label them as _Atomic. Otherwise, the code that the compiler generates will assume none of our variables are shared across threads, and we are bound to have all sorts of nasty race conditions.

But, we don’t tag _Atomic on h4x0r_word_ring_item_t or h4x0r_word_ring_epochs_t directly, because threads will be keeping their own private versions for updating.

My version of the top-level ring looks like this:

typedef void (*h4x0r_word_ring_drop_handler)(h4x0r_word_ring_t *, void *);

struct h4x0r_word_ring_t {
    h4x0r_word_ring_drop_handler drop_handler;
    uint32_t                     num_cells;
    uint32_t                     last_slot;
    h4x0r_atomic_epochs_t        epochs;
    h4x0r_word_ring_cell_t       cells[];
};

The drop handler is a function pointer, and we’d expect it to be set when initializing the ring, defaulting to NULL when we don’t need to do anything in particular to deal with drops (either because we’re not using dynamic memory, or because we have something like a garbage collector to clean up if needed).

The epochs field is tagged as being atomic, though it’s hidden behind the typedef.

Similarly, the variable-length array of cells is really an array of items we want to be atomic. The fact that cell loading and storing requires using the C atomic API is in there, but hidden behind a typedef.

However, the num_cells field and last_slot field are not tagged _Atomic. That’s because these should be set by one thread during initialization, and then never changed. As long as the memory has synced before other threads start to use these fields, we definitely don’t need them to be treated specially.

Usually, if we do initialization in a proper function, the call boundary is going to be a memory barrier that makes sure they’re sync’d when other threads start getting a handle on our ring.

But, if initialization might be inlined, you should probably flag these things as _Atomic, but when you access them via the C11 atomic API, ask for memory_order_relaxed, which essentially means, “no atomic op necessary here, so don’t add instructions to sync here”.

In that case, the _Atomic modifier will make sure our writes to those fields are seen by subsequent loads, but we don’t have to slow down those loads once the values are written.

The num_cells field is always going to be the number of cells in our ring, and 2^32 cells should be plenty, thus 32 bits. But because of C alignment rules, our struct is going to end up with a 32-bit hole somewhere. So, we fill that space with last_slot, which is always going to be one less than the value of num_cells, allowing us to perform our fast modulo without first having to subtract one from num_cells.

The dequeue operation

So far, our queue has nothing in it, and we don’t know how to put anything in it yet.

Still, let’s start with our dequeue operation.

Our first order of business is to the epochs, and if the tail and head are in the same place (or if the tail somehow lapped the head), then the queue is empty, and we shouldn’t even bother taking a ticket, so to speak.

But, if we do see items to dequeue, we’ll play the game! We’ll use FAA to get a unique read epoch. However, it’s 100% possible that other readers beat us, with no writers coming along.

That means, we could actually increment the counter past where the next writer is going to write. We will need to make sure that when we deal with writers, we try to solve that problem.

The epoch we read in won’t ever get another reader, but the writer will need to make sure it doesn’t use the slot we just accidentally burned.

Now, we’ll use our epoch to find the associated cell. We’ll load it, and look at the contents to figure out what to do.

Once we have loaded the cell, if the epoch is the one we’re expecting, and there’s an item in there, then we try to dequeue it by doing a CAS to mark it as dequeued (and generally attempting to set the item to NULL).

If the dequeue succeeds in the right epoch, we’re done.

If we found that another reader wasn’t keeping up, and our writer hasn’t written over it yet, we’d still try the CAS, and if it succeeds, that epoch is invalidated; the writer will eventually figure out that they cannot complete their write, and needs to try again.

But, if we can prove via the epochs that we’re the only writer at the time of our CAS, then, even though we know someone is trying to queue, we take that successful CAS as our linearization point, and declare that the ring was empty.

If we lose the race against a slow writer, no big deal, we start over with the same epoch; the writer probably left us a present.

If our CAS operation fails, before we try again, we look at the value of the expected item, which is the current stored value.

If we find the epoch is lower than ours, then we try to invalidate it with a CAS. If we fail, it could be that the slow writer finished, or it could be that we were suspended and are now behind. If it’s the former, we try again with the same epoch. If it’s the latter, we’ll know because the epoch is higher than ours, and we need to go get another read epoch (we can’t return empty unless we know that because of some CAS that we did, there’s a moment on our timeline where the ring was empty).

That’s already several corner cases we need to worry about. But the most difficult concern here is the last one.

Consider when we use the CAS to dequeue, and that CAS operation failed. But the expected value was the epoch we were looking for. We can just go home and call it a day, right?

It depends:

If we are running with a drop-handler, we absolutely cannot, because the writer will have known it was overwriting, and will be calling the drop handler. We don’t want to risk a double free if there’s dynamic deallocation.
Otherwise, yes; the writer we were competing with absolutely doesn’t care.

For the API call to dequeue, some callers are going to need to distinguish between returning a NULL because the ring is empty, and the case where NULL was enqueued.

The way to deal with this is to have our API call accept a pointer to a boolean. If NULL is passed, then we ignore the parameter. Otherwise, we’ll store a value in the memory pointed to.

The logic around whether to store the pointer is easily lifted out into inline functions:

static inline void *
h4x0r_word_ring_empty(bool *found)
{
    if (found) {
        *found = false;
    }
    return NULL;
}

static inline void *
h4x0r_word_ring_found(h4x0r_word_ring_item_t item,
                      uint64_t              *epoch_ptr,
                      bool                  *found)
{
    if (found) {
        *found = true;
    }

    h4x0r_atomic_fence();

    if (epoch_ptr) {
        *epoch_ptr = item.epoch;
    }

    return item.item;
}

Notice that h4x0r_word_ring_found takes a second pointer parameter– that’s for the caller to get the associated epoch. Day-to-day that may not be too useful. However, it’s very useful for correctness testing, to make sure one thread never sees out-of-order dequeues.

We’ll use it when we do our testing.

With all that, here’s the body of our dequeue operation:

void *
h4x0r_word_ring_dequeue(h4x0r_word_ring_t *ring, bool *found, uint64_t *ep)
{
    h4x0r_word_ring_cell_t  *cell;
    h4x0r_word_ring_item_t   expected;
    h4x0r_word_ring_item_t   last;
    h4x0r_word_ring_item_t   candidate;
    h4x0r_word_ring_epochs_t epochs = h4x0r_atomic_load(&ring->epochs);

    while (true) {
        if (epochs.epoch_d >= epochs.epoch_q) {
            return h4x0r_word_ring_empty(found);
        }
        epochs   = h4x0r_epochs_increment(&ring->epochs, read_incr);
        cell     = h4x0r_word_ring_slot_addr(ring, epochs.epoch_d);
        expected = h4x0r_atomic_load(cell);

        candidate = (h4x0r_word_ring_item_t){
            .item     = NULL,
            .epoch    = epochs.epoch_d,
            .enqueued = false,
            .dequeued = true,
        };

        while (expected.epoch <= epochs.epoch_d) {
            last = expected;

            if (h4x0r_atomic_cas(cell, &expected, candidate)) {
                if (epochs.epoch_d == last.epoch) {
                    return h4x0r_word_ring_found(last, ep, found);
                }
                if (epochs.epoch_d > last.epoch) {
                    h4x0r_word_ring_drop(ring, last);
                    break;
                }
                return h4x0r_word_ring_found(last, ep, found);
            }
            else {
                if (last.epoch == epochs.epoch_d && !ring->drop_handler) {
                    return h4x0r_word_ring_found(last, ep, found);
                }
                epochs = h4x0r_atomic_load(&ring->epochs);
                continue;
            }
            if (epochs.epoch_q - epochs.epoch_d <= 1) {
                return h4x0r_word_ring_empty(found);
            }
        }
        // We got lapped and need a new epoch.
        epochs = h4x0r_atomic_load(&ring->epochs);
    }
}

You’ll notice there are a few more helper functions in there:

h4x0r_epochs_increment() uses some inline code to do the 128-bit FAA using a union for conversion, as discussed above.

static inline h4x0r_word_ring_epochs_t
h4x0r_epochs_increment(_Atomic h4x0r_word_ring_epochs_t *p,
                       h4x0r_epoch_info_t                counter)
{
    h4x0r_epoch_info_t result;

    result.i = h4x0r_atomic_add((_Atomic(__uint128_t) *)p, counter.i);

    return result.s;
}

h4x0r_word_ring_slot_addr() takes our epoch, performs the modulo operation, and gets us a pointer to our cell. The code to get the address inside that inline function is more compact than the call, but we prefer the extra clarity, and the compiler is expected to inline it, especially if we put it in a header file (or we can annotate it to always inline).

static inline _Atomic(h4x0r_word_ring_item_t) *
h4x0r_word_ring_slot_addr(h4x0r_word_ring_t *ring, uint64_t epoch)
{
    return &ring->cells[epoch & ring->last_slot];
}

There are several h4x0r_atomic_*() calls. Those wrap the C11 API calls so we can apply consistent memory ordering that’s different from C’s default. We’ll discuss this a bit at the end of the article for those who want to understand. We do use C’s new-ish _Generic feature to abstract away over fetch-and-add, selecting our function for 128 bit values:

#define h4x0r_atomic_add(x, y)                    \
    _Generic((y),                                 \
        __uint128_t: _h4x0r_atomic_faa_128(x, y), \
        default: atomic_fetch_add_explicit(x, y, H4X0R_MO_RW))

    static inline __uint128_t                                 
        _h4x0r_atomic_faa_128(void *v, __uint128_t val) 
    {                                                         
        _Atomic __uint128_t *var = v;                         
                                                             
        __uint128_t expected;                                 
        __uint128_t desired;                                  
                                                              
        expected = h4x0r_atomic_load(var);                    
                                                              
        do {                                                  
            desired = expected + val;                 
        } while (!h4x0r_atomic_cas(var, &expected, desired)); 
        return expected;                                      
    }

h4x0r_decl_binopu128(faa, +);

h4x0r_word_ring_drop() is about as simple as you’d want it to be:

static inline void
h4x0r_word_ring_drop(h4x0r_word_ring_t *ring, h4x0r_word_ring_item_t cell)
{
    if (ring->drop_handler && cell.enqueued) {
        (*ring->drop_handler)(ring, cell.item);
    }
}

The enqueue operation

We’re already more than halfway done with our first ring. All we really need to do is get stuff into it.

Our first order of business when a thread calls our enqueue function is to load the existing epochs, grabbing its own ticket (epoch) in the process, via our fetch-and-add.

Second, we check the epochs to see if they need repair. That could be one of two scenarios:

We see that the tail is lagging (meaning, the epochs indicate the tail is farther behind than the number of slots); this is due to a relative lack of dequeuers.
We see that some dequeue operation accidentally look a read-slot when the queue was empty, during a race condition (We do not want to write into a slot that no reader could ever read).

If the enqueuer sees either of these two scenarios, it will attempt to ‘fix’ the tail by moving the dequeue epoch to the lowest epoch that might still be enqueued. Here are my helper functions to deal with these scenarios:

#define H4X0R_BACKOFF_START_NS 1000
#define H4X0R_BACKOFF_MAX_NS   65536

static const struct timespec start_sleep = {
    .tv_sec  = 0,
    .tv_nsec = H4X0R_BACKOFF_START_NS,
};

static inline bool
h4x0r_word_ring_needs_repair(h4x0r_word_ring_epochs_t epochs,
                             uint32_t                 ring_size)
{
    if (epochs.epoch_d + ring_size < epochs.epoch_q) {
        return true;
    }
    if (epochs.epoch_d > epochs.epoch_q) {
        return true;
    }
    return false;
}

static inline void
h4x0r_ring_lag_sleep(struct timespec *sleep_time)
{
    // We don't really care if we sleep the whole time or not.
    nanosleep(sleep_time, NULL);
    sleep_time->tv_nsec <<= 1;

    if (sleep_time->tv_nsec > H4X0R_BACKOFF_MAX_NS) {
        sleep_time->tv_nsec = H4X0R_BACKOFF_MAX_NS;
    }
}

// Returns true if we ever go through the loop, indicating
// we may need to  update our own epoch.
static inline bool
h4x0r_word_ring_repair(h4x0r_word_ring_epochs_t epochs,
                       h4x0r_word_ring_t       *ring)
{
    struct timespec          sleep_time = start_sleep;
    bool                     repair     = false;    
    h4x0r_word_ring_epochs_t candidate;

    while (h4x0r_word_ring_needs_repair(epochs, ring->num_cells)) {
        repair = true;
        
        if (epochs.epoch_d > epochs.epoch_q) {
            candidate = (h4x0r_word_ring_epochs_t){
                .epoch_q = epochs.epoch_q,
                .epoch_d = epochs.epoch_q,
            };
        }
        else {
            candidate = (h4x0r_word_ring_epochs_t){
                .epoch_q = epochs.epoch_q,
                .epoch_d = epochs.epoch_q - ring->num_cells,
            };
        }
        if (!h4x0r_atomic_cas(&ring->epochs, &epochs, candidate)) {
            return true;
        }
        h4x0r_ring_lag_sleep(&sleep_time);
    }

    return repair;
}

If the enqueuers don’t do the tail-correction, it penalizes dequeuers who are already behind; they’ll pay the price of going back for tickets, only to find they’re out of date, which can exacerbate problems when they’re behind.

If we do see lag, after attempting to fix it, we should help even more by taking a really short snooze to go ahead and help any pending reader succeed. If we don’t, then we’re at more risk of dequeuers continually being forced to retry because writers are starving them. Here is one place in our algorithm where, if we want to go for full wait-freedom, we can turn this process into an exponential backoff loop.

Once we leave the readers acceptably far behind, we then go to the cell we’re supposed to be writing to. There, we’re going to want to load the state to see what’s what.

If we see our epoch is already in there, we were too slow, and some reader invalidated the slot; we need to grab a new epoch and try everything again.
We do exactly the same thing if we see a HIGHER epoch than ours (writers lapped us, probably because we got suspended).

Next, we add the value, move the state to enqueued, and attempt to install the item via a CAS.

We keep trying until that succeeds.

Once our CAS is successful, then the write is successful. However, we still are not always done.

Specifically, we may need to look at what we overwrote (which should be installed in the expected field).

If we overwrote a queued item, then we need to pass that item to the drop handler. This is a particularly important thing to do when the item we found is a pointer to heap memory.

If we simply overwrite without checking, we might be leaking the memory the pointer references.

Of course, if we have no drop handler, we don’t need the extra step; there’s no problem.

We can just return, successful in our mission.

Here’s my implementation:

uint64_t
h4x0r_word_ring_enqueue(h4x0r_word_ring_t *ring, void *item)
{
    h4x0r_word_ring_cell_t  *cell;
    h4x0r_word_ring_epochs_t epochs;
    uint64_t                 write_epoch;
    h4x0r_word_ring_item_t   expected;

    while (true) {
        epochs      = h4x0r_epochs_increment(&ring->epochs, write_incr);
        write_epoch = epochs.epoch_q;
        
        if (h4x0r_word_ring_repair(epochs, ring)) {
            if (write_epoch + ring->num_cells < epochs.epoch_q) {
                continue;
            }
        }

        cell     = h4x0r_word_ring_slot_addr(ring, write_epoch);
        expected = h4x0r_atomic_load(cell);

        // This has to be a CAS; we might have another writer who
        // laps us between the last epoch check and the coming op.
        h4x0r_word_ring_item_t new = {
            .item     = item,
            .epoch    = write_epoch,
            .enqueued = true,
        };

        while (expected.epoch < write_epoch) {
            if (h4x0r_atomic_cas(cell, &expected, new)) {
                // If we overwrote something, it'll need to be dropped.
                h4x0r_word_ring_drop(ring, expected);
                return write_epoch;
            }
        }
        continue; // too slow; get a new epoch and retry.
    }
}

We’ve Inscribed some words on your ring

We have our first lock-free ring. But so far, it’s like we got the ring out of a cracker-jack box. It’s not yet something nice enough to use, since we’re limited to putting in 64-bit values.

However, we will 100% solve that problem.

That 💍 is too small, I want a BIG one

One thing about a ring that’s often valued is that you can operate in a fixed amount of statically allocated memory. Only giving 64 bits of space for the ring will push us towards dynamic allocation, which is a shame.

But we can use our word ring to make a big ring with larger, fixed-sized memory cells.

The basic idea is that we have two circular buffers, one of them being the word ring (we’ll call it R). Then, we’ll create a second circular buffer to store our arbitrary-sized records. We’ll call this one S (for store).

The entries we enqueue into our word ring R will simply be an epoch that points to a spot in S.

Write threads will peek at the ring’s global epoch info, for two reasons:

As a hint for where to start in the larger array, and
So that we know which records conceptually aren’t in the ring anymore and can be overwritten.

Here are the data structures I used in my implementation, along with the state flags I use throughout.

typedef struct h4x0r_ring_t h4x0r_ring_t;
typedef struct h4x0r_ring_entry_info_t h4x0r_ring_entry_info_t;


// The full cell definition for the outer ring cells.

struct h4x0r_ring_entry_t {
    _Atomic h4x0r_ring_entry_info_t info;
    uint64_t                        len;
    char                            data[];
};

// This the first item in the outer ring cell, 
struct h4x0r_ring_entry_info_t {
    uint64_t write_epoch;
    uint64_t state;
};

struct h4x0r_ring_t {
    h4x0r_word_ring_t  *word_ring;
    h4x0r_ring_entry_t *entries;
    _Atomic uint64_t    entry_ix;
    uint64_t            last_entry;
    uint64_t            entry_len;
};

enum : uint64_t {
    H4X0R_RING_EMPTY             = 0x00,
    H4X0R_RING_RESERVED          = 0x01,
    H4X0R_RING_DEQUEUE_RESERVE   = 0x02,
    H4X0R_RING_ENQUEUE_DONE      = 0x04,
    H4X0R_RING_USED              = 0x07,
    H4X0R_RING_DEQUEUE_DONE_MASK = ~0x06,
};

I’m going to skip initialization here, but two important notes if you’re going to DIY:

Double-check that the “core” word ring’s size in bits is a power of two.
Ensure that cell sizes are properly aligned (probably to a 16-byte boundary to be safe).

The enqueue operation

The write thread starts by grabbing the underlying word ring’s epoch information. It takes the write epoch it finds, and maps that into S (e % len(S), if e is the epoch)`.

Starting at that position, the writer scans S to find the first valid spot it can claim.

That means, if it notices an operation in progress, it skips that cell and keeps probing until it can claim a cell that’s safe to write. Once the write completes, we enqueue the position into our word ring. Adding it to the word ring gives us the epoch; we add that into our slot inside S, before we remove the flag that indicates we’re writing to the cell.

As a result, entries in S can be out of order, and that’s totally fine. The correct order will be kept in the word ring.

More specifically, the steps for enqueuers (writers) are as follows:

Find a spot in S, and reserve it, so no other writer will touch it.
Copy data into the reserved cell.
Enqueue a pointer to S into R (our linearization point).
Write into S the epoch that R returned to us when we enqueued.
Update the slot in S with the epoch, and indicate we’re done with their enqueue.

To make this all work, we really should have S contain more entries than R. We want to have enough to ensure the right number of items can all be enqueued at once in a full queue, and that any write thread will still have a space to write. If we don’t do that, writers will be roaming around a full parking garage until a spot opens up and they’re lucky enough to nab it.

If we have a ceiling on the number of threads we allow, we can use that value. But, if not, just doubling the number of entries should be more than good enough. There will be no competition for the enqueuer’s slot from other enqueuers.

However, a dequeuer can come in after step 3 completes and before step 4 completes, while we’re suspended.

That’s not a problem for us– the linearization point is the enqueue into R. We just need to make sure that enqueuers can only claim a slot if BOTH enqueuers and dequeuers are done with their operation (and, if it’s not in R, of course).

Note that dequeuers will set a state bit when they start dequeuing, so we can easily avoid taking those slots. However, when figuring out whether we can overwrite a state no thread is in, we need to check the stored epoch, to make sure it’s far enough behind ours that it’s not conceptually in the queue anymore.

Here’s the core of the enqueue operation:

 void
h4x0r_ring_enqueue(h4x0r_ring_t *self, void *item, uint64_t len)
{
    uint64_t                 ix;
    uint64_t                 byte_ix;
    uint64_t                 start_epoch;
    h4x0r_ring_entry_info_t  expected;
    h4x0r_ring_entry_info_t  candidate;
    h4x0r_ring_entry_t      *cur;
    h4x0r_word_ring_epochs_t epochs;

    if (len > self->entry_len) {
        len = self->entry_len;
    }

    epochs      = h4x0r_atomic_load(&self->word_ring->epochs);
    start_epoch = epochs.epoch_q;
    ix          = start_epoch & self->last_entry;

    while (true) {
        byte_ix = ix * (sizeof(h4x0r_ring_entry_t) + self->entry_len);
        cur     = (h4x0r_ring_entry_t *)&(((char *)self->entries)[byte_ix]);

        expected              = h4x0r_atomic_load(&cur->info);
        candidate.write_epoch = 0;
        candidate.state       = H4X0R_RING_RESERVED;

        if (h4x0r_atomic_cas(&cur->info, &expected, candidate)) {
            break;
        }

        if (!h4x0r_ring_can_write_here(expected,
                                       start_epoch,
                                       self->last_entry)) {
            ix = (ix + 1) & self->last_entry;
            continue;
        }

        if (h4x0r_atomic_cas(&cur->info, &expected, candidate)) {
            break;
        }
        ix = (ix + 1) & self->last_entry;
    }

    memcpy(cur->data, item, len);

    candidate.write_epoch = h4x0r_word_ring_enqueue(self->word_ring,
                                                    (void *)ix);
    candidate.state       = H4X0R_RING_ENQUEUE_DONE;
    cur->len              = len;

    h4x0r_atomic_store(&cur->info, candidate);
}

The only new helper function here is h4x0r_ring_can_write_here() and its helper:


static inline bool
h4x0r_ring_entry_is_being_used(h4x0r_ring_entry_info_t info)
{
    if (info.state & H4X0R_RING_USED) {
        return true;
    }

    return false;
}

static inline bool
h4x0r_ring_can_write_here(h4x0r_ring_entry_info_t info,
                          uint64_t                my_write_epoch,
                          uint32_t                last_entry)
{
    if (h4x0r_ring_entry_is_being_used(info)) {
        return false;
    }

    if (info.write_epoch > my_write_epoch) {
        return false;
    }

    if (info.write_epoch > (my_write_epoch - (last_entry + 1))) {
        return false;
    }

    return true;
}

The dequeue operation

Dequeuers (readers) take the following steps:

Dequeue a value from R, which gives us the index into S we need; at the same time, we use our reference parameter to capture the epoch the writer used to make sure we don’t read from the future.
Attempt to mark the cell in S for reading and validating. If validation fails, we restart.
Actually perform the read.
They mark the cell state in S to indicate the read is done.

Note that a slow dequeuer might find that by the time they attempt to flag the cell in L for read, someone has already claimed that cell for writing a newer log message. In such cases, the slow dequeuer just needs to try again.

Or, the writer may not have stored its epoch yet. We know if they got a ticket, we can go find the right slot. If the epoch is anything less than the epoch we dequeued, it’s definitely our value to dequeue.

For the dequeue, we’ll use these two very simple helpers:

static inline bool
h4x0r_ring_can_dequeue_here(h4x0r_ring_entry_info_t info,
                            uint64_t                expected_epoch)
{
    if (info.write_epoch > expected_epoch) {
        return false;
    }

    return true;
}

static inline uint64_t
h4x0r_ring_set_dequeue_done(uint64_t state)
{
    return state & H4X0R_RING_DEQUEUE_DONE_MASK;
}

Our dequeue function is going to return whether there was a dequeue or not, instead of returning a value, and use a parameter for people to check if the queue was empty.

That’s because we’re going to need the caller to pass in a buffer for the result. We DEFINITELY don’t want to pass back a pointer to the ring entry; that’s a recipe for disaster.

Testing our rings

Especially since we’re dealing with concurrency, we want to make sure to test thoroughly. We should run in a number of different configurations, and should thoroughly test to make sure that we only ever see linearized results when we dequeue.

We’re also going to want to count some stuff, then check it all for consistency:

We should count successful dequeues.
We should independently count the number of drops, too.
We should capture the number of times fast readers find an empty queue.

Each thread should collect metrics privately, and then add them to totals, when it’s done with the work.

If we add independently collected dequeues to drops, we should get the total number of enqueues; otherwise we have a bug.

You’ll need to know when to stop trying to dequeue. The simplest path is to have the main thread first join() on all enqueuers; at that point, we know there’s nothing else to enqueue. So when a dequeue thread sees all writers are done, the first time they encounter an empty queue, they know they’re done. That way is easy to implement, but leaves a small window where the empty dequeue time will push up. You can instead have individual writer threads decrement a global counter, which will shorten that window.

Also, even if it’s not real world, we should test for worst case, and run enqueues and dequeues as back to back as possible, to help us understand worst case performance, or any other considerations we might need.

I’ll spare you the code, because you can go get it in the codeberg repo. But, it does iterate over both types of ring, using five different sizes of buffer, and with a variety of (mostly imbalanced) readers and writers.

Let’s look at some example output though (I’ve deleted a few columns to keep us under 80 characters).

First, for the main, arbitrary ring, let’s look at our most minimal ring size:

Tests for queue with 16 items:
Test   Th+    Th-   Time(sec)     Q-             Q💧    Mop/s (✅)
------------------------------------------------------------------
# 1    1      1      0.0246     251,355       10,781      10.23   
# 2    2      2      0.0505     235,606        3,705       5.11   
# 3    4      4      0.1124      68,417      187,282       0.67   
# 4    8      8      0.2179     199,100       34,775       1.04   
# 5    2      1      0.0364     110,678      132,274       3.57   
# 6    4      1      0.0624      40,678      210,810       0.82   
# 7    8      1      0.0855       7,927      251,875       0.12   
# 8    1      2      0.0271     176,210       85,934       6.51   
# 9    1      4      0.0472     183,065       79,079       3.88   
#10    1      8      0.0930     183,732       78,412       1.97

Here, Th+ is the number of enqueue threads. The - sign denotes the dequeue size.

Q💧 is the number of drops. Then, the last column denotes how many millions of ops per second we performed in that test case.

The three columns we omitted that you’ll see in the output:

Q+ is the total number of enqueues, which is always 262,144 in my runs.
Q∅, the number of times a dequeuer attempted to dequeue, and found the queue was empty.
Mop/s (✅+∅) which recomputes Mop/sec, including dequeues that find the queue empty.

One thing that should jump out to you is that there are an absurd number of drops in there. And when we get those absurd drops, our overall performance tends to plummet. The table makes it clear that we need to do more to deal with the contention.

That’s the kind of concern you should be looking for when testing parallel algorithms.

The insight makes sense; if the queue is full, the writers help with the tail, but then go back to competing where the table comes together.

The conclusion I drew is that, before the queue fills, enqueuers should briefly pause to give preference to readers, so as not to contend with them. In the code repo for this article, I did just that for you, setting the threshold to 75%, which gives much better results:

Tests for queue with 16 items:
Test   Th+    Th-   Time(sec)     Q-         Q💧    Mop/s (✅)
--------------------------------------------------------------
# 1    1      1      0.0248     262,142        2      10.55
# 2    2      2      0.0450     260,071       43       5.82
# 3    4      4      0.0923     236,809      434       2.83
# 4    8      8      0.2132     208,501      177       1.23
# 5    2      1      0.0400     256,786       32       6.55
# 6    4      1      0.0479     253,364       20       5.47
# 7    8      1      0.0989     219,220       53       2.65
# 8    1      2      0.0384     262,101       43       6.83
# 9    1      4      0.0489     262,084       60       5.36
#10    1      8      0.0868     261,814      330       3.02

If we run more tests, we may see some significant drops, but we’d expect big numbers only when the number of writers greatly outweighs the number of readers. And in that case, the drops are expected. On my machine, this only ever happens for 16-item queues though. Even at 128 items, it looks good (on my machine):

Tests for queue with 128 items:
Test   Th+    Th-   Time(sec)      Q-           Q💧     Mop/s (✅)  
------------------------------------------------------------------
#11    1      1      0.0147     262,145          0      17.82
#12    2      2      0.0428     262,145          0       6.13
#13    4      4      0.0765     262,144          1       3.42
#14    8      8      0.1879     262,140          5       1.39
#15    2      1      0.0263     262,145          0       9.97
#16    4      1      0.0517     262,145          0       5.07
#17    8      1      0.0936     262,145          0       2.80
#18    1      2      0.0251     262,144          1      10.43
#19    1      4      0.0417     262,142          3       6.28
#20    1      8      0.0859     262,145          0       3.05

If we study these charts, we can compare to see exactly how big an impact that contention actually has. In tests where we were dropping, the number of operations plummeted massively due to the contention.

With the above tweak to the enqueuer’s help rules, my laptop tests never fail to top 1 million operations a second, and raw word-queue performance peaks at about 24 Mop/sec, and stays above 3 MOp/sec for all but a couple of configurations (the ones where I’ve overcommitted my cores w 8q+/8q-, so past the point where I’ve maxed out potential parallelism).

Not bad, considering we haven’t really done anything to optimize (I do have a more optimized implementation of the base word-ring algorithm that can get as high at 40Mop/sec with one enqueuer and one dequeuer, and bottoms out at 4Mops/sec when the number of dequeuers is lopsided, all on the same machine).

My test bed is there in the repo for you to use if you like.

Did we forget to order something?

When I wrote about futexes and mutexes, I glossed over the topic of memory ordering, because it’s notoriously hard to communicate well. But given we’re into the world of concurrency without locks, I think it’s remiss not to cover it. I’ll try to make it as straightforward as possible.

If you are still confused after reading this section (which is likely), give me feedback on what questions it leaves you with, and I’ll try again.

Your compiler may be gaslighting you

Your compiler wants you to think your code will execute in the order you’d expect while staring at the source code.

Meanwhile, behind your back, it’s generally going way out of its way to thwart your expectations. Though, to be fair, it’s doing it for you. It knows how disappointed you’d be if it performs poorly for you.

So yes, the compiler will absolutely rewrite your code (particularly when you turn on any level of optimization). It has no qualms changing the order you’d expect, with hopes of it running faster for you. But, it’s hoping you won’t notice. The transformations often aren’t semantically identical to what you might have intended, but, at least in the context of a single thread, you’re not likely to notice the difference.

Even for multi-threaded programs, compilers often try hard to optimize what you’re doing, transforming and reordering to take advantage of the CPU.

Still, there are things the compiler won’t do (just like Meatloaf). Specifically, compilers have this idea of “barriers”, which are features in your code that the compiler won’t move stuff past.

For instance, compilers will not move code across the boundary of a mutex, or any op on a variable labeled _Atomic (unless you explicitly allow it at the call site).

The compiler is conservative here, because you expressed your intent. Event if the compiler thinks you’re walking away from a big performance gain, and even if they could “prove” that it’s not going to change the semantics of your program. Generally, function boundaries result in a barrier as well, as long as they’re not inlined.

But, even with multi-processor programs, the compiler does all that analysis as if a single thread is running. So it can move data around across threads in many frustrating ways. If you don’t pay attention to how you handle concurrency, compiler transformations can definitely make race conditions far worse.

The processor is a bad influence

Making matters worse, your compiler’s friend, the processor, tries to apply parallelism everywhere it can, because of performance. So there are also plenty of architectural considerations that can lead to data races.

For instance, you probably know that the CPU is fast, and memory accesses are slow. And when many threads are competing to load memory at the same time, things can get chaotic, because memory cells loaded into registers on one core don’t magically sync instantly across multiple cores.

And, the processor may do its own reordering of instructions, for instance, to achieve fine-grained parallelism via things like instruction pipelining, which can definitely run your code out of order.

Processors not only have a very complex memory model, but that model can be vastly different across architectures (e.g., x86 and ARM).

Very few programmers are going to be aware of most of that subtlety.

Languages could generate code to make sure everything always happens in a well-defined order (full sequential consistency), but generally they do not. Processors go out of their way to make things faster by moving your code around, and compilers often do a lot to make your code faster too.

So no language is going to find it an acceptable hit to force the processor to run everything in a well-defined order, at least in the case of multiple threads.

How to make this relationship work

Programmers typically will need to tell their compiler where to get conservative, and sacrifice performance for correctness. In C, you can do that by labeling variables as _Atomic, which tells the compiler that variables will be used across threads.

But _Atomic doesn’t truly mean, “always atomic”, primarily because you can explicitly ask for different behavior on a case-by-case basis.

If you don’t specify anything other than _Atomic, it does mean that you’re not going to get corrupting data races, and it does mean that, by default, the compiler will ensure all operations on that variable will happen in the same order, from the perspective of any thread.

Enforcing that kind of order generally slows things down, so, you can specify to the compiler cases where you want different behavior. For instance, if you’re initializing memory, you probably have exclusive access to the data. Your own view on the ordering is consistent anyway, so at this point you may not care to slow down the processor.

However, that could be problematic for the next thread to read. Since the first access didn’t care, it’s definitely possible for a thread to get a reference to the object, and see the pre-initialization value, but only if that second access explicitly relaxes its requirement for getting access to the variable.

Generally, those kinds of surprises are easy to find, especially when they involve fields in an object whose pointer you read atomically, but whose fields are not marked as being _Atomic. It’s a great recipe for Heisenbugs.

So generally, if you know multiple threads will handle a variable, not only should you go ahead and declare it as _Atomic, but also you should avoid giving up most of its protection– you’re just inviting disaster.

My memory order arrived damaged

By default, accessing a single _Atomic variable will make sure that any changes to the variable happen in a well-defined (linearized) order. For example, let’s say we have a huge number of threads, and thread 0 goes to modify _Atomic variable A.

Just by declaring the variable to be _Atomic, the compiler will, if necessary, generate code that makes sure that any changes to A that were in flight when thread 0 goes to modify it, all end before its operation. Similarly, any subsequent loads of that variable will see the reflected value.

_Atomic variable access (unless relaxed) acts like a barrier that the compiler will enforce for the variable in question. That enforcement, though, is really done by generating an assembly that helps get the proper semantics, which is very platform dependent.

If you do not mark a variable as _Atomic, then you should not be using the variable across threads. The compiler will certainly generate code under the assumption that those variables won’t have to deal with concurrent access.

But we do have some options:

Relaxed Memory Order. In C/C++ parlance, the semantics of variables that are not marked _Atomic is called relaxed memory order. It’s a weird name that means there are no ordering guarantees outside of what would be promised by the underlying architecture for a non-atomic variable, and the promise of non-corrupting data races. That’s scary.
Sequentially Consistent Ordering. This lives on the other end of the spectrum from relaxed ordering. Using this is supposed to ensure a global ordering for all atomic data by default, at the price of some efficiency. This is the default memory ordering, and is the strongest, but it does have some slight issues we’ll discuss in a minute.
Acquire / Release Ordering. This is in between relaxed and sequentially consistent. It basically does what you want it to do on a variable-by-variable basis, forcing a well defined ordering.

You’ll often see “Acquire” and “Release” listed as separate strategies. They’re more like two sides of the same coin:

acquire semantics apply only to loading an _Atomic variable (i.e., acquiring it). The guarantee is that any modifications to a memory address that were made by other threads will be reflected by the time the value loads, with nothing still in progress.
release semantics only apply to storing an _Atomic value (i.e., releasing it). The guarantee here is that the store operation will be reflected in the next subsequent load of that address. That is to say, once the store finishes, no other thread will be able to load the previous value.

For some operations, only one of these two things makes sense. For instance, acquire semantics make sense for an atomic_load() call, but that doesn’t update the value, so release semantics don’t make sense here (and thus, acquire/release doesn’t make sense either).

You don’t specify memory ordering when you declare something _Atomic. By default, every operation for such variables will get the strongest memory ordering.

If you want anything else, you can get it on a call-by-call basis every time the variable is used, by calling a call in the C atomic library that allows you to explicitly change to another ordering, but only at that one slot.

Generally, the defaults are going to be least surprising (in a world where surprises are common, and understanding what’s going on is hard).

That doesn’t mean that most strict memory ordering is perfect: sequential consistency has some issues that prevent it from living up to its name, particularly when you end up mixing access with different memory orderings (see section 2 of this paper for more details).

Plus, sequential consistency generally isn’t much stronger than acquire / release semantics, and it can be slower, depending on the architecture. So it’s pretty reasonable to use acquire/release semantics as the default.

But this stuff is trickier than we might think.

For instance, you may have noticed that I declared the item field inside the struct h4x0r_word_ring_item_t to be _Atomic.

If you remove that qualifier, on some platforms our tests will occasionally fail. Why? How is that possible??

Sure, we dequeue an entire h4x0r_word_ring_item_t atomically. But, we store the result of that dequeue into a second memory location, that isn’t itself marked to be atomic. In our case, in our word ring dequeue algorithm, that’s going to be the variable called last.

So, when we return last.item, we might end up returning the value that was stored in that field before the CAS operation.

Since we also return the associated epoch, we could possibly get an old value there, too. However, since they are only unloaded into the last field together, (atomically), we can be pretty confident that, if the item is available, then the epoch will be too.

Still, C doesn’t guarantee that; it’s just a matter of having some knowledge of what’s going on under the hood (and experience to back it up).

But, if you wanted to be really careful, you would want to tag the epoch field as _Atomic too. If you try though, you’ll get an error, because _Atomic doesn’t work with bit slices directly. You’d have to do something else. Some options are:

Use a union, with one of the types in the union being an _Atomic uint64_t
Declare the thing as a uint64_t, and manage the flag bit manually (but you do need the flag to be a bit).
Leave it unsynced, and tell the compiler to sync all relevant variables before pulling the epoch out of the item.

This third option you can do with a memory fence, putting it before the assignment in h4x0r_word_ring_found. In our associated code, h4x0r_atomic_fence() will do the trick; this is a very trivial wrapper around a C atomic builtin.

By the way, if we tag item as being _Atomic, but specify relaxed memory ordering when we copy it out, we could absolutely end up with the same problem, because _Atomic is only atomic until you explicitly tell it otherwise.

Yes, there are a lot of subtleties. Nobody said it would be easy.

Good luck, you’re going to need it.

– Lee T. Solo (tho with a ring on it)