See Look Ma, h4x0rs! for an introduction to this series.

AKA, we h4x0r a mere (parallel) arena allocator

There’s absolutely no reason for memory management in C to be as hard as it is. In fact, there’s almost never a good reason for memory management to be hard, and is one of my issues with Rust– why should I work even harder on memory management than I would in C?

Well sure, there’s safety. But if that’s so important, there are a million other languages that don’t make it hard to allocate, and still manage to be safe.

For some of us, maybe safety’s overrated. Living on the edge is much more interesting! More realistically, dealing with C in all its abject horrors is actually one of the best possible ways to a gain deep understanding of what’s going on. Every programming ecosystem, no matter how unsafe, is built on top of a base that is absolutely not going to make safety guarantees. Clearly, we need more people learning how to build their own safety scaffolding.

So by all means, if you want to feel safe, have your parents overnight your “blankie”, and stick to safe languages only. But still, don’t torture yourself with working hard on memory management either, because that’s just tedious, and wholly unnecessary.

Wait, what’s that I hear? “GC is bad?” Are you sure? Probably, you think GC is a bad thing for systems code, because “everybody knows it’s bad”. “Long pause times scanning the whole heap, or something?”

You took the wrong pill, Neo.

We’re going to give you a shot at redemption. Jump down the rabbit hole, or at least, bring yourself to the edge, peer in, and tell yourself you might learn something, if you play along at home!

Over the next few articles, we’re going to build out a full garbage collector, so that you can alloc, and never worry about the deallocation again.

You’ll learn that we don’t have to try very hard to get something that performs about as well as the malloc() implementation that comes with your standard library. And, you’ll learn we absolutely don’t have to scan much of the heap when it’s time to collect (we’ll only need to touch live data). And if you come on the entire journey here, you’ll find we’ve got some other unexpected benefits for you… yet to be revealed.

On the path to full garbage collection, we’ll start by building an arena allocator, which by itself is far easier than the traditional C interface to memory management. The basic idea is to allocate a huge pool of memory that you can then use for smaller allocations. Instead of worrying about freeing the small individual allocations, you just delete the whole arena at once.

If you don’t want to see where the rabbit hole leads, you can just stop there, make your systems programming task easier, and go back to dumping on garbage collection. You can pretend me saying your opinions being wrong was just a dream.

Enter the arena (Ready?? Fight!!)

We will not build our allocator on top of malloc(), or any other third-party allocator. Real h4x0rs know that by writing their own allocator from the ground up, they become experts at how allocators work. Plus, stacking allocation management is extra cost for no value.

So, like real h4x0rs are required to do (it’s part of the moral code), we are going to go directly to the OS and use mmap() to request memory for our arena.

One of the beautiful things about modern operating systems is the zero-mapped memory page. That means, the way we will make our call to mmap() it will return a pointer to pages of memory, where ach of those pages will start out mapped to a single page of memory that is all zeroes.Modern operating systems take a ‘copy-on-write’ approach, which means, once we start writing to any given memory page, that’s when we’ll really get our own page.

This has benefits:

We will start out with memory that is already zeroed out, which is always nice if we can be confident in omitting default zero-valued initializers (since, let’s face it, people are going to forget them).
We can, if we like, confidently ask for a large arena, even if we’re not sure we’re going to need all that memory. The pages we don’t touch don’t make it to resident memory. You only need to worry about the maximum amount of memory you’re using at once. And, as it turns out, when we get to Garbage Collection, the higher the percentage of the arena that is garbage, the better our overall performance is going to be.

We’ll want to be wary of some of our biggest issues:

We’ll want to make sure to handle things gracefully if we asked for too small of an arena, and that arena ends up running out of space.
We need to be able to handle multiple threads allocating from the same arena all at once. And when multiple threads realize the arena needs to grow, we need to handle the synchronization properly.
We need to make sure the allocations we return can deal with underlying hardware requirements for memory alignment.

To deal with these initial issues, when building our initial arena implementation, we’re going to:

Keep a linked list of “segments” in our arena, in case we need to expand. For us, and arena will consist of one or more segments, that will all be freed at the same time (when doing GC, we will collect across all segments of a single arena).
Keep track of the starts and ends of allocations, along with the location information on where we allocated from, so we can do some correctness checking.
Keep memory aligned, by over-allocating when a request isn’t properly aligned.
Use a simple approach for handing out allocations across competing threads that doesn’t require locks (as long as the heap has sufficient memory).
Use a spin-lock for situations where the arena needs to grow.

To be clear, the way we’re allowing arenas to grow, an arena may not be a single contiguous range of memory. There’s not huge downside to that, and that, plus the use of zero-mapped memory, takes the stress off you for making sure any given arena is properly sized.

🗒️

Generally, when a "critical section" in a multi-threaded program is a just single atomic instruction (even if the instruction can fail and may need to be retried if there's contention), we prefer it to using a lock. Modern architectures tend to handle both cases with a similar level of performance, but when using a lock, threads can be suspended while holding the lock, giving them a slight disadvantage. Plus, we can usually forego the extra space for the lock.

We can do all of the above, and then some in about 300 lines of code, without bending over backwards.

The complete code for this article can be found at: http://www.github.com/h4x0r-org/arena/.

The arena data type

Here’s our basic data type setup for the arena, and the segments that make it up:

// All of this code should be in a header file, say, h4x0r_alloc.h
#include <stdint.h>
#include <stdlib.h> // Not used in this snippit, but we'll need it.

typedef struct h4x0r_segment_t h4x0r_segment_t;

struct h4x0r_segment_t {
    h4x0r_segment_t *next_segment;
    int64_t          size;
    char             mem[];
};

typedef struct h4x0r_arena_t {
    char            *segment_end;
    _Atomic (char  *)next_alloc;
    h4x0r_segment_t *current_segment;
    _Atomic int32_t  mutex;
    bool             heap_allocated;
} h4x0r_arena_t;

In our arena data type, the segment_end field keeps track of the end of the current segment we’re using to serve allocations. That address isn’t valid memory inside the segment; the last writable byte would require subtracting one from the send.

The next_alloc field will keep track of where the allocator is in the segment. Any address inside the segment below the value of next_alloc has been handed out. When users request memory, we’ll need to make sure that there’s enough space between next_alloc and segment_end for their request, our accounting, and whatever padding we need due to alignment restrictions.

When there’s not enough room to serve an allocation, we will add a new segment before we service more allocations, using the mutex field to coordinate adding the next segment. The mutex is only used for adding the next segment, so that we can safely update multiple fields at once. For allocations, we’ll have a protocol for modifying thesegment_end field.

You might be looking at this struct declaration, and wondering why those two fields are typed char * instead of void *. You might feel the later is more appropriate, since these pointers are pointing to a big wad of memory with no specific type. And, if you were that observant, you probably also noticed that in the segment data type, we used a zero-length array of type char for the mem field.

The reason we do that? Because C is gross.

Specifically, these things are all declared as char * because we’re going to doing math (and then comparisons) on those pointers, and using a char * makes the math more natural. For instance, let’s say that the current value of the next_alloc starts at address 0xff000000000000100.

Even some people who have written a lot of C can be surprised to find that the result of next_alloc + 16 is dependent on the type we used to declare next_alloc. If we used a char *, we’ll get 0xff000000000000110, which is pretty natural– if you convert the hex to an integer, the new memory address is 16 higher than the old one.

But, if next_alloc is declared a void *, adding 16 will (typically) give us 0xff000000000000180, a difference of 128. That’s because, for better or for worse, pointer math assumes you’re doing array access. So adding 16 to next_alloc in C is the same as taking the address of next_alloc[16]. A void * is treated as a pointer to the natural word type. Thus, on a 64-bit machine, 8 * 16 == 128.

If you are super judgey about using char * here, but you want unsurprising math, you can cast to unsigned integers. But using integer types obfuscate the fact that we’re dealing with a memory address. And, we would need to be careful with how we code that if we want to use an integer type that works on any platform, which required a bunch of tomfoolery before C23. If you’re willing to require C23 (and we’ll generally make use of C23 here), now you can actually use:

typedef unsigned _BitInt(sizeof(void *) * 8) h4x0r_word_t;

There’s no “right answer” here. No matter what approach we pick, we’d expect to hold our nose and do some casting. But, because the byte-based math is more intuitive for most people, we’ll stick with casting our hearts out. Yup, it really is gross, but not only do I think it’s the least gross option, and we accept that this is the idiom in C, for ~~better or~~ worse.

Memory Alignment

We are going to be careful about memory alignment. Meaning, we will only allow allocations to start at specific memory addresses, where those memory addresses are a multiple of our alignment value. For instance, if we were to align to two bytes, we’d make sure addresses we hand out are even.

Hardware platforms will often optimize for particular alignments. Most importantly, pointers are almost universally expected to be word-aligned (where these days a word is typically 8 bytes). On some platforms, if pointers are not word-aligned, access is significantly slower (e.g., older ARM systems). On other platforms, unaligned accesses will generally result in a crash (X86 platforms).

To deal with alignment, compilers tend to lay out data structures with internal padding if necessary, so that each data type is aligned to whatever size the underlying hardware architecture is going to expect.

And, an allocator needs to not muck things up, so it needs to be pretty conservative about alignment.

Particularly, we will align to 16 bytes (which is the minimum required in the C standard for malloc() on a 64-bit system). That means, we only want to hand out addresses where the lowest four bits are all 0.

Ensuring alignment is only going to require basic math. We’ll build ourselves a simple primitive that allows us to round up to the nearest value that happens to be a multiple of a specific power-of-two (that specific power-of-two in this case being 16, the one we need to use for alignment).

When rounding to a power of two, we can do what’s called “strength reduction”, replacing the ‘more expensive’ modulus operator with a binary AND operator.

Numbers that are powers-of-two have only a single bit set. But if you subtract one from that value, you’ll always get a number where the bits to the right of that one bit are set to 1 (the rest of the bits are always zero).

If we need to get the remainder for dividing by a power of two, we simply can use a binary AND operation of the power minus one; it gives the same result as modulus of the power, removing all the bits above (or equal to) our power of two, and preserving all the bits below.

AND, we can round down instead of taking the modulus by taking the power minus one, then flipping all the bits before our AND operation.

If we want to avoid a conditional in critical code, we can avoid testing for whether the value to align is already aligned by always adding one less than the power of two, then taking the remainder.

For instance, when aligning by 16 (0x10), if we enter the value 0 we get (0 + 0x0f) & ~0x0f which is 0. If we enter the value 15, we get (0x0f + 0x0f) & ~0x0f, which is 0x1e & ~0x0f, which gives us 16.

This makes our power-of-two rounding algorithm:

// All of this code can go in the same header file as above.

#define H4X0R_ALIGN 16

// We will only call this where the first argument is a power of 2.
// It will not work if it isn't!
//
// If we want to make this round up in the general case, it should
// be reasonable to replace the below with:
//
// (to_round + power - 1) - (to_round % power)
//
// Honestly, that's not going to be significantly slower, particularly
// because your compiler will most likely tend to strength reduce itself
// given this is inlined, and, in the hot path, the first argument is
// probably always going to be a compile-time constant.
//
// And even if it doesn't, even w/ frequent allocations, it's kinda
// going to be in the noise.

static inline uint64_t
h4x0r_pow2_round_up(uint64_t power, uint64_t to_round)
{
    uint64_t modulus = power - 1;
    return (to_round + modulus) & ~modulus;
}

// An inlined helper to deal with the casting transparently.
// This is basically a no-op, just makes compile-time easier.
static inline void *
h4x0r_alloc_align(void *addr)
{
    return (void *)h4x0r_pow2_round_up(H4X0R_ALIGN, (uint64_t)addr);
}

🗒️

On the X86, while the traditional instruction set works well w/ 16-byte alignment, SIMD instructions can require up to 64-byte alignment. But, people will start complaining about wasted space if your default alignment is that high. For that case, you might want to add a second allocation option that allows you to specify higher alignments. This is a bit more work than it might seem, since the allocation code below currently assumes that the space between the start of an allocation's header information, and the pointer handed back to the user is of constant size. But it's not too hard either; you just need to make sure there's enough total space to align, and dynamically calculate the aligned start to return to the user.

Page alignment

We also will need to know the page size, because when we request memory from the operating system via mmap(), we always get pages of memory back. The call will be grumpy and fail if our ask is not a multiple of the page size. And it jus tso happens, pages are always aligned to whatever the operating system’s page size is.

We should dynamically query the page size from the operating system, since on some OSes it is configurable. But then we can be confident that page size isn’t going to change across calls to mmap()— we only need to do it once. Instead of dealing with a once function at this point (they’re a pain in the neck for us in pthreads), we will just keep an atomic global variable and only try to set it when we create new arenas.

extern int h4x0r_get_page_size(void);

#if defined(__GNUC__) && !defined(__clang__)
#define _GNU_SOURCE // necessary on Linux for `getpagesize()`
#endif

// Above here, probably stick it in your header file.

#include <stdatomic.h>
#include <unistd.h>

static _Atomic int h4x0r_page_size = 0;

size_t
h4x0r_get_page_size(void)
{
    size_t result = atomic_load(&h4x0r_page_size);

    // Write contention here doesn't really matter;
    // multiple threads will write out the same value.
    if (!result) {
        result = getpagesize();
        atomic_store(&h4x0r_page_size, result);
    }

    return result;
}

Creating an Arena

Now, we have the tools we need to be able to allocate and initialize a new arena. We’ll have our allocator take a desired arena size; we’ll make sure the user has at least that much memory available in the first segment by adding in the initial arena overhead, and page-aligning the result.

extern h4x0r_arena_t * h4x0r_new_arena(uint64_t);
extern void h4x0r_initialize_arena(h4x0r_arena_t *, uint64_t);

#include <sys/mman.h>
#define H4X0R_MPROT (PROT_READ | PROT_WRITE)
#define H4X0R_MFLAG (MAP_PRIVATE | MAP_ANON)

// The below in a .c file...

static void h4x0r_add_arena_segment(h4x0r_arena_t *arena, uint64_t request_len);

h4x0r_arena_t *
h4x0r_new_arena(uint64_t user_size)
{
    uint64_t       hdr_size = h4x0r_pow2_round_up(h4x0r_get_page_size(),
                                            sizeof(h4x0r_arena_t));
    h4x0r_arena_t *result   = mmap(NULL,
                                 hdr_size,
                                 H4X0R_MPROT,
                                 H4X0R_MFLAG,
                                 -1,
                                 0);

    if (result == MAP_FAILED) {
        abort();
    }

    h4x0r_initialize_arena(result, user_size);
    result->heap_allocated = true;

    return result;
}

void
h4x0r_initialize_arena(h4x0r_arena_t *arena, uint64_t size)
{
    atomic_store(&arena->next_alloc, NULL);
    atomic_store(&arena->mutex, 0);
    
    arena->segment_end     = NULL;
    arena->current_segment = NULL;
    arena->heap_allocated  = false;

    h4x0r_add_arena_segment(arena, size);
}

🗒️ In modern C, atomic_store(&arena->next_alloc, NULL);is the same as writing arena->next_alloc = NULL; since arena->next_alloc is declared to be _Atomic. Both give us sequential consistency. I try to remember to be more explicit that there's an atomic operation going on. Also, if we never need to do more than loads and stores, we may never get the compiler ensuring that we properly declared the variable to be atomic. Nonetheless, I'm very likely to forget in future calls.

Now, let’s deal with what we do when we need to add a segment to the arena.

We’ve got a value in our data structure named mutex (which, if you don’t recognize as a mutual exclusion primitive, means you’ve been lucky enough to avoid any exposure to parallel programming; can we trade lives?). We need a mutex here, because, when multiple threads share an allocator, it’s very likely that more than one thread is going to notice running out of room at about the same time. We need them to cooperate; we don’t need extra segments.

The easiest approach is for each thread that notices it doesn’t have enough space to acquire the lock. And once any given thread acquires the lock, it should re-check to see if the segment has been updated by someone else. The most straightforward way to do that is to re-check whether enough space for our allocation.

To keep our life simple, we’re always going to make any new segment the same size as the previous segment, unless doing so wouldn’t leave us with enough memory to service the thread adding the new segment. In that case, we’ll take whatever the needed amount is, and align it to the next highest page boundary.

🗒️	The request length that gets passed into this function won’t be exactly what the user asked for; it will also include any accounting overhead we will need to add.

static void
h4x0r_add_arena_segment(h4x0r_arena_t *arena, uint64_t request_len)
{
    h4x0r_segment_t *segment;

    // Spin lock.
    while (atomic_fetch_or(&arena->mutex, 1))
       /* No body */;

    // Check to see if someone else added a segment. If so,
    // unlock and return.
    char *next = atomic_load(&arena->next_alloc);

    if (next && next + request_len < arena->segment_end) {
        atomic_store(&arena->mutex, 0);
        return;
    }

    // Give ourselves at least a page of overhead.
    uint64_t needed = request_len + atomic_load(&h4x0r_page_size)
                    + sizeof(h4x0r_segment_t);
    uint64_t size = 0;

    if (arena->current_segment) {
        size = arena->current_segment->size;
    }

    if (size < needed) {
        size = needed;
    }

    size = h4x0r_pow2_round_up(h4x0r_get_page_size(), size);

    // Here, we're actually getting pages from the OS. We'd expect the
    // map to fail only if we're out of memory.
    segment = mmap(NULL, size, H4X0R_MPROT, H4X0R_MFLAG, -1, 0);

    if (segment == MAP_FAILED) {
       abort(); // out of memory.
    }

    segment->size          = size;
    segment->next_segment  = arena->current_segment;
    arena->next_alloc      = h4x0r_alloc_align(segment->mem);
    arena->segment_end     = ((char *)segment) + size;
    arena->current_segment = segment;

   atomic_store(&arena->mutex, 0);
}

It would be perfectly reasonable to use a “proper” mutex instead of the spin lock here (the difference being, if there’s a lot of contention, a “proper” mutex can proactively suspend threads; with a spin lock, they eat CPU until they acquire the lock). Before too long, we’ll implement our own concurrency primitives, and then favor those.

Here, the critical section is small, and even when many threads are trying to allocate from the same heap at once, when the lock is yielded, the lock contention goes away quickly. And, most mutex implementations will start with a spin lock before suspending threads anyway. Which approach is better for this instance is unclear enough, I’d want to pretty extensively measure before having a real opinion. That means either approach is fine, unless someday you see profiling data that indicates you should do something different here.

Allocations

We’ve got all of our arena scaffolding done, so we’re almost at the finish line. While we could have our allocation just hand out raw memory and be done, we’re going to go ahead and do accounting work, saving a bit of metadata for each allocation. We’re doing this for two reasons:

Having the information is valuable when you need to debug.
We’re going to need to keep some per-allocation information when we extend our code to build a full garbage collector.

Like most memory allocators, we’re going to take the easy path and put this metadata in the heap we’re allocating from. This isn’t the safest path, though. Keeping the metadata outside the heap would be a whole lot safer, but it does make the accounting work we need to do for garbage collection significantly harder, so we’ll leave it as an exercise to the reader (meaning, we’re not going to deal with this any time soon).

But, we’ll do ourselves a solid, by adding a random 64-bit ‘guard’ value at the very front of all applications. This will serve two purposes:

It’ll allow us to do some simple checking for overflows.
It’ll give us an easy way to take a given heap address and find the beginning of its allocation, without a complicated data structure.

Additionally, let’s go ahead and add to our account information:

The total length of the allocation. As implemented, the start address of the allocation, plus the length field will give us the location of the next guard word (useful for scanning a segment start-to-finish for overflows).
The file and line number associated with the allocation. This is, of course, optional. However, it’s good information to have around for debugging, and keeping them doesn’t cost significant CPU, since all those values are staticly known at compile time.

We’ll order these things so that the 64-bit items are next to each other, to make sure to avoid unnecessary padding of the data structure.

    // This is also destined for our header file.
    typedef struct h4x0r_alloc_info_t {
        uint64_t  guard;
        char     *file_name;
        uint32_t  alloc_len;
        uint32_t  line_number;
    } h4x0r_alloc_info_t;

For the guard’s value, it’s best for security reasons to have it be chosen randomly each program execution. So, we should add our code to initialize it when the program is starting up, when there’s definitely no thread contention. For now, we can just put it in h4x0r_get_page_size() (at the end of the if block), and assume we’re not racing to create the very first arena.

For getting the random value for the guard, on Linux we can use the getrandom() system call. With most other modern Unix systems, we can rely on the presence of arc4random_buf().

🪟

A house with no windows

My great uncle was an assembly programmer, from early days into the 1980’s, and kept coding for fun, until he died. In the mid-90s I tried to get him to switch to ANY higher level language, Python, or even C. I also tried to get him to use email, or the web. He absolutely refused, and so nobody ever looked at any of his code ever again (I would have read his asm, but how was I ever going to get it??)

While I long scoffed at him (which was admittedly quite rude of me, especially since he gave me my first real computer, when we certainly couldn’t afford one), I am starting to relate, because I absolutely cannot bring myself to also cover Windows. I have nothing against it (anymore), but I’ve got 30 years between me, and the one Windows 3.1 device driver I wrote. That was my college dalliance, stepping outside my commitment to the Unix family, and I certainly don’t feel the need to stray ever again.

PRs are always accepted, though!

For getting random bytes, let’s abstract way the underlying implementation:

#if defined(__linux__)
#include <sys/random.h>

static inline void
h4x0r_random_bytes(char *bufptr, ssize_t len)
{
    if (getrandom(bufptr, len, GRND_NONBLOCK) == -1) {
        abort();
    }
}
#else
#include <stdlib.h>
#define h4x0r_random_bytes(bufptr, len) arc4random_buf(bufptr, len)
#endif

It’s now trivial to create our guard value when we need to:

static uint64_t h4x0r_guard_word;

static inline void
h4x0r_guard_init(void)
{
    h4x0r_random_bytes(&h4x0r_guard_word, sizeof(uint64_t));
}

Nutritious (and delicious!) portions of zero-mapped memory

Now that we can create our guard value, let’s work on our allocation function.

The biggest challenge here is going to be dealing with multiple threads wanting to allocate at the same time. While we could use our mutex for this, we can avoid locking altogether by using an atomic operation, if we approach the algorithm thoughtfully.

For any given thread, the space we’re going to need from the arena to service the allocation doesn’t need to be known by other threads. That includes the space the user asked for, plus the space for our accounting information, all aligned properly. So in any world, we should compute that total request before we do anything else.

In a world with no contention, either the next_alloc field is our return value, or we need to create a new segment. But if there’s space, we need to update the value of next_alloc for subsequent callers, by adding in the size of our allocation.

The check to see if we’d be out of space should certainly happen before we modify the pointer, and that can happen in parallel to other threads. We can also compute what we want the value of next_alloc to be.

Our critical section (when no new segment is necessary) is simply making sure that we can atomically update the value of next_alloc in the face of thread contention.

We can do that using a spin-lock to acquire our mutex. But if we have to set up a loop for locking anyway, we can instead use an atomic operation that TRIES to change the value. And if it fails, we re-compute next_alloc and try again.

Specifically, the operation we need is the compare-and-swap operation. That operation gives us a way to update the value of a memory address, but only if it has the value we’re expecting it to have. And, we can of course, detect when it fails, and try again.

Once a thread gets a compare-and-swap operation to succeed, then we know that we’re the thread allowed to return the original allocation address that we read. Until then, we lost the race, and we re-compute what we’d need next_alloc to be, and try again.

extern void * _h4x0r_arena_alloc(h4x0r_arena_t *, uint64_t, char *, uint32_t);

#define h4x0r_arena_alloc(arena, request) \\
        _h4x0r_arena_alloc(arena, request, __FILE__, __LINE__)

// The above should go in the header file, below in the implementation.

void *
_h4x0r_arena_alloc(h4x0r_arena_t *arena,
                   uint64_t       request,
                   char          *file,
                   uint32_t       line)
{
    // The first round-up call is unnecessary if you're sure the info
    // structure is 16-byte aligned.  But still add in the header size.
    // The second round-up is always needed.

    uint32_t overhead = h4x0r_pow2_round_up(H4X0R_ALIGN,
                                            sizeof(h4x0r_alloc_info_t));

    request = request + overhead;
    request = h4x0r_pow2_round_up(H4X0R_ALIGN, request);

    char *found_value;
    char *desired_value;

    do {
        found_value   = atomic_load(&arena->next_alloc);    
        desired_value = found_value + request;

        if (desired_value > arena->segment_end) {
            h4x0r_add_arena_segment(arena, request);
            continue;
        }

     } while(!atomic_compare_exchange_strong(&arena->next_alloc,
                                             &found_value,
                                             desired_value));

    // `found_value` now holds the start of our allocation chunk.
    // The metadata comes first, then we return the address `overhead`
    // bytes after.
    //
    // Remember, while we are returning a void *, we are keeping it `char *`
    // to get byte math.

    h4x0r_alloc_info_t *info   = (h4x0r_alloc_info_t *)found_value;
    char               *result = found_value + overhead;

    *info = (h4x0r_alloc_info_t) {
        .guard       = h4x0r_guard_word,
        .file_name   = file,
        .line_number = line,
        .alloc_len   = request,
    };

    return result;
}

Realistically, whether we use a mutex or the atomic operation, the difference should be equally insignificant if there’s no contention. But under heavy contention, the performance can easily vary by architecture, but should be very fast in both cases.

With a mutex, we’d probably still use a spin-lock. Our biggest performance delay would come if a thread gets put to sleep by the scheduler while holding the lock, even though this should be rare. But if we use a full-blown lock for such a small critical section, threads not holding the lock will have a good chance of sleeping longer than we’d need them to.

So if you wanted to use our spin lock, the do / while loop could be replaced with:

    while (atomic_fetch_or(&arena->mutex, 1))
        /* No body */;

    found_value   = atomic_load(&arena->next_alloc);
    desired_value = found_value + request;

    if (desired_value > arena->segment_end) {
        h4x0r_add_arena_segment(arena, request);
    }
    atomic_store(&arena->mutex, 0);

You’d then need to remove the similar code inside h4x0r_add_arena_segment(), so that we don’t deadlock waiting for a lock we already hold.

I tend to go with atomics over locks when the critical section really is just one value being updated, since that’s what they’re there for.

Still, in this case, atomic operation does suffer from having to re-check bounds every time it is tried, but the (compare-and-swap) operation fails-— with the mutex each thread only makes that comparison a single time. It just compares against the value of the lock over-and-over instead.

Ultimately, either approach is fine, I won’t die on any hill here.

You can’t be perfect, so at least be l33t

Sometimes it’s helpful for debugging to spam yourself with information about allocations. Let’s give ourselves a compile-time capability to print out allocation information to stderr as we allocate.

But let’s not spam ourselves TOO badly— when compiled in, we’ll want it off by default, and give ourselves an option to turn it on and off easily.

As with any kind of debugging functionality like this, we want code to stay as readable as possible, so we’ll build a little API that is automatically compiled in or out depending on the value of relevant compile-time defines that you can pass.

Let’s start with that API:

#if defined(H4X0R_ALLOC_LOG)

#if !defined(H4X0R_ALLOC_LOG_DEFAULT)
#define H4X0R_ALLOC_LOG_DEFAULT (0)
#endif

extern void h4x0r_alloc_log_on(void);
extern void h4x0r_alloc_log_off(void);
extern bool h4x0r_alloc_log_status(void);

#define h4x0r_alloc_log(...)          \
    (h4x0r_alloc_log_status() ? fprintf(stderr, __VA_ARGS__) : 0)
#else
#define h4x0r_alloc_log_on()
#define h4x0r_alloc_log_off()
#define h4x0r_alloc_log_status()
#define h4x0r_alloc_log(...)
#endif

When H4X0R_ALLOC_LOG is off at compile time, our calls will all automatically be deleted. But when compiled in, h4x0r_alloc_log will be replaced with code that conditionally checks the status at runtime, and if the status function returns true, then calls fprintf(), hardcoding the first parameter, but passing the rest (from the format string onward) through directly.

Notice we make sure H4X0R_ALLOC_LOG_DEFAULT is always set, which lets us choose at compile time what the default value at program startup will be. If you don’t supply your own #define then the default ends up 0.

Now, let’s implement that API:

#if defined(H4X0R_ALLOC_LOG)
#include <stdio.h>

static _Atomic bool alloc_log = (bool)H4X0R_ALLOC_LOG_DEFAULT;

void
h4x0r_alloc_log_on(void)
{
    atomic_store(&alloc_log, true);
}

void
h4x0r_alloc_log_off(void)
{
    atomic_store(&alloc_log, false);
}

bool
h4x0r_alloc_log_status(void)
{
    return atomic_load(&alloc_log);
}
#endif

That was insultingly easy. Of course, here, all threads share one value. If you think you would want to change this per thread, well then wait a couple of articles, because we’ll make it easy to do that.

Now, we can add a call to h4x0r_allog_log() at the end of our allocation function, without fear of paying the cost when it’s not compiled in:

void *
_h4x0r_arena_alloc(h4x0r_arena_t *arena,
                   uint64_t       request,
                   char          *file,
                   uint32_t       line)
{
    ...
    h4x0r_alloc_log("%s:%d: alloc %llu bytes @%p (hdr @%p)\n",
                    file,
                    line,
                    request,
                    result,
                    info);

    return result;
}

While you can edit flags like this in your build script, whatever build system you like to use, we can give ourselves a simple way to do it via Makefile. Our Makefile will automatically apply any defines we put into the environment variable CFLAGS, so we can do something like the following at the command line:

$ export CFLAGS="-DH4X0R_ALLOC_LOG -DH4X0R_ALLOC_LOG_DEFAULT=1"; make

That would not only turn on the debug logging, but swap the default startup status, so you’ll get maximum spam, unless turned off at runtime.

In real projects, we normally vomit in our mouth if people use Make, but it is basically universal, and this project is trivial, so let’s just live with the shame:

.PHONY: clean
COMPILE_FLAGS=-Wall -g -I./include -ffast-math $(CFLAGS)
LINK_FLAGS=-flto $(LDFLAGS)
obj_files = alloc.o test.o

all: $(obj_files)
	$(CC) $(LINK_FLAGS) -o alloc_test $(obj_files)

clean:
	rm alloc_test *.o

$(obj_files): %.o: %.c
	$(CC) -c $(COMPILE_FLAGS) $< -o $@

# Dependencies. No fancy Make magic, just done by hand, as an act of
# self-punishment.
alloc.o: alloc.c include/h4x0r/alloc.h include/h4x0r/random.h
test.o: test.c include/h4x0r/alloc.h include/h4x0r/random.h include/h4x0r/timing.h

That makefile does have a few files we didn’t reference above:

A test file. We provide a dumb test program in the repo if you want it, or insert your own.
A header file called timing.h. That’s got a little API to make it easy to do comparative timing, which we use in the test program. And when we say little, we mean it. Here’s the whole thing:

#pragma once

#include <stdbool.h>
#include <time.h>
#include <sys/time.h>

#define h4x0r_save_timestamp(x) clock_gettime(CLOCK_REALTIME, x)
#define H4X0R_NSEC_PER_SEC      1000000000

static inline bool
h4x0r_timespec_lt(struct timespec *t1, struct timespec *t2)
{
    if (t1->tv_sec < t2->tv_sec) {
        return true;
    }

    if (t1->tv_sec > t2->tv_sec) {
        return false;
    }

    return t1->tv_nsec < t2->tv_nsec;
}

static inline void
h4x0r_timespec_diff(struct timespec *item1,
                    struct timespec *item2,
                    struct timespec *result)
{
    int64_t          tmp;
    struct timespec *smaller, *bigger;

    if (h4x0r_timespec_lt(item1, item2)) {
        smaller = item1;
        bigger  = item2;
    }
    else {
        smaller = item2;
        bigger  = item1;
    }

    result->tv_sec = bigger->tv_sec - smaller->tv_sec;

    tmp = bigger->tv_nsec - smaller->tv_nsec;

    if (tmp < 0) {
        tmp += H4X0R_NSEC_PER_SEC;
        result->tv_sec -= 1;
    }

    result->tv_nsec = tmp;
}

Deign to delete

Generally, we should use the macro h4x0r_arena_alloc() to call the function (which has the same name, but with an underscore added), in order to automatically add in the line and file information for debugging.

The only critical thing we have left to do is the code that frees the whole arena when we’re done with it. Here, we can generally assume there won’t be a race for deallocation– you will have determined somehow that it’s time to ditch your arena.

If you might have a race, you’ll need a single global deallocation lock– you absolutely can’t put the lock in memory you’re about to deallocate. Well, I suppose you can, but you sure won’t like what happens when you do.

But, if you want the lock, we will leave that one as an exercise to the reader (laziness FTW!).

// Add a prototype for this one in the header too.

void
h4x0r_arena_delete(h4x0r_arena_t *arena)
{
    // First, delete the segments we had to add, if any.
    h4x0r_segment_t *segment = arena->next_segment;
    h4x0r_segment_t *last    = (n00b_segment_t *)(((char *)arena)
                                                  + sizeof(h4x0r_arena_t));
    h4x0r_segment_t *next;

    while (segment != last) {
        next = segment->next_segment;
        munmap(segment, segment->size);
        segment = next;
    }

    munmap(arena, last->size);
}

While we didn’t worry much about performance, this arena allocator should be about as good as malloc() implementations for single threaded applications, because even though we didn’t think about optimization, we don’t have to fuss with trying to reuse memory blocks.

If you use multiple threads, we’d expect our approach to do us proud (especially when there’s contention in the allocator). But more importantly, it’ll be much easier to manage your memory, even if you don’t use a full-bore garbage collector, because you won’t need to worry about freeing every little thing.

You can instead create arenas that have a defined static scope, and then delete the whole thing when the scope exits (we’ll eventually go through how to make that basically automatic with a defer-like capability).

But I think you’ll see as we progress that full-on garbage collection will still be very fast, and will make life far easier for you.

Sure, Rust will always be safer, but safer is not l33t (it’s in the newest revision of the h4x0r rule book, I swear). Instead, you’ll be able to rib your friends who write Rust about how much less time you spend managing your memory!

Extra Credit

The full code, for today, along with some basic test code is at: http://www.github.com/h4x0r-org/arena/.

There’s certainly more that we can do to enhance this allocator. For instance:

As mentioned above, you can move the metadata out of the heap, or give yourself an API where keeping metadata at all is an option.
You can align allocs to the page size, and use mprotect() to lock all memory pages that haven’t been handed out yet. Essentially, you would lock them when you create a segment, and unlock them as you’re allocating, before you start using the page. This helps find a lot of scribbles due to array indexing errors.
You probably add a guard page to the beginning and end of your segments, for better overflow protection. This just makes the math a bit more involved; the mprotect() call isn’t hard to use.
You can add a check function that will scan through each segment from the beginning, making sure the guard value and the length fields are all consistent (for instance, do this as you finish off a segment, and when you deallocate an arena).

Next time, we’ll build an actual working garbage collector, building on top of the allocator we just built.

Until then, happy hacking!

— Lee T. H., esquire