A queue implementation using atomic instructions (with benchmarks)

Everyone says using atomic instructions is faster than using mutual exclusion locks. Given the problems that come with using mutual exclusion locks (priority inversion, dead locks … etc) atomics seems to be the solution. But there is a higer price to pay from programmer productivity. Because rolling out your own stuff (which are correct) by using atomics is a hard job to achieve. The major problem however is checking whether your program has race conditions or not, I’ve been used to valgrind over the years for even simple debugging. But valgrind does nothing to help you with atomics[1]. Therefore you the developer is truly alone in rolling out your own even debugging other developers’ code. A static analysis tool might be the one which really addresses this issue directly but until then we have have to depend on our brains for debugging. One approach that I think which would work well is to test what you have written very thoroughly so that you can conclude that your algorithms do not fail atleast most of the time.

This is new territory for me since I only started learning about these in depth a few days ago, so my understanding might differ from the truth but my implementation seems to survive the stress test that I threw at it. It is not entirely my code[5] (note that I said its better to use someone else’s stuff in this, you can always roll your own but the time you’ll have to put in to that will be very large) but the stress test and the comparison tests are results from the literature survey I did on the topic[2][3][4].

Lets get on to the core data structure/algorithms here, which is a ringbuffer and operations on it. There is an excellent explanation on how ringbuffers work and why they work in [5] so I am not doing reiterate all that here. But I’d like to explain on the things that were hard to understand for me hoping that others who follow would benefit from it.

Lets have some assumptions inplace before we get in to the details,

  1. We assume that the code we write will be executed in the same order as we write them (which is a big lie for all code by the way, we’ll see how to guarantee this in real life code as we go along but for now stick with this)
  2. We also assume integer read and store are atomic, meaning they adhere to all-or-nothing & consistency principal from ACID properties.
  3. Only one thread can operate on pushing things in to the queue, and only one thread can pop things out. 1-consumer and 1-producer.

So now that we made these things clear lets look at some code,

// initialization
contatiner = new T[size];
rear = front = 0;

bool Pop(T& t)
{
  if (rear == front)
  {
    // the ringbuffer is empty!
    return false;
  } else {
    t = container[rear];
    rear = (rear+1) % size;
    return true;
  }
}
bool Push(T t)
{
  int new_front = (front+1) % size;
  // as long as adding something doesn't make 
  // rear == front we can keep adding
  if (rear == new_front)
  {
    return false;
  } else {
    container[front] = t;
    front = new_front;
    return true;
  }
}

It is easy to say that this will work well if you protect each function with a mutex lock. We also see that the only race condition that is possible here happens when the producer and the consumer tries to modify the same memory which is a point in the container. Note that only the producer modifies ‘front’ and only the consumer modifies ‘rear’ while both reads each value. Now with the assumptions we made above lets see how we can use two independent threads to produce and consume. And lets benchmark this with the mutual exclusion lock to see what kind of a gain we can get. We add independently and remove independently only while these two operations see each other’s changes consistantly, meaning that the code should be executed exactly as written and the changes on the integer variables ‘rear’ and ‘front’ must be atomic (this is not a big deal since we already assumed those!). So we can say clearly that when we read a value for ‘rear’ or ‘front’ all above code that made that change has been executed for sure otherwise you will not be seeing that change in the first place. Code might run interlaced but it can never run out of order!

But wait! This is just a mind game we played. How can we write this so that it would run on an actual machine? How can we make the assumptions we made true in a computer? Computers suck at running what we wrote, they do out of order execution to speed things up all the time. How can we guarantee that things won’t be changed? Compilers change code too! Making things worse in some architectures even integer reads and writes are not atomic! Enter C++11, not only C++11 alot of new languages (Java/.NET) have allowed the modification of this out of order execution nature to make way for algorithms that make use of such features. Modern processors have features to allow atomic operations with special instructions they do all the needed synchronizations under the hood (keeping caches up to date, keeping memory up to date with caches).

Lets see how to code this in C++11,

#include <atomic>
#include <vector>
#include <iostream>
#include <cstdlib>

using namespace std;
template<typename T>
class Queue
{
private:
    alignas(64) atomic<int> front_,rear_;
    int size_;
    T *container;
public:
    Queue(int size) :
        size_(size),
        bpop_(false),
        bpush_(false)
    {
        g_front_ = front_ = 0;
        g_rear_ = rear_ = 0;

        container = static_cast<T *>( malloc(sizeof(T)*size) );

    }
    bool Pop(T& t)
    {
        int cr=rear_.load();
        if (cr == front_.load())
        {
            return false;
        }
        t = container[cr];
        rear_.store((cr+1)%size_);
        return true;
    }
    bool Push(T t)
    {
        int cf=front_.load(memory_order_relaxed);
        int new_cf=(cf+1) % size_;
        if (new_cf == rear_.load()) {
            return false;
        }
        container[cf]=t;
        front_.store(new_cf);
        return true;
    }
    ~Queue()
    {
        free(container);
    }
};

So this is code for a C++11 queue class that does things in atomic instructions. Remember our assumption that things are sequential? C++11 memory model by default gives that guarantee for atomic variables. It has internal mechanisms to detect and stop optimizations of code reordering where the atomic guarantees fail. The guarantee is as follows: any thread reading an atomic variable will see the changes that are applied to the same variable in another thread prior and operations that depend on it will not be reordered. In other words, you won’t see the new value of ‘rear’ before the ‘container’ has been updated with the last run pop() where it changes ‘rear’ value after changing the ‘container’. C++11 also guarantees the reads and writes are done as transactions just like in ACID, it’s all or nothing. 3rd assumption holds because we are going to hold it!

As always C++11 also allows for more optimizations (ahh, the things to love about a language!). The above sequential assumption is called ‘memory_order_seq_cst’ (short for sequentially consistant). There are other things such as ‘memory_order_relaxed’ (there is no order guarantees here, just the fact that your variable has atomic read and writes), ‘memory_order_acquire’ (this goes with loads, where all changes before the load is seen as sequential at the load) and ‘memory_order_release’ (this goes with stores, where all changes done up until it will synchronize with the next acquire load on the same vaiable). I am not very sure that I have grasped the exact ideas behind these new ways of memory ordering but there exists implementations using these, and you can use them with some knowledge of them (and with some more reading on the internet).

On to the tests!

These numbers are the number of pushes and pops done per second on each thread independently, you will see sometimes the POPs appearing before any PUSHes this is because I have not synchronized the console (I don’t even take a lock for the console, but this must be done for anything serious!)

Using Mutex Locks,

POP 4.35772e+06
PUSH 4.3822e+06
POP 4.32842e+06
PUSH 4.34622e+06
POP 4.36313e+06
PUSH 4.3764e+06
POP 4.2914e+06
PUSH 4.30392e+06
POP 4.3737e+06
PUSH 4.39303e+06
POP 4.33139e+06
PUSH 4.35881e+06
POP 4.33681e+06
PUSH 4.34405e+06

With Atomic Instructions, memory_order_seq_cst

POP 1.10804e+07
PUSH 1.21797e+07
POP 1.14811e+07
PUSH 1.31541e+07
POP 1.21784e+07
PUSH 1.34477e+07
POP 1.20657e+07
PUSH 1.35652e+07
POP 1.17625e+07
PUSH 1.32726e+07
POP 1.20278e+07
PUSH 1.33463e+07
POP 1.20177e+07
PUSH 1.33821e+07
POP 1.22883e+07
PUSH 1.33557e+07

Using Atomic Instructions, memory_order_relaxed, memory_order_acquire and memory_order_release with alignment set to 64. Without alignment set to 64 for atomic variables the results do not always give results near to the results given below. It sometimes gives nearer results but most of the time it gives a bit worse results.

PUSH 2.13774e+07
POP 2.23716e+07
PUSH 2.04712e+07
POP 2.16111e+07
PUSH 2.19893e+07
POP 2.24591e+07
PUSH 1.98706e+07
POP 2.12814e+07
PUSH 2.12732e+07
POP 2.2586e+07
PUSH 2.14926e+07
POP 2.22773e+07
PUSH 2.22551e+07

So in conclusion it gives almost about 5x increase in performance even for two threads! Which is great news! I always wondered whether using atomic instructions for a 1-consumer,1-producer queue implementation would have any performance increases seems it does.

Download the code from here.

If you find any error please comment and let me know!

References

1. http://valgrind.org/docs/manual/hg-manual.html#hg-manual.effective-use

2. https://www.youtube.com/watch?v=c1gO9aB9nbs

3. https://www.youtube.com/watch?v=CmxkPChOcvw

4. https://www.youtube.com/watch?v=X1T3IQ4N-3g

5. http://www.codeproject.com/Articles/43510/Lock-Free-Single-Producer-Single-Consumer-Circular

6. https://gcc.gnu.org/wiki/Atomic/GCCMM/AtomicSync

7. http://www.open-std.org/Jtc1/sc22/wg21/docs/papers/2008/n2664.htm

Special thanks to Jeff Preshing and Herb Sutter for their phenomenal constributions for explaining the details of C++11 Atomics and Memory ordering.

Advertisements