A queue implementation using atomic instructions (with benchmarks)

Everyone says using atomic instructions is faster than using mutual exclusion locks. Given the problems that come with using mutual exclusion locks (priority inversion, dead locks … etc) atomics seems to be the solution. But there is a higer price to pay from programmer productivity. Because rolling out your own stuff (which are correct) by using atomics is a hard job to achieve. The major problem however is checking whether your program has race conditions or not, I’ve been used to valgrind over the years for even simple debugging. But valgrind does nothing to help you with atomics[1]. Therefore you the developer is truly alone in rolling out your own even debugging other developers’ code. A static analysis tool might be the one which really addresses this issue directly but until then we have have to depend on our brains for debugging. One approach that I think which would work well is to test what you have written very thoroughly so that you can conclude that your algorithms do not fail atleast most of the time.

This is new territory for me since I only started learning about these in depth a few days ago, so my understanding might differ from the truth but my implementation seems to survive the stress test that I threw at it. It is not entirely my code[5] (note that I said its better to use someone else’s stuff in this, you can always roll your own but the time you’ll have to put in to that will be very large) but the stress test and the comparison tests are results from the literature survey I did on the topic[2][3][4].

Lets get on to the core data structure/algorithms here, which is a ringbuffer and operations on it. There is an excellent explanation on how ringbuffers work and why they work in [5] so I am not doing reiterate all that here. But I’d like to explain on the things that were hard to understand for me hoping that others who follow would benefit from it.

Lets have some assumptions inplace before we get in to the details,

  1. We assume that the code we write will be executed in the same order as we write them (which is a big lie for all code by the way, we’ll see how to guarantee this in real life code as we go along but for now stick with this)
  2. We also assume integer read and store are atomic, meaning they adhere to all-or-nothing & consistency principal from ACID properties.
  3. Only one thread can operate on pushing things in to the queue, and only one thread can pop things out. 1-consumer and 1-producer.

So now that we made these things clear lets look at some code,

// initialization
contatiner = new T[size];
rear = front = 0;

bool Pop(T& t)
{
  if (rear == front)
  {
    // the ringbuffer is empty!
    return false;
  } else {
    t = container[rear];
    rear = (rear+1) % size;
    return true;
  }
}
bool Push(T t)
{
  int new_front = (front+1) % size;
  // as long as adding something doesn't make 
  // rear == front we can keep adding
  if (rear == new_front)
  {
    return false;
  } else {
    container[front] = t;
    front = new_front;
    return true;
  }
}

It is easy to say that this will work well if you protect each function with a mutex lock. We also see that the only race condition that is possible here happens when the producer and the consumer tries to modify the same memory which is a point in the container. Note that only the producer modifies ‘front’ and only the consumer modifies ‘rear’ while both reads each value. Now with the assumptions we made above lets see how we can use two independent threads to produce and consume. And lets benchmark this with the mutual exclusion lock to see what kind of a gain we can get. We add independently and remove independently only while these two operations see each other’s changes consistantly, meaning that the code should be executed exactly as written and the changes on the integer variables ‘rear’ and ‘front’ must be atomic (this is not a big deal since we already assumed those!). So we can say clearly that when we read a value for ‘rear’ or ‘front’ all above code that made that change has been executed for sure otherwise you will not be seeing that change in the first place. Code might run interlaced but it can never run out of order!

But wait! This is just a mind game we played. How can we write this so that it would run on an actual machine? How can we make the assumptions we made true in a computer? Computers suck at running what we wrote, they do out of order execution to speed things up all the time. How can we guarantee that things won’t be changed? Compilers change code too! Making things worse in some architectures even integer reads and writes are not atomic! Enter C++11, not only C++11 alot of new languages (Java/.NET) have allowed the modification of this out of order execution nature to make way for algorithms that make use of such features. Modern processors have features to allow atomic operations with special instructions they do all the needed synchronizations under the hood (keeping caches up to date, keeping memory up to date with caches).

Lets see how to code this in C++11,

#include <atomic>
#include <vector>
#include <iostream>
#include <cstdlib>

using namespace std;
template<typename T>
class Queue
{
private:
    alignas(64) atomic<int> front_,rear_;
    int size_;
    T *container;
public:
    Queue(int size) :
        size_(size),
        bpop_(false),
        bpush_(false)
    {
        g_front_ = front_ = 0;
        g_rear_ = rear_ = 0;

        container = static_cast<T *>( malloc(sizeof(T)*size) );

    }
    bool Pop(T& t)
    {
        int cr=rear_.load();
        if (cr == front_.load())
        {
            return false;
        }
        t = container[cr];
        rear_.store((cr+1)%size_);
        return true;
    }
    bool Push(T t)
    {
        int cf=front_.load(memory_order_relaxed);
        int new_cf=(cf+1) % size_;
        if (new_cf == rear_.load()) {
            return false;
        }
        container[cf]=t;
        front_.store(new_cf);
        return true;
    }
    ~Queue()
    {
        free(container);
    }
};

So this is code for a C++11 queue class that does things in atomic instructions. Remember our assumption that things are sequential? C++11 memory model by default gives that guarantee for atomic variables. It has internal mechanisms to detect and stop optimizations of code reordering where the atomic guarantees fail. The guarantee is as follows: any thread reading an atomic variable will see the changes that are applied to the same variable in another thread prior and operations that depend on it will not be reordered. In other words, you won’t see the new value of ‘rear’ before the ‘container’ has been updated with the last run pop() where it changes ‘rear’ value after changing the ‘container’. C++11 also guarantees the reads and writes are done as transactions just like in ACID, it’s all or nothing. 3rd assumption holds because we are going to hold it!

As always C++11 also allows for more optimizations (ahh, the things to love about a language!). The above sequential assumption is called ‘memory_order_seq_cst’ (short for sequentially consistant). There are other things such as ‘memory_order_relaxed’ (there is no order guarantees here, just the fact that your variable has atomic read and writes), ‘memory_order_acquire’ (this goes with loads, where all changes before the load is seen as sequential at the load) and ‘memory_order_release’ (this goes with stores, where all changes done up until it will synchronize with the next acquire load on the same vaiable). I am not very sure that I have grasped the exact ideas behind these new ways of memory ordering but there exists implementations using these, and you can use them with some knowledge of them (and with some more reading on the internet).

On to the tests!

These numbers are the number of pushes and pops done per second on each thread independently, you will see sometimes the POPs appearing before any PUSHes this is because I have not synchronized the console (I don’t even take a lock for the console, but this must be done for anything serious!)

Using Mutex Locks,

POP 4.35772e+06
PUSH 4.3822e+06
POP 4.32842e+06
PUSH 4.34622e+06
POP 4.36313e+06
PUSH 4.3764e+06
POP 4.2914e+06
PUSH 4.30392e+06
POP 4.3737e+06
PUSH 4.39303e+06
POP 4.33139e+06
PUSH 4.35881e+06
POP 4.33681e+06
PUSH 4.34405e+06

With Atomic Instructions, memory_order_seq_cst

POP 1.10804e+07
PUSH 1.21797e+07
POP 1.14811e+07
PUSH 1.31541e+07
POP 1.21784e+07
PUSH 1.34477e+07
POP 1.20657e+07
PUSH 1.35652e+07
POP 1.17625e+07
PUSH 1.32726e+07
POP 1.20278e+07
PUSH 1.33463e+07
POP 1.20177e+07
PUSH 1.33821e+07
POP 1.22883e+07
PUSH 1.33557e+07

Using Atomic Instructions, memory_order_relaxed, memory_order_acquire and memory_order_release with alignment set to 64. Without alignment set to 64 for atomic variables the results do not always give results near to the results given below. It sometimes gives nearer results but most of the time it gives a bit worse results.

PUSH 2.13774e+07
POP 2.23716e+07
PUSH 2.04712e+07
POP 2.16111e+07
PUSH 2.19893e+07
POP 2.24591e+07
PUSH 1.98706e+07
POP 2.12814e+07
PUSH 2.12732e+07
POP 2.2586e+07
PUSH 2.14926e+07
POP 2.22773e+07
PUSH 2.22551e+07

So in conclusion it gives almost about 5x increase in performance even for two threads! Which is great news! I always wondered whether using atomic instructions for a 1-consumer,1-producer queue implementation would have any performance increases seems it does.

Download the code from here.

If you find any error please comment and let me know!

References

1. http://valgrind.org/docs/manual/hg-manual.html#hg-manual.effective-use

2. https://www.youtube.com/watch?v=c1gO9aB9nbs

3. https://www.youtube.com/watch?v=CmxkPChOcvw

4. https://www.youtube.com/watch?v=X1T3IQ4N-3g

5. http://www.codeproject.com/Articles/43510/Lock-Free-Single-Producer-Single-Consumer-Circular

6. https://gcc.gnu.org/wiki/Atomic/GCCMM/AtomicSync

7. http://www.open-std.org/Jtc1/sc22/wg21/docs/papers/2008/n2664.htm

Special thanks to Jeff Preshing and Herb Sutter for their phenomenal constributions for explaining the details of C++11 Atomics and Memory ordering.

FFmpeg Audio Playback Sample

It was very hard to find a proper sample code that would compile with the newest FFmpeg library, and when some compiled they didn’t play correctly. This one however runs great, the planar audio channel playback was added by me, and later support for float playback(double PCM follows the same but wasn’t added due to media files being scarce) so now it plays the majority of MP3s, WMAs, AACs … and any audio format that you can name, I guess. I modified the code that was found here http://stackoverflow.com/questions/9799560/decode-audio-using-libavcodec-and-play-using-libao many thanks to him. This code was adapted from C, so somethings have not been thoroughly converted to C++; it compiles and runs anyway.

EDIT:

Please refer to this updated code

Compile with

g++ -std=c++11 test.cpp `pkg-config --libs --cflags libavcodec libavformat libavutil ao`

For MSVC, use this libao version: https://github.com/manushanga/libao-win32

UTF-16 reading and writing with C

It seems almost no one does this kind of stuff with C, it was very hard to find material for this on the net. So I thought of posting it so that everyone can benefit. The code is for an XML/HTML tag stripper, which strips all tags. Make sure you keep the error checker intact, it spits out some uncommon errors that at first glance you won’t know what it’s talking about.

A major plus point in using “ccs=UTF16[BE|LE]” is that it handles endianness and BOM automatically. But you have to know what you are dealing with, I *think* “ccs=UTF16” has Native Endianness, which means LE for majority of PCs that are in use the files I got to strip were BE and LE mixed so I had to check BOM manually; 0xFF 0xFE is Little Endian.

Note: This code was only tested on Linux, it should work without any changes on any POSIX system. Windows is a different story as always it may need some changes in “ccs=…” part I saw in some material UTF-16BE/UTF-16LE has been used, but it didn’t work on Linux that way if this code doesn’t work on Windows you might have to change to that.

#include <stdio.h>
#include <wchar.h>
#include <errno.h>
#include <string.h>

int main(int argc, char *argv[])
{
	int i;
	for (i=1;i<argc;i++){
		printf("%s\n",argv[i]);
		FILE *o=fopen(argv[i],"r");
		FILE *f;
		if (fgetc(o)!=0xFF){
			f=fopen(argv[i],"r,ccs=UTF16BE");	
		} else {
			f=fopen(argv[i],"r,ccs=UTF16LE");

		}
		fclose(o);
		char new[32];
		sprintf(new,"./striped/%s.txt",argv[i]);
		FILE *g=fopen(new,"w,ccs=UTF16LE");

		int write=1;
		wint_t c=fgetwc(f);
		while (c!=WEOF) {
			if (c==L'<')
				write=0;

			if (write){
				fputwc(c,g);
			}
			if (c==L'>')
				write=1;

			c=fgetwc(f);

		}
		printf("%s\n",strerror(errno));
		fclose(g);
		fclose(f);

	}
	return 0;
}

Wrapper for fopen in Windows and OSes that implement POSIX

This is a wrapper that I wrote for Vrok to make fopen behave sanely and accept UTF8 strings on all systems. The defines FOPEN_RB, FOPEN_WB…etc had to be defined since systems other than Windows do not have a binary switch for fopen. You can add this code to a header and use fopenu instead of fopen, on POSIX systems there will be no overhead for calling fopenu since it’s just a #define.

#include <cstdio>
#if defined(_WIN32) && defined(UNICODE)
#include <winnls.h>
inline FILE *fopenu(const char *path,const char *opt){
    wchar_t wpath[1024];
    wchar_t wopt[8];
    MultiByteToWideChar(CP_UTF8, 0, path, -1, wpath, 1024);
    MultiByteToWideChar(CP_UTF8, 0, opt, -1, wopt, 8);

    return _wfopen(wpath, wopt);
}
#define FOPEN_RB "rb"
#define FOPEN_WB "wb"
#define FOPEN_AB "wb+"
#else
#define fopenu fopen
#define FOPEN_RB "r"
#define FOPEN_WB "w"
#define FOPEN_AB "w+"
#endif

Smart pointers in C? Why not?

The new C++ standard has smart pointers, which does the memory management for you but with some overhead. C does not support these and won’t be. All these things really do is control the scope of a variable. You’ll be able to get around the troubles of memory allocation by using the stack extensively, but you’ll always have to have preallocated space for that. To do that, you need to know the size of the allocation, so many *think* that C will never be able to have a reasonable replacement for this.

And I think what I’ve written up is reasonable, and that is why you are reading this.

First we need to decouple the allocation as an entity and its scope, these two can only be declared together in C, in other words; if you declare “int i”, i’s scope is what you declared it on. That’s bad and that’s where all the problems begin, because of this you can’t demote or promote i’s life span according to your needs, if this the declaration for some random data type that may grow or does not have a constant size; things get much worse.

What if what you declare controls the scope of your allocations? What if you can bind on to any arbitrary scope? Now things are just plain simple from here. You just place lists on random scopes and give them names, have all the pointers to the allocations go in to a list and when you want to free a scope just free the things in the list. If you want more control make more of these lists!

And here’s my implementation, it’s based on a linked list approach for the lists and has a few defines that the user should use,

mmdecl(varname) - varname is the name we give to the scope variable, do not use this name again and mmdecl is only a declaration it can be placed on the static part of an executable.

mminit(varname) - this is a define that wraps a routine, so it is not a declaration. You must place it in the code path.

mmalloc(varname, size) - this is what you use instead of malloc for custom mallocs please redefine malloc and free before you include mmanage.h to your project.

mmclean(varname) - the big clean up.

Source with Sample

Compile with

cc test.c mmanage.c

Rabin Karp and Boyer Moore implemented for byte data

This is a another assignment from the University. I’m posting this here mostly because there’s not much examples or proofs of these methods in the internet that’s understandable(I have to agree that the Boyer Moore explaination in the report must need some work to be done by the reader, but that’s because we were limited in page count for explainations).

This implementation features string matching on any data given, its alphabet size is 256(unlike other implementations found on the internet).

Download PDF

Download Source