Notes on VrokNext: Have we arrived at an optimal playback engine yet?

In my opinion the answer for the question is: not yet, as far as current general purpose(non-real time) operating systems and their APIs are concerned. Most of them appear to over do the idea of playing back of a sound buffer. Rightly so, since even the fastest CPUs will hiccup on playback if playback is done with a bad design. But as time progresses single thread performance has risen and multi-thread performance tends to be better if done correctly but there are too many ways to acutally do it.

We have always considered multi-threaded designs superior for audio playback and processing for the mere reason of we just can do more in a given amount of time. Simple playback as been fluid for years on CPUs that are generally available, even on phones and other small devices. Playback with live DSP is a whole different problem, Vrok(older versions) tried to answer it with a 2 thread model with a decoder thread and a playback thread. DSP is done on the decoder thread; the design had little reason for doing so. The only reason was why not! It worked well, but as the CPUs that it was run on got weaker the buffer had to be made bigger to reduce the artifacts. It added a notable amount of latency to the audio.

Even with years of DSP on audio, almost everyone in the audio business has been doing it linearly: one effect after another. There’s very less reason for this choice, right now I can’t think of one other than the ease of implementation. Non-linear DSP should be great! Atleast in theory. However, this is not the first time that someone thought of it, JACK(JACK Audio Connection Kit) offers a great implementation on supported operating systems. JACK is a bulky system and is recommended for people with some knowledge in sound servers and patch bays, it is a no go for someone who is only going to listen to some music in their free time. Therefore why not a mini-JACK like system? Which is faster, portable and extendable.

Below is a sketch of how VrokNext would work, each node will run on a runnable unit (lets call it a thread for now, further below the differences are detailed).
Buffer Graph

Communication between threads are done by two queues which are built to work fast (preferably lock-free). Memory allocation and deallocation could be done for buffer management but there’s two things that’ll add some overhead, every allocation and deallocation would need to be synced and a big amount of allocations and deallocations for long periods of time will have an adverse affect on the allocator because of segmentation (you may think of this as over engineering but if run on a system with low resources these might be significant). Preallocating everything before you use it and reuseing what you’ve allocated is always the good approach. Therefore, I chose to implement so.

Buffer FlowIf every node is run on a different thread, there will be a lot of threads at sometime and what’s slowing things down would be their own contention. On an N core system, if HyperThreading is enabled there is no way that you are going to get more than 2N threads running at the sametime. Therefore it’s useful to have a mini scheduler that will schedule work of every node without the overhead of a lot of threads. Each thread will run a preset amount of processing functions. This is no way a finalized design, however it has the potential to make use of whatever hardware there is to its maximum without forcing the DSP or the playback to be of low quality.

The implementation so far,

https://github.com/manushanga/VrokNext

Working Binaries (Windows & Linux versions work similar to shells, more information on the repo README)

Windows Binary

Android APK

A queue implementation using atomic instructions (with benchmarks)

Everyone says using atomic instructions is faster than using mutual exclusion locks. Given the problems that come with using mutual exclusion locks (priority inversion, dead locks … etc) atomics seems to be the solution. But there is a higer price to pay from programmer productivity. Because rolling out your own stuff (which are correct) by using atomics is a hard job to achieve. The major problem however is checking whether your program has race conditions or not, I’ve been used to valgrind over the years for even simple debugging. But valgrind does nothing to help you with atomics[1]. Therefore you the developer is truly alone in rolling out your own even debugging other developers’ code. A static analysis tool might be the one which really addresses this issue directly but until then we have have to depend on our brains for debugging. One approach that I think which would work well is to test what you have written very thoroughly so that you can conclude that your algorithms do not fail atleast most of the time.

This is new territory for me since I only started learning about these in depth a few days ago, so my understanding might differ from the truth but my implementation seems to survive the stress test that I threw at it. It is not entirely my code[5] (note that I said its better to use someone else’s stuff in this, you can always roll your own but the time you’ll have to put in to that will be very large) but the stress test and the comparison tests are results from the literature survey I did on the topic[2][3][4].

Lets get on to the core data structure/algorithms here, which is a ringbuffer and operations on it. There is an excellent explanation on how ringbuffers work and why they work in [5] so I am not doing reiterate all that here. But I’d like to explain on the things that were hard to understand for me hoping that others who follow would benefit from it.

Lets have some assumptions inplace before we get in to the details,

  1. We assume that the code we write will be executed in the same order as we write them (which is a big lie for all code by the way, we’ll see how to guarantee this in real life code as we go along but for now stick with this)
  2. We also assume integer read and store are atomic, meaning they adhere to all-or-nothing & consistency principal from ACID properties.
  3. Only one thread can operate on pushing things in to the queue, and only one thread can pop things out. 1-consumer and 1-producer.

So now that we made these things clear lets look at some code,

// initialization
contatiner = new T[size];
rear = front = 0;

bool Pop(T& t)
{
  if (rear == front)
  {
    // the ringbuffer is empty!
    return false;
  } else {
    t = container[rear];
    rear = (rear+1) % size;
    return true;
  }
}
bool Push(T t)
{
  int new_front = (front+1) % size;
  // as long as adding something doesn't make 
  // rear == front we can keep adding
  if (rear == new_front)
  {
    return false;
  } else {
    container[front] = t;
    front = new_front;
    return true;
  }
}

It is easy to say that this will work well if you protect each function with a mutex lock. We also see that the only race condition that is possible here happens when the producer and the consumer tries to modify the same memory which is a point in the container. Note that only the producer modifies ‘front’ and only the consumer modifies ‘rear’ while both reads each value. Now with the assumptions we made above lets see how we can use two independent threads to produce and consume. And lets benchmark this with the mutual exclusion lock to see what kind of a gain we can get. We add independently and remove independently only while these two operations see each other’s changes consistantly, meaning that the code should be executed exactly as written and the changes on the integer variables ‘rear’ and ‘front’ must be atomic (this is not a big deal since we already assumed those!). So we can say clearly that when we read a value for ‘rear’ or ‘front’ all above code that made that change has been executed for sure otherwise you will not be seeing that change in the first place. Code might run interlaced but it can never run out of order!

But wait! This is just a mind game we played. How can we write this so that it would run on an actual machine? How can we make the assumptions we made true in a computer? Computers suck at running what we wrote, they do out of order execution to speed things up all the time. How can we guarantee that things won’t be changed? Compilers change code too! Making things worse in some architectures even integer reads and writes are not atomic! Enter C++11, not only C++11 alot of new languages (Java/.NET) have allowed the modification of this out of order execution nature to make way for algorithms that make use of such features. Modern processors have features to allow atomic operations with special instructions they do all the needed synchronizations under the hood (keeping caches up to date, keeping memory up to date with caches).

Lets see how to code this in C++11,

#include <atomic>
#include <vector>
#include <iostream>
#include <cstdlib>

using namespace std;
template<typename T>
class Queue
{
private:
    alignas(64) atomic<int> front_,rear_;
    int size_;
    T *container;
public:
    Queue(int size) :
        size_(size),
        bpop_(false),
        bpush_(false)
    {
        g_front_ = front_ = 0;
        g_rear_ = rear_ = 0;

        container = static_cast<T *>( malloc(sizeof(T)*size) );

    }
    bool Pop(T& t)
    {
        int cr=rear_.load();
        if (cr == front_.load())
        {
            return false;
        }
        t = container[cr];
        rear_.store((cr+1)%size_);
        return true;
    }
    bool Push(T t)
    {
        int cf=front_.load(memory_order_relaxed);
        int new_cf=(cf+1) % size_;
        if (new_cf == rear_.load()) {
            return false;
        }
        container[cf]=t;
        front_.store(new_cf);
        return true;
    }
    ~Queue()
    {
        free(container);
    }
};

So this is code for a C++11 queue class that does things in atomic instructions. Remember our assumption that things are sequential? C++11 memory model by default gives that guarantee for atomic variables. It has internal mechanisms to detect and stop optimizations of code reordering where the atomic guarantees fail. The guarantee is as follows: any thread reading an atomic variable will see the changes that are applied to the same variable in another thread prior and operations that depend on it will not be reordered. In other words, you won’t see the new value of ‘rear’ before the ‘container’ has been updated with the last run pop() where it changes ‘rear’ value after changing the ‘container’. C++11 also guarantees the reads and writes are done as transactions just like in ACID, it’s all or nothing. 3rd assumption holds because we are going to hold it!

As always C++11 also allows for more optimizations (ahh, the things to love about a language!). The above sequential assumption is called ‘memory_order_seq_cst’ (short for sequentially consistant). There are other things such as ‘memory_order_relaxed’ (there is no order guarantees here, just the fact that your variable has atomic read and writes), ‘memory_order_acquire’ (this goes with loads, where all changes before the load is seen as sequential at the load) and ‘memory_order_release’ (this goes with stores, where all changes done up until it will synchronize with the next acquire load on the same vaiable). I am not very sure that I have grasped the exact ideas behind these new ways of memory ordering but there exists implementations using these, and you can use them with some knowledge of them (and with some more reading on the internet).

On to the tests!

These numbers are the number of pushes and pops done per second on each thread independently, you will see sometimes the POPs appearing before any PUSHes this is because I have not synchronized the console (I don’t even take a lock for the console, but this must be done for anything serious!)

Using Mutex Locks,

POP 4.35772e+06
PUSH 4.3822e+06
POP 4.32842e+06
PUSH 4.34622e+06
POP 4.36313e+06
PUSH 4.3764e+06
POP 4.2914e+06
PUSH 4.30392e+06
POP 4.3737e+06
PUSH 4.39303e+06
POP 4.33139e+06
PUSH 4.35881e+06
POP 4.33681e+06
PUSH 4.34405e+06

With Atomic Instructions, memory_order_seq_cst

POP 1.10804e+07
PUSH 1.21797e+07
POP 1.14811e+07
PUSH 1.31541e+07
POP 1.21784e+07
PUSH 1.34477e+07
POP 1.20657e+07
PUSH 1.35652e+07
POP 1.17625e+07
PUSH 1.32726e+07
POP 1.20278e+07
PUSH 1.33463e+07
POP 1.20177e+07
PUSH 1.33821e+07
POP 1.22883e+07
PUSH 1.33557e+07

Using Atomic Instructions, memory_order_relaxed, memory_order_acquire and memory_order_release with alignment set to 64. Without alignment set to 64 for atomic variables the results do not always give results near to the results given below. It sometimes gives nearer results but most of the time it gives a bit worse results.

PUSH 2.13774e+07
POP 2.23716e+07
PUSH 2.04712e+07
POP 2.16111e+07
PUSH 2.19893e+07
POP 2.24591e+07
PUSH 1.98706e+07
POP 2.12814e+07
PUSH 2.12732e+07
POP 2.2586e+07
PUSH 2.14926e+07
POP 2.22773e+07
PUSH 2.22551e+07

So in conclusion it gives almost about 5x increase in performance even for two threads! Which is great news! I always wondered whether using atomic instructions for a 1-consumer,1-producer queue implementation would have any performance increases seems it does.

Download the code from here.

If you find any error please comment and let me know!

References

1. http://valgrind.org/docs/manual/hg-manual.html#hg-manual.effective-use

2. https://www.youtube.com/watch?v=c1gO9aB9nbs

3. https://www.youtube.com/watch?v=CmxkPChOcvw

4. https://www.youtube.com/watch?v=X1T3IQ4N-3g

5. http://www.codeproject.com/Articles/43510/Lock-Free-Single-Producer-Single-Consumer-Circular

6. https://gcc.gnu.org/wiki/Atomic/GCCMM/AtomicSync

7. http://www.open-std.org/Jtc1/sc22/wg21/docs/papers/2008/n2664.htm

Special thanks to Jeff Preshing and Herb Sutter for their phenomenal constributions for explaining the details of C++11 Atomics and Memory ordering.

Latex package manager in Arch Linux

It is news to me that there exists a latex package manager for Arch Linux, so now you do not need to manually install anything latex related. You can enter the package manager and it works in a similar way to apt-get on Ubuntu/Debian/.

Get the following pacakge from AUR, texlive-localmanager-git or you can use yaourt/packer to install it. Stay away from the non git version it exits with an error as of now.

After you have done installing run, tllocalmgr. Then type,

install <package name>

Run texhash as super user to refresh the hashes. The packages should be available to you now, thanks to fellow archers!

Arch Wiki Page

Remove telecom pop ups from Android

Aren’t you angered at the pop ups that are from the telecos that are trying to sell you something as soon as your phone is idle? Stuff pops up every minute. I hated it and I removed it, here’s how.

You need to have your phone rooted.

Use an Android terminal emulator and run commands below,

su # get on to super user console
rm /system/app/Stk.apk # remove 
touch /system/app/Stk.apk # make sure they think its there so its not installed again 


 

Seeking with FFmpeg

So people(including me) struggle with FFmpeg. It’s very powerful and also a bit hard to understand for the average developer who hasn’t had a background in media decoding/encoding. I struggled with playback first, then I found a good implementation on stackoverflow and built on top of that. This post details that.

So soon after we get the playback working we need to get the seeking working. If you are a Vrok(<=3) user then you might have experienced the buggy seeking it does. As I dwelled on to FFmpeg this vacation it just got clear with the help of dranger‘s good ‘ol tutorial. PTS/DTS are detailed well there within the context of FFmpeg. The documentation of FFmpeg tells half the story of its design, which is based on multiple streams. Each stream has its own context and each context has its separate time base. What’s a time base? It’s just a number that decides how big the time scale is. If you have a bigger time base, then you can store fine grained time intervals. But this should be balanced with media duration and the storage size of the variable we are going to store this number in.

FFmpeg has a separate context for the whole file (container in code) this context has a time base which is defined as AV_TIME_BASE. It is this time base that is used to store durations about the whole file. Namely total duration! So the following code will give you total duration in seconds.

duration_in_seconds = container->duration / AV_TIME_BASE;

In the audio stream there exists a different time base. Which you can get by accessing,

container->streams[audio_stream_id].time_base

This is expressed as a AVRational, all you need to know about this is every rational number can be represented in decimal (numerator/denominator). So now you can again take any PTS (Presentation Time Stamp) in seconds using this time base too. This can be used to show the current playback position.

av_seek_frame(container,audio_stream_id,seek_to *audio_st->time_base.den / audio_st->time_base.num ,AVSEEK_FLAG_ANY);

Is used to seek to the correct frame (it is not exact but does the job, FFmpeg only seeks to key frames). Here, seek_to is stored in seconds as you can see I’ve converted it to the time base version of time by dividing it by the time base (yes, the convention is a bit different from the AVFormatContext’s time base). Hope this clears out FFmpeg seeking! Refer to Vrok source for sample code.

ALC1150 on MSI z87-GD65 ALSA setup

The ALC1150 chip on this motherboard fails to mix all channels played at it a simple,

pcm.!default{
   type hw
   card 1
}

Works but fails in mixing and does not take in SND_PCM_FORMAT_FLOAT inputs(may be because it is writing directly to the card’s buffer which doesn’t support float inputs). The fix is to use dmix but a lot of the dmix setups I used didn’t work at all, all ended in “can’t open output” errors. Put the following in /etc/asound.conf and it’ll work like a charm, no need for sound servers! It is a modified version of the conf found in http://wiki.gentoo.org/wiki/ALSA.

pcm.dmixed {
        type asym
        playback.pcm {
                type dmix
                ipc_key 5678293
                ipc_perm 0660
                ipc_gid audio

                slave {
                        channels 2 # make 6 or 5.1 channel
                        pcm {
                                format S16_LE # S32_LE
                                rate 48000 # can also be 44100
                                type hw
                                card 1 # your card
                                device 0 # your device
                                subdevice 0 #important?
                        }

                        period_size 1024
                        buffer_size 8192
                }

                bindings {
                        0 0
                        1 1
# Uncomment below if using 6 channel
#                       2 2
#                       3 3
#                       4 4
#                       5 5
                }
        }
        capture.pcm "hw:0"
}

pcm.!default {
        type plug
        slave.pcm "dmixed"
}