C++ vs. Java benchmarks?

Note: I think this should go on the top: This post has been attracting alot of views(which I am very much enjoying). And a many of that think I am wrong to do it in the first place, guys! it’s something that I did in my spare time I have not come to any conclusion other than my optimmized code is faster than the one given by the original author. Who was flaming C++ for being slow. I do not know what content in this post makes my post look like that it was written by someone who says C++ is the only language that should be used. And if you are offended by this post, all I can tell is to read the last paragraph again. This post is not about “what’s the best language evar!”, it is about C++ optimizations and how it is easy to write slow code in C++. And ofcause, the links are given here for a reason read them first.

The internet is full of surprises, today I was searching for a better C++ IDE(I know about KDevelop and how awesome it is, yet there might be something better!). I stumbled upon this conversation,

http://www.cplusplus.com/forum/unices/67839/ and from a link from there this, http://www.cplusplus.com/forum/unices/64290/

Have to say that I was amazed to see them proving C++ slower than Java in the most uninformed way possible! The code has been crafted from a Java perspective and has been implemented in C++. Which is insane because, no one in their right mind would think that the given C++ code is optimal. Unlike in Java where the developer can do very little to optimize the code, C++ has a wide range of optimizations that should be done by the developer. C/C++ is considered not as easy as Java just because of that, it is so much easier to write slower code with it because there are too many ways to get one thing done.

Java is garbage collected, so it does no freeing of memory until its next collection round. This can be a huge problem if memory isn’t preallocated outside loops and again there’s the space-time trade off which any comp. sci. student would know mostly by experience GC systems achieve high performance by killing the space, it eats up as much memory as possible and gives you some performance increase. Given this amount of memory(preallocated) to unmanaged code you will see performance boosts more than you will ever see with a GC, that’s the only thing that I’ve changed in the code that the author claiming C++ is slower has given, I just did a preallocation. This will be faster if everything happened on stack, but that needs a resize of the stack to be run correctly so I used new and delete. shared_ptr was left because the user has claimed its being widely used and needed here(?). He should be informed that there are good garbage collector implementations in C++.

The Java code which is on the post(one similar to the below C++ code) runs in 5.6 seconds on host machine. With following arguments,

-XX:+UseCompressedOops -XX:+DoEscapeAnalysis -XX:+EliminateLocks -XX:+UseBiasedLocking

Host runs Arch Linux, so Java detected that its a server and added -server! Further, the extra arguments had no effect on this machine. And it was Oracle JDK.

#include <vector>
#include <string>
#include <memory>
#include <ctime>
#include <iostream>

void test() {
  const int COUNT = 1000000;

  std::shared_ptr<std::string> *v=(std::shared_ptr<std::string> *) ::operator new(sizeof( std::shared_ptr<std::string>)*COUNT);
  std::shared_ptr<std::string> sp(new std::string("0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ"));

  for(size_t i = 0; i < COUNT; ++i)
      v[i]=sp;
  delete v;
}

int main() {

    std::clock_t b, e;
    b = std::clock();

    for (int i = 0; i < 1000; ++i)
        test();

    e = std::clock();

    std::cout << "Time: " << (double(e - b)/CLOCKS_PER_SEC) << '\n';
    return 0;
}

Above code just runs on 4.6 seconds. With just -O3.

Memory usage

C++ code: 15 MB

Java code: 320 MB

Java VM only(taken from running a infinite loop): 8 MB

There’s a lot of optimizations one would do to the above C++ code, I just did one and mentioned another. But there is another, which runs even faster!

const int COUNT = 1000000;
char v[COUNT*sizeof(std::shared_ptr<std::string>)];
void test() {
  std::shared_ptr<std::string> sp(new std::string("0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ"));

  for(size_t i = 0; i < COUNT; ++i)
      memcpy((void*)&v[i*sizeof(std::shared_ptr<std::string>)],(void*)&sp,sizeof(std::shared_ptr<std::string>));
}

And still keeping shared_ptr! This one runs on 3.6 seconds! Original author might disagree with the usage of memcpy and static allocation(specially when Java doesn’t have static allocation), well C++ doesn’t have a GC by default either!

Languages have their differences and there are claims that people should avoid from when one of the languages in question has two levels of complexities one on actual hardware optimization and another on VM code optimization. Not to mention this post is not and was not about what’s the best language it is just a post detailing C++ optimizations.

Deadlock Synchronization

UPDATE: If you want a fast playback system which can do async requests for buffers better go with a lock-free audio queue. Vrok’s backend will be changed soon to a lock-free audio queue which is the most optimal design that I found after some research.

NOTE: If you want a bug free program in a constrained amount of time, this won’t be the most optimal method; for now at least.

ANOTHER NOTE: DO NOT use common lock names, std::mutex, pthread_mutex_lock ..etc and roll out your own and use Thread Error Detectors they actually think its the standard version of these classes/functions!

After you have read the above note, you can continue reading of my experience in using something that was always considered to avoid. I did it because it did not occur to me in the first place. I did not use much of the common standards that were in use in the field. But I however used the “producer-consumer” problem to gain some ground.

I am talking about my experience while implementing the decoder-playback loop of  Vrok. The loop is documented below,

Decoder Thread

mutex[0].lock()
// work
mutex[1].unlock()

Playback Thread

if (atomic_check(&pause_check))
    atomic_set(&pause, true)
    pause_mutex.lock()
    pause_mutex.unlock()
    atomic_set(&pause, false)
    atomic_set(&pause_check,false)
mutex[1].lock()
// work
mutex[0].unlock()

Anyone would be implementing something like this if they were to follow a producer-consumer approach, and then I remembered that this is a player and I have to pause this from time to time, and I did not pick to just pause the output or stop the looping altogether; I chose to deadlock purposely which led to alot of problems there after, thread error detector programs like Valgrind and Intel’s Inspector XE reported deadlocks in huge amounts that I started to wonder how does it even run at all with all of these deadlocks! And there were actual deadlocks because of the “irresponsible” implementations that I have done in the pause() function.

I had to make up my mind either to keep on using what ever that I have written or go read the source of some successful non-freezing player! It was then that I remembered that I haven’t refered to any documentation nor any proper engineering approach to answer this problem. After drawing the locking diagram, I could see it in a way that I did not see before. Thread Error detectors can help you if you have used the basic elements that are given by standard thread libraries, if you roll your own you are alone! After rethinking about the design I was able to get the more serious problems resolved, problems(i.e. freezing of the control thread: Qt’s main thread) do not appear at normal playback of a song but at the end or the start of a song due to the deadlocks that I have introduced purposely, forget to release the locks in a way that it gurantees that the system resolves the deadlock it is in you will suffer endless nights thinking what you did wrong. Only later you find out that you were doing it right but not in the correct order. I still haven’t seen any documentation on how to implement/debug systems that use deadlocks as a synchronization mechanism, I’d like to see some if some of you know of any drop a comment or a mail if you know of anything that would contribute.

For people who are interested in the technical details of the implementation,

Thread/lock implementation, you need a mutex that’s unlockable from any thread(not just the locked thread). I do know of RW locks but the rwlock() and rdlock() did not suite for this purpose for me because all the existing code was written to use a single lock() function.

Decoder thread code, the implementations of rewind() and pause() are now the same because of some fixes, this was done because to keep the API stable, it’ll be changed soon.

Playback thread code

Single header utility library for C++

For now it only includes string functions such as, split, to_string, from_string, to_upper, to_lower …etc. More will be added in the future, all functions are added to the std namespace if they are related to any std class, such as std::string. If they are not related they’ll be in the cpplib namespace.

License: Public Domain

https://github.com/manushanga/cpplib