Thursday, July 18, 2013

Creating a thread-safe C++ Singleton Instance with TBB

http://goparallel.sourceforge.net/creating-a-thread-safe-c-singleton-instance-with-tbb

Singleton instances have their place in programming, but can present a problem in multithreaded programming. You only want a single thread to create the initial instance and to perform the initialization. With the help of the atomic type in Threading Building Blocks, you can safely create your instance.
Last time we met up, we talked about the inner workings of the atomic template in Threading Building Blocks. That template helps you with some high-performance mutexes and adding and swapping by making use of some assembler-level operations. These operations use what’s called fencing to make such operations possible. The end result is that you can create fully thread-safe classes that, for example, include a usage count.
Quick Theory: Atomic Template
When might this type of template might be useful? If you want to have some initialization take place only after an object is accessed the first time, for example, then some cleanup when the object is no longer used—all while being multi-thread, multi-core safe. The good news is that you don’t have to write a single line of assembler code. The template already has done that part for you.
Without such a template, you might use operating-system level mutexes. But then you end up taking a performance hit. By letting the template work at the assembler level, time spent is minimal.
Here’s the idea: When you need to increment a value, you want to grab the current value, add one to it, and store it back in. Without any kind of mutex in place, two competing threads might both grab the current value at almost the exact same time, thus getting the same value. They both add one to it, and store the new value back in. So instead of incrementing by two as it should, the value only increments by one, resulting in a bug. At the assembler level, the processor allows a single thread to grab the value, add on to it, and write it back on as an atomic operation (hence the name of the template, atomic). This fencing happens extremely fast, so the wait time for other threads is very minimal. That’s far more efficient than creating a mutex at the operating-system level.
And Quick Practice…
Let’s try it out. Remember, we’re dealing with templates, so we’ll be using the template as a starting point for our own class. We’ll create a singleton class. The first time the instance is needed, we’ll create the instance. To keep this simple, we won’t do a decrement or cleanup; feel free to try that part yourself and discuss it in the comments, and we’ll look at in the future if there’s interest.
Here’s some code that does it. This isn’t the only approach, but it’s one way that works:
01
class Singleton
02
{
03
    static tbb::atomic<Singleton *> inst;
04
    int x;
05
    int y;
06
    Singleton(): x(10), y(20) { }
07
public:
08
    int getX() {
09
        return x;
10
    }
11
    int getY() {
12
        return y;
13
    }
14
    static Singleton &getInst() {
15
        if (inst == 0) {
16
            Singleton* temp = new Singleton();
17
            if (inst.compare_and_swap(temp, NULL) != NULL) {
18
                delete temp;
19
            }
20
        }
21
        return *inst;
22
    }
23
};
24
tbb::atomic<Singleton *> Singleton::inst;
25
 
26
int _tmain(int argc, _TCHAR* argv[])
27
{
28
 
29
    cilk_for(int i=0; i<8; i++) {
30
        Singleton& s = Singleton.getInst();
31
        cout << s.getX();
32
    }
33
}
Before diving into this code, I need to point out something about static initialization. In C++, the static line inside the class defines the member, but doesn’t actually declare the member. (In other words, it sets up a name for it, but doesn’t create the storage for it.) You create the storage for it after the class with a line like this:
1
tbb::atomic<Singleton *> Singleton::inst;
The C++ compiler will then allocate the storage. But we have a bit of a problem. I’d like to initialize this, but I can’t. The reason is the static initialization treats the initialization as a call to a copy constructor (which doesn’t exist), instead of using the overloaded = operator to assign the value.
The solution is a bit odd, but it’s the officially sanctioned solution from Intel. The compiler initializes the data to 0, so that’s what we’ll go with. In fact, that’s exactly what we want, an initial value of 0 (which is the same as NULL). So it works as is, without us initializing the data. It seems a little odd, and for purists it might seem outright dangerous, but we’re fine. We have a guarantee from Intel that their compiler will initialize it to 0.
Now for the important part of the code. If the instance exists, we just return it in the getInst function. If it doesn’t exist, we attempt to create it. This is where the thread-safe part comes in. We don’t just create it and shove it into the private inst variable, because there’s a slight possibility that another thread could be doing the same thing at the same time. Instead, we create an instance and save it in a temporary variable. Then in a single atomic operation, we check if the private inst variable is 0, and if so, save the new instance into the variable. We use the compare_and_swap function to do that. In the event another thread managed to create the instance in between the time it took to check if the inst variable is 0, and when we run the compare_and_swap, we’ll be safe; that other thread did it in an atomic operation, and basically beat us to it. So in that case, we’ll get back a non-NULL value, in which case we just delete the temporary variable.
This means you won’t want to do any initialization in the constructor, of course; put that in a separate function, add an else statement and do it there. 
In the end, you’ll have a singleton instance that you can access in your threads with only a single thread guaranteed to create the final instance used.