Lately the clock speeds of processors in computers and mobile devices have reached a plateau. Clock speeds have been steadily increasing since the advent of the modern processor, faithfully following the prediction of Intel’s co-founder Gordon Moore (known as Moore’s Law) that the amount of transistors in a CPU would double every year. Around 2005 this changed. Processor manufacturers hit a wall, not only was it not practical to increase the clock speed from a power consumption point of view – which is much more of an issue currently then it was back when Moore made his famous law – but processor manufacturers actually reached the physical limits of how small transistors can be made before they fail to work efficiently. Instead, CPU manufacturers have opted to increase the amount of cores in a processor in order to improve the processing power.
At first this seems like this could be a valid solution. But when given more thought, a problem becomes apparent: all the algorithms and data structures intrinsic to software development, that are taught at universities and power all the software we use, were designed in a time where the common computer had a single processor. There was little thought given to writing algorithms to perform well concurrently (with the possible exception of software written for expensive server and mainframe systems). With the amount of cores squeezed onto a single processor increasing at a phenomenal rate with processors such as the 64 core TILE64 and 100 core Tilera, computer scientists need to start thinking outside the box to develop new efficient versions of common algorithms and data structures that work and scale efficiently in a multiprocessor environment. By that I don’t mean just tacking on fixes to make current algorithms and data structures compatible with multiple processors, I mean rethinking the whole design from the ground up.
Although there have been modifications of existing algorithms to scale in concurrent systems, it is rare to get a one-to-one speed up between the number of cores and the performance of algorithms. In other words, the overheads of concurrent memory access from different cores create bottlenecks. This suggests that more cores don’t necessarily mean more speed. This raises the question of whether the fundamental data structures and algorithms of computer science can ever scale over multiple processors effectively.
An interesting online lecture from MIT given by Nir Shavit I watched earlier in the year, which has sadly been made private for unknown reasons, showed me that it is indeed possible to break free of these bottlenecks if you are willing to think outside traditional methods. As an example he used a simple stack, one of the most basic, well known data structures in computer science. A standard stack is a last-in first-out data structure, meaning that the data that has been in the stack for the least time will be the first to come out of it, (think of a shelf at a supermarket). It is a simple data structure to implement. There is a pointer to to the current data element which is incremented when an item is added or ‘pushed’, and decremented when an item is taken out or ‘popped’. There is also a counter to keep track of the number of elements in the stack so you can’t pop when it is empty or push too many items into it. Although they are simple, accessing a classic stack with multiple processors concurrently is a surefire recipe for data corruption. As the order of operations are not deterministic the results are unpredictable.
An easy way to fix this is to only let one processor access the stack at one time. Although this is 100 percent effective at preventing data corruption due to concurrency, it is very inefficient. When one processor is using the stack the others are just doing nothing and waiting their turn. This also defeats the purpose of using multiple cores, as it turns each operation into a linear series of events rather than a concurrent one. There are a myriad of ways to improve the performance of this kind of concurrent stack, such as optimistic concurrency, but none of the methods get a one-to-one speed up. In fact, most of the benefits of these methods decrease as you start using more threads which prevent them from being truly scale-able. Among the different methods, the asymmetric rendezvous stack which was mentioned in the MIT lecture – an example of which is outlined here – is the only one that stands out.
The rendezvous stack works on a simple concept, if a stack is a last-in first-out data type, and there are many processes trying to both pop and push at the same time, why not just hand the data directly from the pushers to the poppers? (if you excuse my terrible use of language). To explain this in a practical example imagine a warehouse where trucks drop off pallets of goods at random and other trucks come to take away the pallets (also at random). As the last pallet put into the warehouse is at the front and easily accessible it is picked up first and taken away. This system works if we only have a single delivery per hour but what if hundreds of trucks arrive at once? This causes a bottleneck which could take days to clear and all the while more trucks are arriving. Then lets say that the frustrated drivers then have an idea: what if the trucks delivering the pallets transfer the pallets directly to the trucks taking the pallets away? With the middleman cut out, the trucks pair up and all work concurrently to transfer the pallets and the bottleneck is cleared in no time at all. This is the basic principal of the rendezvous stack.
Not only is this an extremely simple idea, it increases the performance of the stack by an incredible amount. I copied a graph from the scientific paper I mentioned earlier and put it below as a reference. Shown in purple as a baseline is the standard Java stack implementation which is flat as it cannot use more than one thread at once. The FC-Parallel, FC-Single and ED-Tree all use different methods to allow stacks to operate concurrently and allow for modest scaling as the number of threads increase. The definite stand-out, however, is the AdaptiveAR rendezvous method which scales incredibly well with the number of threads. What is really impressive is that at times the improvement is super-linear. Not only is there a one-to-one speed up between the number of threads and the speed of the algorithm – which is an achievement in itself – but the algorithm actually works faster with more threads than with less. As you can see, after eight threads the rate of improvement actually increases, especially when using the sequential IDs that are mentioned in the paper. In this case increasing the number of cores in a CPU is a boon rather than a hindrance!
It might seem that this post is just a promotion for the work of Afek et al. but I am just using their work as an example of the kind of thinking needed to improve current outdated single core methodologies. Their innovations have brought the dated stack data structure into the multi-core era. I think that the field of concurrency is only going to get more important as the number of cores in a CPU grow and people demand more processing power for the new technologies that are coming out with ever-increasing requirements. By encouraging people to think concurrently, rather than teaching the same outdated methods that have been around for decades, universities could spearhead the transition to properly scalable algorithms to build software for the modern era.