Problems with Old Motherboard Chipsets and New SSDs

I recently installed Windows 8 as a dual-boot on my old Vostro 1500 laptop (I usually run Linux on it) and found that it was suffering from terrible bouts of freezing when running Windows. It would happen periodically, about every 1-5 minutes. Anything accessing the hard drive would stall for about a minute and then resume like nothing had happened. Everything not accessing the hard drive would work perfectly which showed it was something to do with my storage controller or the hard drive itself.

It was extremely frustrating and I couldn’t really get any work done. After not having much luck with any settings I stumbled upon the Windows Event Viewer which I had never paid much attention to before and found a whole heap of warnings that matched up with the times of each freeze that went something along the lines of:

disk: The IO operation at logical block address xxxxxxx for Disk 0 was retried.

storahci: Reset to device, \Device\RaidPort0 was issued.

After a bit of online investigation I found that other people were having a similar problem but there didn’t seem to be a definite solution. One common factor, however, was the fact that they were all using solid state drives. I tried some suggested fixes such as turning off dynamic ticks. Although, this didn’t have any effect and I didn’t want to switch my IO controller to basic IDE mode as AHCI has important features such as the TRIM function for SSDs which keeps them running fast among other things.

I finally had some luck when I installed some old storage drivers off the Intel website. The freezing completely stopped and Windows 8 ran like a dream, which was surprising given the old hardware in the laptop. There was one catch though: TRIM seemed to be disabled so I couldn’t optimise the drive and unused blocks were not marked for clearing in the background. It seems that since the Intel ICH8 chipset used in the Vostros was made long before SSDs became a big thing, the chipset does not support the TRIM command at all. This seems to make sense as the TRIM command is sent periodically and probably crashed the storage driver causing the freezes. Further confirmation of this problem comes from the fact that optimising the SSD drive (which is supposed to run TRIM on the whole drive) doesn’t seem to function and causes a freeze. I was having a similar problem last year when I had a standard hard drive installed, when Windows 7 sent some sort of power saving command to the hard drive the storage controller would crash and reset causing a freeze. Turning off a certain power saving option fixed this which seems to point to the incompatibility of the old chipset. I’m not sure why Windows doesn’t just detect that the storage controller doesn’t support certain commands.

So in summary, if you have an SSD drive with an old chipset like the ICH8, set your controller to IDE mode in your bios or install old drivers to avoid freezing errors. I still have to confirm if the TRIM command works in Linux – I assume it doesn’t but I probably haven’t noticed because it fails more gracefully than Windows. If anyone has a solution be sure to comment.

Concurrent Algorithms: The Next Frontier

Lately the clock speeds of processors in computers and mobile devices have reached a plateau.  Clock speeds have been steadily increasing since the advent of the modern processor, faithfully following the prediction of Intel’s co-founder Gordon Moore (known as Moore’s Law) that the amount of transistors in a CPU would double every year. Around 2005 this changed. Processor manufacturers hit a wall, not only was it not practical to increase the clock speed from a power consumption point of view – which is much more of an issue currently then it was back when Moore made his famous law – but processor manufacturers actually reached the physical limits of how small transistors can be made before they fail to work efficiently. Instead, CPU manufacturers have opted to increase the amount of cores in a processor in order to improve the processing power.

Graph of intel CPU clock speed over time.

Intel CPU clock speed over time. (Source)

At first this seems like this could be a valid solution. But when given more thought, a problem becomes apparent: all the algorithms and data structures intrinsic to software development, that are taught at universities and power all the software we use, were designed in a time where the common computer had a single processor.  There was little thought given to writing algorithms to perform well concurrently (with the possible exception of software written for expensive server and mainframe systems). With the amount of cores squeezed onto a single processor increasing at a phenomenal rate with processors such as the 64 core TILE64 and 100 core Tilera, computer scientists need to start thinking outside the box to develop new efficient versions of common algorithms and data structures that work and scale efficiently in a multiprocessor environment. By that I don’t mean just tacking on fixes to make current algorithms and data structures compatible with multiple processors, I mean rethinking the whole design from the ground up.

Although there have been modifications of existing algorithms to scale in concurrent systems, it is rare to get a one-to-one speed up between the number of cores and the performance of algorithms. In other words, the overheads of concurrent memory access from different cores create bottlenecks. This suggests that more cores don’t necessarily mean more speed. This raises the question of whether the fundamental data structures and algorithms of computer science can ever scale over multiple processors effectively.

An interesting online lecture from MIT given by Nir Shavit I watched earlier in the year, which has sadly been made private for unknown reasons, showed me that it is indeed possible to break free of these bottlenecks if you are willing to think outside traditional methods. As an example he used a simple stack, one of the most basic, well known data structures in computer science. A standard stack is a last-in first-out data structure, meaning that the data that has been in the stack for the least time will be the first to come out of it, (think of a shelf at a supermarket). It is a simple data structure to implement. There is a pointer to to the current data element which is incremented when an item is added or ‘pushed’, and decremented when an item is taken out or ‘popped’. There is also a counter to keep track of the number of elements in the stack so you can’t pop when it is empty or push too many items into it. Although they are simple, accessing a classic stack with multiple processors concurrently is a surefire recipe for data corruption. As the order of operations are not deterministic the results are unpredictable.

A basic diagram of a stack.

A basic diagram of a stack. (Source)

An easy way to fix this is to only let one processor access the stack at one time.  Although this is 100 percent effective at preventing data corruption due to concurrency, it is very inefficient. When one processor is using the stack the others are just doing nothing and waiting their turn. This also defeats the purpose of using multiple cores, as it turns each operation into a linear series of events rather than a concurrent one. There are a myriad of ways to improve the performance of this kind of concurrent stack, such as optimistic concurrency, but none of the methods get a one-to-one speed up. In fact, most of the benefits of these methods decrease as you start using more threads which prevent them from being truly scale-able. Among the different methods, the asymmetric rendezvous stack which was mentioned in the MIT lecture – an example of which is outlined  here –  is the only one that stands out.

The rendezvous stack works on a simple concept, if a stack is a last-in first-out data type, and there are many processes trying to both pop and push at the same time, why not just hand the data directly from the pushers to the poppers? (if you excuse my terrible use of language). To explain this in a practical example imagine a warehouse where trucks drop off pallets of goods at random and other trucks come to take away the pallets (also at random). As the last pallet put into the warehouse is at the front and easily accessible it is picked up first and taken away. This system works if we only have a single delivery per hour but what if hundreds of trucks arrive at once? This causes a bottleneck which could take days to clear and all the while more trucks are arriving. Then lets say that the frustrated drivers then have an idea: what if the trucks delivering the pallets transfer the pallets directly to the trucks taking the pallets away? With the middleman cut out, the trucks pair up and all work concurrently to transfer the pallets and the bottleneck is cleared in no time at all. This is the basic principal of the rendezvous stack.

Not only is this an extremely simple idea, it increases the performance of the stack by an incredible amount. I copied a graph from the scientific paper I mentioned earlier and put it below as a reference. Shown in purple as a baseline is the standard Java stack implementation which is flat as it cannot use more than one thread at once. The FC-Parallel, FC-Single and ED-Tree all use different methods to allow stacks to operate concurrently and allow for modest scaling as the number of threads increase. The definite stand-out, however, is the AdaptiveAR rendezvous method which scales incredibly well with the number of threads.  What is really impressive is that at times the improvement is super-linear. Not only is there a one-to-one speed up between the number of threads and the speed of the algorithm – which is an achievement in itself – but the algorithm actually works faster with more threads than with less.  As you can see, after eight threads the rate of improvement actually increases, especially when using the sequential IDs that are mentioned in the paper. In this case increasing the number of cores in a CPU is a boon rather than a hindrance!

Diagram of concurrent stack speeds.

The speed of different concurrent stack implementations run on a UltraSPARC T2 Plus multicore machine. AdaptiveAR uses a rendezvous model.
Source: Fast and Scalable Rendezvousing. Yehuda Afek, Michael Hakimi and Adam Morrison. DISC 2011.

It might seem that this post is just a promotion for the work of Afek et al. but I am just using their work as an example of the kind of thinking needed to improve current outdated single core methodologies. Their innovations have brought the dated stack data structure into the multi-core era. I think that the field of concurrency is only going to get more important as the number of cores in a CPU grow and people demand more processing power for the new technologies that are coming out with ever-increasing requirements. By encouraging people to think concurrently, rather than teaching the same outdated methods that have been around for decades, universities could spearhead the transition to properly scalable algorithms to build software for the modern era.

Packaging Latex dependencies with your .tex files.

A problem I’ve had for a while now is that when I move my Latex code around to different computers or send it to colleagues is the fact that different Latex installations often have different sets of packages installed.  This usually results in lots of missing file errors and the slow process of discovering all the missing dependencies manually and installing them. Looking for a way to do this automatically, I did a quick web search which didn’t seem to turn up any results. One mailing list post however did point me to a Latex switch called ‘-recorder’ which prints out a list of file operations performed by the Latex complier and dumps them into a ‘.fls’ file. Not wanting to wade through this file looking for dependencies manually I wrote a python script called latex-packager.py which you can find in my Github repository. If you run it with a ‘.tex’ file as a parameter it will run through the following steps:

  1. Check if the current ‘.fls’ is valid (i.e. it exists and is up to date).
  2. If the ‘.fls’ file is invalid it will (optionally) run the latex compiler for you with the recorder switch to generate a new ‘.fls’ file.
  3. Create a temporary folder and copy in all the files used to compile your ‘.tex’ file.
  4. Put all the files into an archive file with the format of your choosing.

I was worried that the Latex compiler wouldn’t look in the working directory for all the dependencies so I tested it by packaging one of my ‘.tex’ files and extracting the archive file into an empty directory. When I compiled the document with the recorder switch the ‘.fls’ file showed that the complier was sourcing all the files from the current directory which shows that the current directory is always the first place it looks for any files.

The next step will be to find some way to store all the dependencies into a subdirectory and reference it so that the directory is not littered with so many files although it is fine to quickly send a ‘.tex’ file to someone and not have to worry about if they have all the dependencies needed to compile it.

Basic Usage: ./latex-packager.py tex_file output_file

There are numerous switches that you can use to customise the behaviour of latex packager:

-v: Print more verbose information while executing.
-f: Force the script to ignore non-critical errors (not recommended).
-[z | j | s | p]: Use gzip, bzip2, 7zip and zip compression formats respectively (default is gzip).
–latex <spec>: Specify the latex compiler to use and any custom switches (default is pdflatex).