Unity3D Chroma Key Shader

I thought that finding a shader that makes an area of a texture of a certain colour transparent would be easy. It turns out that finding an efficient and free chroma key shader required a little bit more digging.

The best implementation I found was on a blog called Pilcrow Pipe, but it was written in OpenGL shader language, not Unity3D’s ShaderLab language which is more universal and allows you to change parameters such as key colour and tolerance through the Unity GUI. So without further ado, here is the ShaderLab port of the chroma key shader.

// Adapted from http://pilcrowpipe.blogspot.jp/2013/03/chromakeyingtransparentbackground.html

Shader Custom/ChromaKey {

    Properties {
        _MainTex (Base (RGB)2D) = white {}
        _thresh (ThresholdRange (016)) = 0.8
        _slope (SlopeRange (01)) = 0.2
        _keyingColor (Key ColourColor) = (1,1,1,1)
    SubShader {
        Tags {Queue=Transparent IgnoreProjector=True RenderType=Transparent}
        LOD 100
        Lighting Off
        ZWrite Off
        AlphaTest Off
        Blend SrcAlpha OneMinusSrcAlpha 
        Pass {
                #pragma vertex vert_img
                #pragma fragment frag
                #pragma fragmentoption ARB_precision_hint_fastest

                sampler2D _MainTex;
                float3 _keyingColor;
                float _thresh// 0.8
                float _slope// 0.2

                #include “UnityCG.cginc

                float4 frag(v2f_img i) : COLOR {
                    float3 input_color = tex2D(_MainTexi.uv).rgb;
                    float d = abs(length(abs(_keyingColor.rgb – input_color.rgb)));
                    float edge0 = _thresh * (1.0 – _slope);
                    float alpha = smoothstep(edge0_threshd);
                    return float4(input_coloralpha);
    FallBack Unlit

I hope that somebody finds this useful.

Again, all credit for the logic behind the shader goes to this blog post.

How to Host Multiple Domains on a Single Google App Engine Instance

I have two domain names that I manage, one for my online business card and the other is just a place to access my projects. I had them both hosted on the same Amazon EC2 instance until the free tier expired recently and I went looking for another free place to host my websites. I settled on Google App Engine as, although it is a PaaS and gives me less direct control of the instance, it is very generous with how much it provides for free and allows me to host my websites for as long as I want without spending a cent. Also I can still write rich, dynamic web content in python (or Java or Go if I want to) and have plenty of control over how my website functions.

The first thing I did was set up the ‘app.yaml’ file to redirect all requests to my main python application which serves out the content for each domain. I couldn’t use the ‘static’ handler as even though it is more efficient than serving files through an application, it could not differentiate between domains. Therefore, I passed all requests to an application handler which determines which domain the request was from and sends back the required content. It’s a little bit more involved than just reading the requested file and sending it to the client. As a bare minimum the server should send the MIME type in the headers so the browser knows how to interpret the file. To do this I simply imported the Python ‘mimetype’ package and it worked out of the box.  I also added some basic cache checking to avoid sending down a whole file again if it hasn’t changed since the last request.

If you are like me and just like to get your teeth straight into an example have a look at the example code I hosted on my GitHub account.

If you aren’t happy with using your application processing power to serve your static files, a great way to take load off the server is to use a caching proxy service called Cloudflare. Cloudflare caches your website automatically and serves out your files through it’s CDN taking the load off your server and speeding up responses to your clients. Best of all, it’s free. Not to mention that it provides protection from things like denial of service attacks.

Clearer Web Design: Separating Content and Design

As I have recently started work as a web developer, I have been exposed to many more examples of how websites can be constructed, both at work, and online when I am looking for inspiration and solutions to problems that I’m having. Web development seems to be a task that can allow many bad habits to form as there are many ways of coercing a browser to lay out your web page according to the design you want. When you are under time constraints, or if you’re just starting in web development, it’s tempting to take what initially appears to be the easy path. However, building a website the wrong way is like building a house with a crumbling foundation. As soon as you want to change anything the whole house falls down which causes much more grief than if you had built the foundation properly in the first place.

I could list many of these ‘shortcuts’ and poor development practices but today I want to focus on the classic argument of whether to use content elements such as divs, or using table elements for layout purposes. People argue vehemently from each camp, the supporters of tables arguing that tables are simpler and that if the end product is the same, it shouldn’t matter how you code your web page if the average audience member won’t see the source. Supporters of using the content elements argue that the content should be separate from the style to allow rapid changes to layout and design among other things. However, many people take it too far and think that the use of the table elements are completely discouraged and will make their life harder by creating strange nested div layouts to display the tabular data the table element is designed for. I think it all comes down to a big misunderstanding of the way a web site should be constructed.

A web page is not just a big table with invisible borders or a big image, it is primarily a document. If you think about the way you write a book, you write the content first and then you divide it up into sections and headings but in the first stages you are primarily concerned with the content. You then take your manuscript to a publisher and they apply a layout or style to your content to create the final book. Building a web page is exactly the same idea, you write the content you want to convey to your audience and divide it up into logical sections and afterwards you apply a style to it which can be altered at anytime without having to mess with the content at all.

A perfect example of the separation of content and style being a boon is when you want to use the same content in different situations. For example, your web page could be viewed on a smartphone, tablet or even printed, each having different layouts and styles. Tables do not allow for this kind of flexibility. Another example is screen readers for the vision impaired. If your content is separated from the layout the screen readers can easily pick out the relevant information and ignore the style which just gets in the way.

Below I put an example of the same web page laid out using content tags and tables. It’s clear that separating the content from the layout produces much more readable code. A screen reader could easily traverse the example on the left and know the context of the content, but would have a hard time working out how the content fits together in the table layout.  If at any time I wanted to change the layout of  the left example, I could simply change a definition in the CSS file and the content wouldn’t have to be touched. For a great example of this see the CSS Zen Garden

An example of website layout syntax.

An example of the same web page using content tags (left) and tables (right).

In this golden age of information we need to make sure that information is presentable and easily accessible. It is no longer the 90’s, the standards are there and there is an abundance of information on how to write web sites properly that you can find easily with a quick web search.  For more examples of why tables should only be used for tabular data, check out this website, a great introduction to these principles that goes into much more depth than I did.

Problems with Old Motherboard Chipsets and New SSDs

I recently installed Windows 8 as a dual-boot on my old Vostro 1500 laptop (I usually run Linux on it) and found that it was suffering from terrible bouts of freezing when running Windows. It would happen periodically, about every 1-5 minutes. Anything accessing the hard drive would stall for about a minute and then resume like nothing had happened. Everything not accessing the hard drive would work perfectly which showed it was something to do with my storage controller or the hard drive itself.

It was extremely frustrating and I couldn’t really get any work done. After not having much luck with any settings I stumbled upon the Windows Event Viewer which I had never paid much attention to before and found a whole heap of warnings that matched up with the times of each freeze that went something along the lines of:

disk: The IO operation at logical block address xxxxxxx for Disk 0 was retried.

storahci: Reset to device, \Device\RaidPort0 was issued.

After a bit of online investigation I found that other people were having a similar problem but there didn’t seem to be a definite solution. One common factor, however, was the fact that they were all using solid state drives. I tried some suggested fixes such as turning off dynamic ticks. Although, this didn’t have any effect and I didn’t want to switch my IO controller to basic IDE mode as AHCI has important features such as the TRIM function for SSDs which keeps them running fast among other things.

I finally had some luck when I installed some old storage drivers off the Intel website. The freezing completely stopped and Windows 8 ran like a dream, which was surprising given the old hardware in the laptop. There was one catch though: TRIM seemed to be disabled so I couldn’t optimise the drive and unused blocks were not marked for clearing in the background. It seems that since the Intel ICH8 chipset used in the Vostros was made long before SSDs became a big thing, the chipset does not support the TRIM command at all. This seems to make sense as the TRIM command is sent periodically and probably crashed the storage driver causing the freezes. Further confirmation of this problem comes from the fact that optimising the SSD drive (which is supposed to run TRIM on the whole drive) doesn’t seem to function and causes a freeze. I was having a similar problem last year when I had a standard hard drive installed, when Windows 7 sent some sort of power saving command to the hard drive the storage controller would crash and reset causing a freeze. Turning off a certain power saving option fixed this which seems to point to the incompatibility of the old chipset. I’m not sure why Windows doesn’t just detect that the storage controller doesn’t support certain commands.

So in summary, if you have an SSD drive with an old chipset like the ICH8, set your controller to IDE mode in your bios or install old drivers to avoid freezing errors. I still have to confirm if the TRIM command works in Linux – I assume it doesn’t but I probably haven’t noticed because it fails more gracefully than Windows. If anyone has a solution be sure to comment.

Concurrent Algorithms: The Next Frontier

Lately the clock speeds of processors in computers and mobile devices have reached a plateau.  Clock speeds have been steadily increasing since the advent of the modern processor, faithfully following the prediction of Intel’s co-founder Gordon Moore (known as Moore’s Law) that the amount of transistors in a CPU would double every year. Around 2005 this changed. Processor manufacturers hit a wall, not only was it not practical to increase the clock speed from a power consumption point of view – which is much more of an issue currently then it was back when Moore made his famous law – but processor manufacturers actually reached the physical limits of how small transistors can be made before they fail to work efficiently. Instead, CPU manufacturers have opted to increase the amount of cores in a processor in order to improve the processing power.

Graph of intel CPU clock speed over time.

Intel CPU clock speed over time. (Source)

At first this seems like this could be a valid solution. But when given more thought, a problem becomes apparent: all the algorithms and data structures intrinsic to software development, that are taught at universities and power all the software we use, were designed in a time where the common computer had a single processor.  There was little thought given to writing algorithms to perform well concurrently (with the possible exception of software written for expensive server and mainframe systems). With the amount of cores squeezed onto a single processor increasing at a phenomenal rate with processors such as the 64 core TILE64 and 100 core Tilera, computer scientists need to start thinking outside the box to develop new efficient versions of common algorithms and data structures that work and scale efficiently in a multiprocessor environment. By that I don’t mean just tacking on fixes to make current algorithms and data structures compatible with multiple processors, I mean rethinking the whole design from the ground up.

Although there have been modifications of existing algorithms to scale in concurrent systems, it is rare to get a one-to-one speed up between the number of cores and the performance of algorithms. In other words, the overheads of concurrent memory access from different cores create bottlenecks. This suggests that more cores don’t necessarily mean more speed. This raises the question of whether the fundamental data structures and algorithms of computer science can ever scale over multiple processors effectively.

An interesting online lecture from MIT given by Nir Shavit I watched earlier in the year, which has sadly been made private for unknown reasons, showed me that it is indeed possible to break free of these bottlenecks if you are willing to think outside traditional methods. As an example he used a simple stack, one of the most basic, well known data structures in computer science. A standard stack is a last-in first-out data structure, meaning that the data that has been in the stack for the least time will be the first to come out of it, (think of a shelf at a supermarket). It is a simple data structure to implement. There is a pointer to to the current data element which is incremented when an item is added or ‘pushed’, and decremented when an item is taken out or ‘popped’. There is also a counter to keep track of the number of elements in the stack so you can’t pop when it is empty or push too many items into it. Although they are simple, accessing a classic stack with multiple processors concurrently is a surefire recipe for data corruption. As the order of operations are not deterministic the results are unpredictable.

A basic diagram of a stack.

A basic diagram of a stack. (Source)

An easy way to fix this is to only let one processor access the stack at one time.  Although this is 100 percent effective at preventing data corruption due to concurrency, it is very inefficient. When one processor is using the stack the others are just doing nothing and waiting their turn. This also defeats the purpose of using multiple cores, as it turns each operation into a linear series of events rather than a concurrent one. There are a myriad of ways to improve the performance of this kind of concurrent stack, such as optimistic concurrency, but none of the methods get a one-to-one speed up. In fact, most of the benefits of these methods decrease as you start using more threads which prevent them from being truly scale-able. Among the different methods, the asymmetric rendezvous stack which was mentioned in the MIT lecture – an example of which is outlined  here –  is the only one that stands out.

The rendezvous stack works on a simple concept, if a stack is a last-in first-out data type, and there are many processes trying to both pop and push at the same time, why not just hand the data directly from the pushers to the poppers? (if you excuse my terrible use of language). To explain this in a practical example imagine a warehouse where trucks drop off pallets of goods at random and other trucks come to take away the pallets (also at random). As the last pallet put into the warehouse is at the front and easily accessible it is picked up first and taken away. This system works if we only have a single delivery per hour but what if hundreds of trucks arrive at once? This causes a bottleneck which could take days to clear and all the while more trucks are arriving. Then lets say that the frustrated drivers then have an idea: what if the trucks delivering the pallets transfer the pallets directly to the trucks taking the pallets away? With the middleman cut out, the trucks pair up and all work concurrently to transfer the pallets and the bottleneck is cleared in no time at all. This is the basic principal of the rendezvous stack.

Not only is this an extremely simple idea, it increases the performance of the stack by an incredible amount. I copied a graph from the scientific paper I mentioned earlier and put it below as a reference. Shown in purple as a baseline is the standard Java stack implementation which is flat as it cannot use more than one thread at once. The FC-Parallel, FC-Single and ED-Tree all use different methods to allow stacks to operate concurrently and allow for modest scaling as the number of threads increase. The definite stand-out, however, is the AdaptiveAR rendezvous method which scales incredibly well with the number of threads.  What is really impressive is that at times the improvement is super-linear. Not only is there a one-to-one speed up between the number of threads and the speed of the algorithm – which is an achievement in itself – but the algorithm actually works faster with more threads than with less.  As you can see, after eight threads the rate of improvement actually increases, especially when using the sequential IDs that are mentioned in the paper. In this case increasing the number of cores in a CPU is a boon rather than a hindrance!

Diagram of concurrent stack speeds.

The speed of different concurrent stack implementations run on a UltraSPARC T2 Plus multicore machine. AdaptiveAR uses a rendezvous model.
Source: Fast and Scalable Rendezvousing. Yehuda Afek, Michael Hakimi and Adam Morrison. DISC 2011.

It might seem that this post is just a promotion for the work of Afek et al. but I am just using their work as an example of the kind of thinking needed to improve current outdated single core methodologies. Their innovations have brought the dated stack data structure into the multi-core era. I think that the field of concurrency is only going to get more important as the number of cores in a CPU grow and people demand more processing power for the new technologies that are coming out with ever-increasing requirements. By encouraging people to think concurrently, rather than teaching the same outdated methods that have been around for decades, universities could spearhead the transition to properly scalable algorithms to build software for the modern era.

Packaging Latex dependencies with your .tex files.

A problem I’ve had for a while now is that when I move my Latex code around to different computers or send it to colleagues is the fact that different Latex installations often have different sets of packages installed.  This usually results in lots of missing file errors and the slow process of discovering all the missing dependencies manually and installing them. Looking for a way to do this automatically, I did a quick web search which didn’t seem to turn up any results. One mailing list post however did point me to a Latex switch called ‘-recorder’ which prints out a list of file operations performed by the Latex complier and dumps them into a ‘.fls’ file. Not wanting to wade through this file looking for dependencies manually I wrote a python script called latex-packager.py which you can find in my Github repository. If you run it with a ‘.tex’ file as a parameter it will run through the following steps:

  1. Check if the current ‘.fls’ is valid (i.e. it exists and is up to date).
  2. If the ‘.fls’ file is invalid it will (optionally) run the latex compiler for you with the recorder switch to generate a new ‘.fls’ file.
  3. Create a temporary folder and copy in all the files used to compile your ‘.tex’ file.
  4. Put all the files into an archive file with the format of your choosing.

I was worried that the Latex compiler wouldn’t look in the working directory for all the dependencies so I tested it by packaging one of my ‘.tex’ files and extracting the archive file into an empty directory. When I compiled the document with the recorder switch the ‘.fls’ file showed that the complier was sourcing all the files from the current directory which shows that the current directory is always the first place it looks for any files.

The next step will be to find some way to store all the dependencies into a subdirectory and reference it so that the directory is not littered with so many files although it is fine to quickly send a ‘.tex’ file to someone and not have to worry about if they have all the dependencies needed to compile it.

Basic Usage: ./latex-packager.py tex_file output_file

There are numerous switches that you can use to customise the behaviour of latex packager:

-v: Print more verbose information while executing.
-f: Force the script to ignore non-critical errors (not recommended).
-[z | j | s | p]: Use gzip, bzip2, 7zip and zip compression formats respectively (default is gzip).
–latex <spec>: Specify the latex compiler to use and any custom switches (default is pdflatex).