Before we start optimizing the wc command line tool we wrote lets first find a simple way to compare this implementation with the wc command tool available on my linux (written in C). I created a file with 197,000 dummy lines (with each line just over 80 characters long) and measured how long it takes to count the number of lines, with each tool:
So the current Haskell implementation is 37x slower than the native C version. The first thing to note is how running the haskell program without compiling is not efficient at all. So lets put together a simple make file and use ghci to compile the .hs file to a native executable. Here’s a possible make file:
So after a simple compilation the results we’re now getting are:
Which puts at at 24x slower which is already some progress with absolutely no code changes. Now the ghc compiler also allows you to use a few optimizing flags that can help make the output code quicker. Lets use the basic -O2 optimization and see how much we can gain. We’ll actually add the compilation directive to the .hs file with the following line:
Now after recompiling, here is the current performance of our tool:
Not much of a boost really and its not a surprise since our program isn’t very complicated we can’t expect the compiler to be able to save time that easily. There are a few things to look at before we start profiling and here’s a list of the usual suspects when it comes to bad performance in Haskell:
-
String is painfully slow and is known for being 20x slower than a similar C implementation. The fix is to use the ByteString type which is known to only be 2x slower than a similar C implementation.
-
read and show are known to also perform badly and you should use the same functions that manipulate the ByteString datatype.
So lets start by importing the required library to use the ByteString datatype and using all of the functions that handle ByteString. Be aware that it can be a bit of hassle to get your functions working with the ByteString data type and quite a bit of work to get all the types lined up just write, but here’s what the wc command implementation looks like that now uses ByteStrings:
Just a few tweaks really in terms of the functions being used and how to handle the new ByteString datatype. I also realized that the tool wasn’t outputting a necessary newline at the end of the output so I added that. After this small change surprisingly enough we’re now really close to the C implementation speed:
Now we’re still 69% slower than the C implementation which means we have room for improvement, but the interesting part is we’re actually 72% faster than the C implementation when we have to calculate the number of lines, words and characters at the same time:
Now we’ve reached the point where if we want to get our counting of lines to perform as well as the C implementation we’re going to have to profile our Haskell program. To profile we need to compile with a few additional flags:
We added the -prof -auto-all to build with profiling enabled, the -auto-all generates cost centres for all top level functions, you can read more about that in the Haskell documentation on profiling. When you try to run the make wc again if you get something like so:
Just run the cabal install command like so for each package:
That will reinstall the package and make sure to compile the required profiling information. You can now run the same command like so:
You’l now have a nice wc.prof file to look at which contains information like this in it (slightly reformatted to fit):
With the above we can now see that we’re spending 50% of our time in countlines and the other 50% in the main function which is most likely in the interact function reading the input and 50% of the time parsing and counting in the countlines. Since we don’t have access to the interact function what we can do is create cost centres for the countlines function and see if it identifies something we weren’t expect. So we’ll add this to our source:
So when we recompile and run with the above cost centres we can then get more detail from profiling:
We can see that we’re spending about 25% of our time each of the places we set those cost centres. The thing to do now is try to understand why it takes 25% of the time to do the pack call or the show call when they’re just converting an Int to String and then to a ByteString. I just wanted to show how to profile an existing tool and that Haskell can in fact be as quick as C without having to any major changes and using the right types.
So we’ve pretty much optimized our wc implementation and the only thing left to do is a detailed comparison of the performance of our implementation vs the C implementation:
Counting lines in a file with 500,000 lines where all lines have 80 columns:
- Ref Implementation: 0.084s
- Our Implementation: 0.149s
Counting words in a file with 500,000 lines where all lines have 80 columns:
- Ref Implementation: 1.243s
- Our Implementation: 0.742s
Counting chars in a file with 500,000 lines where all lines have 80 columns:
- Ref Implementation: 1.224s
- Our Implementation: 0.116s