Optimizing Haskell Programs

Before we start optimizing the wc command line tool we wrote lets first find a simple way to compare this implementation with the wc command tool available on my linux (written in C). I created a file with 197,000 dummy lines (with each line just over 80 characters long) and measured how long it takes to count the number of lines, with each tool:

time bash -c "cat test | wc -l" 197000 bash -c "cat test | wc -l" 0.00s user 0.02s system 65% cpu 0.043 total time bash -c "cat test | ./wc.hs -l" 197000 bash -c "cat test | ./wc.hs -l" 1.46s user 0.07s system 95% cpu 1.590 total

So the current Haskell implementation is 37x slower than the native C version. The first thing to note is how running the haskell program without compiling is not efficient at all. So lets put together a simple make file and use ghci to compile the .hs file to a native executable. Here’s a possible make file:

init:
    mkdir -p build
    cp *.hs build

wc: wc.hs init
    cd build ; ghc --make wc.hs -o wc

clean:
    rm -fr build *.o *.hi

So after a simple compilation the results we’re now getting are:

time bash -c "cat test | build/wc -l" 197000 bash -c "cat test | build/wc -l" 0.96s user 0.04s system 95% cpu 1.049 total

Which puts at at 24x slower which is already some progress with absolutely no code changes. Now the ghc compiler also allows you to use a few optimizing flags that can help make the output code quicker. Lets use the basic -O2 optimization and see how much we can gain. We’ll actually add the compilation directive to the .hs file with the following line:

#! /usr/bin/env runhaskell
{-# LANGUAGE DeriveDataTypeable #-}
{-# OPTIONS_GHC -O2 #-}
..._removed other code from here

Now after recompiling, here is the current performance of our tool:

time bash -c "cat test | build/wc -l" 197000 bash -c "cat test | build/wc -l" 0.96s user 0.04s system 95% cpu 1.049 total

Not much of a boost really and its not a surprise since our program isn’t very complicated we can’t expect the compiler to be able to save time that easily. There are a few things to look at before we start profiling and here’s a list of the usual suspects when it comes to bad performance in Haskell:

String is painfully slow and is known for being 20x slower than a similar C implementation. The fix is to use the ByteString type which is known to only be 2x slower than a similar C implementation.
read and show are known to also perform badly and you should use the same functions that manipulate the ByteString datatype.

So lets start by importing the required library to use the ByteString datatype and using all of the functions that handle ByteString. Be aware that it can be a bit of hassle to get your functions working with the ByteString data type and quite a bit of work to get all the types lined up just write, but here’s what the wc command implementation looks like that now uses ByteStrings:

#! /usr/bin/env runhaskell
{-# LANGUAGE DeriveDataTypeable #-}
{-# OPTIONS_GHC -O2 #-}

module Main (main) where

import System....
import Control.Arrow
import qualified Data.ByteString.Char8 as C

data WC = WC {chars :: Bool, lines_ :: Bool, words_ :: Bool}
    deriving (Show, Data, Typeable)

wc = WC { chars = def &= name "m" &= help "print the byte counts",
          lines_ = def &= help "print the character counts",
          words_ = def &= help "print the word counts" }
        &= help ("Print newline, word, and byte counts for each FILE, " ++
                 "and a total line if more than one FILE is specified." ++
                 " With no FILE, or when FILE is -, read standard " ++
                 "input.  A word is a non-zero-length sequence of " ++
                 "characters delimited by white space.")
        &= summary "wc v0.0.1, (C) Rodney Gomes"

countwords = C.pack . show . length . C.words
countlines = C.pack . show . length . C.lines
countchars = C.pack . show . C.length

space = C.pack " "
flat (a,(b,c)) = C.concat( a : space : b : space : c : [] )

addnl x = C.concat( x : (C.pack "\n") : [])
optionHandler WC{chars=True} = addnl . countchars
optionHandler WC{lines_=True} = addnl . countlines
optionHandler WC{words_=True} = addnl . countwords
optionHandler _ = addnl . flat . ( countlines &&& countwords &&& countchars )

main = cmdArgs wc >>= C.interact . optionHandler

Just a few tweaks really in terms of the functions being used and how to handle the new ByteString datatype. I also realized that the tool wasn’t outputting a necessary newline at the end of the output so I added that. After this small change surprisingly enough we’re now really close to the C implementation speed:

time bash -c "cat test | build/wc -l" 197000 bash -c "cat test | build/wc -l" 0.02s user 0.04s system 77% cpu 0.073 total

Now we’re still 69% slower than the C implementation which means we have room for improvement, but the interesting part is we’re actually 72% faster than the C implementation when we have to calculate the number of lines, words and characters at the same time:

time bash -c "cat test | wc" 197000 3743000 19306000 bash -c "cat test | wc" 0.57s user 0.02s system 95% cpu 0.613 total time bash -c "cat test | build/wc" 197000 3743000 19306000 bash -c "cat test | build/wc" 0.30s user 0.04s system 94% cpu 0.357 total

Now we’ve reached the point where if we want to get our counting of lines to perform as well as the C implementation we’re going to have to profile our Haskell program. To profile we need to compile with a few additional flags:

wc: wc.hs init
    cd build ; ghc -prof -auto-all -rtsopts --make wc.hs -o wc

We added the -prof -auto-all to build with profiling enabled, the -auto-all generates cost centres for all top level functions, you can read more about that in the Haskell documentation on profiling. When you try to run the make wc again if you get something like so:

wc.hs:7:8: Could not find module `System....': Perhaps you haven't installed the profiling libraries for package `cmdargs-0.9'? Use -v to see a list of the files searched for.

Just run the cabal install command like so for each package:

cabal install --reinstall -p cmdargs

That will reinstall the package and make sure to compile the required profiling information. You can now run the same command like so:

wc -l +RTS -p -RTS

You’l now have a nice wc.prof file to look at which contains information like this in it (slightly reformatted to fit):

Mon Nov 14 19:05 2011 Time and Allocation Profiling Report (Final) wc +RTS -p -RTS -l total time = 0.04 secs (2 ticks @ 20 ms) total alloc = 13,593,584 bytes (excludes profiling overheads) COST CENTRE MODULE %time %alloc main Main 50.0 1.3 countlines Main 50.0 98.5 individual inherited COST CENTRE MODULE no. entries %time %alloc %time %alloc MAIN MAIN 1 0 0.0 0.0 100.0 100.0 main Main 358 3 50.0 1.3 100.0 99.9 optionHandler Main 361 1 0.0 0.0 50.0 98.5 addnl Main 363 1 0.0 0.0 0.0 0.0 countlines Main 362 2 50.0 98.5 50.0 98.5 wc Main 360 0 0.0 0.0 0.0 0.0 CAF Main 352 33 0.0 0.0 0.0 0.0 addnl Main 364 0 0.0 0.0 0.0 0.0 wc Main 359 1 0.0 0.0 0.0 0.0 CAF Data.Typeable 350 5 0.0 0.0 0.0 0.0 CAF GHC.Show 348 1 0.0 0.0 0.0 0.0 CAF Data.HashTable 290 3 0.0 0.0 0.0 0.0 CAF GHC.IO.Handle.FD 288 3 0.0 0.0 0.0 0.0 CAF GHC.IO.FD 272 4 0.0 0.0 0.0 0.0 CAF GHC.IO.Handle.Internals 252 1 0.0 0.0 0.0 0.0 CAF GHC.IO.Encoding.Iconv 246 2 0.0 0.0 0.0 0.0 CAF GHC.Conc.Signal 243 1 0.0 0.0 0.0 0.0 CAF Data.Data 227 3 0.0 0.0 0.0 0.0 CAF System.....Implicit.Global 226 3 0.0 0.0 0.0 0.0 CAF System.....Implicit.Reader 223 1 0.0 0.0 0.0 0.0 CAF Data.Generics.Any.Prelude 222 2 0.0 0.0 0.0 0.0 CAF System.....Explicit 209 6 0.0 0.0 0.0 0.0 CAF System.....Explicit.Type 202 1 0.0 0.0 0.0 0.0 CAF System.....Explicit.Help 189 1 0.0 0.0 0.0 0.0 CAF System.....Implicit.Ann 187 4 0.0 0.0 0.0 0.0 CAF System.....Annotate 186 1 0.0 0.0 0.0 0.0 CAF Data.ByteString.Char8 184 1 0.0 0.0 0.0 0.0

With the above we can now see that we’re spending 50% of our time in countlines and the other 50% in the main function which is most likely in the interact function reading the input and 50% of the time parsing and counting in the countlines. Since we don’t have access to the interact function what we can do is create cost centres for the countlines function and see if it identifies something we weren’t expect. So we’ll add this to our source:

countwords = C.pack . show . length . C.words
countlines = {-# SCC "C.pack" #-} C.pack . {-# SCC "show" #-} show . \
             {-# SCC "length" #-} length . {-# SCC "C.lines" #-} C.lines
countchars = C.pack . show . C.length

So when we recompile and run with the above cost centres we can then get more detail from profiling:

COST CENTRE MODULE no. entries %time %alloc %time %alloc$ $ MAIN MAIN 1 0 75.0 0.5 100.0 100.0$ C.pack Main 358 1 0.0 0.0 25.0 99.4$ show Main 359 1 0.0 0.0 25.0 99.4$ length Main 360 1 0.0 0.0 25.0 99.4$ C.lines Main 361 1 25.0 99.4 25.0 99.4$

We can see that we’re spending about 25% of our time each of the places we set those cost centres. The thing to do now is try to understand why it takes 25% of the time to do the pack call or the show call when they’re just converting an Int to String and then to a ByteString. I just wanted to show how to profile an existing tool and that Haskell can in fact be as quick as C without having to any major changes and using the right types.

So we’ve pretty much optimized our wc implementation and the only thing left to do is a detailed comparison of the performance of our implementation vs the C implementation:

Counting lines in a file with 500,000 lines where all lines have 80 columns:

Ref Implementation: 0.084s
Our Implementation: 0.149s

Counting words in a file with 500,000 lines where all lines have 80 columns:

Ref Implementation: 1.243s
Our Implementation: 0.742s

Counting chars in a file with 500,000 lines where all lines have 80 columns:

Ref Implementation: 1.224s
Our Implementation: 0.116s