All Unkept
Posted in: Haskell  —  August 03, 2009 at 09:58 PM

Haskell string support

by Luke Plant

This is my suggestion about what needs to go into the Haskell Platform.

Consider the following extremely simple program:

s = "λ"

main = do
    writeFile "test.txt" s
    s2 <- readFile "test.txt"
    print (s == s2)

No prizes for guessing that the output of this program is not "True". It highlights an essential problem with the Haskell standard library — many of the functions provided by the Prelude, System.IO, System.Posix and many others are completely broken (by design) and silently corrupt your data, unless it is composed only of ASCII characters.

The problem is that these APIs use Strings for operating system calls (such as reading/writing files, reading environment variables etc). A String is a list of unicode Chars, but none of the operating system calls have a clue what unicode chars are — they work entirely with bytes, which are a completely different kind of thing. Result: your program breaks without warning if you don't happen to be using ASCII.

And even worse, many libraries are built on the use of Strings and standard library functions, and they inherit these same problems, so as a user of those libraries, you can end up with problems that you can't even work around. For the library developer, too, it can be a very nasty problem — you start developing code using Strings, which works fine for ages, but a long time later you realise you can't support just ASCII, and really you need Data.ByteString, which requires changing function signatures or duplicating existing code if you don't want to break compatibility.

This is a rather embarrassing situation for the standard library of a modern language. What's worse is that even if you include the Haskell Platform as it currently stands, as far as I can see there is no solution to this bug — no correct way to simply write a string out to disk and read it back! I presume this is because there is no universally accepted library for dealing with encodings. Personally, I'd like to see the standard library change to remove the pretence that you can talk Unicode to the operating system, but at the very least we need a standardised way of doing the right thing, so that developers (of both programs and libraries) don't have to use those broken functions, and know what the correct alternatives are.

Comments §

blog comments powered by Disqus