All Unkept
Posted in: Haskell  —  3 August 2009

Haskell string support

This is my suggestion about what needs to go into the Haskell Platform.

Consider the following extremely simple program:

s = "λ"

main = do
writeFile "test.txt" s
s2 <- readFile "test.txt"
print (s == s2)

No prizes for guessing that the output of this program is not "True". It highlights an essential problem with the Haskell standard library — many of the functions provided by the Prelude, System.IO, System.Posix and many others are completely broken (by design) and silently corrupt your data, unless it is composed only of ASCII characters.

The problem is that these APIs use Strings for operating system calls (such as reading/writing files, reading environment variables etc). A String is a list of unicode Chars, but none of the operating system calls have a clue what unicode chars are — they work entirely with bytes, which are a completely different kind of thing. Result: your program breaks without warning if you don't happen to be using ASCII.

And even worse, many libraries are built on the use of Strings and standard library functions, and they inherit these same problems, so as a user of those libraries, you can end up with problems that you can't even work around. For the library developer, too, it can be a very nasty problem — you start developing code using Strings, which works fine for ages, but a long time later you realise you can't support just ASCII, and really you need Data.ByteString, which requires changing function signatures or duplicating existing code if you don't want to break compatibility.

This is a rather embarrassing situation for the standard library of a modern language. What's worse is that even if you include the Haskell Platform as it currently stands, as far as I can see there is no solution to this bug — no correct way to simply write a string out to disk and read it back! I presume this is because there is no universally accepted library for dealing with encodings. Personally, I'd like to see the standard library change to remove the pretence that you can talk Unicode to the operating system, but at the very least we need a standardised way of doing the right thing, so that developers (of both programs and libraries) don't have to use those broken functions, and know what the correct alternatives are.

Comments §

§ On 3 August 2009, Don Stewart wrote:
468 The problem here of course is not the Char or String type, but the existing library IO functions.

To solve this there are several approaches:


    import Prelude hiding (writeFile, readFile, print)
    import System.IO.UTF8

    s = "λ"

    main = do
        writeFile "test.txt" s
        s2 <- readFile "test.txt"
        print (s == s2)

i.e. add utf8-string to the Haskell Platform.

Or wait for GHC 6.12 and use the new unicode IO layer. Google for "Rewrite of the IO library, including Unicode support"


§ On 4 August 2009, Johan Tibell wrote:
469 This is why I wrote network-bytestring. It solves the problem for socket I/O. There's an open ticket to merge network-bytestring into the network library. Unfortunately we can't remove the old String based function from network easily as there's a lot of legacy code that depends on it. I suggest adding a `Network.Socket.ByteString` module and deprecate the String based functions.

I can see two ways to work with Unicode I/O.

    1. Perform all I/O in terms of ByteStrings and use Data.Text.Encoding to do the translation.
    2. Have a stream wrapper on top of e.g. Handle that does the translation.

§ On 4 August 2009, luke wrote:
470 @Don: Thanks for the comment. I use utf8-string already, which works very nicely, but I've also seen the 'encoding' package, which is much broader, and in terms of solving the problem that seems better. Should I stick with utf8-string for now?

§ On 4 August 2009, Don Stewart wrote:
471 It depends on what format you want to encode things. I'd use utf8-string for now, then use the new encoded IO layer in 6.12:

http://www.haskell.org/pipermail/cvs-libraries/2009-June/010890.html

Add comment

Format:

  • Javascript has to be on to get past my spam protection, and cookies, and there is a delay, sorry for any inconvenience!
  • I reserve the right to moderate comments.