Breaking “provably correct” Leftpad

by Luke Plant

Posted in:

— October 3, 2025 21:00

I don’t know much about about formal methods, so a while back I read Hillel Wayne’s post Let’s prove Leftpad with interest. However:

I know Donald Knuth’s famous quote: “Beware of bugs in the above code; I have only proved it correct, not tried it”
I also know how it turned out that code that had been proved correct harboured a bug not found for decades.

So I thought I’d take a peek and do some testing on these Leftpad implementations that are all “provably correct”.

Methodology

I’ll pick a few, simple, perfectly ordinary inputs at random, and work out what I think the output should be. This is a pretty trivial problem so I’m expecting that all the implementations will match my output. [narrator: He is is expecting no such thing]

I’m also expecting that, even if for some reason I’ve made a mistake, all the implementations will at least match each other. [narrator: More lies] They’ve all been proved correct, right?

Here are my inputs and expected outputs. I’m going to pad to a length of 10, and use - as padding so it can be seen and counted more easily than spaces.

Item	Input	Length	Expected padding	Expected Output
1	`𝄞`	1	9	`---------𝄞`
2	`Å`	1	9	`---------Å`
3	`𓀾`	1	9	`---------𓀾`
4	`אֳֽ֑`	1	9	`---------אֳֽ֑`
5	`résumé`	6	4	`----résumé`
6	`résumé`	6	4	`----résumé`

[“ordinary”, “random” - I think my work here is done…]

I’ve used a monospace font so that the right hand side of the outputs all line up as you’d expect.

Entry 6 is not a mistake, by the way, it just does “e acute” in a different way to entry 5. Nothing to see here, move along…

Implementations

Not all of the implementations were that easy to run. In fact, many of them didn’t take any kind of “string” as input, but vectors or lists or such things, and it wasn’t obvious to me how to pass strings in. So I discounted them.

For the ones I could run, I attempted to do so by embedding the test inputs directly in the program, if possible.

Liquid Haskell

Embedding the characters directly in Haskell source code kept getting me “lexical error in string/character literal”, so I wrote a small driver program that read from a file.

Java

The leftpad function provided didn’t take a string, but a char[]. Thankfully, it’s easy to convert from String objects, using the .toCharArray() function. So I did that.

Lean4

There is a handy online playground, and the implementation had a helpful #eval block that I could modify to get output. You can play with it here.

Rust

The code here had loads of extra lines regarding specs etc. which I stripped so I could easily run it, which worked fine.

More tricky was that the code didn’t take a string, but some Vec<Thingy>. As I know nothing about Rust, I got ChatGPT to tell me how to convert from a string to that. It gave me two options, I picked the one that looked simpler and less <<angry>>. I didn’t deliberately pick the one which made Rust look even worse than all the others, out of peevish resentment for every time someone has rewritten some Python code (my go-to language) in Rust and made it a million times faster – that’s a ridiculous suggestion.

Some competition!

To make things interesting, let’s compare these provably correct implementations with one vibe-coded by ChatGPT, in some random language, like, say, um, Swift. It gave me this code:

import Foundation

func leftPad(_ string: String, length: Int, pad: Character = " ") -> String {
    let paddingCount = max(0, length - string.count)
    return String(repeating: pad, count: paddingCount) + string
}

You can play with it online here.

Results

Here are the results, green for correct and red for … less correct.

Input	Reference	Java	Haskell	Lean	Rust	Swift
𝄞	---------𝄞	--------𝄞	---------𝄞	---------𝄞	------𝄞	---------𝄞
Å	---------Å	--------Å	--------Å	--------Å	-------Å	---------Å
𓀾	---------𓀾	--------𓀾	---------𓀾	---------𓀾	------𓀾	---------𓀾
אֳֽ֑	---------אֳֽ֑	------אֳֽ֑	------אֳֽ֑	------אֳֽ֑	--אֳֽ֑	---------אֳֽ֑
résumé	----résumé	----résumé	----résumé	----résumé	--résumé	----résumé
résumé	----résumé	--résumé	--résumé	--résumé	résumé	----résumé

And pivoted the other way around so you can compare individual inputs more easily:

Language	𝄞	Å	𓀾	אֳֽ֑	résumé	résumé
Reference	---------𝄞	---------Å	---------𓀾	---------אֳֽ֑	----résumé	----résumé
Java	--------𝄞	--------Å	--------𓀾	------אֳֽ֑	----résumé	--résumé
Haskell	---------𝄞	--------Å	---------𓀾	------אֳֽ֑	----résumé	--résumé
Lean	---------𝄞	--------Å	---------𓀾	------אֳֽ֑	----résumé	--résumé
Rust	------𝄞	-------Å	------𓀾	--אֳֽ֑	--résumé	résumé
Swift	---------𝄞	---------Å	---------𓀾	---------אֳֽ֑	----résumé	----résumé

Comments

Rust, as expected, gets nul points. What can I say?

Vibe-coding with Swift: 💯

Other that, we can see:

The only item that all implementations (apart from Rust) get correct is entry 5, the first of the two résumé options.
Java is mostly consistent with the others, but it appears it doesn’t like musical notation, or Egyptian hieroglyphics (item 3 is “standing mummy”), which seems a little rude.

The score so far:

Fancy-pants languages and formal verification: 0
Vibe-coding it with ChatGPT: 1

Explanation

OK, I’ve had my fun now :-)

(The original “Let’s Prove Leftpad” project was done “because it is funny”, and this post is in the same spirit. I want to be especially clear that I’m not actually a fan of vibe-coding).

What’s actually going on here? There are two main issues, both tied up with the concept of “the length of a string”.

(If you already know enough about Unicode, or don’t care about the details, you can skip to the “What went wrong?” section to continue discussion regarding formal verification).

First:

What is a character?

Strings are composed of “characters”, but what are they?

Most modern computer languages, and all the ones I included above, use Unicode as the basis for answering this. Unicode, at its heart, is a list of “code points” (although it is bit more than this). A code point, however, is not exactly a character.

Many of the code points you use most often, like Latin Capital Letter A U+0041, are exactly what you think of as a character. But many are not. These include:

combining characters, which are used to add accents and marks to other characters. These have some complexity:
- First, whether you regard the accent as part of the character or not can be a matter of debate. For example, in é, “e acute”, you might think of this as a different letter to e, or an e plus an acute accent. In some languages, this distinction is critical e.g. in Turkish ç is not just “c with some decoration”, it’s an entirely distinct letter.
- Second, Unicode often has multiple ways of generating these accented characters, either “pre-composed” as a single code point, or as a combination of multiple code points. So, there is both:
  - é: Latin Small Letter E With Acute U+00E9
  - é: Latin Small Letter E U+0065 + Combining Acute Accent U+0301
various spacing characters
control characters such as bidi isolation characters.
and probably more

So, Unicode has another concept, called the “extended grapheme cluster”, or “user-perceived character”, which more closely maps to what you might think of as a “character”. That’s the concept I’m implicitly using for my claim of what leftpad should output.

Secondly, there is the question of:

How does a programming language handle strings?

Different languages have different fundamental answers to the question of “what is a string?”, different internal representations of them (how the data is actually stored), and different ways of exposing strings to the programmer.

Some languages, especially performance oriented ones, provide little to zero insulation from the internal representation, while others provide a fairly strong abstraction. Some languages, like Haskell, provide multiple string types (which can be used with string literals in your code with the OverloadedStrings extension).

At this point, as well as “code points”, we’ve got to consider “encodings”. If you have a code point, even a “simple” one like U+0041 (A), you have said nothing about how you are going to store or transmit that data. An encoding is a system for doing that. Two relevant ones here:

UTF-8 is probably the most common/popular one. It uses anything from 1 to 4 bytes to express code points. It has lots of useful properties, but an important one is backwards compatibility with ASCII - if you happen to be limited to the 127 characters found in ASCII, then UTF-8 encoded Unicode text is one-byte-per-codepoint, and is byte-for-byte the the same as ASCII.
UTF-16 is an encoding where most code points (specifically those in the Basic Multilingual Plane or BMP) take up 2 bytes, and the remainder can be specified using 4 bytes and a system called “surrogate pairs”.

UTF-16 exists because originally Unicode had less than 65,536 code points, meaning you could represent all points with just two bytes, and it was thought we would never need more.

In terms of languages today, with some simplification we can say the following:

In Haskell, Lean, Python, and many other languages, strings are sequences of Unicode code points.

There is little attempt to hide the idea of code points from you, although in some, like Python, the internal representation itself might be hidden (see PEP 393).
In Java and Javascript, strings are sequences of UTF-16 items.

Originally strings were intended to be sequences of code points, but when Unicode grew, to avoid breaking all existing Java code, it morphed into the UTF-16 compromise.
In Rust, strings are UTF-8 encoded Unicode code points (see String).

Rust does also have an easily accessible chars method/concept, which corresponds to a Unicode code point. I didn’t use this above - Rust would have behaved the same as Haskell/Lean if I had.
In Swift, strings are sequences of user-perceived characters.

I’m not a user of Swift, but from the docs it appears to do a pretty good job of abstracting away from the “sequence of code point” paradigm. For example, iterating over a string gets you the “characters” (i.e. extended grapheme clusters) one by one, and the .count property gives you the number of such characters. It does also have a .length property that gives you the number of code points.

Putting them together

These differences, between them, explain the differences in output above. In more detail:

the treble clef 𝄞 is a single code point, but it is outside the BMP and requires 2 UTF-16 items. So Java considers the length to be 2, and only 8 padding characters were added, in contrast to other languages.
I created the Å using two code points (although there is a pre-composed version of this character).
The אֳֽ֑ is Hebrew, and is composed an Aleph followed by multiple vowel marks and a cantillation mark, bringing it up to 4 code points.
résumé was spelled in two different ways, one with precomposed é which is one code point, the other with combining characters where each é requires two code points.
None of the inputs was wholly in the ASCII range, so encoding them as UTF-8 requires more bytes, which is why Rust (as I used it) behaved as it did.

What went wrong?

For me, the biggest issue is not the “code points” vs “characters” debate, which is responsible for most of the variation shown, but the issue that resulted in the difference in the Java output i.e. UTF-16. All of the others (if I hadn’t stitched up Rust) would have resulted in the same output at least.

Apparently, nothing in the process of doing the formal verification forced the implementations to converge, and I think it is pretty fair to conclude that that at least one of the implementations must be faulty, given that they produce different output.

So what went wrong?

Lies, damned lies and natural language

English (or any natural language) is at the heart of the problem here. We should start with the phrase “provably correct”. It has a technical definition, but I’m not convinced those English words help us. The post accompanying the LiquidHaskell entry for Leftpad puts it this way:

My eyes roll whenever I read the phrase “proved X (a function, a program) correct”.

There is no such thing as “correct”.

There are only “specifications” or “properties”, and proofs that ensures that your code matches those specifications or properties.

For these reasons, I think I’d prefer talking about “formally verified” functions – it at least prompts you to ask “what does that mean”, and maybe suggests that you should be thinking about what, specifically, has been verified.

The next bit of English to trip us up is “the length of a string”. It’s extremely easy to imagine this is an easy concept, but in the presence of Unicode it really isn’t.

Hillel’s original informal requirements don’t actually use that phrase, instead they use the pseudo-code max(n, len(str)). Looking at the implementations, it appears people have subconsciously interpreted this as “the length of the string”, and then assumed that the functions like length or size that their language provides do “the right thing”.

We could conclude this is in fact a problem with informal requirements – it was at the level of interpreting those requirements that this went wrong. Therefore, we need more formal specifications and verification, not less. But I don’t think this gets round the fact that we’ve got to translate at some point, and at that point you’ve got trouble.

What is correct?

The issue I haven’t addressed so far is whether my reference output and the Swift implementation are actually “correct”. The reality is that you can make arguments for different things.

Implicitly, I’m arguing that left pad should be used for visual alignment in a fixed width context, and the implementation that does the best at that is the best one. I think that is a pretty reasonable case. But I’m sure you could make a case for other output – there isn’t actually anything that says what left pad should be used for. It’s possible that there are use cases where “the language’s underlying concept of the length of a string, whatever that may be” is the most important thing.

In addition, I was hiding the fact that “fixed width” is yet another lie:

I was originally going to use a flag character like 🏴󠁧󠁢󠁥󠁮󠁧󠁿 as one of my inputs, which is a single “extended grapheme cluster” that uses no less then 7 code points. It also results in 14 UTF-16 units in Java. The problem was that this character, like most emojis and many other characters from wide scripts like Chinese, takes up a double width even with a monospace font.

To maintain the subterfuge of “look how these all line up neatly in the correct output”, I was forced to use other examples. In other words, the example use case I was relying on to prove that these leftpad implementations were broken, is itself a broken concept. But I would still maintain that my reference output is closer to what you would expect leftpad to do.

A big point is this: even if we argue that a give implementation is “correct” (in that it does what its specifications say it does), that doesn’t mean you are using it correctly. Are you using it for its intended purpose and context? That seems like a really hard question to answer even for leftpad, and many other real world functions are similar.

So, I’m not sure what my final conclusion is, other than “programming is hard, let’s go shopping let’s eat chocolate” (alternative suggested by my wife, that’s my plan for the evening then).

Confessions and corrections

The Swift implementation was indeed written by ChatGPT, and it got it right first time, with just the prompt “Implement leftpad in Swift”. However:
- Swift is the only language I know where an implementation that does what I wanted it to do is that simple.
- I followed up by getting ChatGPT to produce a Python version, and it had all the same problems as Haskell/Lean and similar.
- I noticed that Swift doesn’t calculate strings lengths the way I would need for my use case for some characters, such as Zero Width Space U+200B and Right-To-Left Isolate U+2067, which I would need to count for zero length.
As mentioned, the other way to use the Rust version has the same behaviour as the Haskell/Lean/etc versions. ChatGPT did actually point out to me that this way was better, and the other way (the one I used) was adequate only if the input was limited to ASCII.

I do feel slightly justified in my treatment of this solution however - the provided code didn’t give a function that actually took a string, nor did it force you to use it in a specific way. It dodged the hard part, so I punished it.

Response

Hillel was kind enough to look at this post, and had this response to add:

In general, formally verified code can “go wrong” in two ways: the proven properties that don’t match what we need, or the proof depends on assumptions that are not true in practice. This is a good example of the former. An example of the latter would be the assumption that the output string is storable in memory. None of the formally verified functions will correctly render leftpad(“-“, 1e300, “foo”). This is why we always need to be careful when talking about “proving correctness”. In formal methods, “correct” always means “conforms to a certain specification under certain assumptions”, which is very different from the colloquial use of “correct” (does what you want and doesn’t have bugs).

He also pointed out that padding/alignment functionality available in standard libraries, like Python’s Format Specification Mini-Language and Javascript’s padStart, have similar issues.

You may also like: §

Comments §