lukeplant.me.uk

# Zero-based indexing in the real world

Posted in: Python, Software development

It’s often said that the zero-based indexing common in programming languages is a completely “unnatural” (and therefore wrong) way to do indexing. If you count (and label) a list of things in the real world, who in their right mind would start with zero?

This post describes a place in “the real world” where zero-based indexing makes perfect sense, whether or not it is the most obvious thing to do initially. This may be helpful to someone new to programming, when it seems that this zero-based indexing has been deliberately invented just to make like difficult.

Suppose you sell books (hi Stephen!), and you need to store them.

You have shelves that can store 100 books. If you need more than that, you can get more shelves - you can stack 10 shelves on top of each other into an aisle. If you get more than 1000 books, then you can have multiple aisles. Since you want to keep things organised, you keep going with the whole 10 thing - so you have 10 aisles to a room, 10 rooms to a floor, 10 floors to a building etc.

Suppose you haven’t got that far yet, and you’ve just got a single room of aisles and shelves, and can store up to 10,000 books.

You “natually” decided to start labelling things at 1. The first book is book 1, the first shelf is shelf 1, the first aisle is aisle 1.

Now, someone orders book 8650, and you need to find it. Where is it? It’s on aisle 9, shelf 7. Hmm, not so obvious. But it’s fairly simple to add 1 to each digit, right? So book 8699 is also on aisle 9, shelf 7. And book 8700 is on — oh, actually, that is also on aisle 9, shelf 7. Book 8701 is the first one on aisle 9 shelf 8. So the add 1 thing doesn’t really work. The rules are:

For book XYZZ: Add 1 to X to get the aisle, add 1 to Y to get the shelf. And if ZZ is 0, subtract 1 from the shelf number you calculated. No wait, what about book 1000? - that would give me aisle 2, shelf 0, which doesn’t exist. OK, try again…

No, let’s start again.

You think about things slightly longer, and you decide to start labelling things at 0. The first book is 0, the first shelf is 0, the first aisle is 0.

Now, someone orders book 8650, and you need to find it. Where is it? It’s on aisle 8, shelf 6 - just like every book labelled 86ZZ. That’s it.

So, it turns out that the most “natural” way to number things ended up being a pain in the neck.

In fact, if I was given the task of indexing things in a library or book store etc., I’m fairly certain that I would consider starting at zero to be the most natural and obvious choice, because of my experience with computers. It might not be the most obvious choice for someone else, or the easiest to learn initially, but it is certainly easier to use long term (at least in my hypothetical book store).

From the analogy, when would it make sense to start from zero when labelling things?

• If you want to label things with numbers.
• If the length of the numbers (expressed in a certain base) is constrained.
• If you have a lot of things, and want to organise them into groups.

These are things that are true of computer memory. There are lots of pieces of data in a computer that we need to be able to reference. For practical reasons, computers tend to use base 2, not base 10, so there are constraints on the number of binary digits rather than the number of decimal digits. A 32-bit computer can’t (easily) address a number that requires more than 32 binary digits to express.

So computer memory is best labelled starting at zero. But it’s not just memory. To organize the amount of data that computers can process, we often store the size of the data in the data itself, and include this when we pass it around - for example when we send it over a network, or store it in a file. So, a file, or a section of a file, might have a header at the beginning that contains 4 bytes (a 32 bit number) that indicates how long that file or section is, or might contain an offset indicating where another piece of data is.

So, when you deal with computer data, you are often having to deal with labels for the data, and numbers that represent offsets into data etc. To avoid all the adding 1 etc., and errors or head-aches associated with that, the zero-based indexing makes most sense.

Now, there are many problems in programming where the constraints of the underlying architecture do not show through. Very often, the upper limits are so far away that they are of no concern, and the way things are organized in memory is of no concern, and you aren’t dealing with numbers that are constrained to be 32-bits, for example.

But that leads to a dilemma. Should you use the more “natural” 1-based indexing for those problems, and stick to zero-based indexing for the others? Doing so can cause huge problems. Having more than one convention is very confusing. And often, code can be general enough that it will work for a wide variety of problems, so for a given line of code, there is no clue which type of problem it applies to.

So it’s best to stick to one convention to avoid confusion. Which one? Well, that’s the subject of wars, and there are lots more arguments (in both directions). But if you are struggling with the choice of zero-based indexing, it wasn’t made just to annoy you, it was made because it works really well for some types of problems. And you might find it becomes natural eventually.