Is static type checking a redundant testing mechanism?

by Luke Plant

Posted in:

— November 9, 2009 15:45

As there has been discussion about not writing unit tests recently, I thought I'd use my recent experience in finishing a non-trivial Haskell program to comment on the issue of writing tests (unit tests and other automated tests) in the context of real code.

I'm especially prompted by this comment by Ned Batcheldor that I came across a few weeks ago:

Since static type checking can't cover all possibilities, you will need automated testing. Once you have automated testing, static type checking is redundant.

(that's in a comment on his own blog post)

To some extent I agree with this, but I want to give some reasons why a strong and powerful static type checker really does eliminate the need for automated tests in some cases – that is to say, there are instances when the static type checking makes the automated (unit/integration) tests redundant and not the other way around, and does a better job.

I have very few tests in my Haskell blog software. There are significantly more in the Ella library which I wrote alongside it, but still far from complete coverage. While I like test driven development, and did it for some parts of this project, many times it felt like a waste of time. In some cases it was perhaps misdirected laziness, but I'm not convinced it always was. So what are the characteristics of code that doesn't benefit from automated/unit tests?

Trivial code

If code is extremely simple, it can actually be worse to have tests than to not have them.

In defending that statement, the first thing to remember is that tests can have bugs in them too. Now, many bugs in the tests will be caught, as long as you follow the rule of making sure the test fails, then writing the code, then making sure it passes. However, many bugs of omission, which are also very common, will not be caught i.e. when the test fails to test something it ought to.

Second, there is always a cost to writing tests. So, as the probability of making a mistake in your code tends to zero, the usefulness of tests against that code also tends to zero—and not just to zero, it can go negative. You spent x minutes writing a test for something that didn't need testing, which is lost time and money already, and lost opportunities, and you also have extra (test) code to maintain in the future, and a longer test suite to run.

Third, you can write an infinite number of tests, and still have bugs. You can have 100% code coverage, and still have bugs. (I'll leave you to do the research on code coverage if you don't believe me). So, you have to stop somewhere, and therefore you need to know when to stop.

It is always a bad idea to write a test whose cost outweighs its value. That is, there is no neutral code – it is always positive or negative, because merely by existing it has a maintenance cost – not even counting the cost of producing it in the first place.

So suppose you write a utility function that is used to sanitise phone numbers that people might enter. It removes '-' and ' ' characters. (The result will of course be validated separately, but we want to allow people to enter phone numbers in a convenient way). In Python:

def sanitise_phone_number(s):
    return s.replace("-", "").replace(" ", "")

The testing fanatics might stop to write a unit test, but not the rest of us, because:

You would mainly be testing that the built-in string library works.
If you think of the ways that the function is likely to be wrong, the test is just as likely to fail to catch it. For example, the function above might really need to strip newline chars as well, but that's not going to be tested unless I think to write a test for that.
If there actually is a bug here, or the implementation gets more complex so that it merits a test, I can cross that bridge when I come to it, and it won't cost me extra.
It's more likely that I'll forget to use this function than that I get it wrong. Therefore, an integration test would be far more useful. But in some cases, integration tests can be extremely expensive, both to write and to run, especially when testing javascript based web frontends, or GUIs that are not very testable. I'm almost certainly going to test this code by at least one manual integration test, and after that, do I really need to write an automatic one?

However, if I was writing the function in a language that was less capable than Python, I might well write a test for the above.

Declarative code

(You could argue that this is an extension of trivial code, but it feels slightly different, and the case is even stronger).

Imagine your spec says that you should have 5 news items on the front page of your web site. You are using a library that has utility code for getting the first n items, or page x of n items each. And of course you are going to use a constant for that 5, rather than code it right in. So somewhere you are going to write (assuming Python):

NEWS_ITEMS_ON_HOME_PAGE = 5

Are you going to write a test that ensures that this value stays at 5, and doesn't accidentally get changed? Then your code base violates DRY—you now have two places where you are specifying the number of news items on the home page. That is, to some extent, the nature of all tests, but it's worse in this case. With non-declarative code and tests, one instance specifies behaviour, the other implementation, and it's usually obvious which is correct. But with declarative code, if one instance is different, how do you know which is correct?

Or are you going to write a test for the actual home page having 5 items? That would be pointless, because it's just testing that you are capable of calling a trivial API, which itself belongs to thoroughly tested code. You might want a sanity check that you haven’t made a typo, but checking that the page returns anything with a 200 code will often be enough.

What about something like a Django model? Your spec says that a 'restaurant' needs to have a 'name' which is a maximum of 100 chars. You write the following code:

class Restaurant(models.Model):
    name = models.CharField("Name", max_length=100)
    # ...

Are you going to write code to test that you've typed this in correctly? It would again be violating DRY. Are you going to check that this interfaces with the database correctly? There are already hundreds of tests in Django which cover this. Are you going to write tests that are effectively checking for typos? Well, if you use this model at all, it's going to be very obvious if you've made a mistake, and some other simple integration test is going to catch it.

Haskell

Now, coming to Haskell. You can guess the point I'm going to make.

In Haskell, a lot of code is either trivial or declarative.

Further, many of the types of errors you could make are caught by the compiler. Typos and missing imports etc. are always caught, and many other errors beside.

Functional programming languages, especially pure ones, eliminate a lot of the kind of mistakes that are easy in imperative languages. Everything being an expression helps a lot—it forces you to think about every branch and return a value. In monadic code it becomes possible to avoid this, but a lot of your code is pure functional.

Example 1

Imagine a more complex function than our sanitise_phone_number above. It's going to take a list of 'transformation' functions and an input value and apply each function to the value in turn, returning the final value. In some languages, that would be just about worth writing a test for. You might have to worry about iterating over the list, boundary conditions, etc. But in Haskell it looks like this:

apply = foldl' (flip ($))

In the above definition, there is basically nothing that can go wrong. We already know that foldl' works, and isn't going to miss anything, or fail with an empty list. You can't forget to return the return value, like you can in Python. The compiler will catch any type errors. If the function doesn't do anything approaching what it's supposed to then you'll know as soon as you try to use it. I've used point-free style, so there isn't any chance of doing something silly with the input variables, because they don't even appear in the function definition!

For something like the above, you would often write your type signature first:

apply :: a -> [a -> a] -> a

Once you've done that, it's even harder to make a mistake. It's almost possible to try vaguely relevant code at random and see if it compiles. For something like this, if it compiles, and it looks very simple, it's probably correct. (There are obviously times when that will fail you, but it's amazing how often it doesn't. You often feel like you just have to keep doing what the compiler tells you and you'll get working code.)

Is the above code 'trivial' or 'declarative'? Well, that's a tough call. A lot of code in Haskell quickly becomes very declarative in style, especially when written point free.

Example 2

But what about something much bigger—say the generation of an Atom feed? With a library that makes use of a strong static type system, this can be actually quite hard to get wrong.

In my blog software, I use the feed library for Atom feeds. The code I've had to write is extremely simple—a matter of creating some data structures corresponding to Atom feeds. The data structures are defined to force you to supply all required elements. Where there is a choice of data type, it forces you to choose – for example the 'content' field has to be set with either HTMLContent "<h1>your content</h1>" or TextContent "Your content". (For those who don't know Haskell, it should also be pointed out that there is no equivalent to 'null'. Optional values are made explicit using the Maybe type).

After filling in all the values for these feeds, I wrote some very simple 'glue' functions that fed in the data and returned the result as an HTTP response. I created 4 different feeds, all of which worked perfectly first time, as soon as I got them to compile. I cannot see any value, and only cost, in adding tests for this. A check for a 200 response code and non empty content might be worth it, but would be much easier to write as a bash script that uses 'curl' on a few known URLs.

Had I written this in Python, I might have wanted tests to ensure that the HTML in the Atom feed content was escaped properly and various other things, in addition to a simple check for status 200. But the API of the feed library, combined with the type checking that the compiler has done, has made that redundant, and has tested it far more easily and thoroughly than I could have done with tests.

And it's not in general true that the simple functional test will catch any type errors, because often it will only exercise one route through the code, ignoring the fact that in many places dynamically typed code can return values of different types, which can cause type failures etc.

Example 3

One final example of reducing the need for automated tests is the routing system I've used in Ella. OK, it's really a chance to show off the only slightly clever bit of code that I wrote, but hopefully it will explain something of the power of a strong type system :-)

Consider the following bits of code/configuration in a Django project, which are responsible for matching a URL, pulling out some bits from it and dispatching it to a view function.

### myproject/urls.py

patterns = ('',
   (r'^members/(\d+)/$', 'myproject.views.member_detail'),
   # etc...
)

### myproject/views.py

def member_detail(request, memberid):
    memberid = int(memberid)
    member = get_member(memberid)
    # etc...

Now, there are a number of possible failure points in this code that you might want some regression tests for. For example, if in the future we change it so that the URL uses a string such as a user name, rather an integer, we will need to change the URLconf, the line in member_detail that calls int, and the definition of get_member (or use a different function).

There is a DRY or OAOO failure here – the fact that we are expecting an integer is specified multiple times, either implicitly or explicitly. This is one of the causes of fragility in this chunk of code – if one is changed, the others might not be updated, introducing bugs of different kinds. Now, there are things you can do about this, with some small or large changes to how URLconfs work. But they are not complete solutions, and one solution not open to Python developers is the one I coded in Ella.

The equivalent bits of code, with type signatures and explanations of them for those who don't know any Haskell, would look like this in my system.

----- MyProject/Routes.hs

import MyProject.Views

routes = [
   "members/" <+/> intParam //-> memberDetail $ []
   -- etc...
]

----- MyProject/Views.hs

-- memberDetail takes an 'Int' and an HTTP 'Request' object, and returns an
--  HTTP 'Response' (or 'Nothing' to indicate a 404), doing some IO on the
--  way.
memberDetail :: Int -> Request -> IO (Maybe Response)
memberDetail memberId request = do
   member <- getMember memberId
   -- etc...

You should read <+/> as ‘followed by’ and //-> as ‘routes to’. Just ignore the $ [] bit for now (it exists to allow decorators to be applied easily in the routing configuration, but we are applying no decorators, hence the empty list).

intParam is a ‘matcher’: it attempts to pull off the next chunk of the URL (ending in a '/'), match it and parse it as an integer. If it can do so, it passes the parsed value on to memberDetail as a parameter i.e. it partially applies memberDetail with an integer.

The beauty of this system is that nothing can go wrong any more. We still have DRY violations at the moment, but it doesn't cause a problem, because the compiler checks for consistency.

In fact, we can even remove the DRY violation. We could change the code like this:

----- MyProject/Routes.hs

import MyProject.Views

routes = [
   "members/" <+/> anyParam //-> memberDetail $ []
   -- etc...
]

----- MyProject/Views.hs

memberDetail memberId request = do
   member <- getMember memberId
   -- etc...

We've replaced intParam with anyParam, which is a polymorphic version that can match any parameter of type class Param. You can define your own Param instances, so this is completely extensible (and you can also define your own matchers, for complete power). We've also removed the type signature from memberDetail. So how can anyParam know what type of thing to match?

This is where type inference comes in. The function getMember will probably have a type signature, or it will use its parameter in such a way that its type signature can be inferred. From that, the type of memberId can be inferred. From that, the type of value that anyParam must return can be inferred. And from that, finally, the instance of Param can be chosen. The compiler is using the type system to pick which method should be used to match and parse the URL parameters based on how those parameters are eventually used.

This is very nice. (At least I think so :-). We've removed the DRY violation, or, if we choose to use type signatures or explicitly specify types in routes, DRY violations don't matter because the compiler will catch them for us.

Would unit or functional tests have caught any problems? Well, they might. If they checked the happy case, they will prove whether that still works. But they're unlikely to check whether the URLconf is too permissive or not. But the compiler can do that kind of consistency check.

The end result is that there are just fewer things that can possibly go wrong. I'm not saying that you wouldn't bother to write any tests. But in this case, if memberDetail was really just glue, you might decide to only test its component parts (for example, by testing the template that it relies on). Since most of the glue has been constructed so that it can't go wrong, you can focus tests on what can go wrong. And some sections of the code sink below the threshold at which tests provide positive value.

There are many other ways in which static type checking can make automated tests redundant. Parsers are a great example – a spec might define a syntax in BNF notation. In Haskell, you might well implement that using parsec. But if you look at the code, it will have pretty much a one-to-one correspondence with the BNF definitions. Any tests you write will simply check that a few examples happen to be parsed correctly, as you cannot begin to cover the input space. It's therefore far better to spend your time manually checking that the code matches the BNF spec than writing lots of tests.

It's also often argued that integration/unit tests that achieve 100% coverage will catch all type related errors, making static type checking redundant, since even with static type checking we'll need tests to catch the value related errors. But this is a myth. In Python, it's easy to have code with type errors that have 100% test coverage. A simple example:

class Discount(models.Model):
    # nullable if there is no expiry:
    expires_on = models.DateField(null=True)

    @property
    def has_expired(self):
        return date.today() > self.expires_on

def test_has_expired():
    d = Discount(expires_on=date(2000, 1, 1))
    self.assertEqual(d.has_expired, True)

I omitted the negative case for has_expired for brevity, but we already have 100% coverage. However, we didn't check the None case and we'll get a TypeError at runtime for some legitimate values. In dynamically typed languages (or even all languages which allow nullable values), unit testing is extremely unhelpful for this situation. A powerful static type system like that found in Haskell, on the other hand, will find the error at compile time, and require that the signature of has_expired changes, and all the related code. The changes needed to get it to compile are almost impossible to get wrong, so for the case of having no expiry date, you have trivial code that does not need manual automated tests (that is the say, the test you would have written would have so little value, and relatively high costs, that writing one would be a failure of judgement and a waste of your current and future resources).

In general, unit tests often will not catch the type of errors that a compiler can if there is any polymorphism in the code paths. And in dynamically typed code, almost every code path can have polymorphism, because you can usually pass in None (and very often this is reasonable and legitimate), or any duck-typed object, and in the code itself you simply cannot tell how it will be called.

Conclusion

Before you flame me, don't think that I'm attacking other languages. This experience with Haskell has actually proved to me that Python is still easily my favourite language for web development, especially in combination with Django. (I could do a follow up on why that is—I have a growing list of things I dislike about Haskell, some of which are fixable). But I often hear the Python crowd saying things about static typing and testing that come from ignorance, and the way you would imagine things to be (often based on experience of Java/C++/C#), and not from experience of something like Haskell.

Notes

2017-05-15 - Added examples about tests not catching type errors, and opportunity cost.