As there has been discussion about not writing unit tests recently, I thought I'd use my recent experience in finishing a non-trivial Haskell program to comment on the issue of writing tests (unit tests and other automated tests) in the context of real code.
I'm especially prompted by this comment by Ned Batcheldor that I came across a few weeks ago:
Since static type checking can't cover all possibilities, you will need automated testing. Once you have automated testing, static type checking is redundant.
(that's in a comment on his own blog post)
To some extent I agree with this, but I want to give some reasons why a strong and powerful static type checker really does eliminate the need for automated tests in some cases—that is to say, there are instances when the static type checking makes the automated tests redundant and not the other way around, and does a better job.
I have very few tests in my Haskell blog software. There are significantly more in the Ella library which I wrote alongside it, but still far from complete coverage. While I like test driven development, and did it for some parts of this project, many times it felt like a waste of time. In some cases it was perhaps misdirected laziness, but I'm not convinced it always was. So what are the characteristics of code that doesn't benefit from automated/unit tests?
Trivial code
If code is extremely simple, it can actually be worse to have tests than to not have them.
In defending that statement, the first thing to remember is that tests can have bugs in them too. Now, many bugs in the tests will be caught, as long as you follow the rule of making sure the test fails, then writing the code, then making sure it passes. However, many bugs of omission, which are also very common, will not be caught i.e. when the test fails to test something it ought to.
Second, there is always a cost to writing tests. So, as the probability of making a mistake in your code tends to zero, the usefulness of tests against that code also tends to zero—and not just to zero, it can go negative. You spent x minutes writing a test for something that didn't need testing, which is lost time and money already, and you also have extra (test) code to maintain in the future, and a longer test suite to run.
Third, you can write an infinite number of tests, and still have bugs. You can have 100% code coverage, and still have bugs. (I'll leave you to do the research on code coverage if you don't believe me). So, you have to stop somewhere, and therefore you need to know when to stop.
So suppose you write a utility function that is used to sanitise phone numbers that people might enter. It removes '-' and ' ' characters. (The result will of course be validated separately, but we want to allow people to enter phone numbers in a convenient way). In Python:
def sanitise_phone_number(s):
return s.replace("-","").replace(" ","")
The testing fanatics might stop to write a unit test, but not the rest of us, because:
You would mainly be testing that the built-in string library works.
If you think of the ways that the function is likely to be wrong, the test is just as likely to fail to catch it. For example, the function above might really need to strip newline chars as well, but that's not going to be tested unless I think to write a test for that.
If there actually is a bug here, or the implementation gets more complex so that it merits a test, I can cross that bridge when I come to it, and it won't cost me extra.
It's more likely that I'll forget to use this function than that I get it wrong. Therefore, an integration test would be far more useful. But in some cases, integration tests can be extremely expensive, both to write and to run, especially when testing javascript based web frontends, or GUIs that are not very testable. I'm almost certainly going to test this code by at least one manual integration test, and after that, do I really need to write an automatic one?
However, if I was writing the function in a language that was less capable than Python, I might well write a test for the above.
Declarative code
(You could argue that this is an extension of trivial code, but it feels slightly different, and the case is even stronger).
Imagine your spec says that you should have 5 news items on the front page of your web site. You are using a library that has utility code for getting the first n items, or page x of n items each. And of course you are going to use a constant for that 5, rather than code it right in. So somewhere you are going to write (assuming Python):
NEWS_ITEMS_ON_HOME_PAGE = 5
Are you going to write a test that ensures that this value stays at 5, and doesn't accidentally get changed? Then your code base violates DRY—you now have two places where you are specifying the number of news items on the home page. That is, to some extent, the nature of all tests, but it's worse in this case. With non-declarative code and tests, one instance specifies behaviour, the other implementation, and it's usually obvious which is correct. But with declarative code, if one instance is different, how do you know which is correct?
Or are you going to write a test for the actual home page having 5 items? That would be pointless, because it's just testing that you are capable of calling a trivial API, which itself belongs to thoroughly tested code. You might want a sanity check that you have made a typo, but checking that the page returns anything with a 200 code will often be enough.
What about something like a Django model? Your spec says that a 'restaurant' needs to have a 'name' which is a maximum of 100 chars. You write the following code:
class Restaurant(models.Model):
name = models.CharField("Name", max_length=100)
# ...
Are you going to write code to test that you've typed this in correctly? It would again be violating DRY. Are you going to check that this interfaces with the database correctly? There are already hundreds of tests in Django which cover this. Are you going to write tests that are effectively checking for typos? Well, if you use this model at all, it's going to be very obvious if you've made a mistake, and some other simple integration test is going to catch it.
Haskell
Now, coming to Haskell. You can guess the point I'm going to make.
In Haskell, a lot of code is either trivial or declarative.
Further, many of the types of errors you could make are caught by the compiler. Typos and missing imports etc. are always caught, and many other errors beside.
Functional programming languages, especially pure ones, eliminate a lot of the kind of mistakes that are easy in imperative languages. Everything being an expression helps a lot—it forces you to think about every branch and return a value. In monadic code it becomes possible to avoid this, but a lot of your code is pure functional.
Example 1
Imagine a more complex function than our sanitise_phone_number above. It's going to take a list of 'transformation' functions and an input value and apply each function to the value in turn, returning the final value. In some languages, that would be just about worth writing a test for. You might have to worry about iterating over the list, boundary conditions, etc. But in Haskell it looks like this:
apply = foldl' (flip ($))
In the above definition, there is basically nothing that can go wrong. We already know that foldl' works, and isn't going to miss anything, or fail with an empty list. You can't forget to return the return value, like you can in Python. The compiler will catch any type errors. If the function doesn't do anything approaching what it's supposed to then you'll know as soon as you try to use it. I've used point-free style, so there isn't any chance of doing something silly with the input variables, because they don't even appear in the function definition!
For something like the above, you would often write your type signature first:
apply :: a -> [a -> a] -> a
Once you've done that, it's even harder to make a mistake. It's almost possible to try vaguely relevant code at random and see if it compiles. For something like this, if it compiles, and it looks very simple, it's probably correct. (There are obviously times when that will fail you, but it's amazing how often it doesn't. You often feel like you just have to keep doing what the compiler tells you and you'll get working code.)
Is the above code 'trivial' or 'declarative'? Well, that's a tough call. A lot of code in Haskell quickly becomes very declarative in style, especially when written point free.
Example 2
But what about something much bigger—say the generation of an Atom feed? With a library that makes use of a strong static type system, this can be actually quite hard to get wrong.
In my blog software, I use the feed library for Atom feeds. The code I've had to write is extremely simple—a matter of creating some data structures corresponding to Atom feeds. The data structures are defined to force you to supply all required elements. Where there is a choice of data type, it forces you to choose — for example the 'content' field has to be set with either HTMLContent "<h1>your content</h1>" or TextContent "Your content". (For those who don't know Haskell, it should also be pointed out that there is no equivalent to 'null'. Optional values are made explicit using the Maybe type).
After filling in all the values for these feeds, I wrote some very simple 'glue' functions that fed in the data and returned the result as an HTTP response. I created 4 different feeds, all of which worked perfectly first time, as soon as I got them to compile. I cannot see any value, and only cost, in adding tests for this. A check for a 200 response code and non empty content might be worth it, but would be much easier to write as a bash script that uses 'curl' on a few known URLs.
Had I written this in Python, I might have wanted tests to ensure that the HTML in the Atom feed content was escaped properly and various other things, in addition to a simple check for status 200. But the API of the feed library, combined with the type checking that the compiler has done, has made that redundant, and has tested it far more easily and thoroughly than I could have done with tests.
And it's not in general true that the simple functional test will catch any type errors, because often it will only exercise one route through the code, ignoring the fact that in many places dynamically typed code can return values of different types, which can cause type failures etc.
Example 3
One final example of reducing the need for automated tests is the routing system I've used in Ella. OK, it's really a chance to show off the only slightly clever bit of code that I wrote, but hopefully it will explain something of the power of a strong type system :-)
Consider the following bits of code/configuration in a Django project, which are responsible for matching a URL, pulling out some bits from it and dispatching it to a view function.
### myproject/urls.py
patterns = ('',
(r'^members/(\d+)/$', 'myproject.views.member_detail'),
# etc...
)
### myproject/views.py
def member_detail(request, memberid):
memberid = int(memberid)
member = get_member(memberid)
# etc...
Now, there are a number of possible failure points in this code that you might want some regression tests for. For example, if in the future we change it so that the URL uses a string such as a user name, rather an integer, we will need to change the URLconf, the line in member_detail that calls int, and the definition of get_member (or use a different function).
There is a DRY or OAOO failure here—the fact that we are expecting an integer is specified multiple times, either implicitly or explicitly. This is one of the causes of fragility in this chunk of code — if one is changed, the others might not be updated, introducing bugs of different kinds. Now, there are things you can do about this, with some small or large changes to how URLconfs work. But they are not complete solutions, and one solution not open to Python developers is the one I coded in Ella.
The equivalent bits of code, with type signatures and explanations of them for those who don't know any Haskell, would look like this in my system.
----- MyProject/Routes.hs
import MyProject.Views
routes = [
"members/" <+/> intParam //-> memberDetail $ []
-- etc...
]
----- MyProject/Views.hs
-- memberDetail takes an 'Int' and an HTTP 'Request' object, and returns an
-- HTTP 'Response' (or 'Nothing' to indicate a 404), doing some IO on the
-- way.
memberDetail :: Int -> Request -> IO (Maybe Response)
memberDetail memberId request = do
member <- getMember memberId
-- etc...
You should read <+/> as ‘followed by’ and //-> as ‘routes to’. Just ignore the $ [] bit for now (it exists to allow decorators to be applied easily in the routing configuration, but we are applying no decorators, hence the empty list).
intParam is a ‘matcher’: it attempts to pull off the next chunk of the URL (ending in a '/'), match it and parse it as an integer. If it can do so, it passes the parsed value on to memberDetail as a parameter i.e. it partially applies memberDetail with an integer.
The beauty of this system is that nothing can go wrong any more. We still have DRY violations at the moment, but it doesn't cause a problem, because the compiler checks for consistency.
In fact, we can even remove the DRY violation. We could change the code like this:
----- MyProject/Routes.hs
import MyProject.Views
routes = [
"members/" <+/> anyParam //-> memberDetail $ []
-- etc...
]
----- MyProject/Views.hs
memberDetail memberId request = do
member <- getMember memberId
-- etc...
We've replaced intParam with anyParam, which is a polymorphic version that can match any parameter of type class Param. You can define your own Param instances, so this is completely extensible (and you can also define your own matchers, for complete power). We've also removed the type signature from memberDetail. So how can anyParam know what type of thing to match?
This is where type inference comes in. The function getMember will probably have a type signature, or it will use its parameter in such a way that its type signature can be inferred. From that, the type of memberId can be inferred. From that, the type of value that anyParam must return can be inferred. And from that, finally, the instance of Param can be chosen. The compiler is using the type system to pick which method should be used to match and parse the URL parameters based on how those parameters are eventually used.
This is very nice. (At least I think so :-). We've removed the DRY violation, or, if we choose to use type signatures or explicitly specify types in routes, DRY violations don't matter because the compiler will catch them for us.
Would unit or functional tests have caught any problems? Well, they might. If they checked the happy case, they will prove whether that still works. But they're unlikely to check whether the URLconf is too permissive or not. But the compiler can do that kind of consistency check.
The end result is that there are just fewer things that can possibly go wrong. I'm not saying that you wouldn't bother to write any tests. But in this case, if memberDetail was really just glue, you might decide to only test its component parts (for example, by testing the template that it relies on). Since most of the glue has been constructed so that it can't go wrong, you can focus tests on what can go wrong. And some sections of the code sink below the threshold at which tests provide positive value.
There are many other ways in which static type checking can make automated tests redundant. Parsers are a great example — a spec might define a syntax in BNF notation. In Haskell, you might well implement that using parsec. But if you look at the code, it will have pretty much a one-to-one correspondence with the BNF definitions. Any tests you write will simply check that a few examples happen to be parsed correctly, as you cannot begin to cover the input space. It's therefore far better to spend your time manually checking that the code matches the BNF spec than writing lots of tests. Unit tests often will not catch the type of errors that a compiler can if there is any polymorphism in the code paths.
Conclusion
Before you flame me, don't think that I'm attacking other languages. This experience with Haskell has actually proved to me that Python is still easily my favourite language for web development, especially in combination with Django. (I could do a follow up on why that is—I have a growing list of things I dislike about Haskell, some of which are fixable). But I often hear the Python crowd saying things about static typing and testing that come from ignorance, and the way you would imagine things to be (often based on experience of Java/C++/C#), and not from experience of something like Haskell.
Comments §
-Tim
Your spam filter sucks.
The "typos and missing imports" argument isn't persuasive to Python programmers; in my years using Python, I've never had any significant bug caused by a typo. Such errors are usually caught in a few seconds after trying to use whatever functionality was just added.
A better argument is that static typing automates many of the drudge-work validation that goes on in Python. For example, instead of:
The validation can be pushed off to clients, and duplication reduced:
The IO functions don't need to perform any validation, because they're guaranteed that whatever value they're provided is already valid (unless somebody's playing silly buggers with
unsafeCoerce).Could you elaborate on this? Why do you believe there is a difference between monadic and functional code?
@John:
Sure, that wasn't my main argument. It only works in the context of other arguments that are lowering the need to write any tests. Still, there are cases when you think you have tried to use the functionality added, but actually you haven't. Here's one that slipped through precisely because of a think-o like that.
Perhaps its a faulty intuition. But take
ifin Haskell — it's an expression, so you have to supply boththenandelse, which forces a certain rigour in how you think and structure code, compared to theifof other languages. But in monadic code, there are things like when, which get you back to being able to avoid that rigour.Monadic code is also an expression --
donotation is just a mechanical transformation of lambdas into a nicer syntax. Thus:is unsugared/equivalent to:
Or, looking at the definition of
when:The only time when Haskell code might not be purely functional is when there's an
unsafePerformIOlurking around.Hey Luke, overall this was a nice article and I appreciated it.
Anyway, this a slight tangent, but I was curious if you had looked at HAppS and/or Happstack before writing your Ella library?
If you have, how would you compare them?
(In my limited knowledge of the Haskell world, these are the HTTP libraries that stand out.)
Quoting SPJ's word, types are like specifications, type checking is just like validating your specification/design. It's really doing a "universal" check for most of the static properties that you want to have in your design.
Unit testing is output oriented, it helps to guarantee that the set/subset of output we care about is correct.
For the sake of those not wanting to trawl through the related posts, I'll answer Max's question: I didn't look much at HAppS. I couldn't use it on my budget host (long running process), and I didn't want to buy in to its whole approach to storage, as I want to be able to access my data in more traditional ways. And two years ago, when I started this, HAppS wasn't exactly well documented, and Happstack didn't exist. It wasn't a good horse to bet on at the time!
I was also curious to see how Haskell fared when competing under the same terms as other languages. HAppS is written in a way which really makes the most of Haskell's features, which is of course a very sensible thing to do, but often the real world has constraints that we can't just ignore because they are awkward.
If you have a spec and generate a testplan from that then you should, for example be checking for limits. Their are languages which limit the length of strings, for example, or don't allow zero bytes in a string. Sufficient testing will find this type of problem. You could write a routine with correct static typing to sum numbers, but the testing may find that you loose too much precision with the number representation used.
Do you really advocate shipping on successful compilation?
In other words there are things that the logic of the compiler's type checker can prove than cannot be proved in the logic of the language itself.
I'll just add that another reason I don't do TDD is because, often, what I want to get is not very clear until I have gone reasonably far into the design. It could be said that the program evolves along the design in a way that by the time I know against what I should be testing my code, the code is already done.
If any, tests are only necessary during refactoring and the fixtures of the test for the refactoring can only be reliably dictated by the prior version of the program, in a way the only real tests you need are:
for inputs in inputlist: assert oldcode(inputs) == newcode(inputs).
About static vs dynamic typing, I must insist that I mainly complain about typing in the way Java does. Even inference typing in the way of C# feels more like a crutch than a help. So I prefer dynamic typing always.
But the way you bubbled the format of memberId from "somewhere inside getMember" to the router was beautiful! I'll take a look at Haskell again thanks to this post!
@Paddy3118:
Nowhere did I say that I don't use any tests. I will almost always run the test suite before releasing code, if that is feasible.
But in some cases, I will indeed make changes to a bit of code that has no specific tests against it, and will 'ship' if the code compiles without warnings. Normally that will only be for trivial/declarative code.
With regards to your other examples, like needing tests to check for loss of precision or limits being exceeded, I've got two responses.
Often that's not necessary — for example if it's integer arithmetic and you choose something like Haskell's
Integer, then you have (effectively) infinite precision. (You are bounded by the amount of memory available).A general solution that fixes the problem universally is usually much better than trying to catch individual problems. For example, in my Django projects I have no tests that check for SQL injection attacks, XSS or CSRF, because the libraries included have taken care of that for me. Similarly, you can often use a strong type system to eliminate huge classes of other bugs.
‘Enough’ testing will certainly catch all bugs in the end, by the definition of ‘enough’. But what about the cost of those tests? Often they will be prohibitively expensive, and you are not going to write them until you need them. (If they are part of your spec, they you probably need them from the beginning, of course. Many specs don't think about that kind of thing).
Type checkers are conservative so any program they prove valid is valid, but there are valid programs they can't prove are valid. That's why Godel's theorem doesn't render them useless.
---
About testing a method likes this:
def sanitise_phone_number(s):
return s.replace("-","").replace(" ","")
"You would mainly be testing that the built-in string library works."
Sure, making a unittest for this method alone is waste. But surely you will have a more complete test, that involves sanitise_phone_number code. That's how you get test coverage, not making tests for every method, but testing towards the desired final output.
"If you think of the ways that the function is likely to be wrong, the test is just as likely to fail to catch it. For example, the function above might really need to strip newline chars as well, but that's not going to be tested unless I think to write a test for that."
The beautiful thing with tests is that, you can test against live data as well. While the idea of unit tests is testing against specification (and that's why you create stubs/fakes), testing against live data is an easy way of pin-pointing unexpected results.
---
About the URL routing with strong typing:
You're focusing more on the problem of type inference than with the problem of URL matching itself. The reasoning that you need type inference to get a robust implementation of URL matching is weak.
The nature of Django URL's regex matching is so that you can have something like:
(r'^members/(?P<pk>\d+)/$', 'myproject.views.member_detail'),
(r'^members/(?P<name>\w+)/$', 'myproject.views.member_detail'),
This way we know which URL matched by looking at the available parameters and do the necessary type castings, as we know what is the type of each lookup field.
def member_detail(request, **kwargs):
try:
if 'pk' in kwargs.keys():
kwargs['pk'] = int(kwargs['pk'])
m = Member.objects.get(**kwargs)
except Member.DoesNotExist:
raise Http404
You're just exchanging compile-time type inference with runtime type inference. Works the same still.
Also it's naive to say that any parameter that matches '\d+' should be casted to int (I can have a CharField that expects a number representation as a string), so your static typed implementation is not really that flexible.
@Henrique:
That's the point, it doesn't work the same. With the compile-time type inference, it's impossible for anything to go wrong, as the compiler guarantees consistency. If you do it at run-time, it can go wrong. And it's not possible to automatically propagate the type information from one part of the code to the other.
I didn't say that...
The implementation is flexible enough to cope with that kind of requirement. You would create another type, say
DigitString, with constructors that only allows digits to be entered. You make an appropriate instance of Param, and now you can useDigitStringwherever you need it, including in URL routing, and the compiler ensures consistency.I'm not saying that you can't do runtime type checks. Of course you can, Django/Python work very well that way. But the static type checks do have their advantages.
" - You're just exchanging compile-time type inference with runtime type inference. Works the same still.
That's the point, it doesn't work the same. With the compile-time type inference, it's impossible for anything to go wrong, as the compiler guarantees consistency. If you do it at run-time, it can go wrong. And it's not possible to automatically propagate the type information from one part of the code to the other."
My runtime example still won't show any unexpected 500's. It will just thrown a 400 if the URL can't be matched, or if the parameters don't match anything. That isn't what I would call "going wrong", it's failing graciously. The only difference is that static type inference will avoid having 400's when your own code asks for the method with wrong parameters - but still has to fail with a 400 for user requests. In this case I don't see my own code requests to be any different from user requests, so I just expect it to fail with a 400. The good thing here, is that what happens with unexpected parameters is the same - a 400 - so it's easy to test the behavior with unittests, be it for my own code or user requests. So, indeed, I think it makes static type inference redundant.
@Henrique,
The difference is that you have to manually program that behaviour in every time you want it. Of course it is possible to get dynamically typed code to behave (externally) the same as statically typed code. The question is how easy is it to avoid mistakes? You have a load of boilerplate to insert in every view function to effectively do type checking/coercion. I'm not denying that it works. But what happens if you forget it? What happens if you include it but make mistakes that happen to work with the happy case? Or if one part of it is changed, and the other not synced to match the change? With the statically typed method, your code can be much shorter, and it's machine checked, and machines don't get bored. With Haskell you can (effectively) get the compiler to write that code for you.
---
NEWS_ITEMS_ON_HOME_PAGE = 5
Are you going to write a test that ensures that this value stays at 5, and doesn't accidentally get changed?
---
No. I'm going to write a SPECIFICATION. My customer said "the number of items on the home page should be five (or less)". This gets written in code so I don't forget it. I'm NOT testing that the value doesn't get accidentally changed - I am writing down what the customer asked for so I won't forget and ignore it.
Note that if the above value gets changed (for whatever reason) to 4, my "test" won't break. The customer only said "five or less". If, after the change, the customer refines his requirement, I can add another specification saying "if the second page has items, the first one must have exactly five items"; this gets added to my written specifications as code.
I don't think you can get the same effect (a list of specifications written in code) just from static typing, and I submit that having such a list is a net plus.
The important part about writing down specs was "this gets written IN CODE". Don't think of it as TDD - think of it as specification-driven coding, even if it doesn't make for a nice acronym.
Tests are something you do after coding. Integration tests, smoke tests and so on - these are tests. What you do before coding is writing specifications - stuff that allows you to see how your implementation gets closer and closer to what it should be.
I find Misko Hevery a good writer about TDD-related stuff (I'll keep using the acronym because it's what everyone else says, but think "specifications" instead of tests). This article - http://misko.hevery.com/2009/09/02/it-is-not-about-writing-tests-its-about-writing-stories/ - is a good example of what I mean.
I'm surprised no one has mentioned QuickCheck, as it's a prime example of getting extra testing mileage out of the type system. And unlike BDD-style testing, you can write these specifications without having to cook up arbitrary testing data for each case (which will tend not to exercise the edge cases you've overlooked).
Overall I'm pretty much in agreement with Luke, but I feel compelled to play Devil's Advocate with his first example. I tend to think that tiny Python functions like
sanitise_phone_numberare actually ideal for a TDD approach, especially as expressed in the form of doctests. Forcing yourself to write a simple description of the function ahead of time can inspire better tests or a more robust implementation. These sorts of functions are the low-level building blocks for the rest of the code, so we really don't want them causing any trouble.This is what my approach to
sanitise_phone_numberwould probably look like.Then replace
passwith code until the tests (which get run every time I hit "save") go green. Notice how writing the description leads to a more general interpretation of the function's role. It's just not a name for a piece of code,sanitise_phone_numberhas a job.Another nice thing about doctests for simple functions (as opposed to xUnit-style tests) is how they provide a visual anchor for the function's behavior. When I come back to this code in a month, I can see at a glance what I'm expecting from
sanitise_phone_number. I'd argue that this is a sufficient ROI to justify the minute or so spent writing a one-line summary and a couple of examples.Nitpicking aside, good article.