Luke Plant's home page (Posts about Security)

6 digit OTP for Two Factor Auth (2FA) is brute-forceable in 3 days

2019-05-11T11:50:00+01:00

It is common these days to use “TOTP” as an additional factor in 2FA (Two Factor Auth) / multi-factor auth. If you have used Google Authenticator to log in to a site (you can do this with GitHub, for example), then you have used it, and many other apps and sites use the same scheme, and some SMS based 2FA systems may be based on the same concept. TOTP stands for Time-Based One-Time Password, and is specified in RFC 6238. It is based on HOTP, HMAC-Based One-Time Password.

What the RFC for TOTP does not mention at all, and the RFC for HTOP mentions with very little detail, is that the security of these methods depends critically on how you throttle request attempts and/or lock user accounts for repeated failures.

Some systems already have adequate throttling/locking in place, but some certainly don't, and this post is aimed at the latter. Getting the throttling right can be quite tricky too.

(I should mention that this post is not really original. The insight here I got from Why You don't Need 2 Factor Authentication, I am just presenting part of that page in a more detailed way, doing the maths for you, and discussing the consequences, without necessarily agreeing with the conclusions of that page).

To put it simply, with conservative assumptions and common defaults, without account locking (or something similar) an attacker can brute-force a TOTP password in just 3 days. In fact quite a bit faster might be possible.

The attack scenario here is that you have set up 2FA using Google Authenticator (or similar), and an attacker already has your username and password. After getting past the username/password dance they are presented with a screen asking for an OTP.

(If you had set up SMS instead, you will at least get an unexpected text that will alert you that someone has your password, but not with Google Authenticator.)

The whole point of 2FA is that it is supposed to stop an attacker from getting any further. For a high value account, a motivated attacker can and will continue at this point. (And if you don't consider your accounts high value, why are you bothering with 2FA?).

Now the attacker has to try to guess your OTP. How likely is that to succeed? Well, Google Authenticator provides a 6 digit code, giving one million possibilities, and it has a 30 second timeout. Let's assume the attacker can make 10 requests per second. (This is completely reasonable in many scenarios, and significantly higher might be possible). Since we don't have time to try all the possibilities, the chance of success is (30 × 10)/1000000 = 0.0003 = 0.03%, which seems pretty good. Right? Wrong.

We must remember that an attacker does not need to have 100% guarantee of success to attempt something. An attacker will try it if they think they have a 'good' chance of success. Let's assume that is 90%.

Without a timeout, the time to get to 90% chance of success is 0.9 × 1000000 / 10 requests/second = 90000 seconds = 1.04 days.

Now we add a timeout, say an hour, or 90 seconds, or whatever. What happens when the first password times out? According to the TOTP scheme, you can just try the next one. The timeout therefore stops an attacker from being able to try all the possibilities, and rules out a 100% effective attack. But they don't care about that, they just care about having a good chance of success.

Guessing randomly is pretty much our best strategy now. The probabilities look like this:

\begin{equation*} chanceOfSuccess = 1 - chanceOfFailure \end{equation*}

\begin{equation*} chanceOfFailure = {chanceOfFailingOnce}^{numberOfAttempts} \end{equation*}

This last step is a critical part – if you succeed once, you succeed, so you have to fail every time to fail overall. The chance of failing N times is the chance of failing the first time, times the chance of failing the second time.... etc. times the chance of failing the Nth time.

\begin{equation*} chanceOfFailingOnce = 1 - \frac{1}{numberOfPossibilities} \end{equation*}

\begin{equation*} numberOfAttempts = timeInSeconds \times requestsPerSecond \end{equation*}

Re-arranging for \(timeInSeconds\) and substituting:

\begin{equation*} timeInSeconds = \frac{ln(1 - chanceOfSuccess)}{ln(1 - \frac{1}{numberOfPossibilities}) \times requestsPerSecond} \end{equation*}

Putting in \(chanceOfSuccess = 0.9\), \(numberOfPossibilities = 1000000\) (6 digit code) and 10 requests per second we get 230258 seconds or 2.67 days. (To check, put that number of seconds back into the formulas and you'll see the probability of success does come out at 90%).

Note 1: The timeout does not appear in that formula! Reducing your timeout could make a big difference to usability, but makes zero difference to security. That may be counter-intuitive, but consider:

Reducing the timeout from infinity to a few minutes only increased the attack time from 1 day to 2.67 days (aiming for 90% success rate). Clearly the timeout isn't that critical.
Say you are thinking of a number between 1 and 10,000 and give me one thousand attempts to guess it. To make it harder, you change the number every 100 guesses. To make it harder still, you are thinking of changing it every 50 guesses. Would that work? Well, in the first case I get 100 guesses at 10 different numbers, in the second I get 50 guesses at 200 different numbers, but that makes no (practical) difference – I get the same number of guesses, all of them unlikely whether I guess randomly or in sequence, and I only have to guess correctly once to succeed. Mathematically, it boils down to the fact that \({(x^a)}^b = x^{ab}\).

Security is often counter-intuitive, and some security policies can often be nothing more than security theatre. Timeouts are a common target for “tightening” measures, because they seem to be easily understandable by the lay-person.

A while back there was a Django ticket filed that asked for the ability to reduce the password reset timeout to less than 1 day, because “In many applications a day is far too long and doesn't meet security requirements”. I explained that due to the way our password reset is implemented (very differently from HOTP/TOTP), changing the timeout makes precisely zero difference to the ability of an attacker to brute force, and with no timeout at all, or throttling, the mechanism is many millions of times stronger than many of the mechanisms that do indeed need timeouts for security. But I don't think anyone believed me, a short timeout seems better.

Note 2: 3 days is not very long, and entirely feasible for many attackers. If you don't have mitigation measures in place, your 2FA is broken.

In fact, TOTP also has a tolerance factor to allow for delay in transmission, that allows n previous tokens to be used, with a recommend default of n == 1. This effectively doubles your request rate (you are guessing two numbers at once, either will count), reducing the time required to less than 36 hours.

If you go for even odds (50%) rather than 90%, it comes down even further. Using an out-of-the-box installation of django-two-factor-auth, which builds on django-otp, on my development machine I was able to get 20 requests/sec for the 2FA handler without trying hard. I set up a Google Authenticator device for an account and achieved a brute-force in under 8 hours. An attacker could start after you went to bed and might be done by the time you were out of the shower.

One mitigation technique is attempt throttling. However, this has got to be done carefully. It clearly can't be done globally for the 2FA handlers or you could easily DOS yourself. It has to be done carefully on a per-user basis, which takes up storage and would be easy to get wrong. Doing it by IP address also doesn't work well — attackers can easily hire large number of IP addresses these days. Even if we reduced the number of attempts per second by a factor of 10, that attack time would only go up to 30 days, which would still be worthwhile for some attackers.

An alternative is account locking after a number of failures, which is much better. However it also brings problems. It means that your 2FA must only be accessible for people who already have passed one level of security, otherwise you have a denial of service vulnerability. Plus you need all the account unlocking procedures etc, and need to make sure they are secure, and not actually effectively another attack route.

Another option is to use some kind of back-off for failure attempts, which is what the HOTP RFC recommends briefly. For example you could use exponential back-off – you add enforced delays for attempting a token check, requiring the user to wait 1, 2, 4, 8, 16s etc. after each successive failure. This has a number of advantages: it doesn't slow down a legitimate user who just mis-typed once or twice; it requires very little storage; it is highly effective in terms of throttling, to the point of being a kind of account locking [1]; accompanied with appropriate error messages (“You've been temporarily blocked due to X successive failures”) it could alert the real account owner to the presence of an attacker; the soft “account locking” automatically expires, rather than requiring manual intervention of any kind, so that you don't get DOS for the case of unresponsive support staff.

(For the curious/worried regarding django-two-factor-auth/django-otp, a few weeks ago I implemented exactly this for the HOTP and TOTP backends in django-otp, and the fix is availabile in version 0.6.0).

SMS-based 2FA may or may not be better than TOTP, depending on how they are implemented, throttled etc. – some SMS systems just use TOTP and use an SMS message to send the current token, in which case they are equivalent.

SMS does have the advantage that at least the genuine account holder will probably realise that something is going on (although that isn't how the security of 2FA is supposed to work primarily). However, SMS does also have a ton of problems, so maybe it's not any better overall.

Increasing the length of the token does help, as does increasing the alphabet of characters used (although apparently that may have usability issues on phones). Every factor of 10 in the number of possibilities for the token results in a factor of 10 in the time required to brute force. But some apps (e.g. Google Authenticator) only supports 6 digit tokens.

Given the move towards 2FA, the disappointing thing is how little info there is about this. I found this stackexchange question complete with some misguided answers, and some good advice, but little by way of rigorous best practice.

The RFC's have plenty of details about the crypto and algorithms, possibly because those are pretty easy to define and implement in a small chunk of code, but little on the other security critical requirements which are much harder to pin down. This is a security problem in itself – programmers are attracted to something like TOTP because it looks like a properly thought out, defined solution, and the core of it is a nice programming exercise. You can get it ‘working’ relatively easily – but without the critical and more fiddly parts that need to be in place.

[Update 2019-06-01 - added footnote calculation showing effective rate for exponential backoff throttling]

[1]

Exponential increases very rapidly. The limiting factor for an attacker is how often the throttling can get reset by a successful attempt. Let's assume everything is in the attacker's favour:

The legitimate user does a successful 2FA login every day, at a predictable time.
The attacker times their guesses so that the backoff time expires just before this point, so the legitimate user doesn't see an error message and just enters the 2FA code successfully, resetting the backoff for the attacker.

Since we are starting with a 1 second delay and doubling for each failure, this gives the attacker approx \(log_2(secondsInADay)\) \(= log_2(60 \times 60 \times 24)\) \(\approx 16\) attempts per day.

This is an effective requests per second of \(0.000185\). At this rate, using our formula above (and accounting for the TOTP tolerance factor which doubles your effective rate, as above), it would require 59 years to get to even odds of achieving a brute force on a 6 digit token, which is probably OK.

By contrast, if we go with straight “N requests per second” type rate limiting, for the sake of usability for legitimate users you probably wouldn't want to throttle more aggressively than 1 request every 10 seconds per user i.e. 0.1 requests per second. In this scenario it takes just 40 days to get to even odds of a brute force, which is certainly realistic for some attackers.

A simple password-less, email-only login system

2016-07-17T16:52:10+01:00

This post is about a simple password-less login system I created for one web site which can be useful in some use cases. I’ll describe the basic process, the rationale, and the advantages and disadvantages of the system. Then I’ll outline some implementation considerations, and link to my source code which implements it.

Outline

The authentication system is simply this:

To log in, a user enters their email address. The web site sends them an email containing a unique link which will directly log them in to the site. There is no option to use a password. If they have used the site in the past with the same email address, upon logging in they will be using the same account as before, otherwise a new account will be created. Every time the user wants to log in, they must go through the same process (so usually you will make the login session last a significant period of time). For this reason the method is particularly suited to sites where people do not log in very often.

This is not a new idea, although I don't think I was conscious of other implementations when I created mine. In this post I'm presenting my rationale for it, listing some advantages that I haven't seen elsewhere, and other implementation pointers.

Rationale

Many systems really need a working email address, because you need to be able to contact users. In this case you have to do some kind of email verification step at some point anyway (some systems do it at the beginning of the process, others try to fit it in somewhere else and nag users until they have done it). If you fail to have email verification, then people can easily get locked out of the site because password reset usually relies on sending an email, and you don’t have contact details when you need them.

With this system, email verification and login are combined.
In terms of security from the user’s point of view, no-one can hack their account by guessing their password, because they don’t have a password. They can hack it only by gaining access to their email, but given the password reset mechanism most sites have, this is no different to normal. We’ve simply eliminated one source of getting hacked.
For the site implementation, not having a password to store is even better — there is no way you can mess up password hashing and storage, no possibility of a password database being stolen, because you simply do not have passwords.
Not having a password to enter the first time reduces friction for most users.
In terms of user experience when coming back to a site, many people end up doing something similar to the above process anyway, because they forget their passwords. This is especially true for sites that people are not going to use very often – for example, a booking process for a conference that might happen once a year. In this case, people either:
1. choose weak passwords that they can remember easily, which is bad for security,
2. re-use a password so they can remember it easily, again bad for security, or,
3. forget their password.
However, with a password reset, the process is much, much worse:
1. First they have to remember if they signed up for the site in the past, to work out if they should “log in” or “create account”.
2. Then they have to make several attempts at remembering their password.
3. Then they’ve got to use the password reset feature (hopefully it isn’t hidden, but I’ve seen users struggle with this when literally the only things on the page were the login form and a “Forgotten your password?” link).
4. They then have to check their email and click the link.
5. Now they have to negotiate a new password form, possibly including a strength monitor that won’t allow them to choose a weak password.
6. Having finally set a new password, they now have to navigate to the login form again (because sites very rarely integrate password reset with log in, also usually for some good reasons), and re-type their email address (often for the third time by now), and their password (again, typically for the third time, not including all the failed password attempts).
By removing the password entirely, most of these steps are eliminated. Steps 1, 2 and 3 are replaced by a single method for logging in – “Enter your email address”. Step 4 is the same, steps 5 and 6 are eliminated.

There are some additional advantages:

By doing email verification every time, we ensure that we still have a working email address. If we use some email/username + password combination for login, we have to add some kind of regular “Is this still your email address?” feature, or find ourselves unable to contact our users.
For any prompting or promotional emails that we send to a user, we can log them straight away in using this mechanism. As already discussed, this is not a reduction in security in the typical case. If we implement the system using a query string parameter containing a token and a generic middleware that checks the token, we can use this system on any page on the site with no extra work.

So, for example, if we send an email asking for payment, the link can take them straight to the payment page, already logged in. This is the ideal situation, and we can do it with the tiniest amount of work (adding a query parameter to a URL in an email), because we can just re-use the existing login mechanism.
There are significant improvements for privacy concerns.

A typical email + password login system has some problems when it comes to privacy, because it is often possible for an attacker to determine that a certain person has an account with a web site. This can be often be done from several pages on the site:
1. The account creation form
2. The log in form
3. The password reset form
And it can be done in a number of ways:
1. By looking at the different error/validation messages that are returned by these pages, for the cases of existing or non-existing accounts.
2. Even if the messages returned are identical, by doing timing attacks on the pages.
Fixing method 1 often results in UX problems – e.g. if a user doesn’t have an account and is trying to log in, we can no longer tell a user that they don’t have an account and need to create one, we can only tell them their email/password combination is incorrect, and leave them to struggle. Similarly with password reset. Our user encountering scenario 5 above now feels like this:

Fixing method 2 can be very hard. The use of strong password hashing makes a timing attack on the login page trivial if no precautions are taken. Django, for instance, was vulnerable to this for a long while. It now has rudimentary mitigation, which fixes trivial attacks, but a complete fix is very hard. Making the code paths for “yes we found a user record” and “no we didn’t” take exactly the same amount of time would be very hard, and an attacker who was in the same data centre as your server (where network transit noise is much reduced) would probably not have a hard time doing a timing attack on the current code.

However, with the system described in this post, these attacks, and the UX problems, are all completely mitigated. We send the verification email whether there is already an account or not, with exactly the same message (which doesn’t confuse the user), without looking up the account in the database first. We can check whether we need to create a new account or retrieve the old one when the email has been verified, so there is no timing attack possible on this part of the code.
On a code level, the amount of code required for this is very small. Compared to the typical alternative (email/username+password, all the forms to manage passwords, password reset etc.), it is tiny. That alone gives big maintenance and security advantages.

Disadvantages

There are of course some disadvantages:

Not all users have secure email systems, and emails could be intercepted, allowing an attacker to use someone else’s account. As noted already, you are already living with the same issue if you have a password-reset-via-email-link feature.
Users have to go through the “check your email” cycle every time they log in. For the kind of site that people are using daily, and if the login session is configured to expire relatively quickly, this will be annoying. But for use cases where users don’t visit the site often (e.g. occasional conference booking), this won’t be a problem.
If someone’s email address changes, this system has more problems, because in essence it uses an email address as the primary key for the account. To deal with this, you would need to store some other personal info or communication mechanism that could be used to verify the person is the same person, and then have some automatic or manual process for merging accounts etc.

Alternatively, you can live with the fact that if their email address changes, they no longer have access to their old account. For the site I built (a booking system for yearly summer camps), this has not been a problem – it just means that people don’t have the shortcut of being able to re-use information from previous years.

Implementation issues

There are some implementation issues to be aware of, especially security related:

You need a correct and secure way of creating the unique login links. They need to contain some kind of token that verifies an email address, a token which cannot be guessed by an attacker.
The login links should expire – so that a temporary breach of someone’s email account, or accidentally sharing the link, doesn’t given an attacker login access forever.
When comparing the token, you need to be aware of timing attacks.
Security tokens in URLs are a dangerous thing, as they can easily be pinched. It can happen when a user copy-pastes or shares a URL, and it can happen if a page links to has any third party resources, which will then be able to see the URL (and the token) via the Referer header.

Because of this, the token should be checked before any page is rendered, and you should redirect immediately, either to a failure page if it doesn’t match, or to a URL without the token. If you use a query string parameter for the token, this is easy – for the success case you just return HTTP redirect response to the same URL but without the token query parameter (and with a login cookie attached to the response).
You should do case insensitive comparison on email address when looking for an existing account – people don’t always type their email addresses with the same case.
As with all login mechanisms, you should give attention to providing a “log this account out on every device that is logged in to it”. This is often linked to a password change mechanism (which we don't have). It can require additional work to ensure that we are removing a session, not just removing the session cookie from browser.

In my implementation, I use Django’s TimestampSigner to sign the email address. This takes care of 1 (Django uses a HMAC on the string), 2 (you just pass the max_age parameter to unsign) and 3 (Django’s Signer uses a constant_time_compare function internally).

I then base64 encode the result to produce a tidy URL. This results in a longish URL, but not too long to be impractical. I created a small class to wrap up the encoding and decoding.

(An alternative would be to create a nonce and store it in a database on the server, associated with the email address. The implementation above has the advantage that it doesn't require server side resources, but the disadvantage of requiring a longer URL).

I do the checking in a middleware, including the redirect to handle item 4 above. I currently use Django’s signed cookies for implementing login. If a server side session was used, then it would be easier to implement “log out from all devices”.

I’m using a custom model for this account, which does not have a password field, and I’m also using a normal User model for other purposes, so it doesn’t make sense for me to release this as a standalone Django authentication library. But feel free to take the code and do so, or borrow in any other way.

There are other variations on this that could be used, but I think the basic pattern is very useful for some use cases, eliminating a lot of the user hassles and programmer headaches often found with passwords.

This is also not meant to be an alternative to things like OAuth. It is meant to be an alternative to email+password logins. If OAuth is used as well (should you venture down that somewhat dubious path, ), then enhancements are possible – for instance, for people who create accounts via OAuth, there could the option to disable login by email link. This would mitigate the risk of account takeover due to people with insecure email providers.

Updates:

2016-07-18: Note about alternatives like OAuth2
2016-07-18: Note about implementing “log out from all devices”
2016-07-19: Note about security of email services.
2016-07-19: Paragraph about prior art

(Thanks to reddit comments for the prompts that pointed some of these issues out).

Why escape-on-input is a bad idea

2012-08-06T20:59:01+01:00

The right way to handle issues with untrusted data is:

Filter on input, escape on output

This means that you validate or limit data that comes in (filter), but only transform (escape or encode) it at the point you are sending it as output to another system that requires the encoding. It has been standard best practice since just about forever ^{[citation required]}.

An alternative is “escape on input”: at the point that data enters your system, you apply a transformation to it to avoid a problem further down the line when the data is used.

It's come to my attention that some serious web developers (or at least, they take themselves seriously and are taken seriously by others) are still suggesting the practice of escape-on-input.

For example, with escape-on-input, to avoid XSS any data that enters your system has HTML escaping applied to it immediately, before your application code touches it.

I chose that example deliberately, because people are actually recommending it:

in some recent “PHP sucks” debate.
which, in turn, linked to a page by Rasmus Lerdorf recommending escape-on-input as a sensible way to deal with XSS. The page, admittedly, is describing a ‘toy’, a ‘no-framework PHP framework’, yet he does seem to be serious about the usefulness of escape-on-input.

The page is from 2006, and uses the pecl/filter extension, but the extension has since made it into core (PHP 5.2), and the docs for it suggest a configuration that is clearly intended for XSS prevention. As recently as 2008, and probably to this day, Lerdorf is still defending and recommending this approach, and it appears to be part of his reason for thinking that PHP templating doesn't need an autoescape mechanism.
Just as significantly, Etsy are using and recommending escape-on-input (slide 18 onward). As a very successful modern company using PHP, people will look up to them and copy them.

So, this approach, unfortunately, is popular amongst some, and I can't find a decent post explaining why it's such a terrible idea both in theory and practice. Here is my attempt. It should be applicable to almost any system and any language, although I'll mainly be using examples from web development.

In theory

First of all, escape-on-input is just wrong – you've taken some input and applied some transformation that is totally irrelevant to that data. If, taking our example, you have some data collected by HTTP POST or GET parameters, applying HTML escaping to it is a layering violation – it mixes an output formatting concern into input handling. Layering violations make your code much harder to understand and maintain, because you have to take into account other layers instead of letting each component and layer do its own job.

Doing things ‘right’ is very important, even if doing them ‘wrong’ seems to work and you are tempted to be dismissive of ‘theoretical’ concerns about purity etc. When you have to maintain code, you will be very glad if things are in the right place, and not full of hacks and surprises.
You have corrupted your data by default. The system (or the most convenient API) is now lying about what data has come in. As you have applied a transformation to the data itself, the layering violation is not an isolated problem in one part of the code, but infects every part of your code, especially if you store the corrupted data in a database.

Your data is everything. As I read recently, “data matures like wine, applications like fish”. You can always rewrite your application, but if you corrupt your data, you've done the worst thing you can to your system.
This is exacerbated by the fact that many encodings are one-way – you cannot losslessly or unambiguously convert them back. If at a later point you need the original data, you might be in a pickle.
Escaping your data for one output backend will only deal with that output. A typical web app might deal with at least the following backends, which have different characters that are dangerous, and have different requirements for dealing with them:
- HTML: ' < > " &
- URLs: / : & ? # text starting 'javascript:'
- Javascript: " '
- SMTP and HTTP: ; : newlines
- SQL: '
- JSON: "
- Shell - space, quotes and various other characters
Any number of others could be added, and all could have security implications. Using escape-on-input will only fix one of these - apart from happy coincidences where it might fix more than one. Security should not rely on happy coincidences, and for the other outputs you will still need a sensible solution to the problem. Why not have a sensible solution for all of them?
Escaping for one output may not deal with even that single output correctly, because escaping can be context dependent.

Various outputs can be embedded in others, and they have different escaping rules. So, you can embed URLs in HTML. And URLs in CSS. And CSS in HTML. And Javascript in URLs. And Javascript in HTML...

If you prepared something for HTML, did you prepare it for HTML element body context, or HTML attribute context, or URLs in HTML attributes, or CSS in HTML? Or URLs in CSS in HTML? If someone passes in a value for a URL which is then used in an href attribute in HTML, HTML escaping of < > & ; " ' won't protect adequately you from XSS. Interactions between CSS/Javascript parsers and HTML parsers make things even more complex. So “escape at the beginning and then forget about it” does not work even for the single output of ‘HTML’, because it is not a single output.
Escaping on input will not only fail to deal with the problems of more than one output, it will actually make your data incorrect for many outputs.

Suppose you decide to do HTML escaping, and someone enters Jack & Jill as a title for something. Your escape-on-input turns this to Jack & Jill and that goes in the DB. Suppose you want to email people and put this title in the subject line. You now have to apply the reverse transformation to get a sensible subject line in the email, and you have to remember to do this for every output that is not HTML.

Sometimes, the bug is significantly more annoying than an email with an incorrect title:

One ruined sweatshirt, however, is tame compared to the hassle many people suffer due to having a name that a computer won’t accept or mangles. Looking through that article, it’s clear that often the software is escaping on input, resulting in escaped versions being stored in the database (e.g. a woman with an apostrophe in her name is recorded as “Leah D&andrea”), which then causes no end of problems.

You also have daft bugs like the fact that doing a search on that field for the string ‘amp’ (or ‘quot’, ‘apos’, ‘lt’, ‘gt’ etc. or any substrings) will get various false matches.

I have seen some people respond to this by saying “it's better to have the occasional double-encoding bug or incorrect query result than an XSS exploit”. Well, first, that depends on your business. XSS is a problem because it costs time and money, and so does corrupting your data. Many people have data that actually matters, and corrupt data is a big deal, and much harder to cope with than an XSS bug, because data lives on and on. If we took just the example above of storing people’s names incorrectly, the grief caused by escape-on-input is massive.

Second, this decision affects frameworks that are used to handle data of all kinds, and the decision affects the entire code base of your application and beyond, as described below. Data-handling frameworks that work on the assumption that your data is not important are insanity. If the foundations be destroyed, what can the righteous do?

Third, it's entirely unnecessary. XSS is not hard to fix given decent programming tools.
At what point does data ‘enter’ your system?

It might sound like a simple question, but it's tricky in reality, and I'll illustrate using an HTTP request.

In most web apps, the GET and POST parameters are your ‘raw input’. However, using most normal web framework APIs, data in GET and POST parameters has already been interpreted. The ‘raw’ data is really the bytes that make up the HTTP request, which typically will use URL encoding for GET query parameters and a choice of encodings for POST data (URL encoding or MIME multipart attachment format).

The framework may also do another level of decoding – interpreting the series of bytes as a series of unicode code points.

Both parts of this initial transformation makes sense and are appropriate, because they are reversing the encoding already applied to the data by the protocol involved. The web browser takes the data you type in – unicode code points – and applies a series of transformations to it, according to the HTTP protocol, and your web framework reverses these to get the data back.

Now, if you want to avoid XSS problems, you have to apply the escaping after this initial decoding has been done. But this highlights another possibility. What if the data requires further decoding before you get the ‘real’ raw data? For example, some data might be sent base64 encoded for a variety of reasons, or any other type of encoding.

This extra level of encoding gives two problems:
- your automatic HTML escaping may have corrupted the encoded data so that it now cannot be decoded. For example, you had a GET parameter that held a URL, which itself had parameters in the query string:
```
GET /foo?bar=1&url=http%3A%2F%2Fexample.com%2F%3Fx%3D1%26y%3D2 HTTP/1.1
```
  Your framework's HTTP handling will produce a query dictionary that looks something like the following:
```
{"bar": 1,
 "url": "http://example.com/?x=1&y=2"
 }
```
  But your automatic escaping turns that into:
```
{"bar": 1,
 "url": "http://example.com/?x=1&amp;y=2"
 }
```
  If you want to extract the y parameter from url, you are stuck. You can't correctly interpret the data in the url parameter, because it has been corrupted. You're going to have to unescape the input, and you might not even notice this problem.
  
  A better example might be handling the ’Referer’ header. (Which you have presumably applied the same HTML encoding to, right? If you did, you have this problem, if you didn't, you have to remember to do it manually, which is a potential XSS vulnerability).
- Even if the data comes through your automatic escaping unscathed (e.g. base64 under HTML escaping), or you can undo the corruption and get it properly decoded, after decoding you will have to manually apply HTML escaping to make it match all the other automatically escaped data. If you don't, you've potentially got a bug and an XSS exploit.
  
  So your automatic escape-on-input has missed data, and this happens because you can't really define the point at which the data has ‘entered’ your system and needs the escaping applied.
This problem means that the escape-on-input approach is inherently flawed and cannot be fixed. You just have to patch it up on a case-by-case basis, which is exactly what escape-on-input is supposed to avoid.

And then, what about other sources of data – data on the file system, in a cache etc. Are these entry points? Well, it depends on how the data was put there. You have to manually follow this all the way through your app; get it wrong and you've got double escaping bugs or security flaws.

(By contrast, escape on output always works, because you apply it at the point where you know it is needed – in the backend that knows the escaping rules.)
Other systems putting data into your database, or getting data out, have to abide by your data transformation rules.

These systems might have nothing to do with your primary domain (e.g. a web site). Making them understand and obey rules that have nothing to do with the data itself is insanity and extremely short sighted.

You can't deal with this problem when you come to it, because you don't have to just fix your code, you've got to fix all your data too, and by the time you cross this bridge you might have a lot of data and might need a very delicate database migration to get it right. The data may even have escaped your control (e.g. been copied into other systems), or backwards compatibility concerns might stop you from making the change you need to make.
Within your main application, the decision to escape on input affects your whole code base.

If you want to use any libraries, you need to make sure that they are using all the same assumptions that you have in your main code base.

For example, if you've got a form/widget library in your web app, it will very often need to echo user input back to them in the case of a form that has validation errors. This library has to know if you already escaped the input.

Writing the library to work in two modes is asking for trouble. Rather, you need it to have been written from the beginning to assume the same escaping rules.

This kills code re-use – you can only use code that assumes the same input escaping – or it means that you will end up with tons of bugs due to incompatibilities between the assumptions made in your application code and the library.

Essentially, this is the problem of a global configuration setting, but worse since it affects the operand of your entire application (the data going through it), not just the functionality of various operators.

Another example might be a cron job that sends out emails, using data from the database. If the data comes from a web form that applied "escape on input" to avoid XSS, then the code will need to apply HTML unescaping - despite the fact that this script has absolutely nothing to do with the web (it reads a database and sends plain text emails).

Effectively, this means that the XSS solution, far from being a solution applied at a single point, is in fact spread out over the entire code base, as it includes every time that pre-escaped data has to be un-escaped.
The confusion caused by the above is likely to increase security problems. “Keep It Simple, Stupid” remains a very good maxim for developers.

To continue an example used above: you want to send an email with some data that has already been HTML escaped, and so you need to unescape the data to avoid emails with the subject “Jack & Jill” when the user entered “Jack & Jill”. You decide it's not sensible for the mail sending functions to do this internally, (or maybe they're provided by a third party who made that decision for you), so the calling code does the unescaping.

You later decide to switch to HTML emails, and the developer who implements it thinks that since data is already escaped, there is no problem including it without extra escaping in the body of the HTML email, leading to a vulnerability (not classic XSS in this case, but still a problem).

There is also the example I gave above where an extra layer of encoding/decoding in the raw data makes it likely you'll forget to apply the escaping.

The confusion caused by escape-on-input means your entire code base becomes a potential source not only of double-escaping bugs but of security problems as well.

In practice

Thankfully, we don't just have to rely on the above analysis to conclude that escape-on-input is a terrible idea. PHP, always willing to help when it comes to “examples of how not to do it”, provides us with a perfect case study.

Magic quotes

PHP used to have a feature called magic quotes. It was an escape-on-input feature that escaped single quotes (') with backslashes. This was to protect you from SQL injection attacks, by making the data safe for interpolation into a SQL query.

This caused all kinds of problems.

First, if you are not first passing something through a database, and using string interpolation to build up SQL queries, you have to remember to strip those slashes using the function stripslashes().

If you don't, you get double encoding. It looks like \\'this\\', you\\'ve almost certainly seen it across the web, though it seems we\\'re thankfully past the worst of it.

Second, even if you remember, you've added some hideous cruft to your code. In the bit of code which is handling form validation (and is therefore echoing user input back to the user without the database being involved), you've got these bizarre stripslashes() calls. What on earth does ‘reverse transforming a string for SQL statement preparation’ have to do with the task of input validation?

Third, it turns out that different databases need different escaping mechanisms to do things fully correctly. So you now have to do stripslashes() on data even if you are passing it to a database using string-interpolated queries!

Then, since the above problems are common (building up SQL queries by string interpolation was always a bad idea, and very often you pass on the data to outputs that don't want SQL escaping at all), it's desirable to have a way to turn this behaviour off completely.

To handle this, there is a php.ini setting to turn it on/off.

And there were more complications, for example:

do you apply magic quotes to ‘all input’ (magic_quotes_runtime) or just to GET/POST/COOKIE data (magic_quotes_gpc)? (This is the problem of defining what exactly is ‘input’)
attempts to fix some of the above with yet more configuration options like magic_quotes_sybase.

And so now you've got even more problems. Since these are global settings, you can't have library code mess with them, since other code might set the global to a different value or assume a certain value.

You could try making all code detect the current setting and have different code paths depending on the result. This works very badly – having multiple code paths is a recipe for code duplication and bug proliferation. It's extremely easy to forget to do it, or get one of the paths wrong, since you will likely only test one configuration value and one set of code paths in reality.

Alternatively, you can make one bit of code responsible for fixing the setting to a sensible value (the only one being 'off'), and then make all code assume that from then on. (If you can't turn it off, you can use the code included here as a horrible kludge to reverse its behaviour).

Eventually, this final approach was the one taken by all significant projects. Turn the whole feature off, and assume it is off from then on. (Which means the feature is useless, of course).

And of course, thankfully, the PHP developers realised that this entire thing was a huge mistake that caused nothing but a vast amount of confusion and bugs, and removed the whole thing for good in PHP 5.4.

Magic quotes, as eevee put it, were “so close to secure-by-default, and yet so far from understanding the concept at all.”

To digress for a moment: we keep getting told that PHP is improving, and the community has learnt from its mistakes. Unfortunately it seems the leaders in the community are bent on recreating old mistakes.

According to Lerdorf, the much newer PHP filter extension is “magic_quotes done right”. But it still suffers from almost all the problems described here, for all the reasons described. Global HTML escaping on input is essentially the same as magic quotes, and just as tragically bad.

Elgg

In researching for this post, I came across this ticket for Elgg, an open source social networking engine. Just read through the ticket and see the mess they are in. It's clear they strongly regret the decision they made to escape-on-input, and, in their own words, have created “horrendous” problems for themselves, especially as their application has grown to include other interfaces such as JSON REST APIs.

However, fixing it is very hard. They have to coordinate many changes across their code base with a big database migration. If data has leaked from the databases and tables they control into other systems, such as denormalised tables, other databases, caches etc., or if there is other code by third parties that makes the old assumptions about encoded data, they are in even more of a pickle. And both of those things are probably inevitable in something like an open source framework, which is designed for other people to build on and extend.

This is the pain that comes from mixing input handling and output encoding, and from corrupting the data in your database.

Etsy

According to their security presentation, Etsy are using escape-on-input for XSS protection.

They claim that this is a much more secure option, as it is secure by default. (They do note, however, the problem with input that is encoded in some other way, like base64, so they are aware of the problems.)

Their presentation goes on to describe an elaborate system for detecting and fixing XSS attacks (the slides don't give enough detail for me to understand what exactly they are doing, but it's clearly a lot of work).

And their system does indeed catch XSS bugs in the wild and allow them to fix them within hours.

Wait, what?

They've corrupted their database by doing escape-on-input, they've inflicted themselves with all the development pain described above, and they've still got XSS bugs?

Granted, they've got impressive ways of dealing with these problems. But it's like virus checkers on voting machines. Advanced ways of dealing with problems that shouldn't even be possible tells you that you are doing it wrong. They've become very fast at re-tying their shoelaces, instead of working out how to tie shoelaces so they don't come undone.

They claim that with escape-on-input, XSS problems are now greppable, but it doesn't sound like it. If they were, code audits would be a massively more efficient way to find XSS problems than the methods they are using.

The main problem is almost certainly that they are using an output system for HTML that doesn't do HTML escaping by default (I'm guessing they are using PHP as their template language). If the backend that deals with HTML actually deals with HTML then you eliminate the vast majority of these problems overnight.

I'm willing to bet that large sites that use Django (or other frameworks that have basically solved the XSS problem by HTML escaping on output by default) don't have teams and automated systems dedicated to this problem, and don't need them. In Django apps, XSS problems are greppable - you grep for mark_safe in Python and the |safe filter in templates (and then, obviously, you may have to recursively grep for any functions that call mark_safe on inputs). Since all data which isn't mark_safe() gets escaped by the templating engine, and all HTML comes out of the template engine, that's basically all you need to do.

Now for some flame bait

How did this happen to Etsy?

Are the Etsy devs stupid? I suspect not. Etsy is clearly doing well, and I imagine they have enough money to hire top-notch developers. Some of their careers pages show they are happy using a variety of languages and technologies, and their engineering blog seems to be sane and competent. Even their security presentation showed considerable ingenuity and technical ability in dealing with security problems (in entirely the wrong way, unfortunately, but still impressive).

I doubt they are low quality developers. Rather, I suspect that use of PHP has addled their brains. They have become far too accustomed to working in an environment in which insanity reigns – an environment in which the less than operator pretends to work correctly with strings but it's just a trap.

When I programmed in a Windows environment, I theorised that use of Windows itself contributed to the poor quality of the programming in the code base, and the fact that developers thought nothing or writing tons of tedious code. Because Windows was so unscriptable, I imagined that Windows programmers developed a high tolerance for tedium and repetition, which is exactly the opposite of qualities needed by a programmer to make a computer do everything efficiently and reliably. (Since then, I've found that Sturgeon's law was probably a better explanation for the quality of the code, but I still think the fundamental idea applies).

With PHP, the fact that it comes with a template language that is simply not fit for purpose – because it doesn't do HTML escaping by default, or even easily — has somehow made the Etsy developers believe that it is normal to struggle with XSS, that it is perfectly reasonable that even after taking the drastic action of corrupting their entire database by HTML escaping it, they should still need elaborate XSS-catching systems.

Instead of trying to fix XSS, they should just fix it. Like this in Django. Or this in Turbogears and Jinja. Or this in Yesod. Or even this in PHP (though due to limitations of the language you won't be able to have the convenience of things like mark_safe in Django). But living with an environment of pain and madness makes you think that it ought to be hard.

Right the way up to Rasmus Lerdorf at the top, many people in the PHP community live with the insanity of their tools, and add more insanity to cope with it, rather than fix their tools or choose better ones.

A lesson for Pythonistas

Bashing other languages is fun, but when I do so I always try to get something more valuable out of it by using the opportunity to examine myself. The problem I discussed in the last section (which is just a manifestation of the broken windows theory) applies to other communities, and I'll attempt to apply it to the Python community.

Refusing to live with stupidity is one of the reasons that Python 3 is really important.

Python 3 does not represent a massive leap forward in terms of additions to the language. Mainly it just fixes a bunch of mistakes in Python 2, and introduces a whole lot of backwards incompatibilities in the process. One of the biggest is unicode/bytes. Python 2 was stupid here – it went directly against the Zen of Python, and said “in the face of ambiguity about what encoding to use, guess.” This caused a world of pain.

Now, you can work around it in most cases by some sensible conventions and a certain amount of discipline. You can also cope with the fact the "a" < 1 doesn't raise an exception. You can live with next() being a method in the iterator protocol, when it should be a method called __next__() and a builtin function next(). You can live with the fact that print is a totally unnecessary keyword, since it should just be a builtin function. You can get used to the fact that class Foo: means something subtly but significantly different from class Foo(object):. You can work around or ignore dozens of other little niggles, gotchas and inconsistencies.

But all the while, you are training yourself to tolerate stupidity, inconsistency and brokenness. Removing these warts is really important, and worth all the pain of the migration. The alternative is for Python to become the next PHP.

On top of these things, there are other types of brokenness in Python that people in the community seem less willing to acknowledge or tackle. For some of these I think we need exposure to completely different languages – languages where you can spawn thousands of ‘threads’ easily and get performance benefits, for example, or languages where you can write code that is both very high level and extremely fast. If we live entirely with Python and its set of limitations, we'll think that those problems are normal and unavoidable.

Main updates:

2012/08/07 - corrections about turning magic_quotes_gpc off at runtime.
2012/10/08 - noted bug with queries returning false matches.
2014/05/05 - added info about different contexts in HTML

Updated validator and CsrfMiddleware

2005-12-14T23:45:01Z

I've released some small updates to my 'Django validator app' and 'CsrfMiddleware'. The main changes are:

added a setup.py to both of them, after working out how these work and a lot of fiddling around.
added support for mod_python to the validator app (thanks nesh)
added a setting to allow the validator to ignore certain paths.

Get them here:

I've also discovered that my CsrfMiddleware is currently number 6 in a google search for Cross Site Request Forgery, which is rather pleasing, or perhaps it just tells you how little there is on the web about this exploit.