Why escape-on-input is a bad idea

by Luke Plant

Posted in:

— August 6, 2012 20:59

The right way to handle issues with untrusted data is:

Filter on input, escape on output

This means that you validate or limit data that comes in (filter), but only transform (escape or encode) it at the point you are sending it as output to another system that requires the encoding. It has been standard best practice since just about forever ^{[citation required]}.

An alternative is “escape on input”: at the point that data enters your system, you apply a transformation to it to avoid a problem further down the line when the data is used.

It's come to my attention that some serious web developers (or at least, they take themselves seriously and are taken seriously by others) are still suggesting the practice of escape-on-input.

For example, with escape-on-input, to avoid XSS any data that enters your system has HTML escaping applied to it immediately, before your application code touches it.

I chose that example deliberately, because people are actually recommending it:

in some recent “PHP sucks” debate.
which, in turn, linked to a page by Rasmus Lerdorf recommending escape-on-input as a sensible way to deal with XSS. The page, admittedly, is describing a ‘toy’, a ‘no-framework PHP framework’, yet he does seem to be serious about the usefulness of escape-on-input.

The page is from 2006, and uses the pecl/filter extension, but the extension has since made it into core (PHP 5.2), and the docs for it suggest a configuration that is clearly intended for XSS prevention. As recently as 2008, and probably to this day, Lerdorf is still defending and recommending this approach, and it appears to be part of his reason for thinking that PHP templating doesn't need an autoescape mechanism.
Just as significantly, Etsy are using and recommending escape-on-input (slide 18 onward). As a very successful modern company using PHP, people will look up to them and copy them.

So, this approach, unfortunately, is popular amongst some, and I can't find a decent post explaining why it's such a terrible idea both in theory and practice. Here is my attempt. It should be applicable to almost any system and any language, although I'll mainly be using examples from web development.

In theory

First of all, escape-on-input is just wrong – you've taken some input and applied some transformation that is totally irrelevant to that data. If, taking our example, you have some data collected by HTTP POST or GET parameters, applying HTML escaping to it is a layering violation – it mixes an output formatting concern into input handling. Layering violations make your code much harder to understand and maintain, because you have to take into account other layers instead of letting each component and layer do its own job.

Doing things ‘right’ is very important, even if doing them ‘wrong’ seems to work and you are tempted to be dismissive of ‘theoretical’ concerns about purity etc. When you have to maintain code, you will be very glad if things are in the right place, and not full of hacks and surprises.
You have corrupted your data by default. The system (or the most convenient API) is now lying about what data has come in. As you have applied a transformation to the data itself, the layering violation is not an isolated problem in one part of the code, but infects every part of your code, especially if you store the corrupted data in a database.

Your data is everything. As I read recently, “data matures like wine, applications like fish”. You can always rewrite your application, but if you corrupt your data, you've done the worst thing you can to your system.
This is exacerbated by the fact that many encodings are one-way – you cannot losslessly or unambiguously convert them back. If at a later point you need the original data, you might be in a pickle.
Escaping your data for one output backend will only deal with that output. A typical web app might deal with at least the following backends, which have different characters that are dangerous, and have different requirements for dealing with them:
- HTML: ' < > " &
- URLs: / : & ? # text starting 'javascript:'
- Javascript: " '
- SMTP and HTTP: ; : newlines
- SQL: '
- JSON: "
- Shell - space, quotes and various other characters
Any number of others could be added, and all could have security implications. Using escape-on-input will only fix one of these - apart from happy coincidences where it might fix more than one. Security should not rely on happy coincidences, and for the other outputs you will still need a sensible solution to the problem. Why not have a sensible solution for all of them?
Escaping for one output may not deal with even that single output correctly, because escaping can be context dependent.

Various outputs can be embedded in others, and they have different escaping rules. So, you can embed URLs in HTML. And URLs in CSS. And CSS in HTML. And Javascript in URLs. And Javascript in HTML...

If you prepared something for HTML, did you prepare it for HTML element body context, or HTML attribute context, or URLs in HTML attributes, or CSS in HTML? Or URLs in CSS in HTML? If someone passes in a value for a URL which is then used in an href attribute in HTML, HTML escaping of < > & ; " ' won't protect adequately you from XSS. Interactions between CSS/Javascript parsers and HTML parsers make things even more complex. So “escape at the beginning and then forget about it” does not work even for the single output of ‘HTML’, because it is not a single output.
Escaping on input will not only fail to deal with the problems of more than one output, it will actually make your data incorrect for many outputs.

Suppose you decide to do HTML escaping, and someone enters Jack & Jill as a title for something. Your escape-on-input turns this to Jack & Jill and that goes in the DB. Suppose you want to email people and put this title in the subject line. You now have to apply the reverse transformation to get a sensible subject line in the email, and you have to remember to do this for every output that is not HTML.

Sometimes, the bug is significantly more annoying than an email with an incorrect title:

One ruined sweatshirt, however, is tame compared to the hassle many people suffer due to having a name that a computer won’t accept or mangles. Looking through that article, it’s clear that often the software is escaping on input, resulting in escaped versions being stored in the database (e.g. a woman with an apostrophe in her name is recorded as “Leah D&andrea”), which then causes no end of problems.

You also have daft bugs like the fact that doing a search on that field for the string ‘amp’ (or ‘quot’, ‘apos’, ‘lt’, ‘gt’ etc. or any substrings) will get various false matches.

I have seen some people respond to this by saying “it's better to have the occasional double-encoding bug or incorrect query result than an XSS exploit”. Well, first, that depends on your business. XSS is a problem because it costs time and money, and so does corrupting your data. Many people have data that actually matters, and corrupt data is a big deal, and much harder to cope with than an XSS bug, because data lives on and on. If we took just the example above of storing people’s names incorrectly, the grief caused by escape-on-input is massive.

Second, this decision affects frameworks that are used to handle data of all kinds, and the decision affects the entire code base of your application and beyond, as described below. Data-handling frameworks that work on the assumption that your data is not important are insanity. If the foundations be destroyed, what can the righteous do?

Third, it's entirely unnecessary. XSS is not hard to fix given decent programming tools.
At what point does data ‘enter’ your system?

It might sound like a simple question, but it's tricky in reality, and I'll illustrate using an HTTP request.

In most web apps, the GET and POST parameters are your ‘raw input’. However, using most normal web framework APIs, data in GET and POST parameters has already been interpreted. The ‘raw’ data is really the bytes that make up the HTTP request, which typically will use URL encoding for GET query parameters and a choice of encodings for POST data (URL encoding or MIME multipart attachment format).

The framework may also do another level of decoding – interpreting the series of bytes as a series of unicode code points.

Both parts of this initial transformation makes sense and are appropriate, because they are reversing the encoding already applied to the data by the protocol involved. The web browser takes the data you type in – unicode code points – and applies a series of transformations to it, according to the HTTP protocol, and your web framework reverses these to get the data back.

Now, if you want to avoid XSS problems, you have to apply the escaping after this initial decoding has been done. But this highlights another possibility. What if the data requires further decoding before you get the ‘real’ raw data? For example, some data might be sent base64 encoded for a variety of reasons, or any other type of encoding.

This extra level of encoding gives two problems:
- your automatic HTML escaping may have corrupted the encoded data so that it now cannot be decoded. For example, you had a GET parameter that held a URL, which itself had parameters in the query string:
```
GET /foo?bar=1&url=http%3A%2F%2Fexample.com%2F%3Fx%3D1%26y%3D2 HTTP/1.1
```
  Your framework's HTTP handling will produce a query dictionary that looks something like the following:
```
{"bar": 1,
 "url": "http://example.com/?x=1&y=2"
 }
```
  But your automatic escaping turns that into:
```
{"bar": 1,
 "url": "http://example.com/?x=1&amp;y=2"
 }
```
  If you want to extract the y parameter from url, you are stuck. You can't correctly interpret the data in the url parameter, because it has been corrupted. You're going to have to unescape the input, and you might not even notice this problem.
  
  A better example might be handling the ’Referer’ header. (Which you have presumably applied the same HTML encoding to, right? If you did, you have this problem, if you didn't, you have to remember to do it manually, which is a potential XSS vulnerability).
- Even if the data comes through your automatic escaping unscathed (e.g. base64 under HTML escaping), or you can undo the corruption and get it properly decoded, after decoding you will have to manually apply HTML escaping to make it match all the other automatically escaped data. If you don't, you've potentially got a bug and an XSS exploit.
  
  So your automatic escape-on-input has missed data, and this happens because you can't really define the point at which the data has ‘entered’ your system and needs the escaping applied.
This problem means that the escape-on-input approach is inherently flawed and cannot be fixed. You just have to patch it up on a case-by-case basis, which is exactly what escape-on-input is supposed to avoid.

And then, what about other sources of data – data on the file system, in a cache etc. Are these entry points? Well, it depends on how the data was put there. You have to manually follow this all the way through your app; get it wrong and you've got double escaping bugs or security flaws.

(By contrast, escape on output always works, because you apply it at the point where you know it is needed – in the backend that knows the escaping rules.)
Other systems putting data into your database, or getting data out, have to abide by your data transformation rules.

These systems might have nothing to do with your primary domain (e.g. a web site). Making them understand and obey rules that have nothing to do with the data itself is insanity and extremely short sighted.

You can't deal with this problem when you come to it, because you don't have to just fix your code, you've got to fix all your data too, and by the time you cross this bridge you might have a lot of data and might need a very delicate database migration to get it right. The data may even have escaped your control (e.g. been copied into other systems), or backwards compatibility concerns might stop you from making the change you need to make.
Within your main application, the decision to escape on input affects your whole code base.

If you want to use any libraries, you need to make sure that they are using all the same assumptions that you have in your main code base.

For example, if you've got a form/widget library in your web app, it will very often need to echo user input back to them in the case of a form that has validation errors. This library has to know if you already escaped the input.

Writing the library to work in two modes is asking for trouble. Rather, you need it to have been written from the beginning to assume the same escaping rules.

This kills code re-use – you can only use code that assumes the same input escaping – or it means that you will end up with tons of bugs due to incompatibilities between the assumptions made in your application code and the library.

Essentially, this is the problem of a global configuration setting, but worse since it affects the operand of your entire application (the data going through it), not just the functionality of various operators.

Another example might be a cron job that sends out emails, using data from the database. If the data comes from a web form that applied "escape on input" to avoid XSS, then the code will need to apply HTML unescaping - despite the fact that this script has absolutely nothing to do with the web (it reads a database and sends plain text emails).

Effectively, this means that the XSS solution, far from being a solution applied at a single point, is in fact spread out over the entire code base, as it includes every time that pre-escaped data has to be un-escaped.
The confusion caused by the above is likely to increase security problems. “Keep It Simple, Stupid” remains a very good maxim for developers.

To continue an example used above: you want to send an email with some data that has already been HTML escaped, and so you need to unescape the data to avoid emails with the subject “Jack & Jill” when the user entered “Jack & Jill”. You decide it's not sensible for the mail sending functions to do this internally, (or maybe they're provided by a third party who made that decision for you), so the calling code does the unescaping.

You later decide to switch to HTML emails, and the developer who implements it thinks that since data is already escaped, there is no problem including it without extra escaping in the body of the HTML email, leading to a vulnerability (not classic XSS in this case, but still a problem).

There is also the example I gave above where an extra layer of encoding/decoding in the raw data makes it likely you'll forget to apply the escaping.

The confusion caused by escape-on-input means your entire code base becomes a potential source not only of double-escaping bugs but of security problems as well.

In practice

Thankfully, we don't just have to rely on the above analysis to conclude that escape-on-input is a terrible idea. PHP, always willing to help when it comes to “examples of how not to do it”, provides us with a perfect case study.

Magic quotes

PHP used to have a feature called magic quotes. It was an escape-on-input feature that escaped single quotes (') with backslashes. This was to protect you from SQL injection attacks, by making the data safe for interpolation into a SQL query.

This caused all kinds of problems.

First, if you are not first passing something through a database, and using string interpolation to build up SQL queries, you have to remember to strip those slashes using the function stripslashes().

If you don't, you get double encoding. It looks like \\'this\\', you\\'ve almost certainly seen it across the web, though it seems we\\'re thankfully past the worst of it.

Second, even if you remember, you've added some hideous cruft to your code. In the bit of code which is handling form validation (and is therefore echoing user input back to the user without the database being involved), you've got these bizarre stripslashes() calls. What on earth does ‘reverse transforming a string for SQL statement preparation’ have to do with the task of input validation?

Third, it turns out that different databases need different escaping mechanisms to do things fully correctly. So you now have to do stripslashes() on data even if you are passing it to a database using string-interpolated queries!

Then, since the above problems are common (building up SQL queries by string interpolation was always a bad idea, and very often you pass on the data to outputs that don't want SQL escaping at all), it's desirable to have a way to turn this behaviour off completely.

To handle this, there is a php.ini setting to turn it on/off.

And there were more complications, for example:

do you apply magic quotes to ‘all input’ (magic_quotes_runtime) or just to GET/POST/COOKIE data (magic_quotes_gpc)? (This is the problem of defining what exactly is ‘input’)
attempts to fix some of the above with yet more configuration options like magic_quotes_sybase.

And so now you've got even more problems. Since these are global settings, you can't have library code mess with them, since other code might set the global to a different value or assume a certain value.

You could try making all code detect the current setting and have different code paths depending on the result. This works very badly – having multiple code paths is a recipe for code duplication and bug proliferation. It's extremely easy to forget to do it, or get one of the paths wrong, since you will likely only test one configuration value and one set of code paths in reality.

Alternatively, you can make one bit of code responsible for fixing the setting to a sensible value (the only one being 'off'), and then make all code assume that from then on. (If you can't turn it off, you can use the code included here as a horrible kludge to reverse its behaviour).

Eventually, this final approach was the one taken by all significant projects. Turn the whole feature off, and assume it is off from then on. (Which means the feature is useless, of course).

And of course, thankfully, the PHP developers realised that this entire thing was a huge mistake that caused nothing but a vast amount of confusion and bugs, and removed the whole thing for good in PHP 5.4.

Magic quotes, as eevee put it, were “so close to secure-by-default, and yet so far from understanding the concept at all.”

To digress for a moment: we keep getting told that PHP is improving, and the community has learnt from its mistakes. Unfortunately it seems the leaders in the community are bent on recreating old mistakes.

According to Lerdorf, the much newer PHP filter extension is “magic_quotes done right”. But it still suffers from almost all the problems described here, for all the reasons described. Global HTML escaping on input is essentially the same as magic quotes, and just as tragically bad.

Elgg

In researching for this post, I came across this ticket for Elgg, an open source social networking engine. Just read through the ticket and see the mess they are in. It's clear they strongly regret the decision they made to escape-on-input, and, in their own words, have created “horrendous” problems for themselves, especially as their application has grown to include other interfaces such as JSON REST APIs.

However, fixing it is very hard. They have to coordinate many changes across their code base with a big database migration. If data has leaked from the databases and tables they control into other systems, such as denormalised tables, other databases, caches etc., or if there is other code by third parties that makes the old assumptions about encoded data, they are in even more of a pickle. And both of those things are probably inevitable in something like an open source framework, which is designed for other people to build on and extend.

This is the pain that comes from mixing input handling and output encoding, and from corrupting the data in your database.

Etsy

According to their security presentation, Etsy are using escape-on-input for XSS protection.

They claim that this is a much more secure option, as it is secure by default. (They do note, however, the problem with input that is encoded in some other way, like base64, so they are aware of the problems.)

Their presentation goes on to describe an elaborate system for detecting and fixing XSS attacks (the slides don't give enough detail for me to understand what exactly they are doing, but it's clearly a lot of work).

And their system does indeed catch XSS bugs in the wild and allow them to fix them within hours.

Wait, what?

They've corrupted their database by doing escape-on-input, they've inflicted themselves with all the development pain described above, and they've still got XSS bugs?

Granted, they've got impressive ways of dealing with these problems. But it's like virus checkers on voting machines. Advanced ways of dealing with problems that shouldn't even be possible tells you that you are doing it wrong. They've become very fast at re-tying their shoelaces, instead of working out how to tie shoelaces so they don't come undone.

They claim that with escape-on-input, XSS problems are now greppable, but it doesn't sound like it. If they were, code audits would be a massively more efficient way to find XSS problems than the methods they are using.

The main problem is almost certainly that they are using an output system for HTML that doesn't do HTML escaping by default (I'm guessing they are using PHP as their template language). If the backend that deals with HTML actually deals with HTML then you eliminate the vast majority of these problems overnight.

I'm willing to bet that large sites that use Django (or other frameworks that have basically solved the XSS problem by HTML escaping on output by default) don't have teams and automated systems dedicated to this problem, and don't need them. In Django apps, XSS problems are greppable - you grep for mark_safe in Python and the |safe filter in templates (and then, obviously, you may have to recursively grep for any functions that call mark_safe on inputs). Since all data which isn't mark_safe() gets escaped by the templating engine, and all HTML comes out of the template engine, that's basically all you need to do.

Now for some flame bait

How did this happen to Etsy?

Are the Etsy devs stupid? I suspect not. Etsy is clearly doing well, and I imagine they have enough money to hire top-notch developers. Some of their careers pages show they are happy using a variety of languages and technologies, and their engineering blog seems to be sane and competent. Even their security presentation showed considerable ingenuity and technical ability in dealing with security problems (in entirely the wrong way, unfortunately, but still impressive).

I doubt they are low quality developers. Rather, I suspect that use of PHP has addled their brains. They have become far too accustomed to working in an environment in which insanity reigns – an environment in which the less than operator pretends to work correctly with strings but it's just a trap.

When I programmed in a Windows environment, I theorised that use of Windows itself contributed to the poor quality of the programming in the code base, and the fact that developers thought nothing or writing tons of tedious code. Because Windows was so unscriptable, I imagined that Windows programmers developed a high tolerance for tedium and repetition, which is exactly the opposite of qualities needed by a programmer to make a computer do everything efficiently and reliably. (Since then, I've found that Sturgeon's law was probably a better explanation for the quality of the code, but I still think the fundamental idea applies).

With PHP, the fact that it comes with a template language that is simply not fit for purpose – because it doesn't do HTML escaping by default, or even easily — has somehow made the Etsy developers believe that it is normal to struggle with XSS, that it is perfectly reasonable that even after taking the drastic action of corrupting their entire database by HTML escaping it, they should still need elaborate XSS-catching systems.

Instead of trying to fix XSS, they should just fix it. Like this in Django. Or this in Turbogears and Jinja. Or this in Yesod. Or even this in PHP (though due to limitations of the language you won't be able to have the convenience of things like mark_safe in Django). But living with an environment of pain and madness makes you think that it ought to be hard.

Right the way up to Rasmus Lerdorf at the top, many people in the PHP community live with the insanity of their tools, and add more insanity to cope with it, rather than fix their tools or choose better ones.

A lesson for Pythonistas

Bashing other languages is fun, but when I do so I always try to get something more valuable out of it by using the opportunity to examine myself. The problem I discussed in the last section (which is just a manifestation of the broken windows theory) applies to other communities, and I'll attempt to apply it to the Python community.

Refusing to live with stupidity is one of the reasons that Python 3 is really important.

Python 3 does not represent a massive leap forward in terms of additions to the language. Mainly it just fixes a bunch of mistakes in Python 2, and introduces a whole lot of backwards incompatibilities in the process. One of the biggest is unicode/bytes. Python 2 was stupid here – it went directly against the Zen of Python, and said “in the face of ambiguity about what encoding to use, guess.” This caused a world of pain.

Now, you can work around it in most cases by some sensible conventions and a certain amount of discipline. You can also cope with the fact the "a" < 1 doesn't raise an exception. You can live with next() being a method in the iterator protocol, when it should be a method called __next__() and a builtin function next(). You can live with the fact that print is a totally unnecessary keyword, since it should just be a builtin function. You can get used to the fact that class Foo: means something subtly but significantly different from class Foo(object):. You can work around or ignore dozens of other little niggles, gotchas and inconsistencies.

But all the while, you are training yourself to tolerate stupidity, inconsistency and brokenness. Removing these warts is really important, and worth all the pain of the migration. The alternative is for Python to become the next PHP.

On top of these things, there are other types of brokenness in Python that people in the community seem less willing to acknowledge or tackle. For some of these I think we need exposure to completely different languages – languages where you can spawn thousands of ‘threads’ easily and get performance benefits, for example, or languages where you can write code that is both very high level and extremely fast. If we live entirely with Python and its set of limitations, we'll think that those problems are normal and unavoidable.

Main updates:

2012/08/07 - corrections about turning magic_quotes_gpc off at runtime.
2012/10/08 - noted bug with queries returning false matches.
2014/05/05 - added info about different contexts in HTML