WordPress 4.7.2 post mortem

by Luke Plant

Posted in:

— February 17, 2017 15:49

A few weeks ago, WordPress released version 4.7.2 to address several security vulnerabilities, including one critical one. This vulnerability allowed a remote, unauthorised attack to update web pages via the REST API. Since then, hundreds of thousands of unpatched installations have been defaced. In addition, in some cases this can lead to remote code execution, and that has been seen in the wild.

The vulnerability was found by Sucuri, and they have detailed the issue on their blog.

I haven't found much by way of deeper analysis, and so this post is my take on some of the deeper coding/development process issues behind such a serious security problem. It is meant to be constructive – that is, what positive lessons can be learned for avoiding this kind of thing in our own projects.

(Upfront note about my biases: I'm not a fan of PHP, though I used to be a long time ago, or WordPress, especially when it comes to security; after PHP I've worked mainly with Python, and I'm a (rather inactive) Django core developer; I've used various other languages, and I'm drawn to languages like Haskell and similar.)

Summary of coding error

The primary error was a piece of code that failed to check a returned value to see if it was an error value. In particular, the code looked like the following (removing irrelevant details, as I will for all the samples in this post):

public function update_item_permissions_check( $request ) {
    $post = get_post( request['id'] );

    if ($post && ! $this->check_update_permissions( $post ) {
        return new WP_Error(...);
    }

    # Various other conditions checked

    return true;
}

In this case, get_post could return an error value, but this was never checked for, so the permissions check ends up returning true. Therefore, the permissions check was inadequate, allowing the privilege escalation – i.e. an unauthorised user could change any post by forcing the error condition in get_post, which turns out to be easy to do.

There are various other bits of code that have issues, particularly failure to properly validate the incoming post ID, and code which did it different ways. If any one of them been done differently, the vulnerability might have been avoided. You can look at Securi's posts above to see the full details, and I'll discuss more below.

Causes of the vulnerability

Now I'll come to a list of issues that I think were, or might have been, behind this vulnerability.

I'm aware that some of these may annoy you if your framework/language of choice is implicated as inferior – but if you react by dismissing that point, and saying it was all the others that are to blame, you've missed the whole point of doing a post mortem on someone else's misfortune.

Dynamic typing/null values.

The get_post method was able to return a null value, instead of an object, without forcing that value to be checked. This is a flaw in all dynamically typed languages, and many statically typed ones. In statically typed languages that do not allow null values (e.g. Haskell, Rust in normal code), you'd be forced to choose an explicit error handling mechanism that would be much less prone to this kind of issue.

There are multiple other ways that these kinds of statically typed languages would have prevented an invalid string from being passed through multiple levels of code – the design of such languages generally forces you to sanitise earlier, eliminating the confusion that caused this bug.
PHP flaw – very poor error handling mechanisms.

Builtin PHP functions, and therefore any PHP project, has a whole range of error handling mechanisms – errors, warnings, returning error values, and exceptions. At every point, calling code needs to know which system will be used to handle errors. The calling code in update_item_permissions_check above would have been fine if get_post had thrown an exception for an invalid post, but it didn't. To review the code, you need to know the implementation and conventions of the code that is being called, which is seriously impeded by having multiple options.
Poor choice of error handling.

While PHP has flawed error handling (as above), it is still possible to do your best. Since PHP has no method to ensure that calling code checks for returned errors, using returned error values (whether null values, like null, false etc., or custom error objects) is one of the worst kinds of error handling mechanisms in PHP, and WordPress made the mistake of using it.

In fact, doing it correctly in WordPress is harder, in that they use not only builtin false-y values to indicate errors, but they also have a custom error class WP_Error, which is not false-y (and probably can't be false-y due to PHP limitations), so that properly checking for a null/error condition is either very verbose, or requires you to remember which convention is used. (e.g. get_post returns null for error conditions, but WP_REST_Posts_Controller::get_post returns WP_Error).

This contrasts with languages like Go, for example. Go returns a tuple for every method that can fail, and if you don't check/use the error value from the calling code, that's a compile time error, so it is much, much harder to get this wrong. Plus, the language design means that basically all Go code will work the same, so you know what correct code looks like.

It also contrasts with conventions in Python and frameworks like Django, which will typically raise exceptions in situations like this – Post.objects.get(id='invalidid') will raise Post.DoesNotExist, rather than return None etc.
Very poor framework design decision – merging of parameters from URL path component and query string parameters.

The WordPress code for the REST API has some routing methodology that allows parameters to be defined in the paths of URLs e.g.:
```
"/(?P<id>[\d+]+)"
```
This regex matches part of the path component in a URL, and the name id is used as a key in a dictionary of parameters passed to the handling function. (This is pretty much exactly like Django URL routing). These matched parameters are available in the request object passed to the functions above.

However, that object also includes GET (and perhaps POST) parameters, and as it happens, when you do the simple dictionary access on that object, you get a merged version in which GET takes priority over the URL path components. This means that it is possible to pass an id value that doesn't match the regex:
```
/123/?id=123somethingelse
```
...and it is 123somethingelse, not 123 that will be returned by request['id'] shown in the pseudo-code above. While there are more specific methods available for getting URL path or query string parameters, the dictionary/array access shown is the most convenient (and therefore most encouraged) way to get data out of it.

This feature seems to be inspired by PHP's $_REQUEST, but is worse, because $_REQUEST is no more convenient than $_POST or $_GET, and in fact requires more characters to type. Here, however, not only is the merged dictionary more convenient to access than the un-merged ones, it also includes URL path components to confuse things even more.

This meant that the checking of the format of the ID parameter that appeared to take place in the routing code (via a regex that limited to numeric IDs) was ineffective.

Lesson: Merging data from multiple distinct sources into a single bag with the same namespace is very often a bad idea. In particular, if the different sources are not subject to the same validation rules, or correct usage might rely on users knowing which source has precedence, you are asking for trouble.

Historic Django note: long ago before 1.0 Django had a very similar feature — dictionary access on the request object did a merge of request.GET and request.POST parameters. It was realised this was a bad idea, and for the 1.0 release this was replaced with a more explicit and discouraged request.REQUEST object. This too was eventually deprecated in 1.7 and later removed. I heard some protests about this, but I'm glad it's gone – it is too much of a liability to justify its very occasional usefulness.

(Update – just hours after writing this post, while reviewing deprecation warnings for a Django 1.8 project I maintain, a usage of request.REQUEST was flagged and – surprise, surprise – it haboured a security vulnerability).
Multiple calls to retrieve the same thing from the database.

You might think that since get_post returns a null object (of some kind), then there won't be a post object to update, and therefore no vulnerability. However, it turns out that get_post is called in multiple places within the handler – once to check permissions, and again to actually do the update, something like this:
```
public function update_item( $request ) {
    $id = (int)request['id'];
    $post = get_post( $id );

    # Some checks and processing...then:

    wp_update_post( $post, $some_other_data );
}
```
When this code is run, the permission checks should have already been run (in a separate code path). This is a performance bug in itself – it is doing multiple DB queries (assuming no caching) to get the same thing.

There is a bigger problem, which is that it is possible for the different callees to call get_post with different data, and as it turns out they do. The difference allowed the attack to proceed – the code checking for permission to edit the object did not find an object to edit, but the code doing the editing did. This is a flawed permission checking API.

Lesson: if your permission checking code is separate from the main path that retrieves the data, it might be doing something critically different.

Lesson: if you are loading the same data from the database multiple times, this is a code smell that you've got duplication that might be harbouring mistakes.
Multiple attempts at sanitising.

Sanitising is good, but it should be done in the right place. In update_item above, when it called get_post it made a half-hearted attempt to sanitise the post ID value it passed in, by converting to an integer. This made it different from the other call to get_post, with the result described above – this code was able to retrieve a post object to update, while the permissions checking code was not.

The problem is that just sanitising what you pass to a call doesn't sanitise where you got it from, or sanitise other people's usage of it which might have happened earlier. Sanitising is in the wrong place if it does not make it harder for any other code to get the unsanitised data.

Lesson: If you're sanitising data all over the place, it's a sign you don't actually know how to clean input data up, and you could in fact be making things worse.
PHP weak typing flaw – ‘casting’ (as it is called in PHP) doesn't raise errors for invalid values, and in fact sometimes just ignores bad data.

Namely, in PHP:
```
(int)"123abc" === 123
```
However:
```
is_numeric("123abc") === false
```
So, in update_item above, ‘casting’ to an int just ignores the data that couldn't be converted. If PHP had done something more strict (e.g. raise an error, or even return null), this vulnerability would not have happened. In fact, if these two methods had simply agreed with each other, the vulnerability would not have happened either. The difference between them meant that one bit of code thought “There isn't a valid ID that I could even use to do a DB lookup”, and another was able to find a post just fine, by ignoring the data that didn't look like a number.
A permissions framework that defaults to “allow”.

The permission method responsible for the failure basically works like this:
```
if error_condition_1:
    return Error(...)

if missing_permission_1:
    return Error(...)

if missing_permission_2:
    return Error(...)

return True
```
It's basically impossible to look at the code and guess whether they missed anything (in this case, checking for an error code in one of the called functions was missing). A lot of permission checking code looks like this, or has the same “default to allow” construction. For example, in Django code, views are all public by default, and you have to remember to add a decorator to limit access.

Many database applications also work the same way – often any code in the application can access and update any record of any table, and you have to specifically remember every single limitation you want to apply if you don't want the “allow all” behaviour. Fixing this can be hard, requiring big changes in architecture.

One way to do it is to have allowed functionality encapsulated into some kind of code object, that is only returned when correct credentials are passed. Or, permissions check methods should return some credential object that must then be passed to every model layer function that wants to take some action.

At the very least, rather than the REST interface and the normal admin interface doing the same checks, the model layer should do this. This results in less code surface area to attack.
A “don't let it crash” mentality.

The design of PHP has this mentality – most builtin functionality will just try to continue and return something rather raise an error that might terminate execution of the program. This is due to its history as a template language. Unfortunately, while this might sometimes be a reasonable design choice for templating (though that is highly debatable), it is always a terrible choice for a general purpose language. It is better to crash than to do the wrong thing.

The same mentality can be seen in some of the code involved in this vulnerability.

For example, update_item has:
```
public function update_item( $request ) {
    $id   = (int) $request['id'];
    $post = get_post( $id );
```
Converting to integer presumably is designed to makes sure that get_post doesn't crash – we will at least be passing it something of the right form. It is a misguided and misimplemented form of defensive programming. If it had been omitted, the vulnerability would have been avoided.

In update_item_permissions_check, we have this code:
```
public function update_item_permissions_check( $request ) {

    $post = get_post( $request['id'] );

    if ( $post && ! $this->check_update_permission( $post ) ) {
        ...
```
The code checks that $post is not null or false-y before attempting to call a method on it. This is again defensive programming, and seems reasonable, but in fact is only designed to make sure the code the doesn't crash here – “I must remember not to call a function that might crash with null parameters”. However, the big problem is that the code doesn't actually deal with the possibility that $post == null. If instead the code had avoided the null check, then for the exploit scenario check_update_permission would simply have caused a crash – which would have been harmless (you just get a 500 error, which is an error for that user, but doesn't affect anyone else).

In fact the 4.7.2 code still has the same structure, but now looks like this:
```
$post = $this->get_post( $request['id'] );
if ( is_wp_error( $post ) ) {
    return $post;
}

if ( $post && ! $this->check_update_permission( $post ) ) {
    ...
```
The last line here makes no sense now – unlike the global function get_post, $this->get_post never returns a false-y value, it returns either a WP_Post or a WP_Error. The rest of the codebase is littered with this kind of thing, and in many cases it is the right thing (at least locally). This pattern can easily become a habit, and so it becomes hard to spot that here it is a vulnerability (4.7.1 code), or makes no sense (4.7.2).

Lesson: the aim should be that your program does the right thing, and if it can't do that, it should do nothing. The worst thing you can do is have a philosophy of “whatever happens, the program should not crash in the line of code I'm currently writing”.
Code-base compatibility constraints.

Looking through the code, it becomes clear that there are multiple different ways to do everything.

The REST API uses code that is Ruby-on-Rails-ish – controller classes with request objects being passed in to methods. This feels modern-ish (despite various design flaws I've mentioned and being very verbose) – anyone using a recent web framework would understand roughly how it works. Most of the rest of the code base uses the classic “Spaghetti PHP files” anti-pattern – a chain of includes that you have to follow to work out where things are actually done, request handling code executing at the top level rather than just function and class definitions.

As noted above, some functions use false-y values to indicate errors, others use WP_Error (e.g. WP_REST_Posts_Controller::get_post). You've just got to remember which is which to get it right. In fact, of the first variety, some code uses false (e.g. WP_Post::get_instance), while some use null (e.g. global get_post), which is the kind of inconsistency that is asking for trouble.

Why hasn't any of this been cleaned up? My guess is backwards-compatibility. WordPress's crown jewel is its installation base, which is huge. Fixing their code to be more consistent and secure would involve breaking tons of plugins, so they are not going to do that. Their only option is to try to write new code using better patterns, but this is itself a problem – the classic Lava flow anti-pattern.

Sometimes people complain about the rate at which Django deprecates things, but I've already given one example above of why it is very important that we do so. Django has a pretty good track record on security only because we are willing to sometimes break things at the API level, rather than just leave sharp edges everywhere and tell people not to cut themselves.
Volume of code. The more verbose your code, the easier it will be for code reviews to miss things like this – both specific errors, and poor internal API design.

Look at the WordPress code, and you will find there is tons of it. It is extremely verbose – it's like you are reading code designed to mimic the bad points of Java with none of its good points. Every class is so big it needs its own file, despite having extremely little meat in it.

WordPress is currently about 500,000 LOC (PHP+JS+CSS, including comments, not including docs, tests and bundled themes, and not including other text that does not need to be maintained – I tried to come up with some criteria that would be roughly fair to several projects, knowing that they are different in nature).

Compare this (with the caveats already mentioned, and using similar criteria for counting) to:
- the web framework framework Django – about 135,000
- the Django based CMS Mezzanine – 43,000
- the Django based CMS WagTail – 57,000
- the Django based CMS django-fiber – 18,000
Or, within a CMS, compare the REST backend in WordPress which is about 9,000 LOC (not including an additional 3,000 in the base classes and other supporting code), with the REST backend in django-fiber – about 300 LOC (it uses django-rest-framework to do the heavy lifting).

If we decide that's not fair, we could also compare to WagTail's REST API which doesn't use a framework (other than Django), and therefore includes its own router and serialization classes, and still only weighs in at 1,400 LOC — about one tenth the size of the equivalent code in WordPress.

Yes, django-fiber uses django-rest-framework, and all the above use Django and other projects, but those other projects provide re-usable code, code that has many features unused by the above CMS's, and code that is therefore very generic. WordPress code on the other hand, despite its 'framework' code being written for internal use, and not really being usable outside WordPress, manages to be many times more verbose. You could put Django, django-rest-framework (15k), Mezzanine, Wagtail, django-fiber, Pillow (image library used by many Python CMSs, 22k), django-mptt and django-treebeard (Django database tree libraries used by these CMSes, 4k and 6k), django-compressor (build tool used by several, 5k) all together and still be nowhere near matching WordPress for LOC.

How does WordPress manage to be so big? Yes, WordPress code contains more comments, but that's still code you have to scroll past, or read to understand. Whatever the reason, one of the consequences is that when reviewing code, you are just going to fall asleep that much sooner. I couldn't find much by way of detailed review for the ticket that introduced the WordPress REST infrastructure given it is such a big patch (but I might be looking in the wrong places).

I suspect that the verbosity of WordPress is likely triggered by the poor quality of language features and/or poor internal design.

The linked Python code is generally much more compact, and this is probably mainly due to the design of Python, especially things like first-class classes and decorators. In some cases, there are definitions of REST API endpoints that have no methods defined in them at all. Declarative code was sufficient in many cases, and the code is easy to read and very free of noise.
Not learning from other systems and frameworks.

I've highlighted a few places in the design of WordPress's REST infrastructure that look like very poor design decisions (e.g. merging of parameters from different sources). These things are are known to be poor design decisions in other parts of the web development world (e.g. Django removed or discouraged them years ago), yet experienced PHP developers didn't see them when it came to code review time. This could have been because they fell asleep due to the volume, but I'd tentatively suggest that another factor might have been a narrow experience of the world, and one particularly dominated by the norms of the PHP world, where the whole language and framework design normalises poor design decisions.

This may be unfair – a lot of the new code for the REST API looks like it is inspired by other frameworks, which presumably means the developers have had some exposure to other systems.

But either way – the lesson is that a failure to learn from other programming sub-cultures and other languages means we might be walking into traps that we could easily avoid. For myself, I think that in the Python community there can be a snobbishness about other languages (e.g. the way I look down on PHP) that can lead to a complacent lack of knowledge of other languages and systems that are way ahead in some areas.

Well, that's all. Hopefully there is something we can learn from this nasty vulnerability!

Summary of coding error

Causes of the vulnerability

You may also like: §

Comments §