I recently used django-fiber for a simple project.
It’s a nice CMS, a bit more lightweight than django-cms, with a slightly slicker frontend editing experience, and a bit easier when it comes to sharing content between different pages.
I found however, that it was doing a rather large number of queries to render pages, and some that pulled back lots of data, especially when you were logged in (which enables frontend editing).
So, I set to work and created a query count reduction branch. Below are the results so far, and some lessons.
I used the example django-fiber project for testing my changes (as well as my own project), and tried them out on both the home page and a more deeply nested page.
- URL: /
- Anonymous user:
- Original: 30 queries
- Optimised: 15 queries - factor 2 reduction
- Staff user:
- Original: 103 queries
- Optimised: 28 queries - factor 3 reduction
- Anonymous user:
- URL: /products/product-b/downloads/
- Anonymous user:
- Original: 64 queries
- Optimised: 16 queries - factor 4 reduction
- Staff user:
- Original: 150 queries
- Optimised: 31 queries - factor 5 reduction
- Anonymous user:
As you can see, there was a lot of unnecessary work. Django makes it extremely easy to generate database queries, which is both a good and bad thing, and here it is bad.
Here are some lessons for avoiding this unnecessary work:
Use django-debug-toolbar when developing, right from the beginning.
Seriously, use django-debug-toolbar.
Use django-debug-toolbar and keep it turned on. And look at it regularly. OK, point made.
This project was quite a lot easier to fix than django-cms, because it is much newer, and so has less code, and - more importantly - far fewer compatibility issues to worry about with 3rd parties who depend on certain features.
You should think about big O scaling issues fairly early, because you can easily put yourself into a situation where things are hard to fix, due to:
- schema design.
- promises you’ve made regarding functionality to 3rd parties.
For example, django-fiber has a concept of ‘current’ pages (pages which would form part of the bread crumb for the page you are on) and, in addition to the obvious ones (the ‘ancestors’ of the active page in the tree of pages), it has a feature which allows any page in the database to be a candidate ‘current page’ for any other page, based on a regex field. And so you have to check all these pages when rendering any page.
This does not scale well. Thankfully, it’s not too much of a problem if you don’t use this feature, since you can do DB level filtering to eliminate most of the pages as potential candidates for this.
But if you do use it, you have a scaling problem - for every time you use it, then the amount of work you’ve got to do to render any page increases. (And, if the DB filtering isn’t efficient, you may still be paying an increasing penalty for every page added to the system even if you never use the feature).
EDIT: I should have mentioned on the positive side that django-fiber has obviously given thought to general scaling issues, and used django-mptt for the tree structure of their Page model. This made it relatively easy to fix the ‘show_menu’ template tag to do everything in 2 queries. Otherwise it would have been a nightmare.
Create tests with assertNumQueries, and add them fairly early. You’ll then be alerted to performance regressions that affect scaling.
The tests will need to include tests for whole pages, not just bits, if you are going to analyse Big O scaling correctly, because a set of objects might be retrieved from the DB efficiently, but each one can easily make more database calls when it is actually used.
Some more specific lessons:
Read and understand the Django docs on optimizing DB access.
Understand when QuerySets are evaluated (which is part of the above, but worth mentioning).
There were some examples in the fiber code of really inefficient use of QuerySets, e.g. if obj in MyModel.objects.filter(foo=bar) (example). This code will load all the MyModel records with foo=bar and create MyModel instances. It does not do MyModel.objects.filter(foo=bar, pk=obj.pk).exists(), and it certainly does not do if obj.foo == bar.
Although it could do the first of these, Django deliberately does not make this optimisation. Django’s QuerySets are deliberately dumb. The rule of thumb is this: a QuerySet will only do one query - the query you have told it to do using methods such as filter(), order_by() and slicing. It does not respond ‘intelligently’ to any Python builtins such as len(), bool() or the in operator - these simply force the QuerySet to be evaluated. There is efficiency in the way it uses its cache, but there is no ‘cleverness’.
(BTW, I remember the discussions we had about this a long time ago on django-dev, and I’m convinced with hindsight we made the right discussion. It might seem like a nice opportunity to do some clever queries, but since the cleverness does not extend to actual mind-reading, it will fail and it will get in the way. For instance, in the template example in the docs, cleverness on the part of QuerySet would only result in unnecessary work, and it would be much harder to get it to do the right thing).
Don’t do queries when you’ve already got the information you need. There were multiple examples of this.
This will often mean that you will have some duplication of logic - a manager method that defines some filtering for a set of objects, and an instance level method which answers the question ‘do I belong to that set of objects’.
This duplication is often unavoidable if you want any kind of performance. Just put a comment on the two methods indicating that they need to be synced with each other, and never use one when the other is what you need.
(I can conceive of a cleverer system that would allow one of these to be automatically created from the other, but it would be limited to the subset of what can be expressed in both SQL and Python, and it doesn’t exist in Django).
When defining special values that you need to search for, ensure that you can do efficient DB filtering.
Beware this common bug - writing:
if not foo:
when what you mean is:
if foo is not None:
These are completely different. If None is being used as a sentinel value, the first will treat things like the empty list or empty dictionary incorrectly. If you mean if foo is None or if foo is not None, always write just that, never assume that no other false-y values will be passed in.
This is not just a performance-related bug, but it can cause a massive amount of repeated work where None is a sentinel value meaning “the work has not been done yet”, which is very common. This bug resulted in dozens of unneeded queries (including database UPDATEs being made for every request) in django-fiber.