A Django PAGNI: efficient bulk properties

Posted in:

Adding to my “Probably Are Gonna Need It” list (started by my YAGNI exceptions post a few months back, with follow ups by Simon Willison and Jacob Kaplan-Moss), this post is about a pattern that often crops up in Django applications. It probably applies to many other database-driven applications too, but I’m more confident about saying it is a PAGNI in Django – namely, a situation where you are usually better off taking the risk of doing the extra work up front.

To state it briefly:

If you have a calculated property that relates to a Django model and requires a database query (or other expensive work), consider making efficient in bulk even if you don't need it in bulk right now.

Example – initial requirement

Suppose we are writing an internal task management app for our team. A requirement comes along: the user’s dashboard page should have some text that indicates how many “in progress” tasks they have.

We already have a Task model associated with our User model:

class User(AbstractBaseUser):
    pass

class Task(models.Model):
    assigned_to = models.ForeignKey('myproject.User', related_name='assigned_tasks')
    state = models.CharField(choices=["ready_for_work", "in_progress", "done"])

We might already have a custom QuerySet with an in_progress method that does the appropriate filtering:

class TaskQuerySet(models.QuerySet):
    def in_progress(self):
        return self.filter(state="in_progress")  # or perhaps something more complex

We could then do the calculation with just the following code:

Task.objects.in_progress().filter(assigned_to=user).count()

# or, from a ``User`` instance it might be:

user.assigned_tasks.in_progress().count()

At a SQL level, this is doing a simple SELECT COUNT() with some filtering e.g.:

SELECT COUNT(*) FROM myproject_tasks WHERE user_id = 123 AND state = 'in_progress'

We should probably wrap that up in a method or property, which we could easily add to our User model like this:

class User(AbstractBaseUser):
    ...

    @cached_property
    def in_progress_tasks_count(self):
        return self.assigned_tasks.in_progress().count()

You can now use this property in a template very easily:

<p>Tasks in progress: {{ request.user.in_progress_tasks_count }}</p>

You may or may not like putting properties on the user model like that, but the code is simple and gets the job done, and seems to be fine. An alternative structure would put this code in a utility function in the model layer somewhere, which would require a bit more work to make the data available in our template, but either way it doesn’t affect how this example will unfold.

Example – new requirement

Some time later, a request comes up like this:

You know that "in progress tasks count" on the user dashboard? Can we add that as a column to the admin screen that shows the list of users?

This sounds very simple – they just want a piece of information we already know how to calculate to appear in another place. What could be easier?

If you are using the Django admin for the admin screen, and you coded the first part as above, the solution could indeed be very simple to execute – as simple as adding "in_progress_tasks_count" to the list_display property – a five minute job maximum.

It will work fine. But you've hit the dreaded N+1 queries problem.

For each user instance displayed in the list, we will end up executing a separate SQL query to get the task count. This is completely unnecessary – there are multiple ways to do this much more efficiently in SQL:

  • Using a SQL COUNT with a sub-query added to the main user query.

  • Using a SQL COUNT FILTER and a join added to the main user query.

  • Using a second query like this:

    SELECT user_id, COUNT(*) FROM myproject_user
    WHERE assigned_to_id IN (1, 2, 3) AND state = 'in_progress'
    GROUP BY assigned_to_id;
    

    where our list of user IDs values comes from the first query we executed.

Instead of those, we’ve ended up with a very slow method that is going to hurt us quite quickly in terms of performance. We might not notice the problem on our development machines, but it will quickly add up in production.

Even if we do notice it, we’re going to have a problem doing it the right way at this point:

  • The weight of the existing code pushes us in the wrong direction. The easy thing is slow.

    Also remember that by the time we go to do this, in addition to in_progress_tasks_count, we might also have completed_tasks_count, deferred_tasks_count etc.

  • The low estimate we probably made for this new requirement, either explicitly or just internally in the time we’ve allowed for it, pushes us to find a quick solution.

  • “One way to do it”, and “Once And Only Once” push against us implementing a second way to do the same calculation.

So the result will be:

  • either the wrong way, which will be slow and contribute further to patterns that will make us even slower in the future,

  • or, doing rework which will be an unexpected and unwelcome cost at this point.

Implementation tips

So, if we want to do this a bulk efficient way, what are our options?

  1. We can load our User objects in a query with an annotation that does the calculation in the database, as part of the main query.

    For the case above, we could use a custom User QuerySet method something like:

    class UserQuerySet(models.QuerySet):
        def with_in_progress_tasks_count(self):
            return self.annotate(in_progress_tasks_count=models.Count(
                'assigned_tasks',
                filter=Q(assigned_tasks__state='in_progress')
            ))
    

    This produces pretty nice SQL, although one downside of this code is that we have duplicated some logic from our TaskQuerySet.in_progress filter.

    You then need to load your user objects with User.objects.with_in_progress_tasks_count(), which would require overriding ModelAdmin.get_queryset if you are using the Django admin for example.

    To keep everything happy, it is often useful to still define a property on the model like this:

    class User(AbstractBaseUser):
    
        @cached_property
        def in_progress_tasks_count(self) -> int:
            raise AssertionError("Use User.objects.with_in_progress_tasks_count() if you want to use this")
    
  2. We can have a separate query which does the count, and then decorates the list of user objects with the value of in_progress_tasks_count.

    One of the advantages of this method is that the queries can be easier to write, whether you are using the ORM or dropping down to raw SQL, and they don’t complicate the main query at all, which can be a big bonus. Also, sometimes it can be easier to re-use existing custom QuerySet methods this way.

    One of the disadvantages is that it can be difficult to insert this extra bit of work at the right point, especially if you are in the context of framework code (like the Django admin or DRF), where the full QuerySet is built up and evaluated outside of your control.

    For this situation, in a number of projects I’ve started using a mechanism for adding callbacks that run immediately after a QuerySet is evaluated.

    Usage in this case would look like this:

    class UserQuerySet(AfterFetchQuerySetMixin, models.QuerySet):
        def with_in_progress_tasks_count(self):
            def add_in_progress_tasks_count(user_list):
                user_ids = [u.id for u in user_list]
                counts = (
                    Task.objects
                    .in_progress().order_by()
                    .filter(
                        assigned_to__id__in=user_ids
                    )
                    .values('assigned_to_id').annotate(count=models.Count('id'))
                    .values_list('assigned_to_id', 'count')
                )
                counts_dict = {user_id: c for user_id, c in counts)
                # Decorate user list
                for user in user_list:
                    user.in_progress_tasks_count = counts.get(user.id, 0)
            return self.register_after_fetch_callback(add_in_progress_tasks_count)
    

    This pattern has come up often enough for me that I’m wondering whether something like this should be included in Django itself. AfterFetchQuerySetMixin has to depend on some internals to work.

For this kind of work, in general you might need to get good at using Django’s aggregation features, which are not the easiest in my opinion. Haki Benita’s guide to Group By in Django is invaluable!

What about bulk write operations?

Above we’ve addressed bulk reads, but what about doing bulk writes in a similarly efficient manner? In my experience, this depends much more on the task in hand. In many cases, going from operating on a single row to needing operations to be efficient in bulk is less common. Often, even if we do need bulk operations, we can afford to do it more slowly in a background task. Your mileage may vary etc.

Discussion —why YAGNI fails

YAGNI is based on a few main observations, which I think are normally true:

  • The time to develop a feature later is (approximately) the same as the time to develop it earlier.

  • Life is full of surprises, and you might not need the feature later even when you suspect you will.

  • Even if you can correctly guess what features will be needed eventually, there is always an opportunity cost of delivering something before you need to (“cost of delay” as Martin Fowler describes it). There is almost certainly an endless list of features/improvements you do need, so if you have implemented a feature that is not used (yet), then that’s a planning mistake that has cost you in terms of features needed more urgently.

  • In addition to the cost of delay, Martin Fowler also points out the cost of carry – the complexity burden you carry for having added something unneeded.

Something counts as a YAGNI exception not if you just correctly predict the future, but if implementing it before you need to ends up with lower costs overall.

I’m claiming these arguments fail in this case, but why?

First, there is always some cost associated with re-work, and for a relatively small feature like this the overheads are significant. In particular, implementing the bulk-inefficient way then re-working and implementing the bulk-efficient way is always going to take longer than just implementing the bulk-efficient way, especially once you’ve added the desire to not repeat logic.

In this case, it’s true that you may not need the efficiency later on, but often you do, and the bulk-efficient way also works fine for non-bulk usage, without introducing that much complexity, and without a large opportunity cost because it doesn’t take that long, especially if you set up the patterns at the beginning.

In addition, the attitude of ”I’ll just re-design when I need it” fails to take into account some powerful forces:

  • The disproportionate effect that existing code structure has on code that follows it. Overwhelmingly, coders will try to make new code fit the existing pattern, which is not a bad instinct, but isn’t always the right thing.

    Every bit of code you write is actually setting a fairly powerful precedent that requires significant work to overcome.

    This actually means that the “cost of carry” argument may go the other way – if you establish a pattern of bulk-efficient operations, it makes it easier to implement every other (unrelated) feature that might also need bulk-efficient operations, whether from the beginning or later.

  • The pressures of time-management and estimates you’ve already made. It can be very hard to revisit an estimate of 30 minutes for a simple task and say “actually it’s going to take 2 days”.

Up-front thinking about performance like this is not premature optimization — these are not small nano-second or micro-second differences, but milli-seconds that often add up to seconds, and the patterns you choose at the beginning will have big consequences for your performance down the line.

Comments §

Comments should load when you scroll to here...