Avoid Django's GenericForeignKey

Posted in:

In Django, a GenericForeignKey is a feature that allows a model to be related to any other model in the system, as opposed to a ForeignKey which is related to a specific one.

This post is about why GenericForeignKey is usually something you should stay away from. I haven't see any other articles describing why that is, or what the alternatives are, so this is my attempt at “GenericForeignKey considered harmful”.

Legitimate uses

Before I get going, I think that there are some legitimate cases, where most of the problems I'll highlight below aren't an issue. In particular, the following spring to mind:

  • generic auditing, where changes to DB rows are tracked in separate table – for this case, some of the disadvantages below are not so important, and might even be advantages (e.g. being able to refer to deleted rows),

  • generic tagging apps,

  • other generic applications where you have no real alternative, because you really don't know what models, or even how many different models, you might want to refer to.

However, I think there are many situations that don't fit the above, but where people are tempted to use GenericForeignKey:

  • You have a case where each object of a given model needs to be connected to one, and only one, of a known set of other models.

  • You are developing a generic app in which a model is designed to relate to one other model, but you don't know which model yet.

Most of this post is focused on the first of these situations, but I'll also address the second briefly. First, to make things easier to talk about, I'll introduce an example.

Example application

Our example application handles “tasks”. Tasks can be “owned” by either an individual or a group – but not both. You might be tempted to use a GenericForeignKey for this, such as below:

class Person(models.Model):
    name = models.CharField()


class Group(models.Model):
    name = models.CharField()
    creator = models.ForeignKey(Person)  # for a later example


class Task(models.Model):
    description = models.CharField(max_length=200)

    # owner_id and owner_type are combined into the GenericForeignKey
    owner_id = models.PositiveIntegerField()
    owner_type = models.ForeignKey(ContentType, on_delete=models.PROTECT)

    # owner will be either a Person or a Group (or perhaps
    # another model we will add later):
    owner = GenericForeignKey('owner_type', 'owner_id')

In this case there are just two options for owner, for simplicity, but most of what follows will apply just as well if there are more than two.

Please be clear – the pattern above is what I'm NOT recommending! And here's why:

Why it's bad

Database design

The database schema resulting from use of GenericForeignKey is not great. I've heard it said, “data matures like wine, application code matures like fish”. Your database will likely outlast the application in its current incarnation, so it would be nice if it makes sense on its own, without needing the application code to understand what it is talking about.

(If this doesn't sound very convincing, you might still want to read this section – the things explained here are important for the rest of this post).

In general, helpfully named tables and columns (which Django produces), and foreign key constraints (which Django also produces), make databases largely self-explanatory. GenericForeignKey breaks that.

For the above example, this is what your database looks like (using SQLite syntax, because that's what I'm using for the demo app for this post):

CREATE TABLE "gfks_task" (
    "id" integer NOT NULL PRIMARY KEY AUTOINCREMENT,
    "description" varchar(200) NOT NULL,
    "owner_id" integer unsigned NOT NULL,
    "owner_type_id" integer NOT NULL REFERENCES "django_content_type" ("id")
);
CREATE INDEX "gfks_task_618598c8"
    ON "gfks_task" ("owner_type_id");

So, owner_id is just an integer – any integer – with no obvious way to work out what it refers to. owner_type_id is better – we get another table to look at. This is what it looks like:

CREATE TABLE "django_content_type" (
    "id" integer NOT NULL PRIMARY KEY AUTOINCREMENT,
    "app_label" varchar(100) NOT NULL,
    "model" varchar(100) NOT NULL);
)
CREATE UNIQUE INDEX "django_content_type_app_label_76bd3d3b_uniq"
    ON "django_content_type" ("app_label", "model");

Taking a look at the contents of this table for my demo app:

id

app_label

model

1

admin

logentry

2

auth

group

3

auth

user

4

auth

permission

5

contenttypes

contenttype

6

sessions

session

7

gfks

group

8

gfks

person

9

gfks

task

With some good guesses, someone in the future looking at the data might be able to guess how this works, which is as follows:

  • gfks_task.owner_type_id refers us to a row in django_content_type (this is clear from the constraints).

  • By putting together the app_label and model from this row, we can work out the table name by adding underscores e.g. if gfks_task.owner_type_id == 8, we need to look at the gfks_person table;

    (In fact this is incorrect. To do it correctly, we actually need to look at the model i.e. we need to import gfks.models.Person, and look at its ._meta.db_table attribute. This is a rather nasty little gotcha which will catch you out if the Meta.db_table attribute was set explicitly for a model, and means that we have a rather ugly dependence on being able to import our Python application in order to make sense of the database).

  • We now have a table name, in which we can look up the record whose PK matches the owner_id value.

There are some obvious things to comment on:

  • This is clearly much more complex than just doing a foreign key lookup to a table.

  • The above mechanism makes writing custom SQL to query this data much harder — the join condition has become very nasty because the table name itself has become a value that has to be calculated.

But worse than these is that the database schema no longer describes your data very well.

Referential integrity

We have a big problem with referential integrity – namely, you have none.

This is perhaps the biggest and most important problem. The consistency and integrity of data in a database is of first importance, and with GenericForeignKey you lose out massively compared to database foreign keys.

Because owner_id is just an integer, it can have junk in there which means it doesn't refer to any real data. This could happen if the field is manually edited, or if the row it referred to is deleted, or if various other things happen – things that your database will protect you from if you use a normal foreign key.

Performance

A major issue with GenericForeignKey is performance.

To get an object with its generic related object, we have to do multiple lookups:

  1. Get the main object (e.g. a Task above).

  2. Get the ContentType object that is pointed at by Task.owner_type (this table is usually cached by Django).

  3. From the ContentType object we can find the model and therefore the table name.

  4. Knowing the table name from part 3, and the object ID from part 1, we can get the related object.

This is a more complex and expensive process than a normal foreign key, and it also resists optimisation, especially when you are getting a batch of objects.

For a start, you cannot use select_related, because that would require knowing what table to join on. For prefetch_related there is some limited support. For example, you can do:

Task.objects.all().prefetch_related('owner')

Django tries to be smart about this case and reduces the number of queries as much as it can. However, if, for example, you wanted to do:

Task.objects.all().prefetch_related('owner__creator')

then you will get an exception, because only Group has the attribute creator, and not Person.

Django code

In addition, in my experience, usage of GFKs will generally make your Django code worse, not better. It can be tempting to think that having a single Task.owner attribute which behaves polymorphically is an attractive option, but it soon breaks down.

First, filtering using the Django ORM works badly – the ORM cannot create joins to the right table, pushing the burden of doing DB level filtering onto you.

For example, if you want to get only tasks assigned to groups, and filter them further on their own, you can't do:

Task.objects.filter(owner__creator=foo)

Instead, you have to do:

group_ct = ContentType.objects.get_for_model(Group)
groups = Group.objects.filter(creator=foo)
tasks = Task.objects.filter(owner_type=group_ct,
                            owner_id__in=groups.values_list('id'))

There are other more efficient options, but you need to be willing to get your hands dirty creating SQL joins manually.

Second, a polymorphic object rarely works out as nicely as it sounds. In my experience, you will very often have to branch on type:

if isinstance(task.owner, Group):
    # do group things
else:
    # do person things

…either in Python code, or in your templates, at which point it doesn't seem so neat any more. This is especially the case when the models you are pointing to aren't under your control, so it's harder to make them all have the same interface.

A necessary consequence of their design means that GFKs are just more awkward to deal with, and this is also reflected in the level of support that they have from other Django features:

Deleting

By default, if you delete a Group or Person (the target object), for example in the admin interface, or from code, the object that refers to it won't be updated/deleted. The admin interface won't trace through GenericForeignKeys that might refer to that object. You will simply be left with corrupt data.

You can, however, add a GenericRelation to the Group and Person models, which will fix the ORM and admin to do the deleting. But note that this is not the default, and is attempting to ensure at the application level something that would be ensured at the database level for a normal foreign key.

Admin interface

For a GenericForeignKey field, the admin will show you only what you would expect for owner_id and owner_type_id – an integer field, and a content type drop down, not very helpful. And yes, you can change the integer value to anything, resulting in dangling rows i.e. corrupt data. There are some third party attempts to get a better interface e.g. see http://stackoverflow.com/questions/13907211/genericforeignkey-and-admin-in-django

And as mentioned above, objects referred to via GFKs don't (by default) get included in the “collect and display objects for deletion” logic of the Django admin delete page.

There are various other gotchas – they work badly with the admin's list filters for example, you'll be having to write extra code to support them, and they don't work nicely with ModelForms. You'll be having to patch up a lot of stuff at the interface level yourself.

Alternatives

Having hopefully persuaded you to find another solution, let's look at some of the options available.

Alternative 1 - nullable fields on source table

This is perhaps the simplest solution. We make an owner field for each type of possible owner there is. That requires making the fields nullable, and doing application level checks to ensure we have one-and-only-one not null in practice.

It looks like this:

class Task(models.Model):
    owner_group = models.ForeignKey(Group, null=True, blank=True,
                                    on_delete=models.PROTECT)
    owner_person = models.ForeignKey(Person, null=True, blank=True,
                                     on_delete=models.PROTECT)

So we have restored proper foreign keys, and all the goodness that goes with them. You will need to do None checks when you access owner_group and owner_person, which you could wrap up like this if you wanted some of the polymorphic behaviour:

@property
def owner(self):
    if self.owner_group_id is not None:
        return self.owner_group
    if self.owner_person_id is not None:
        return self.owner_person
    raise AssertionError("Neither 'owner_group' nor 'owner_person' is set")

Similarly you'll also need to ensure that one and only one of the two fields is set when saving.

This has the disadvantage that at the schema level, unless you add a check constraint, there is the possibility of an Owner pointing to both a Person and a Group, which doesn't make sense. But this is much smaller than the issues you have with GenericForeignKey.

Alternative 2 - intermediate table with nullable fields

Here, we move the nullable FKs out to a new table, where they turn into one-to-one fields, and create a non-nullable FK on the first table. It looks like this:

class Owner(models.Model):
    group = models.OneToOneField(Group, null=True, blank=True,
                                 on_delete=models.CASCADE)
    person = models.OneToOneField(Person, null=True, blank=True,
                                 on_delete=models.CASCADE)


class Task(models.Model):
    owner = models.ForeignKey(Owner, on_delete=models.PROTECT)

This has some nice advantages – we now have an Owner abstraction. If you want to use Task.owner polymorphically, you have a place to put the logic that understands how to treat Person and Group differently, without having to put it on Person or Group, which is especially useful if you don't own those models, or want the logic to be kept separate. We've also got one place that documents all the things that can be ‘owners’.

Further, if you come to need other things that use the same definition of Owner, you will have a very easy implementation – just another FK to Owner, which is much nicer than for alternative 1.

It still has the disadvantages of nullable fields, but having a dedicated Owner model to deal with that issue feels much cleaner.

It also has few other disadvantages compared to the previous solution:

  • We have an extra table, increasing the number of joins required to get everything if we need it all at once.

  • We will need to ensure that an Owner record exists for each group/person that you want to link to. This could mean creating one when we create a group/person, or later. Also, setting the Task.owner field correctly is going to take more work than in alternative 1 – this affects both code and things like default admin interface.

Alternative 3 - intermediate table with OneToOneFields on destination models

This starts with alternative 2, but moves the OneToOneFields to the other table, i.e. to the destination models. By doing so, they no longer need to be nullable.

class Owner(models.Model):
    pass


class Person(models.Model):
    name = models.CharField()
    owner = models.OneToOneField(Owner, on_delete=models.CASCADE)


class Group(models.Model):
    name = models.CharField()
    owner = models.OneToOneField(Owner, on_delete=models.CASCADE)
    creator = models.ForeignKey(Person)


class Task(models.Model):
    description = models.CharField(max_length=200)
    owner = models.ForeignKey(Owner, on_delete=models.PROTECT)

Some notes, compared to alternative 2:

  • We no longer have any NULL foreign keys to worry about.

  • However, we are required to create rows in Owner when creating Person or Group objects. In addition, those rows might never be used, e.g. a group might never be used as an Owner.

  • This pattern requires modifying Person and Group.

  • For some access patterns this requires more queries (e.g. if you start with a Task and want to know which type of Owner you have, this will require more queries than alternative 2).

Alternative 4 - multi-table inheritance

If you are aware of Django's multi-table inheritance, you might recognise that alternative 3 above can be created in Django with less code. Instead of explicit OneToOneFields to Owner, we can make Person and Group inherit from Owner.

This will actually create a very similar database schema as above - Django adds the OneToOneField links for you. Apart from column name differences, the one additional schema difference is that the owner column will also be used as a primary key (which could also be done manually for alternative 3 if you wanted, although I wouldn't recommend it).

At the code level, it is very similar to alternative 3 as well, and in fact simplifies some things significantly e.g. you don't need to manually create the Owner objects. In addition, you now get polymorphism for free (ish) – since Person is-a Owner, it inherits its behaviour.

Personally I avoid using multi-table inheritance. One reason for this is because I worry about the complexity of the inheritance mechanism Django uses. Secondly there are performance concerns – having the OneToOneFields explicit makes it easier for me to be aware of joins and performance issues. Thirdly, Django doesn't support multiple inheritance, so you can only use it once. In other words, you are taking one “is-a” or “has-a” relationship (a Group is-a Owner and a Person is-a Owner) and giving it special status and implementation (concrete model inheritance), while all other similar relationships have to be dealt with using other mechanisms. By contrast, alternatives 2 and 3 can be used as many times as you want. My experience with OOP, real world business objects, and the ever constant reality of ever changing requirements, is that you are better off ‘demoting’ all the relationships and implementing them all using composition rather than inheritance.

For completeness, however, I've added this alternative, with the code outlined below:

class Owner(models.Model):
    pass


class Person(Owner):
    name = models.CharField()


class Group(Owner):
    name = models.CharField()
    creator = models.ForeignKey(Person)


class Task(models.Model):
    description = models.CharField(max_length=200)
    owner = models.ForeignKey(Owner, on_delete=models.PROTECT)

Note that this is concrete model inheritance – you can't use abstract = True for the Owner table (thanks Airith).

Alternative 5 - multiple linked models

This solution will apply if you don't actually need the 'linked' model (Task in our example) to be a single model/table. For some use cases, it might be perfectly acceptable (or even desirable) to have Person with a related PersonTask model and Group with a related GroupTask model.

Now, there are few problems that might come up if your models and tables are now completely distinct with no joining table.

First, there may be some instances where you need to show a list that combines instances from the different models, possibly including sorting, filtering and paging. That might seem to require you to have a single table. However, SQL has UNION queries, and Django has support for them via QuerySet.union. Further, Simon Willison has a nice article showing how you can use this to get lists of objects from different tables, while being able to do sorting in the database, with a relatively small performance hit compared to having them in one table.

Secondly, you might have a lot of duplicated functionality between PersonTask and GroupTask. In Django this is easy to deal with. For a start, simply pull out the common stuff into an abstract Task model:

# Person and Group as in our initial code

class Task(models.Model):
    description = models.CharField(max_length=200)

    class Meta:
        abstract = True


class PersonTask(Task):
    owner = models.ForeignKey(Person, related_name="tasks")


class GroupTask(Task):
    owner = models.ForeignKey(Group, related_name="tasks")

Now you can put any common fields and functionality into Task. On a schema level your two types of Task are now completely separate, the inheritance exists only at the Python level.

You may have other code (utilities, views, templates etc.) that needs to manipulate both PersonTask and GroupTask instances. Due to duck typing, in Python this shouldn't be any problem, if those routines are truly generic, and only use the things that are true for all Task instances. You can always do isinstance checks to see what kind you have, if necessary.

Remember also that Python has first class classes, so you can define functions that take classes as arguments, where the class might be the model. For example:

def get_happy_tasks(model):
    return model.objects.filter(description__contains="☺")

happy_person_tasks = get_happy_tasks(PersonTask)

Similar patterns can be used to reduce a lot of the duplication that you might otherwise be scared about given you have more models with this technique.

You could further enhance this pattern by making Person and Group subclasses of an abstract Owner model. You then have a reference point for any generic code that needs to handle the owner field of both PersonTask and GroupTask instances – it simply needs to be careful to only use the things defined on Owner.

Swappable models

Finally, there is the case of needing to link to a single but unknown model (for example in a generic third party app) for which a GenericForeignKey is a tempting solution.

For this case, there are two approaches I know of:

  1. Make your model abstract, and require users to inherit from it, adding the ForeignKey field themselves. This can be a helpful pattern for other reasons, but can also get a bit unwieldy in some cases.

  2. Use swappable models. Django actually has support for this, but at the time of writing it is officially for internal use only (i.e. for swapping out the django.auth.contrib.User model). However, Swapper is an unofficial attempt to create a public API for it, which looks to be well maintained. This looks like a better option than a GFK to me.

Example code

For all the above examples, I've created a repo: https://github.com/spookylukey/djangoadmintips/tree/master/generic_foreign_key_tests

Notes:

  • All the examples are different apps within the same project.

  • It is bare bones – just for purposes of illustration. Not all things mentioned above are implemented.

  • In each case, the admin changelist for Task illustrates the typical N+1 (or worse) situation. In each case I've implemented ModelAdmin.get_queryset and used select_related and prefetch_related as well as possible. Using the Django debug toolbar you can see how successful that is – for the GFK case, not very.

  • You will also notice that the admin interfaces vary between the different alternatives. There will be ways to make all of them better, but they illustrate what you will get without much work.

Corrections or additions

If there are other strategies or corrections, please let me know – I intend to keep this page up to date as a reference.

Updates

2018-10-19 - Added Alternative 5

Comments §

Comments should load when you scroll to here...