lukeplant.me.uk

Avoid Django’s GenericForeignKey

Posted in: Django, Python

In Django, a GenericForeignKey is a feature that allows a model to be related to any other model in the system, as opposed to a ForeignKey which is related to a specific one.

This post is about why GenericForeignKey is usually something you should stay away from. I haven’t see any other articles describing why that is, or what the alternatives are, so this is my attempt at “GenericForeignKey considered harmful”.

Before I get going, I think that there are some legitimate cases, where most of the problems I’ll highlight below aren’t an issue. In particular, the following spring to mind:

  • generic auditing, where changes to DB rows are tracked in separate table — for this case, some of the disadvantages below are not so important, and might even be advantages (e.g. being able to refer to deleted rows),
  • generic tagging apps,
  • other generic applications where you have no real alternative, because you really don’t know what models, or even how many different models, you might want to refer to.

However, I think there are many situations that don’t fit the above, but where people are tempted to use GenericForeignKey:

  • You have a case where each object of a given model needs to be connected to one, and only one, of a known set of other models.
  • You are developing a generic app in which a model is designed to relate to one other model, but you don’t know which model yet.

Most of this post is focussed on the first of these situations, but I’ll also address the second briefly. First, to make things easier to talk about, I’ll introduce an example application.

Our example application handles “tasks”. Tasks can be “owned” by either an individual or a group — but not both. You might be tempted to use a GenericForeignKey for this, such as below:

class Person(models.Model):
    name = models.CharField()


class Group(models.Model):
    name = models.CharField()
    creator = models.ForeignKey(Person)  # for a later example


class Task(models.Model):
    description = models.CharField(max_length=200)

    # owner_id and owner_type are combined into the GenericForeignKey
    owner_id = models.PositiveIntegerField()
    owner_type = models.ForeignKey(ContentType, on_delete=models.PROTECT)

    # owner will be either a Person or a Group (or perhaps
    # another model we will add later):
    owner = GenericForeignKey('owner_type', 'owner_id')

In this case there are just two options for owner, for simplicity, but most of what follows will apply just as well if there are more than two.

Please be clear — the pattern above is what I’m NOT recommending! And here’s why:

Why it’s bad

Database design

The database schema resulting from use of GenericForeignKey is not great. I’ve heard it said, “data matures like wine, application code matures like fish”. Your database will likely outlast the application in its current incarnation, so it would be nice if it makes sense on its own, without needing the application code to understand what it is talking about.

(If this doesn’t sound very convincing, you might still want to read this section — the things explained here are important for the rest of this post).

In general, helpfully named tables and columns (which Django produces), and foreign key constraints (which Django also produces), make databases largely self-explanatory. GenericForeignKey breaks that.

For the above example, this is what your database looks like (using SQLite syntax, because that’s what I’m using for the demo app for this post):

CREATE TABLE "gfks_task" (
    "id" integer NOT NULL PRIMARY KEY AUTOINCREMENT,
    "description" varchar(200) NOT NULL,
    "owner_id" integer unsigned NOT NULL,
    "owner_type_id" integer NOT NULL REFERENCES "django_content_type" ("id")
);
CREATE INDEX "gfks_task_618598c8"
    ON "gfks_task" ("owner_type_id");

So, owner_id is just an integer — any integer — with no obvious way to work out what it refers to. owner_type_id is better — we get another table to look at. This is what it looks like:

CREATE TABLE "django_content_type" (
    "id" integer NOT NULL PRIMARY KEY AUTOINCREMENT,
    "app_label" varchar(100) NOT NULL,
    "model" varchar(100) NOT NULL);
)
CREATE UNIQUE INDEX "django_content_type_app_label_76bd3d3b_uniq"
    ON "django_content_type" ("app_label", "model");

Taking a look at the contents of this table for my demo app:

id app_label model
1 admin logentry
2 auth group
3 auth user
4 auth permission
5 contenttypes contenttype
6 sessions session
7 gfks group
8 gfks person
9 gfks task

With some good guesses, someone in the future looking at the data might be able to guess how this works, which is as follows:

  • gfks_task.owner_type_id refers us to a row in django_content_type (this is clear from the constraints).

  • By putting together the app_label and model from this row, we can work out the table name by adding underscores e.g. if gfks_task.owner_type_id == 8, we need to look at the gfks_person table;

    (In fact this is incorrect. To do it correctly, we actually need to look at the model i.e. we need to import gfks.models.Person, and look at its ._meta.db_table attribute. This is a rather nasty little gotcha which will catch you out if the Meta.db_table attribute was set explicitly for a model, and means that we have a rather ugly dependence on being able to import our Python application in order to make sense of the database).

  • We now have a table name, in which we can look up the record whose PK matches the owner_id value.

There are some obvious things to comment on:

  • This is clearly much more complex than just doing a foreign key lookup to a table.
  • The above mechanism makes writing custom SQL to query this data much harder — the join condition has become very nasty because the table name itself has become a value that has to be calculated.

But the biggest issue first of all is that the database schema no longer describes your data very well.

Referential integrity

Even more importantly than above is the problem of referential integrity — namely, you have none.

This is perhaps the biggest and most important problem. The consistency and integrity of data in a database is of first importance, and with GenericForeignKey you lose out massively compared to database foreign keys.

Because owner_id is just an integer, it can have junk in there which means it doesn’t refer to any real data. This could happen if the field is manually edited, or if the row it referred to is deleted, or if various other things happen — things that your database will protect you from if you use a normal foreign key.

Performance

A major issue with GenericForeignKey is performance.

To get an object with its generic related object, we have to do multiple lookups:

  1. Get the main object (e.g. a Task above).
  2. Get the ContentType object that is pointed at by Task.owner_type (this table is usually cached by Django).
  3. From the ContentType object we can find the model and therefore the table name.
  4. Knowing the table name from part 3, and the object ID from part 1, we can get the related object.

This is a more complex and expensive process than a normal foreign key, and it also resists optimisation, especially when you are getting a batch of objects.

For a start, you cannot use select_related, because that would require knowing what table to join on. For prefetch_related there is some limited support. For example, you can do:

Task.objects.all().prefetch_related('owner')

Django tries to be smart about this case and reduces the number of queries as much as it can. However, if, for example, you wanted to do:

Task.objects.all().prefetch_related('owner__creator')

then you will get an exception, because only Group has the attribute creator, and not Person.

Django code

In addition, in my experience, usage of GFKs will generally make your Django code worse, not better. It can be tempting to think that having a single Task.owner attribute which behaves polymorphically is an attractive option, but it soon breaks down.

First, filtering using the Django ORM works badly — the ORM cannot create joins to the right table, pushing the burden of doing DB level filtering onto you.

For example, if you want to get only tasks assigned to groups, and filter them further on their own, you can’t do:

Task.objects.filter(owner__creator=foo)

Instead, you have to do:

group_ct = ContentType.objects.get_for_model(Group)
groups = Group.objects.filter(creator=foo)
tasks = Task.objects.filter(owner_type=group_ct,
                            owner_id__in=groups.values_list('id'))

There are other more efficient options, but you need to be willing to get your hands dirty creating SQL joins manually.

Second, a polymorphic object rarely works out as nicely as it sounds. In my experience, you will very often have to branch on type:

if isinstance(task.owner, Group):
    # do group things
else:
    # do person things

…either in Python code, or in your templates, at which point it doesn’t seem so neat any more. This is especially the case when the models you are pointing to aren’t under your control, so it’s harder to make them all have the same interface.

A necessary consequence of their design means that GFKs are just more awkward to deal with, and this is also reflected in the level of support that they have from other Django features:

Deleting

By default, if you delete a Group or Person (the target object), for example in the admin interface, or from code, the object that refers to it won’t be updated/deleted. The admin interface won’t trace through GenericForeignKeys that might refer to that object. You will simply be left with corrupt data.

You can, however, add a GenericRelation to the Group and Person models, which will fix the ORM and admin to do the deleting. But note that this is not the default, and is attempting to ensure at the application level something that would be ensured at the database level for a normal foreign key.

Admin interface

For a GenericForeignKey field, the admin will show you only what you would expect for owner_id and owner_type_id — an integer field, and a content type drop down, not very helpful. And yes, you can change the integer value to anything, resulting in dangling rows i.e. corrupt data. There are some 3rd party attempts to get a better interface e.g. see http://stackoverflow.com/questions/13907211/genericforeignkey-and-admin-in-django

And as mentioned above, objects referred to via GFKs don’t (by default) get included in the “collect and display objects for deletion” logic of the Django admin delete page.

There are various other gotchas — they work badly with the admin’s list filters for example, you’ll be having to write extra code to support them, and they don’t work nicely with ModelForms. You’ll be having to patch up a lot of stuff at the interface level yourself.

Alternatives

Having hopefully persuaded you to find another solution, let’s look at some of the options available.

Alternative 1 - nullable fields on source table

It looks like this:

class Task(models.Model):
    owner_group = models.ForeignKey(Group, null=True, blank=True,
                                    on_delete=models.CASCADE)
    owner_person = models.ForeignKey(Person, null=True, blank=True,
                                     on_delete=models.CASCADE)

This is perhaps the simplest solution. You will need to do None checks when you access owner_group and owner_person, which you could wrap up like this if you wanted some of the polymorphic behaviour:

@property
def owner(self):
    if self.owner_group_id is not None:
        return self.owner_group
    if self.owner_person_id is not None:
        return self.owner_person
    raise AssertionError("Neither 'owner_group' nor 'owner_person' is set")

Similarly you’ll also need to ensure that one and only one of the two fields is set when saving.

This has the disadvantage that at the schema level, unless you add a check constraint, there is the possibility of an Owner pointing to both a Person and a Group, which doesn’t make sense. But this is much smaller than the issues you have with GenericForeignKey.

Alternative 2 - intermediate table with nullable fields

Here, we move the nullable FKs out to a new table, where they turn into one-to-one fields, and create a non-nullable FK on the first table. It looks like this:

class Owner(models.Model):
    group = models.OneToOneField(Group, null=True, blank=True,
                                 on_delete=models.CASCADE)
    person = models.OneToOneField(Person, null=True, blank=True,
                                 on_delete=models.CASCADE)


class Task(models.Model):
    owner = models.ForeignKey(Owner, on_delete=models.CASCADE)

This has some nice advantages — we now have an Owner abstraction. If you want to use Task.owner polymophically, you have a place to put the logic that understands how to treat Person and Group differently, without having to put it on Person or Group, which is especially useful if you don’t own those models, or want the logic to be kept separate. We’ve also got one place that documents all the things that can be ‘owners’.

Further, if you come to need other things that use the same definition of Owner, you will have a very easy implementation — just another FK to Owner, which is much nicer than for alternative 1.

It still has the disadvantages of nullable fields, but having a dedicated Owner model to deal with that issue feels much cleaner.

It also has few other disadvantages compared to the previous solution:

  • We have an extra table, increasing the number of joins required to get everything if we need it all at once.
  • We will need to ensure that an Owner record exists for each group/person that you want to link to. This could mean creating one when we create a group/person, or later. Also, setting the Task.owner field correctly is going to take more work than in alternative 1 — this affects both code and things like default admin interface.

Alternative 3 - intermediate table with OneToOneFields on destination models

This starts with alternative 2, but moves the OneToOneFields to the other table, i.e. to the destination models. By doing so, they no longer need to be nullable.

class Owner(models.Model):
    pass


class Person(models.Model):
    name = models.CharField()
    owner = models.OneToOneField(Owner, on_delete=models.CASCADE)


class Group(models.Model):
    name = models.CharField()
    owner = models.OneToOneField(Owner, on_delete=models.CASCADE)
    creator = models.ForeignKey(Person)


class Task(models.Model):
    description = models.CharField(max_length=200)
    owner = models.ForeignKey(Owner, on_delete=models.CASCADE)

Some notes, compared to alternative 2:

  • We no longer have any NULL foreign keys to worry about.
  • However, we are required create rows in Owner when creating Person or Group objects. In addition, those rows might never be used, e.g. a group might never be used as an Owner.
  • This pattern requires modifying Person and Group.
  • For some access patterns this requires more queries (e.g. if you start with a Task and want to know which type of Owner you have, this will require more queries than alternative 2).

Alternative 4 - multi-table inheritance

If you are aware of Django’s multi-table inheritance, you might recognise that alternative 3 above can be created in Django with less code. Instead of explicit OneToOneFields to Owner, we can make Person and Group inherit from Owner.

This will actually create a very similar database schema as above - Django adds the OneToOneField links for you. Apart from column name differences, the one additional schema difference is that the owner column will also be used as a primary key (which could also be done manually for alternative 3 if you wanted, although I wouldn’t recommend it).

At the code level, it is very similar to alternative 3 as well, and in fact simplifies some things significantly e.g. you don’t need to manually create the Owner objects. In addition, you now get polymorphism for free (ish) — since Person is-a Owner, it inherits its behaviour.

Personally I avoid using multi-table inheritance. One reason for this is because I worry about the complexity of the inheritance mechanism Django uses. Secondly there are performance concerns — having the OneToOneFields explicit makes it easier for me to be aware of joins and performance issues. Thirdly, Django doesn’t support multiple inheritance, so you can only use it once. In other words, you are taking one “is-a” or “has-a” relationship (a Group is-a Owner and a Person is-a Owner) and giving it special status and implementation (concrete model inheritance), while all other similar relationships have to be dealt with as by other mechanisms. In contrast, alternatives 2 and 3 can be used as many times as you want. My experience with OOP, real world business objects, and the ever constant reality of ever changing requirements, is that you are better off ‘demoting’ all the relationships and implementing them all using composition rather than inheritance.

For completeness, however, I’ve added this alternative, with the code outlined below:

class Owner(models.Model):
    pass


class Person(Owner):
    name = models.CharField()


class Group(Owner):
    name = models.CharField()
    creator = models.ForeignKey(Person)


class Task(models.Model):
    description = models.CharField(max_length=200)
    owner = models.ForeignKey(Owner, on_delete=models.CASCADE)

Note that this is concrete model inheritance — you can’t use abstract = True for the Owner table (thanks Airith).

Swappable models

Finally, there is the case of needing to link to a single but unknown model (for example in a generic 3rd party app) for which a GenericForeignKey is a tempting solution.

For this case, there are two approaches I know of:

  1. Make your model abstract, and require users to inherit from it, adding the ForeignKey field themselves. This can be a helpful pattern for other reasons, but can also get a bit unwieldy in some cases.
  2. Use swappable models. Django actually has support for this, but at the time of writing it is officially for internal use only (i.e. for swapping out the django.auth.contrib.User model). However, Swapper is an unofficial attempt to create a public API for it, which looks to be well maintained. This looks like a better option than a GFK to me.

Example code

For all the above examples, I’ve created a repo: https://bitbucket.org/spookylukey/djangoadmintips/src/default/generic_foreign_key_tests/

Notes:

  • All the examples are different apps within the same project.
  • It is bare bones — just for purposes of illustration. Not all things mentioned above are implemented.
  • In each case, the admin changelist for Task illustrates the typical N+1 (or worse) situation. In each case I’ve implemented ModelAdmin.get_queryset and used select_related and prefetch_related as well as possible. Using the Django debug toolbar you can see how successful that is — for the GFK case, not very.
  • You will also notice that the admin interfaces vary between the different alternatives. There will be ways to make all of them better, but they illustrate what you will get without much work.

Corrections or additions

If there are other strategies or corrections, please let me know — I intend to keep this page up to date as a reference.

Comments §

...loading...
blog comments powered by Disqus