Bundling dependencies

Posted in:

This post is about maintenance programming and the issue of Open Source dependencies that may need customising. It compiles some of my current thoughts, but I'm also eager to find out what other people do.

3 approaches to dependencies

  1. Pure dependency

    The source code of the dependency does not become a part of your project in any way. For a web project with Python and virtualenv/pip, you would just list the project name and version in requirements.txt, and it will be installed when you deploy your project.

    This is by far the easiest approach to dependencies.

  2. Forked dependency

    You create a fork of the library (usually hosted publicly, but not necessarily) and add to it the changes you need. You then use this fork from your main project.

    This is done either in the hope that bug fixes and feature additions that you make will be merged into the original, so that you won't have to maintain your fork forever, or with the aim of keeping your changes small enough that it will always be easy to merge in fixes from upstream.

  3. Bundled dependency

    You take a copy of the library, and include it directly into your own source code, so that it becomes a part of your source code, so that you can make whatever modifications you need. The code becomes a part of your source code forever.

This post is about number 3 – the bundled dependency.

(There are, of course, variants and mixtures of these – for example, Django has often bundled dependencies, but this was purely because of the confusing state of packaging, and the code was never modified for use in Django. These libraries have been or will be un-bundled as soon as possible.)

Avoid it if you possibly can

The first thing to say about bundling dependencies is that you should avoid doing so if at all possible:

  • It can result in large increases in code base.

  • You won't get critical fixes from upstream, and it can be hard to merge them in.

Bundling a dependency can be a drastic decision – you are taking on all the technical debt and maintenance burden of the code you are adding. Some developers look at Open Source libraries and think “wow, all this free source code I can just add to my project”. Your attitude needs to be exactly the opposite: “Wow, look at all that code I'm going to have to maintain and debug”.

An external dependency is often much worse from a maintenance point of view than code you have written yourself:

  • You may not understand the code very well at all, and you may not have access to the original reasons for the way it is.

  • When you add it to your project, you typically lose its history, making it harder to track down reasons for its current state.

  • Library code can often be over-generalised and complex. It copes with all kinds of situations that you don't need, but you will have to understand and maintain that complexity.

  • The code will not ‘fit’ into your project well – there may be all kinds of conventions and decisions that make it alien to your project, but now it is part of your project and needs to fit.

Alternatives

To avoid bundling a dependency, you can go for the ‘forked dependency’ above. For the missing features you need, attempt to add extension points that will give you the flexibility you need, rather than simply hard code something very specific to your project that will never get merged upstream.

Another alternative is to build what you need yourself, or very selectively add parts of the dependency into your own source. This may seem more work, but could be easier to maintain long term.

Finally, you could consider a monkey patch. But be very careful, and make sure you know all the places where you are doing that kind of thing, so that you can assess at what point you should be switching strategy.

When you should consider it

However, there are times when you should consider bundling the dependency:

  • When the changes you want to make are more than bug fixes.

  • When the changes can't be easily made by adding extension points to the original.

  • When the number/size of extensions is going to severely inhibit a developer's ability to understand the code.

I recently took on a project that had bundled a copy of Satchmo. It was a bit of a shock, because requirements.txt also listed Satchmo as a dependency, making me think I was in situation 1, when actually I was in situation 3, which is much worse.

Sometimes, however, it is unavoidable. e.g. you need multiple fields adding to DB different models, or you need to make invasive changes in other ways. As I looked at the number of modifications made to the bundled Satchmo, I realised there was no way that strategies 1 or 2 would be any good. Strategy 3 had already been chosen, it was impossible to turn back the clock, and with hindsight it was probably the right decision.

But implementation of that decision was lacking in lots of ways.

So how do you cope when you are forced to bundle? Here are my hints so far.

  • Recognise that you have done a really bad thing, and you need to take equally drastic action to cope with it. The bigger the dependency you've bundled, the more likely it is that you have seriously damaged your ability to maintain the project long term.

  • Make sure you include the tests of the original dependency, and integrate them as part of your test suite.

    Sounds obvious, but in the project that inspired this post, the opposite had been done – they had copied all the source code, with the exception of every file called 'tests.py' or directory called 'tests'. I do not know what possessed them to do this, but this decision was an extremely expensive one for their client, and has caused massive damage to the project.

  • Maintain the test suite properly.

    Again, sounds obvious, but tests are extremely valuable to a project, and in this situation it is vital that you keep them maintained.

    It is acceptable to delete tests if they are checking requirements that you no longer have. But you should be deleting the code that supports those tests as well.

  • Take complete ownership of the code.

    Having made the decision to bundle, don't treat the code like an external dependency. It is your code now, only you can fix it. Don't pretend you are going to merge in upstream changes.

    The code should live at the same ‘level’ as the rest of your code – for example, it should be in the same directory, not off in some 'libs' directory that makes it harder to find. You need to embrace the fact that it is part of your maintenance burden.

    On the other hand, it is your code now, you can do what you want with it. So don't be afraid of making changes. A tentative approach will leave you with the worst of both worlds – a library that doesn't really do what you want, but that you have to maintain. Make it do what you want.

    Obviously, there can be some value in maintaining a separation between "your stuff" and the "framework stuff" or "library stuff", but this is just good coding practice – you wouldn't hard code something very specific into a function that is supposed to be generic.

  • Delete, delete, delete.

    If there is code that you don't need, just delete it. The more code you can remove, the better. There can be a case for keeping some code around if:

    1. It is causing very little nuisance to maintenance efforts.

    2. It is fairly likely to be needed in the near future.

    3. It is not causing runtime weaknesses (e.g. security problems), because there is no entrance point to the code.

    But note that just the existence of code is a maintenance problem. If, for example, you need to change the signature of a function, you will do a search for sites that call it. Every hit you get is something you have to investigate, which takes time. If, in the process of this kind of investigation, you find some code that might be unused, find out if it is, and delete aggressively where appropriate.

    And code that might be needed one day is better deleted. By the time you come to need it, it might be horribly broken, or broken in subtle ways that will take you longer to debug than to write, or too complex or badly performing for the context of your evolved application.

    This applies to all kinds of code, including templates etc.

  • Clean aggressively.

    If you delete unused code, you'll find that you may well end up with code that has essentially unused generality, or various other things that no longer make sense for your specific project.

    This is my golden rule for maintenance:

    Leave the code looking as if it had always been designed that way.

    This is a general maintenance principle, but it is especially important for the situation where you are trying to go from a larger code base to a smaller one.

    Ideally, there should never be artefacts that can only be explained by talking about the history of the project. This applies to every detail, including:

    • names of models

    • names of fields

    • names of variables and functions

    Altering models is not hard if you have a good database migration tool e.g. South for Django.

    This principle may seem like it adds to the load of the maintenance programming, but long term it reduces the load, and reduces the likelihood that a project will collapse under its own weight. Even with this principle, projects tend to become unmaintainable – the natural tendency of a project is towards chaos, and you have to be very proactive about reversing that.

    Example 1: after deleting some classes, you end up with a class hierarchy where each base class is only used once. This adds a lot of overhead when reading the code. You should clean aggressively – fold the classes together (unless keeping them separate increases the clarity of the code).

    Example 2: The code I'm maintaining uses livesettings (and uses it far too much in my opinion, for things that ought to be in settings.py). It includes some options that are unlikely to change for a given project, or are likely to become ignored easily. For example, there is an "Only authenticated users can check out" setting. In a project with an overridden login form or login view (which can easily happen), it's very easy for this switch to become (at least partly) broken. When you are working on some code that branches on the value of this switch, there is no point fixing both branches – you won't have decent tests to ensure that the unused branch is really working.

    Instead, find out what the current value is, and just delete the other branch. Then find all instances of the setting being used, and clean up similarly. Finally, delete the code that defines the switch in the first place. Remove every trace – you always have the history if you really need to see how something was done before.

  • Lather, rinse, repeat.

    The aggressive process of deleting and cleaning leads to more, and you should follow this up. You may not have the time to do it right now, but you should be doing as you go – whenever some coding has turned up something that can be cleaned/deleted, first do the necessary commit for whatever you were working on. Then do a round of cleaning/deleting, finding all the code paths that are now dead or can be simplified, commit the change, and repeat.

These things have to go together. Aggressive deleting and cleaning can be made a lot easier if you have a good test suite. Of course, when deleting code, you will do a search for sites that might call it. But it ought to be possible to check if you can delete code simply by running the test suite with it absent.

What other approaches or hints do you have for dealing with this situation?

Comments §

Comments should load when you scroll to here...