Technical debt

I was chatting with my wife a while back and I flippantly dropped technical debt into our conversation. She asked me to describe it to her. I couldn’t. Well not in a way that made any sense whatsoever! Now that I’ve had some time to think I’m going to give it a bash.

In the traditional sense technical debt is described as decisions that we make to speed us up but then slow us down later. You take a deliberate decision to do something a bit hacky, a bit dirty, but you get that feature out the door. Though, if you need to change that feature the cost of change is going to be high. Here you understand you’re making a tradeoff of flexibility and quality for speed. However in my experience engineers use technical debt in a much broader sense. It might be deployment pain where you have to babysit a deployment to production. It might be flaky acceptance tests that you have to rerun a few times to get to pass. It might be that the code you’re working with is structured in a way that you can’t extend. Whatever it is, it wastes your time. Most of these problems in your platform won’t have been deliberate trade-off decisions so aren’t in the strictest sense technical debt, but they still have the same effect, they still waste your time.

So now we have a grasp of what technical debt is - anything that wastes your time, I want you to imagine a world without it. Conjure up a picture of sitting down to write a feature and the entire experience being utterly and completely frictionless. Everything you need to know is easy to find, easy to understand, and you get some really meaningful work done in what feels like no time at all. How amazing would that be?

Let’s switch gears and start thinking about how to get a little closer to the idyllic scenario above. To start tackling technical debt I need some mental models to help in thinking about the causes of technical debt. I find it useful to categorise engineering work into four types.

Fire fighting, where you drop everything and react to the problem at hand.
Change, where you have to make changes to existing code, tests, production hosts. This is like the tax you have to pay to get work done.
Innovation, the fun stuff, where you build something that will drive new business and customer value.
Improvement, where you make improvements to your system of work to reduce the amount of change and fire fighting work you have to do.

How does thinking in these terms help? I believe tech debt is very closely linked to change and fire fighting work. If innovation is what drives new customer value, then fire fighting and change work cost you an opportunity to innovate by wasting your time. But why do we find ourselves spending so much time fire fighting and doing change work? Why do problems build up in our codebases? A relentless drive to deliver features can leave little time to go back and fix problems with the platform. A lack of ownership of a codebase can make it difficult to improve when no-one believes it’s their responsibility to go in and fix a problem. A lack of support for continuous improvement where you don’t feel safe to spend a day fixing a problem. A lack of agreed standards can make you think that what you’ve done is good enough, when in fact it might not be.

All of these examples could lead to an increase in change and fire fighting work. There is an opportunity cost here, the time you spend fire fighting and doing change work, you can’t spend innovating, that time has gone. However investigating the time you spend on wastes normally leads to discovery of rich veins of technical debt in your platform. Fundamentally, I believe you need a strong habit of continuous improvement, the organisational support and dedicated space to do the improvement work. Without this you’ll always be amassing more technical debt.

The hardest thing, and I believe the first step, is for the organisation to admit the problem. This can be extremely challenging to the organisation, serious truth-telling can be hard-hitting and really difficult to accept for a lot of people. It’s admitting that mistakes have been made, and if the organisational culture isn’t strong enough to cope with truth-telling, the conversations that need to happen will not. It’s a difficult thing to have to slow down, to put aside some of your important short term goals and focus on long term sustainability. In order to support this conversation I believe it’s important to measure your delivery and the impact that tech debt is having on it.

How do we measure and then make visible tech debt? The state of devops report outlines some KPIs that you might find useful. The lead time from a code commit to that code in production, deployment frequency, and lead time for user stories help measure your throughput. Change failure rate and mean time to recovery can help measure quality of service. These criteria give you a picture of the health of the platform you’re working in. Here at Findmypast our lead time for code in some legacy applications is close to 7 days, whereas in our newer applications it can be as little as 20 minutes. This shows us that there’s problems getting work done in a legacy codebase, it’s easier to have a conversation about whether reducing the lead time for code in legacy should be a priority for business or not. We could set a target of 3 days lead time and iterate towards it. Then the conversation changes to developing a capability and away from technical debt.

Taking a step back and assuming the business decides to make improvement and developing engineering capability a priority, here are some things we’ve tried here at Findmpast. We have two platform teams, one focussing on mostly operational goals of making it reliable to ship code to production. The second team is focussed on tooling that makes it easier, faster and safer to deliver application code on our primary codebase. These teams have paid massive dividends in throughput and quality. However we had to be very careful not to create siloed teams, by making sure the whole engineering team felt involved in the decisions. This is still a constant challenge that we work on every day. But with the right approach this can reduce a lot of the wasted work for developers every day. In addition we’re now creating dedicated time and space for product delivery teams by giving them 20% of their iteration backlog to plan their own work towards a technology platform goal. This work is prioritised at the same level of importance as product work and gives the teams space to make improvements and address the causes of their fire fighting and change work.

The final piece is to celebrate the improvements you’re making as a whole. We have all hands meetings on a regular basis and it’s important to share progress and celebrate any successes you’ve had as a team. It keeps technical debt and continuous improvement in the forefront people’s minds. It keeps us balanced when considering long term sustainability alongside launching great new functionality for our users. Business is an infinite game, it never ends, and as such we need to equally consider our long and short term goals.

If you like the sound of what we’re doing here Have a look at our jobs and drop us a line.