Slack time, 100% utilization, and the Southwest meltdown

What happens when a system that’s running “hot” runs into adversity?

Jan 02, 2023

Last week’s Southwest Airlines meltdown has me a) salivating that someday there might be a postmortem we can read that’s even better than what’s on Reddit, and b) thinking a lot about how to apply the lessons to software development.

Lots of people are pointing at the “obvious” lesson: in-house code from the 1990s is probably not the best way to run a major airline. They may not want to learn about financial institutions or the power grid.

The more interesting lesson to me, though, is about slack (the lowercase kind). I didn’t know before last week that Southwest runs hot — something like 90% of their fleet is in use at any given time. Southwest is also one of few U.S. airlines that still runs point-to-point-to-point routes, vs. the hub-and-spoke routes that are typical of major carriers today. If a plane headed from point A to point D gets stuck at point C, there may not be another plane available to move those passengers to point D any time soon.

(I am summarizing, here — for a full, jaw-dropping explanation, check out this mastodon thread; this Reddit thread is also pretty good.)

This arrangement had been sort of working well enough most of the time. And then a winter storm hit a whole bunch of point Bs and point Cs, and things suddenly went very badly. Did I mention that Southwest has no automated system for communicating with their pilots or flight attendants, and that the system’s knowledge of a pilot’s location was based on what was supposed to happen, not what actually did?

Seriously, that mastodon thread is wild.

But we write software, we don’t fly airplanes

Yes, but so — and bear with me here — let’s say you’re a software engineer that’s like an airplane, but your job is to move a product feature from point A (“hey we should do this thing”) to point P (production, where it’s delivering value to users). Just like airplanes can suddenly and catastrophically stop moving people across the country in an orderly and predictable way when an overstretched airline encounters unexpected adversity, the same can happen to an engineering org trying to deliver value.

The naive workflow for moving a thing from A → P is:

someone tells an engineer to do a thing (A)
the engineer sits down and does it (…)
the thing is done and in production (P)

Of course, it doesn’t work this way at all. The value delivery work stream starts way before an engineer is asked to work on a thing. Once the engineer is involved, they need to spend time understanding what the thing is, and ideally to understand why they’re doing it at all. They’re going to have to research unfamiliar tools, perhaps, or touch another team’s code. The new thing requires a new common library to be written, so that’s going to need a thorough interface design review, which means the engineer needs to write up the interface design and maybe go to a meeting about it. It’s also going to require standing up a new service, and for reasons, the Security team wants to know about that any time it happens.

File:Cockpit of Boeing 737-800, ATA Airlines 1.jpg - Wikimedia Commons

In this situation, DORA metrics are kind of like the instrument panel in an airplane: really fun to look at, but all they’re going to tell you is that the plane is safely on the ground. The problem isn’t the plane — it’s the system in which it is very much not operating at the moment.

Slacking at work

Like Southwest only has so many airplanes, your team, organization, and company only have so many engineers. For that matter, you only have so many people — people without a software engineer title tend to have a lot to do with the actual process for getting a thing into the world, whether that’s security, legal, sales, a dependent team, a platform team, or myriad other roles.

In the worst case, there’s a hop for every letter between A and P. Maybe you solve for this by farming out some of the grunt work to an underpaid program manager so the engineer can work on “engineering things,” but the point remains: the thing is not in production, the thing is not delivering any value, until it reaches point P. How fast it gets there depends on a few things, at least:

The number of hops from A → P.
Whether human-to-human communication is required to complete the hop.
Whether there is a platform that provides core capabilities and functionality (e.g. logging, request throttling, authentication and authorization, load balancing, input sanitization, etc.) so the engineer doesn’t have to develop them on their own.

The speed at which you can make the whole trip from A → P — because remember, these hops happen in serial in this scenario — is a function of whether there is a person available to complete each hop when it is ready to be completed. In the case of human-to-human communication, you’ll need at least two people available. For hops that don’t depend on an underlying platform, the speed will also be a function of how quickly the engineer can figure out how to DIY. In all cases, the people involved will have to be at least nominally qualified for the task, which limits the actual pool of people you can draw from.

In the long term, there are technical solves here to reduce hop time: build and support a robust platform, emphasize self-serve solutions, beef up your documentation, re-evaluate whether the hop even needs to exist at all, maybe look into that whole microservices thing before you decide nah, we’re good.

The only short-term solve, though, is to make sure there is a person available when one is needed. If you think you can just hire out of this, no — what’s important here is that you have people with the appropriate context and skills who genuinely don’t have anything to do right now.

It’s uncomfortable, right? There are a variety of blog posts out there to assure you that people will use this time for things that benefit the company, and that may be true, but to me that’s beside the point. People in the position to reliably unblock others should almost always prioritize doing that (and perhaps then prioritize making their unblocking unnecessary). They can pursue thinking, prototyping, writing, learning, etc. when they’re idle, but unblocking shortens the time to value delivery on a thing that the company has (ideally) already decided has value.

In sum

Software development organizations aren’t airlines, but a lack of slack in the system can have a sudden negative impact all the same.
If you’re in a no-slack situation, DORA metrics may alert you to a problem, but you’re going to have to dig deeper for solutions. Over-indexing on these metrics can make other important factors in the value delivery process seem less influential than they are.
In the absence of intervention, people and process, not technology, will dominate the full stream of delivering value once a company reaches a certain size (let’s say ~250, but that’s really a hand-wavy guess). It requires intention, effort, and investment to arrive at a different outcome, and it gets harder as an organization grows.

rmurphey's newsletter

Discussion about this post