Resilience Engineering and Strange Loops

My notes and takeaways from a long read on anomalies and system complexity called the STELLA Report from the SNAFUcatchers Workshop on Coping With Complexity, 2017. Via Matt.

This paper is one of the best I’ve read in a while. Many lessons here match my experiences developing—and breaking—software for WordPress. I gained new insight into adaptive mental models, how best to coordinate teams during an outage, and how much I both love, and depend on, debugging and troubleshooting.

stella-report-complexity.png
Screenshot of the STELLA report from the SNAFUcatchers workshop on coping with complexity.

Building and keeping current a useful representation takes effort. As the world changes representations may become stale. In a fast changing world, the effort needed to keep up to date can be daunting.

Several years ago I told a colleague in passing that my professional goal as a software developer was to build a mental model of everything in our codebase. To know where each piece lives and how it works. They just laughed and wished me luck. I was serious.

Though my approach may have seemed naïve, or maybe unnecessary in my job, I saw it as essential for survival in a bug-hunting role. A step toward mastery and adding more value to the company. What I didn’t know at the time was that we were past the point where one person could keep the entire codebase organized in their head.

What this paper indicates is that my coworker was right to laugh—it’s not useful to hold my own mental model of the entire system. I should however strive to learn from every opportunity to update the working knowledge I do have at any given time.

Note: “Resilient performance” sounds like a fancy word for “uptime.”

Much of my team’s work at Automattic is in the area of software quality: error prevention by blocking deploys when automated tests fail, building developer confidence by creating smarter, faster testing infrastructure. So much more we could do there in the future.

Many big tech companies have a specific role around this called Site Reliability Engineer (SRE). Combined with Release Engineering teams they build safeguards such as deploying to a small percent of production servers for each merge, or starting with a small amount of read-only HTTP requests. When no errors occur, the deploy continues.


At a software quality conference last year I learned how Groupon approaches this via Renato Martins. They use “Canary” tests like those we run on WordPress.com—small, critical tests. Once these pass, they push code into a blue/green deployment system. Which means if any error occurs the deploy system immediately switches all traffic to a previously known safe version (blue) while reverting the broken one (green). A continuous sequence of systems: one known safe version, one new version.

Groupon deploys the blue/green changes to a small subset of the public-facing servers, say 5% of all traffic. On top of that they have a Dark Canary, which is a separate server infrastructure that receives the live production HTTP traffic but doesn’t actually reply to the end user’s requests. They run statistical analysis on the results of this traffic to determine whether the build is reliable or not. For example, looking at HTTP response codes to see how many are non-200. (It’s more sophisticated than that, but basically it’s risk-free testing on a tiny portion of traffic.)

The most interesting piece mentioned is that when Groupon first developed this system, they were failing the build once every two weeks or so. But over time that number dropped to almost zero because the developers became conscious of it, and didn’t want to be the one to induce a failure. So it changed their culture, too.


Back to the STELLA report.

Proactive learning without waiting for failures to occur.

Experts are typically much better at solving problems than at describing accurately how problems are solved.

Eliciting expertise usually depends on tracing how experts solve problems.

The concept of “above-the-line/below-the-line” appeared in Ray Dalio’s Principles book as well. Great leaders are able to navige above and below with ease. In this case it deals with mental models of a system (above) with the actual system (below). Another way of stating it: below the line are details around “why what matters.” Above the line is the deeper understanding around “why what matters matters.”

A somewhat startling consequence of this is that what is below the line is inferred from people’s mental models of The System. What lies below the line is never directly seen or touched but only accessed via representations.

So true. I remember seeing an internal post mapping to explain how a new product worked with reactions from people saying, “Wow, I had no idea it was this complex.” And, “Thank you, now I see and understand it clearly.” I often think to myself when considering a software system, “This is probably only fully represented in one developer’s mind.”

Two challenges I’ve come across in practice:

  1. To keep an accurate representation yourself in order to get work done.
  2. To hold a good enough understanding of how others’ represent it in order to work in a team.

I love the SNAFU stories in this paper. Feeling the pain reading it—for times I’ve caused an outage on WordPress.com or a committed bad code to a default WordPress theme.

Pattern: a cascading “pile on” effect—I’ve seen this with user sessions on WordPress.com accumulating into the hundreds of thousands, until our UI tests started failing. We finally saw enough slowdowns that a deeper analysis was warranted to uncover the cause.

More patterns:

  • Surprise: where my mental model doesn’t match reality (both situational and fundamental shifts).
  • Uncertainty: failure to distinguish signal from noise can be wasteful. “It is unanticipated problems that tend to be the most vexing and difficult to manage.”
  • Evolving understanding: start from a fragmented view, expand as you learn how it really works.
  • Tracing: sweep across the environment looking for clues.
  • Tools: command line is closest and most common: “in virtually all cases, those struggling to cope with complex failures searched through the logs and analyzed prior system behaviors using them directly via a terminal window.”
  • Human coordination is interesting and also complex: “This coordination effort is among the most interesting and potentially important aspects of the anomaly response.” (Coworkers and I have noted “watching the systems channel for the entertainment and thrill of the hunt.”)
  • Communication: chat logs help with the postmortem (I saw this often in themes and WordPress.com outages).
  • Conflict between a quick fix and gaining a clear understanding of what/why it happened.
  • Managing risk: pressure is high for a quick fix, but potential for other effects is also high.

Tagging “postmortems”—which at Automattic we do on internal “P2” sites. The paper made me laugh here by calling the archive of these recaps a “morgue” (also used in the journalism/newspaper industry).

Anomalies are unambiguous but highly encoded messages about how systems really work. Postmortems represent an attempt to decode the messages and share them.

Anomalies are indications of the places where the understanding is both weak and important.

This is a key point: learning from outages helps us gain a more accurate understand of our system. Back to my point about trying to hold it all in my head: “Collectively, our skill isn’t in having a good model of how the system works, our skill is in being able to update our model efficiently and appropriately.”

The authors seem to treat postmortems as a deeply social activity for the teams involved, valuable beyond the dry technical review. At Automattic we could benefit from more intentional structure and synchronous sharing around this activity.

During the anomaly, coordinating the work can be difficult, assigning well-bounded tasks out to individuals to speed up recovery, bringing onlookers and potential helpers up to speed—versus doing it yourself to focus on the problem

Good insight for technical people from software developers to QA to DevOps:

To be immediately productive in anomaly response, experts may need to be regularly in touch with the underlying processes so that they have sufficient context to be effective quickly.

It’s much harder to work across many codebases and products and be effective in helping resolve an outage. There is high value in “shared experience working in teams” so that communication about the underlying issues is unneeded during a crisis; communications are short and pointed. You already know if your coworker is capable of something, so you don’t even have to ask.

“Sense making” is what I feel I often do in my daily investigative work, and a valuable skill—pattern matching and synthesis.

Strange loops are interdependencies in the failure that cause even more issues. For example, when you can’t log errors because the log file stopped working due to kernel TCP/IP freeze; and the failures caused an overloaded log or full storage.

This bit applies to WordPress.com: continuous deployment can change the culture around site outages, making them “ordinary” and quickly resolved as brief emergencies because of automation that’s readily available. But, when that automation itself fails — like a hung deploy command—it becomes an existential issue. Now we can’t break the site because our mechanism to quickly recover is gone.

A good summary of the balance between taking time to avoid or pay technical debt with the pressure to quickly ship visible product changes for customers.

There is an expectation that technical debt will be managed locally, with individuals and teams devoting just enough effort to keep the debt low while still keeping the velocity of development high.

Reminds me of how software development teams expect framework and platform changes to continue during normal product cycles—most teams I’ve worked with struggle with balancing the need to do both.

Technical debt in general is easy to spot before writing code, by looking at code, and is solved by refactoring. Dark debt is not recognized or recognizable until an anomaly occurs: complex system failures.

In a complex, uncertain world where no individual can have an accurate model of the system, it is adaptive capacity that distinguishes the successful.

A key insight: adding new people to the team or bringing in experts for analysis can help answer the question, “Why are things done the way they are?” Often lacking during internal discussions. We fix the point problem and move on; fighting fires instead of making a fire suppression system.

This STELLA report shows that value exists in participating in open discussions with other companies around these issues. Sharing common patterns, which is a big benefit of open source software, where you can follow not only the fix but the discussion around it.


More SRE (site reliability engineering) references:

Problem Number Two

Once you eliminate your number one problem, number two gets a promotion. Jerry Weinberg