My notes and takeaways from a long read on anomalies and system complexity called the STELLAReport from the SNAFUcatchers Workshop on Coping With Complexity, 2017. Via Matt.
This paper is one of the best I’ve read in a while. Many lessons here match my experiences developing—and breaking—software for WordPress. I gained new insight into adaptive mental models, how best to coordinate teams during an outage, and how much I both love, and depend on, debugging and troubleshooting.
Building and keeping current a useful representation takes effort. As the world changes representations may become stale. In a fast changing world, the effort needed to keep up to date can be daunting.
Several years ago I told a colleague in passing that my professional goal as a software developer was to build a mental model of everything in our codebase. To know where each piece lives and how it works. They just laughed and wished me luck. I was serious.
Though my approach may have seemed naïve, or maybe unnecessary in my job, I saw it as essential for survival in a bug-hunting role. A step toward mastery and adding more value to the company. What I didn’t know at the time was that we were past the point where one person could keep the entire codebase organized in their head.
What this paper indicates is that my coworker was right to laugh—it’s not useful to hold my own mental model of the entire system. I should however strive to learn from every opportunity to update the working knowledge I do have at any given time.
Note: “Resilient performance” sounds like a fancy word for “uptime.”
Much of my team’s work at Automattic is in the area of software quality: error prevention by blocking deploys when automated tests fail, building developer confidence by creating smarter, faster testing infrastructure. So much more we could do there in the future.
Many big tech companies have a specific role around this called Site Reliability Engineer (SRE). Combined with Release Engineering teams they build safeguards such as deploying to a small percent of production servers for each merge, or starting with a small amount of read-only HTTP requests. When no errors occur, the deploy continues.
At a software quality conference last year I learned how Groupon approaches this via Renato Martins. They use “Canary” tests like those we run on WordPress.com—small, critical tests. Once these pass, they push code into a blue/green deployment system. Which means if any error occurs the deploy system immediately switches all traffic to a previously known safe version (blue) while reverting the broken one (green). A continuous sequence of systems: one known safe version, one new version.
Groupon deploys the blue/green changes to a small subset of the public-facing servers, say 5% of all traffic. On top of that they have a Dark Canary, which is a separate server infrastructure that receives the live production HTTP traffic but doesn’t actually reply to the end user’s requests. They run statistical analysis on the results of this traffic to determine whether the build is reliable or not. For example, looking at HTTP response codes to see how many are non-200. (It’s more sophisticated than that, but basically it’s risk-free testing on a tiny portion of traffic.)
The most interesting piece mentioned is that when Groupon first developed this system, they were failing the build once every two weeks or so. But over time that number dropped to almost zero because the developers became conscious of it, and didn’t want to be the one to induce a failure. So it changed their culture, too.
Back to the STELLA report.
Proactive learning without waiting for failures to occur.
Experts are typically much better at solving problems than at describing accurately how problems are solved.
Eliciting expertise usually depends on tracing how experts solve problems.
The concept of “above-the-line/below-the-line” appeared in Ray Dalio’s Principles book as well. Great leaders are able to navige above and below with ease. In this case it deals with mental models of a system (above) with the actual system (below). Another way of stating it: below the line are details around “why what matters.” Above the line is the deeper understanding around “why what matters matters.”
A somewhat startling consequence of this is that what is below the line is inferred from people’s mental models of The System. What lies below the line is never directly seen or touched but only accessed via representations.
So true. I remember seeing an internal post mapping to explain how a new product worked with reactions from people saying, “Wow, I had no idea it was this complex.” And, “Thank you, now I see and understand it clearly.” I often think to myself when considering a software system, “This is probably only fully represented in one developer’s mind.”
Two challenges I’ve come across in practice:
To keep an accurate representation yourself in order to get work done.
To hold a good enough understanding of how others’ represent it in order to work in a team.
I love the SNAFU stories in this paper. Feeling the pain reading it—for times I’ve caused an outage on WordPress.com or a committed bad code to a default WordPress theme.
Pattern: a cascading “pile on” effect—I’ve seen this with user sessions on WordPress.com accumulating into the hundreds of thousands, until our UI tests started failing. We finally saw enough slowdowns that a deeper analysis was warranted to uncover the cause.
Surprise: where my mental model doesn’t match reality (both situational and fundamental shifts).
Uncertainty: failure to distinguish signal from noise can be wasteful. “It is unanticipated problems that tend to be the most vexing and difficult to manage.”
Evolving understanding: start from a fragmented view, expand as you learn how it really works.
Tracing: sweep across the environment looking for clues.
Tools: command line is closest and most common: “in virtually all cases, those struggling to cope with complex failures searched through the logs and analyzed prior system behaviors using them directly via a terminal window.”
Human coordination is interesting and also complex: “This coordination effort is among the most interesting and potentially important aspects of the anomaly response.” (Coworkers and I have noted “watching the systems channel for the entertainment and thrill of the hunt.”)
Communication: chat logs help with the postmortem (I saw this often in themes and WordPress.com outages).
Conflict between a quick fix and gaining a clear understanding of what/why it happened.
Managing risk: pressure is high for a quick fix, but potential for other effects is also high.
Tagging “postmortems”—which at Automattic we do on internal “P2” sites. The paper made me laugh here by calling the archive of these recaps a “morgue” (also used in the journalism/newspaper industry).
Anomalies are unambiguous but highly encoded messages about how systems really work. Postmortems represent an attempt to decode the messages and share them.
Anomalies are indications of the places where the understanding is both weak and important.
This is a key point: learning from outages helps us gain a more accurate understand of our system. Back to my point about trying to hold it all in my head: “Collectively, our skill isn’t in having a good model of how the system works, our skill is in being able to update our model efficiently and appropriately.”
The authors seem to treat postmortems as a deeply social activity for the teams involved, valuable beyond the dry technical review. At Automattic we could benefit from more intentional structure and synchronous sharing around this activity.
During the anomaly, coordinating the work can be difficult, assigning well-bounded tasks out to individuals to speed up recovery, bringing onlookers and potential helpers up to speed—versus doing it yourself to focus on the problem
Good insight for technical people from software developers to QA to DevOps:
To be immediately productive in anomaly response, experts may need to be regularly in touch with the underlying processes so that they have sufficient context to be effective quickly.
It’s much harder to work across many codebases and products and be effective in helping resolve an outage. There is high value in “shared experience working in teams” so that communication about the underlying issues is unneeded during a crisis; communications are short and pointed. You already know if your coworker is capable of something, so you don’t even have to ask.
“Sense making” is what I feel I often do in my daily investigative work, and a valuable skill—pattern matching and synthesis.
Strange loops are interdependencies in the failure that cause even more issues. For example, when you can’t log errors because the log file stopped working due to kernel TCP/IP freeze; and the failures caused an overloaded log or full storage.
This bit applies to WordPress.com: continuous deployment can change the culture around site outages, making them “ordinary” and quickly resolved as brief emergencies because of automation that’s readily available. But, when that automation itself fails — like a hung deploy command—it becomes an existential issue. Now we can’t break the site because our mechanism to quickly recover is gone.
A good summary of the balance between taking time to avoid or pay technical debt with the pressure to quickly ship visible product changes for customers.
There is an expectation that technical debt will be managed locally, with individuals and teams devoting just enough effort to keep the debt low while still keeping the velocity of development high.
Reminds me of how software development teams expect framework and platform changes to continue during normal product cycles—most teams I’ve worked with struggle with balancing the need to do both.
Technical debt in general is easy to spot before writing code, by looking at code, and is solved by refactoring. Dark debt is not recognized or recognizable until an anomaly occurs: complex system failures.
In a complex, uncertain world where no individual can have an accurate model of the system, it is adaptive capacity that distinguishes the successful.
A key insight: adding new people to the team or bringing in experts for analysis can help answer the question, “Why are things done the way they are?” Often lacking during internal discussions. We fix the point problem and move on; fighting fires instead of making a fire suppression system.
This STELLA report shows that value exists in participating in open discussions with other companies around these issues. Sharing common patterns, which is a big benefit of open source software, where you can follow not only the fix but the discussion around it.
More SRE (site reliability engineering) references:
When I come to a conversation without technique and provide the space to listen, I do so because I’ve failed at this a thousand times. I’ve planned and schemed and got lost in my own mind — missing the conversation, missing the moment, missing the person on the other side.
This time I’m going to do it differently.
I’m going to pause, give enough time and space to see other person first. Listen deeply so I can adjust my effort to the situation. If it’s the right moment, share what has worked for me. Later, I can ask how I’m doing to measure success.
If this is something that comes up for you — I highly recommend John Gardner’s “Personal Renewal” essay (via John Maeda). Powerful and resonant piece; one of the best I’ve ever read. Though written in 1990, it resonates with me today as if the words were spoken in my ear this morning.
Radical renewal is personal renewal — it means you’re ready for impactful changes.
Posting this as a personal bookmark because it comes up often in conversations with new leads. When I talk to people new to management I highlight the mindset change from “just you” to “the team.” The context of an outward mindset is important — you don’t own your time when you manage more than your own time. Keeping track of everything changes drastically when you start paying attention to more that just your own time and tasks.
This explains the frustration of a work day gets cut short — which can happen if something comes up unexpectedly or you’re continually interrupted. The resulting “short period” of time for making or creating is essentially lost. The big project, like the essay or talk you need to start on, don’t get attention because you don’t have the time for deep work.
Another clue for discovering the maker-vs-manager mindset is how you view your calendar. By month — and not by week or day — means you could be in maker mode. If you care more about every hour or 15-minute interval, you’re likely in manager mode.
A visual note to illustrate this concept:
Meetings can be disruptive to makers, says @phil_wade on Twitter. This ties into the concept of “flow state” made famous by Mihaly Csikszentmihalyi and others. If you’re curious to learn more, search that name (hard to spell!) for his talks and books — and read my thoughts on the flow fallacy.
A mental model that keeps coming up for me is “the unscripted dance.” This captures the idea of going into a situation knowing you can rely on your skills to adapt to the other party. Even without knowing ahead. Even without preparing for each move, each step, or each word you’ll use.
In a work setting, this could be a 1-1 chat with a direct report or a quarterly check-in with your boss.
When you’re dancing with an accomplished partner, you may allow the moment to unfold because you trust that a script is not necessary. If you’re dancing with an unaccomplished partner, you may use a script to start with because it helps guide the dance until once again, it becomes unnecessary.
Conversations at work can be like a dance when you are there “in the moment” — so attentive that you are aware of yourself and your partner at the same time — moving in and out of sync. My mind says, “When I don’t have to mold the conversation, it leads to nice possibilities.”
My leadership coach, Akshay Kapur, calls this “Listening” with a capital L. It can be quite fun, but also scary, especially if you’re used to always having things planned out ahead of time. The “Listening” also means not allowing other thoughts to take over my mind; those next questions or points that need to come up in the conversation. When that happens, I’m no longer listening — I’m just following my original plan. That’s when I miss out on insights and understanding.
The unscripted dance helps to improve my communication. To be more open and aware. Especially in established relationships with long-time colleagues where we can naturally move across topics.
I used to try to move the conversation in a certain direction, or get something out of it — my agenda for the conversation. Now I try my best to let the other person drive it. If they don’t have anything to share or ask about, I’m ready with a short list of topics or questions, just in case.
As you ride the currents of your day-to-day work — entering in and out of conversations with your team and with customers — or with your family and friends as your navigate your way through the world?
What’s the “surfboard” made of that you ride from wave to wave? The ups and downs.
What drives you?
For me, the surfboard is a perfect metaphor for describing the core value or the key ability that grounds me. What helps me stay consistent, open, and aware as I navigate my day and underlines my conversations and my relationships.
Another way of phrasing this is, “Coming from a place of _____ (fill in the blank) and then listening for the rough and smooth spots.”
Starting from that place, I’m open. Open to continue finding out what grounds me, drives me, and is the one thing that I fall back on as I navigate change.