People are almost always confronting what computer science regards as the hard cases. Up against such hard cases, algorithms make assumptions, show bias toward simpler solutions, trade off the costs of error against the costs of delay, and take chances.
These aren’t the concessions we make when we can’t be rational. They’re what being rational means.
The 2018 Design in Tech Report from John Maeda is alive and kicking (I’m late to sharing this as it debuted at SxSW in March.) This year’s deck places a strong focus on inclusive design and artificial intelligence.
Computers aren’t good at inclusion. They’re good at exclusion, because they’re only based on past data. The business opportunity for the future-thinking designer is in inclusion. — Fast Company
My book review of Dawn of the New Everything, Encounters With Reality and Virtual Reality by Jaron Lanier (2017). The book is enjoyable and readable, though I did skip around over the more esoteric bits.
Alternating between a deep autobiographical dive into Lanier’s life and a straightforward account of the history of technology, with an emphasis on virtual (VR) and augmented reality (AR). I enjoyed learning about early technologists like Ivan Sutherland, whose 1963 SketchPad presentation was the “greatest demo of all time” to Doug Engelbart’s 1968 productivity software demo than reads like a modern tech stack: file versioning, collaborative editing, and video conferencing.
More eye-opening for me than the history is the honest and thought description of VR from one of the industry’s true pioneers.
A lot of joy in VR remains in just thinking about it.
VR trains us to perceive better… we learn to sense what makes reality real. [Because] human cognition is in motion and will generally outpace progress in VR.
A sense of cognitive momentum, of moment-to-moment anticipation, becomes palpable in VR. Like the chi in tai-chi.
The investigation has no end, since people change under investigation.
The technology of noticing experience itself.
Lanier describes VR as feeling your consciousness in its pure form. “It proves you are real.” The exact opposite of what I’d previously thought of when considering VR—I perceived a “fake” or “out of body” experience. Instead, Lanier emphasizes that it’s meant to be temporary. It’s meant to make you think, not just escape. It’s intended to produce the enjoyment of coming back to your true senses, reborn.
Reading notes: I read the hardcover edition from my local library after seeing a mention in both Wired and The Economist. See on Goodreads.
My notes and takeaways from a long read on anomalies and system complexity called the STELLAReport from the SNAFUcatchers Workshop on Coping With Complexity, 2017. Via Matt.
This paper is one of the best I’ve read in a while. Many lessons here match my experiences developing—and breaking—software for WordPress. I gained new insight into adaptive mental models, how best to coordinate teams during an outage, and how much I both love, and depend on, debugging and troubleshooting.
Building and keeping current a useful representation takes effort. As the world changes representations may become stale. In a fast changing world, the effort needed to keep up to date can be daunting.
Several years ago I told a colleague in passing that my professional goal as a software developer was to build a mental model of everything in our codebase. To know where each piece lives and how it works. They just laughed and wished me luck. I was serious.
Though my approach may have seemed naïve, or maybe unnecessary in my job, I saw it as essential for survival in a bug-hunting role. A step toward mastery and adding more value to the company. What I didn’t know at the time was that we were past the point where one person could keep the entire codebase organized in their head.
What this paper indicates is that my coworker was right to laugh—it’s not useful to hold my own mental model of the entire system. I should however strive to learn from every opportunity to update the working knowledge I do have at any given time.
Note: “Resilient performance” sounds like a fancy word for “uptime.”
Much of my team’s work at Automattic is in the area of software quality: error prevention by blocking deploys when automated tests fail, building developer confidence by creating smarter, faster testing infrastructure. So much more we could do there in the future.
Many big tech companies have a specific role around this called Site Reliability Engineer (SRE). Combined with Release Engineering teams they build safeguards such as deploying to a small percent of production servers for each merge, or starting with a small amount of read-only HTTP requests. When no errors occur, the deploy continues.
At a software quality conference last year I learned how Groupon approaches this via Renato Martins. They use “Canary” tests like those we run on WordPress.com—small, critical tests. Once these pass, they push code into a blue/green deployment system. Which means if any error occurs the deploy system immediately switches all traffic to a previously known safe version (blue) while reverting the broken one (green). A continuous sequence of systems: one known safe version, one new version.
Groupon deploys the blue/green changes to a small subset of the public-facing servers, say 5% of all traffic. On top of that they have a Dark Canary, which is a separate server infrastructure that receives the live production HTTP traffic but doesn’t actually reply to the end user’s requests. They run statistical analysis on the results of this traffic to determine whether the build is reliable or not. For example, looking at HTTP response codes to see how many are non-200. (It’s more sophisticated than that, but basically it’s risk-free testing on a tiny portion of traffic.)
The most interesting piece mentioned is that when Groupon first developed this system, they were failing the build once every two weeks or so. But over time that number dropped to almost zero because the developers became conscious of it, and didn’t want to be the one to induce a failure. So it changed their culture, too.
Back to the STELLA report.
Proactive learning without waiting for failures to occur.
Experts are typically much better at solving problems than at describing accurately how problems are solved.
Eliciting expertise usually depends on tracing how experts solve problems.
The concept of “above-the-line/below-the-line” appeared in Ray Dalio’s Principles book as well. Great leaders are able to navige above and below with ease. In this case it deals with mental models of a system (above) with the actual system (below). Another way of stating it: below the line are details around “why what matters.” Above the line is the deeper understanding around “why what matters matters.”
A somewhat startling consequence of this is that what is below the line is inferred from people’s mental models of The System. What lies below the line is never directly seen or touched but only accessed via representations.
So true. I remember seeing an internal post mapping to explain how a new product worked with reactions from people saying, “Wow, I had no idea it was this complex.” And, “Thank you, now I see and understand it clearly.” I often think to myself when considering a software system, “This is probably only fully represented in one developer’s mind.”
Two challenges I’ve come across in practice:
To keep an accurate representation yourself in order to get work done.
To hold a good enough understanding of how others’ represent it in order to work in a team.
I love the SNAFU stories in this paper. Feeling the pain reading it—for times I’ve caused an outage on WordPress.com or a committed bad code to a default WordPress theme.
Pattern: a cascading “pile on” effect—I’ve seen this with user sessions on WordPress.com accumulating into the hundreds of thousands, until our UI tests started failing. We finally saw enough slowdowns that a deeper analysis was warranted to uncover the cause.
Surprise: where my mental model doesn’t match reality (both situational and fundamental shifts).
Uncertainty: failure to distinguish signal from noise can be wasteful. “It is unanticipated problems that tend to be the most vexing and difficult to manage.”
Evolving understanding: start from a fragmented view, expand as you learn how it really works.
Tracing: sweep across the environment looking for clues.
Tools: command line is closest and most common: “in virtually all cases, those struggling to cope with complex failures searched through the logs and analyzed prior system behaviors using them directly via a terminal window.”
Human coordination is interesting and also complex: “This coordination effort is among the most interesting and potentially important aspects of the anomaly response.” (Coworkers and I have noted “watching the systems channel for the entertainment and thrill of the hunt.”)
Communication: chat logs help with the postmortem (I saw this often in themes and WordPress.com outages).
Conflict between a quick fix and gaining a clear understanding of what/why it happened.
Managing risk: pressure is high for a quick fix, but potential for other effects is also high.
Tagging “postmortems”—which at Automattic we do on internal “P2” sites. The paper made me laugh here by calling the archive of these recaps a “morgue” (also used in the journalism/newspaper industry).
Anomalies are unambiguous but highly encoded messages about how systems really work. Postmortems represent an attempt to decode the messages and share them.
Anomalies are indications of the places where the understanding is both weak and important.
This is a key point: learning from outages helps us gain a more accurate understand of our system. Back to my point about trying to hold it all in my head: “Collectively, our skill isn’t in having a good model of how the system works, our skill is in being able to update our model efficiently and appropriately.”
The authors seem to treat postmortems as a deeply social activity for the teams involved, valuable beyond the dry technical review. At Automattic we could benefit from more intentional structure and synchronous sharing around this activity.
During the anomaly, coordinating the work can be difficult, assigning well-bounded tasks out to individuals to speed up recovery, bringing onlookers and potential helpers up to speed—versus doing it yourself to focus on the problem
Good insight for technical people from software developers to QA to DevOps:
To be immediately productive in anomaly response, experts may need to be regularly in touch with the underlying processes so that they have sufficient context to be effective quickly.
It’s much harder to work across many codebases and products and be effective in helping resolve an outage. There is high value in “shared experience working in teams” so that communication about the underlying issues is unneeded during a crisis; communications are short and pointed. You already know if your coworker is capable of something, so you don’t even have to ask.
“Sense making” is what I feel I often do in my daily investigative work, and a valuable skill—pattern matching and synthesis.
Strange loops are interdependencies in the failure that cause even more issues. For example, when you can’t log errors because the log file stopped working due to kernel TCP/IP freeze; and the failures caused an overloaded log or full storage.
This bit applies to WordPress.com: continuous deployment can change the culture around site outages, making them “ordinary” and quickly resolved as brief emergencies because of automation that’s readily available. But, when that automation itself fails — like a hung deploy command—it becomes an existential issue. Now we can’t break the site because our mechanism to quickly recover is gone.
A good summary of the balance between taking time to avoid or pay technical debt with the pressure to quickly ship visible product changes for customers.
There is an expectation that technical debt will be managed locally, with individuals and teams devoting just enough effort to keep the debt low while still keeping the velocity of development high.
Reminds me of how software development teams expect framework and platform changes to continue during normal product cycles—most teams I’ve worked with struggle with balancing the need to do both.
Technical debt in general is easy to spot before writing code, by looking at code, and is solved by refactoring. Dark debt is not recognized or recognizable until an anomaly occurs: complex system failures.
In a complex, uncertain world where no individual can have an accurate model of the system, it is adaptive capacity that distinguishes the successful.
A key insight: adding new people to the team or bringing in experts for analysis can help answer the question, “Why are things done the way they are?” Often lacking during internal discussions. We fix the point problem and move on; fighting fires instead of making a fire suppression system.
This STELLA report shows that value exists in participating in open discussions with other companies around these issues. Sharing common patterns, which is a big benefit of open source software, where you can follow not only the fix but the discussion around it.
More SRE (site reliability engineering) references:
If you manage technical teams, are looking to grow and learn and broaden your network — you might enjoy connecting with this community of peers from all around the world: Engineering Manager Slack.
I’ve enjoyed participating in the discussions around books, conferences, remote companies, and more. Useful to both get a new perspective once in a while as I’m exposed to fresh ideas outside my own company’s culture and norms. And also to get a zeitgeist feel of my industry, my “people.”
A computing tip from my friend and WordPress web engineer extraordinaire Chris Marslender.
When pushing code to a Heroku app, make the last step be an action to reboot the app, with something like Hubot. So any time code changes, the server restarts. So if someone is offline and the server isn’t running, you can push a change to get it working again without pinging them.
This is a book review of The Way of the Web Tester by Jonathan Rasmusson. Hat tip: Alister Scott.
The goal for test automation, according to the author, is to have more time to do the fun things like developing new features, and less time on boring things like fixing bugs. We can’t test everything, yet “with the right 20%, we can sure test a lot.” Agreed. In broad strokes, this book debunks many common misconceptions of automated testing.
Don’t try to automate everything. Instead, automate just enough.
I love the dual audience of testers and developers, and how each chapter addresses the goals for each to learn in the coming text. The chapter ending summaries are handy. The text flows and the examples are easy to follow. Though a quick read, the book ends up covering important topics such as organization, naming, coupling, reusable code, and avoiding flaky tests by making them deterministic.
I love the concept of a Developer Productivity team at a software company—at Spotify, Rasmusson describes a squad that went around killing and fixing flaky tests. Making things run better, making everyone happier. I think of Excellence Wranglers at Automattic as having a similar goal in our work as quality advocates.
The Way of the Web Tester does a great job introducing important concepts and covers the basics of automated testing, and I’d recommend it to everyone, even seasoned developers and testers.
With these lines in your SSH config file—usually in .ssh directory in your user home directory—you’ll enjoy a more reliable remote shell session.
# Do not kill connection if route is down temporarily.
# Allow ten minutes down time before giving up the connection.
# Conserve bandwith. (Compression is off by default.)