Resilience Engineering and Strange Loops

My notes and takeaways from a long read on anomalies and system complexity called the STELLA Report from the SNAFUcatchers Workshop on Coping With Complexity, 2017. Via Matt.

This paper is one of the best I’ve read in a while. Many lessons here match my experiences developing—and breaking—software for WordPress. I gained new insight into adaptive mental models, how best to coordinate teams during an outage, and how much I both love, and depend on, debugging and troubleshooting.

stella-report-complexity.png
Screenshot of the STELLA report from the SNAFUcatchers workshop on coping with complexity.

Building and keeping current a useful representation takes effort. As the world changes representations may become stale. In a fast changing world, the effort needed to keep up to date can be daunting.

Several years ago I told a colleague in passing that my professional goal as a software developer was to build a mental model of everything in our codebase. To know where each piece lives and how it works. They just laughed and wished me luck. I was serious.

Though my approach may have seemed naïve, or maybe unnecessary in my job, I saw it as essential for survival in a bug-hunting role. A step toward mastery and adding more value to the company. What I didn’t know at the time was that we were past the point where one person could keep the entire codebase organized in their head.

What this paper indicates is that my coworker was right to laugh—it’s not useful to hold my own mental model of the entire system. I should however strive to learn from every opportunity to update the working knowledge I do have at any given time.

Note: “Resilient performance” sounds like a fancy word for “uptime.”

Much of my team’s work at Automattic is in the area of software quality: error prevention by blocking deploys when automated tests fail, building developer confidence by creating smarter, faster testing infrastructure. So much more we could do there in the future.

Many big tech companies have a specific role around this called Site Reliability Engineer (SRE). Combined with Release Engineering teams they build safeguards such as deploying to a small percent of production servers for each merge, or starting with a small amount of read-only HTTP requests. When no errors occur, the deploy continues.


At a software quality conference last year I learned how Groupon approaches this via Renato Martins. They use “Canary” tests like those we run on WordPress.com—small, critical tests. Once these pass, they push code into a blue/green deployment system. Which means if any error occurs the deploy system immediately switches all traffic to a previously known safe version (blue) while reverting the broken one (green). A continuous sequence of systems: one known safe version, one new version.

Groupon deploys the blue/green changes to a small subset of the public-facing servers, say 5% of all traffic. On top of that they have a Dark Canary, which is a separate server infrastructure that receives the live production HTTP traffic but doesn’t actually reply to the end user’s requests. They run statistical analysis on the results of this traffic to determine whether the build is reliable or not. For example, looking at HTTP response codes to see how many are non-200. (It’s more sophisticated than that, but basically it’s risk-free testing on a tiny portion of traffic.)

The most interesting piece mentioned is that when Groupon first developed this system, they were failing the build once every two weeks or so. But over time that number dropped to almost zero because the developers became conscious of it, and didn’t want to be the one to induce a failure. So it changed their culture, too.


Back to the STELLA report.

Proactive learning without waiting for failures to occur.

Experts are typically much better at solving problems than at describing accurately how problems are solved.

Eliciting expertise usually depends on tracing how experts solve problems.

The concept of “above-the-line/below-the-line” appeared in Ray Dalio’s Principles book as well. Great leaders are able to navige above and below with ease. In this case it deals with mental models of a system (above) with the actual system (below). Another way of stating it: below the line are details around “why what matters.” Above the line is the deeper understanding around “why what matters matters.”

A somewhat startling consequence of this is that what is below the line is inferred from people’s mental models of The System. What lies below the line is never directly seen or touched but only accessed via representations.

So true. I remember seeing an internal post mapping to explain how a new product worked with reactions from people saying, “Wow, I had no idea it was this complex.” And, “Thank you, now I see and understand it clearly.” I often think to myself when considering a software system, “This is probably only fully represented in one developer’s mind.”

Two challenges I’ve come across in practice:

  1. To keep an accurate representation yourself in order to get work done.
  2. To hold a good enough understanding of how others’ represent it in order to work in a team.

I love the SNAFU stories in this paper. Feeling the pain reading it—for times I’ve caused an outage on WordPress.com or a committed bad code to a default WordPress theme.

Pattern: a cascading “pile on” effect—I’ve seen this with user sessions on WordPress.com accumulating into the hundreds of thousands, until our UI tests started failing. We finally saw enough slowdowns that a deeper analysis was warranted to uncover the cause.

More patterns:

  • Surprise: where my mental model doesn’t match reality (both situational and fundamental shifts).
  • Uncertainty: failure to distinguish signal from noise can be wasteful. “It is unanticipated problems that tend to be the most vexing and difficult to manage.”
  • Evolving understanding: start from a fragmented view, expand as you learn how it really works.
  • Tracing: sweep across the environment looking for clues.
  • Tools: command line is closest and most common: “in virtually all cases, those struggling to cope with complex failures searched through the logs and analyzed prior system behaviors using them directly via a terminal window.”
  • Human coordination is interesting and also complex: “This coordination effort is among the most interesting and potentially important aspects of the anomaly response.” (Coworkers and I have noted “watching the systems channel for the entertainment and thrill of the hunt.”)
  • Communication: chat logs help with the postmortem (I saw this often in themes and WordPress.com outages).
  • Conflict between a quick fix and gaining a clear understanding of what/why it happened.
  • Managing risk: pressure is high for a quick fix, but potential for other effects is also high.

Tagging “postmortems”—which at Automattic we do on internal “P2” sites. The paper made me laugh here by calling the archive of these recaps a “morgue” (also used in the journalism/newspaper industry).

Anomalies are unambiguous but highly encoded messages about how systems really work. Postmortems represent an attempt to decode the messages and share them.

Anomalies are indications of the places where the understanding is both weak and important.

This is a key point: learning from outages helps us gain a more accurate understand of our system. Back to my point about trying to hold it all in my head: “Collectively, our skill isn’t in having a good model of how the system works, our skill is in being able to update our model efficiently and appropriately.”

The authors seem to treat postmortems as a deeply social activity for the teams involved, valuable beyond the dry technical review. At Automattic we could benefit from more intentional structure and synchronous sharing around this activity.

During the anomaly, coordinating the work can be difficult, assigning well-bounded tasks out to individuals to speed up recovery, bringing onlookers and potential helpers up to speed—versus doing it yourself to focus on the problem

Good insight for technical people from software developers to QA to DevOps:

To be immediately productive in anomaly response, experts may need to be regularly in touch with the underlying processes so that they have sufficient context to be effective quickly.

It’s much harder to work across many codebases and products and be effective in helping resolve an outage. There is high value in “shared experience working in teams” so that communication about the underlying issues is unneeded during a crisis; communications are short and pointed. You already know if your coworker is capable of something, so you don’t even have to ask.

“Sense making” is what I feel I often do in my daily investigative work, and a valuable skill—pattern matching and synthesis.

Strange loops are interdependencies in the failure that cause even more issues. For example, when you can’t log errors because the log file stopped working due to kernel TCP/IP freeze; and the failures caused an overloaded log or full storage.

This bit applies to WordPress.com: continuous deployment can change the culture around site outages, making them “ordinary” and quickly resolved as brief emergencies because of automation that’s readily available. But, when that automation itself fails — like a hung deploy command—it becomes an existential issue. Now we can’t break the site because our mechanism to quickly recover is gone.

A good summary of the balance between taking time to avoid or pay technical debt with the pressure to quickly ship visible product changes for customers.

There is an expectation that technical debt will be managed locally, with individuals and teams devoting just enough effort to keep the debt low while still keeping the velocity of development high.

Reminds me of how software development teams expect framework and platform changes to continue during normal product cycles—most teams I’ve worked with struggle with balancing the need to do both.

Technical debt in general is easy to spot before writing code, by looking at code, and is solved by refactoring. Dark debt is not recognized or recognizable until an anomaly occurs: complex system failures.

In a complex, uncertain world where no individual can have an accurate model of the system, it is adaptive capacity that distinguishes the successful.

A key insight: adding new people to the team or bringing in experts for analysis can help answer the question, “Why are things done the way they are?” Often lacking during internal discussions. We fix the point problem and move on; fighting fires instead of making a fire suppression system.

This STELLA report shows that value exists in participating in open discussions with other companies around these issues. Sharing common patterns, which is a big benefit of open source software, where you can follow not only the fix but the discussion around it.


More SRE (site reliability engineering) references:

How Canaries Help Us Merge Good Pull Requests

Technical update from my colleague Alister for how WordPress.com uses automated tests for build confidence, now running for on GitHub pull requests instead of after deployment to production. The tests and webhook “bridge” infrastructure are open source just like the Calypso source code itself.

Developer Resources

At WordPress.com we strive to provide a consistent and reliable user experience as we merge and release hundreds of code changes each week.

We run automated unit and component tests for our Calypso user interface on every commit against every pull request (PR).

We also have 32 automated end-to-end (e2e) test scenarios that, until recently, we would only automatically run across our platform after merging and deploying to production. While these e2e scenarios have found regressions fairly quickly after deploying (the 32 scenarios execute in parallel in just 10 minutes), they don’t prevent us from merging and releasing regressions to our customer experience.

Introducing our Canaries

Earlier this year we decided to identify three of our 32 automated end-to-end test scenarios that would act as our “canaries”: a minimal subset of automated tests to quickly tell us if our most important flows are broken. These tests execute after a pull…

View original post 477 more words

SSH Config for Slow Connections

Via Andy Skelton in 2010, proving once again that great advice is timeless.

With these lines in your SSH config file—usually in .ssh directory in your user home directory—you’ll enjoy a more reliable remote shell session.

# Do not kill connection if route is down temporarily.
TCPKeepAlive no

# Allow ten minutes down time before giving up the connection.
ServerAliveCountMax 30
ServerAliveInterval 20

# Conserve bandwith. (Compression is off by default.)
Compression yes

Why Isn’t Open Source A Gateway For Coders Of Color?

Food for thought: Why Isn’t Open Source A Gateway For Coders Of Color?. (Via my wife, Erin.)

The Investigative Mindset

Knowing where to look for answers is more important than memorizing a set of requirements or rules.

I have a confession: I often have no idea what I’m doing.

I remember clearly what it felt like my first day at my job: I was new, overwhelmed, and maybe even scared. But the work was exciting, mind-filling, and fun. Now, several years into the role, I still feel this push-and-pull. I’ve learned to juggle these opposing feelings and be both productive and successful at my job.

The key to this—and I believe one of the most important traits for my success—is an investigative mindset. Knowing where to look for answers is more important than memorizing a set of requirements or rules.

Why? Rules and requirements change, and the context I work in is constantly changing. I’m more productive in my work by making good, informed decisions—not by the book. I can work smarter, gaining a new awareness of how everything works.

How? To develop this mindset, I exercise the following:

  1. Take initiative on my own first: do the legwork to find the answer. Be tenacious and know where to look.
  2. Ask questions, know when to stop looking and ask for help. Not being afraid to be ignorant or wrong.
  3. Share my ideas for the better solution.
  4. Look to my teammates as a critical force—we learn together.

Often, the investigation takes me out of bounds—out of my “area”—that’s OK, and natural. I talk to other people outside my team, and I learn a bit more about how it all works together. I fill in the gaps in my knowledge. I’ve raised my awareness.

It wasn’t always this clear for me … after only 3 or 4 months into being at Automattic I had a revelation that changed my mindset—put it into words. One day, I ran into a quote one of our internal P2 sites, expressed as a formula or pseudocode.

( intuition + investigation ) > memorization

I said “Yes, Yes, YES!” I was in the privacy of my home office, so no one heard me, of course. It really made sense, though. And it alleviated part of the struggle I was having to completely internalize all the things I was supposed to know and do. All the stats and bots and checklists and dos and donts.

This was later echoed by something UX guru Jared Spool said at An Event Apart:

The mindset of investigation is about informed decisions, not going “by the book.” Dogmatic, rule-based methodologies exist only to enforce things; they avoid critical thinking and good decision making.

I realized if I made good, informed decisions I could solve problems in both normal and edge cases. Instead of a one-time answer, I could build a framework to answer any question. A mentality. The outcome of finding the answer, solving the problem, sharing the solution—rewards this mindset. A loop. Doing it over and over.

This feedback loop is hugely powerful. It gives me confidence to continue to strive for an investigative mind.


Notes:

1. A video my colleague Justin Shreve posted echoes this investigative mindset, specifically as it applies to software development: Being a developer is being a problem solver.

2. Thinking about investigation reminded me of my time in the Future Problem Solving (FPS) club in high school. We found creative solutions to mock issues like world hunger or renewable energy. It was fun and challenging, but the best part was the process itself. Investigate, organize, present, debate. Learn. (Random trivia: according to Wikipedia a later team from my school won a state FPS competition. Rock!)

3. One more quote: “Never memorize what you can look up in books.” —Einstein (unsourced)

Review: Forge, a Tool for Bootstrapping a WordPress Theme

Forge is a tool for quickly developing a WordPress theme built by the fine folks at The Theme Foundry.

Forge is a free command-line toolkit for bootstrapping and developing WordPress themes in a tidy environment using front-end languages like Sass, LESS, and CoffeeScript.

During the early development process of this year’s default theme for WordPress, the Twenty Twelve team—Drew Strojny and myself—used Github and Forge to build the theme (view the archived source).

I would like to share my thoughts on using Forge during this process now that the theme is back in the core WordPress environment: Subversion and Trac.

In summary: Forge is too restrictive for general theme develpoment.

Continue reading Review: Forge, a Tool for Bootstrapping a WordPress Theme

Change GNU Screen Keyboard Command

I changed the command character for screen from Control-a to Control-b recently, after switching to a wireless Mac keyboard. On this small, portable keyboard—which is the same layout as most Mac laptops—there’s only one Control key, and it’s on the left side of the keyboard.

The weird angle to hit Control-a was hurting my hand. Gone now is the left-side contortion I was forced to make to strike with my pinky and ring finger.

It’s pretty easy to change the command key mapping, just add escape ^Bb to your screen config file (usually located in your home directory). Here’s what my .shellrc looks like:

# Make the shell in every window as your login shell
shell -$SHELL

# Instead of Control-a, make the escape/command character be Control-b
escape ^Bb

# Autodetach session on hangup instead of terminating screen completely
autodetach on

# Turn off the splash screen
startup_message off

# Use a 100000-line scrollback buffer
defscrollback 100000

Automatic WordPress Updates with SVN

Want to keep your WordPress install up to date automatically? Follow these steps to add a cron job to update your WordPress install every 6 hours.

Set up the install

The WordPress install must be a Subversion checkout. You can grab the bleeding edge source with a command like this:

svn co http://core.svn.wordpress.org/trunk/ .

If you aren’t familiar with Subversion, start here:

Schedule the updates

Add the cron job from the command line.

  1. Edit the cron job list.
    crontab -e
  2. Add the cron job (edit the path to your WordPress install).
    MAILTO=""
    # Update WordPress install every six hours
    * */6 * * * svn up -q ~/path/to/your/wp-install
  3. Save and close.

To learn more about editing cron jobs from the command line search Google for man cron and man crontab.

You can also use a GUI tool like CronniX on Mac OS X to manage the cron jobs.

Notes

  • The -q parameter tells the svn update command to run silently so that you don’t have to worry about any output from the cron job. But, you should add the MAILTO definition if you want to completely silence output.
  • Some systems don’t recognize the */6 syntax for hourly notation. If you get an error when trying to save the cron job you might have to change it to comma-separated values instead: 0,6,12 or similar.

Craftsmanship and Coding Standards


Inconsistencies [in your coding style] are jarring and require more time to read and comprehend. Consistency is such a valued quality that developers often abide by a coding standard even if they dislike the coding standard itself. —Chris Shiflett and Sean Coates

I consider following a coding standard a sure sign of a craftmanship. A craftsman knows his tools and languages well, and is consistent in his coding style. Ego and personal preference give way to consistency and best practices.

If you develop on your own, create a coding style and stick with it. There are many practical reasons to abide by code standards when developing on your own. It enforces good habits and helps avoid simple syntax bugs. It speeds up your development process by giving you structure and taking the guesswork out of naming and spacing tasks. And, possibly most importantly, it ensures your code is readable to your future self.

This is all fine and dandy when you’re the only one touching the code. When you share and collaborate with other developers, however, following a coding standard is not a choice—it is mandatory. If you develop for a platform, use the coding standards of the platform. Even if you don’t agree with the standard, you should follow it anyway so that your code is understandable and usable for other developers. Chances are the standards were put in place for very good reasons, both practical and philosophical.

To learn the coding standards for a project you work with, start by looking at other people’s code. Put the standards into practice in your own code and don’t be afraid to ask for a review from someone else. Check your code against similar structures in your software’s codebase, or popular modules, themes, and plugins for the software.

A big part of my job as Theme Wrangler at Automattic is reviewing themes. Lots of themes from lots of designers and developers. Everyone seems to have their own coding style, which makes it difficult to review code and find errors quickly. As a result I spend a lot of time cleaning up themes to match WordPress coding standards before I can even begin the actual work of reviewing and updating the theme.

For that reason code style consistency is especially important to me. It enforces best practices, produces consistent code, and speeds up development between collaborators.

Examples

For the languages I code in often (PHP, HTML, CSS, JavaScript), I use the same general set of standards.

  • Use standard syntax.
  • Remove extra whitespace and line endings.
  • Use consistent spacing and indentation.
  • Use human-readable labels and names.

Following are examples of specific coding standards that I follow in my everyday development.

For a top-notch example of coding standards within a company, see the Fellowship Technologies Code Standards. That, friends, is a well-crafted coding standard that we can all learn from.

Discouraging Image Theft

Recently I received a question from a colleague regarding image theft and how to prevent it. The sad truth is that you can’t. There are techniques to discourage downloading and reuse of your preciously-crafted images, but they generally aren’t 100% effective or user-friendly for your normal site visitors.

The reality of unwanted image downloads is a bit depressing: there is no guaranteed way to protect your images from being taken—the most you can do is discourage it.

First, make sure your copyright notice is clearly posted on each page indicating that downloading or using images without permission is not allowed. In doing so you are legally holding your site visitors accountable if they steal and reuse your content.

If you suspect that image search engines or IE-only users are the culprits, using client- and server-side techniques might help alleviate the problem. But if you are looking for a universal solution, editing the images is your best bet since all these techniques can be circumvented by taking screenshots, using screen scraping tools, or simply viewing the locally cached images.

Client-side techniques

  • Disable the right-click menu
  • Disable the shortcut menu in Internet Explorer
  • Use a transparent image overlay

There are JavaScript solutions for disabling the right-click contextual menu; unfortunately, it will only deter a few people. Anyone that is intent on stealing images can still do so. In older versions of Internet Explorer you can disable the image shortcuts that appear when you hover over an image; again, if that is your target browser (or if you suspect that’s where it’s coming from) then implementing IE-specific techniques might work.

Most right-click disabling scripts rely on browser-specific functionality or JavaScript, making them unreliable in other browsers. The downside is that you risk annoying your real customers by removing expected menus and links.

Another technique—which Flickr employs—is the transparent image overlay. This involves layering a transparent GIF over the top of the image you wish to protect. When the image is right-clicked and saved, the person assumes they are downloading a JPG but instead get the transparent GIF.

From Flickr’s download prevention help text:

Preventing people from downloading something also means that a transparent image will be positioned over the image on the main photo page, which is intended to discourage people from right-clicking to save, or dragging the image on to their desktop. By “discourage” we do mean simply “discourage”. Please understand that if a photo can be viewed in a web browser, it can be downloaded. The transparent image overlaid on the photo will not keep your images safe from theft, and is intended only as a slight hindrance to downloading.

Using Flash to display images is another method to discourage image theft (since Flash right-click menus can be customized), but it isn’t foolproof. Just like these other techniques, people can simply take a screenshot to capture the image.

Server-side techniques

  • Block image search engines
  • Disable image hotlinking

Image search a popular way to access images. If you notice a lot of traffic from image search engines, try blocking them with a rule in your robots.txt file. See Remove an image from Google Image Search for more details.

I also recommend disabling hotlinking by adding rules to your site’s .htaccess file. Doing so will not only potentially save you bandwidth costs by stopping other sites from reusing your images and content, it will prevent directly linking to your images without your permission.

Image content editing

  • Add watermarks
  • Use very low quality images

Although altering the image affects how it looks and works on your site, it is quite a bit more effective than simply trying to disable downloading or saving. Again, this is only a means to discourage theft — skilled graphic artists can remove a watermark and still have a usable image.

Using low quality images could also help, but finding a good balance between impressing your customers and deterring theft can be difficult.

Bottom line

If someone really wants the image, they will get it. Using the techniques described above will discourage most people from downloading your images, but remember that posting your images online means you run the risk of anyone downloading and reusing them.

Further reading