In The big secret of small improvements Tal Bereznitskey explains how to improve “quick fix days,” where software teams take time to make small improvements. Those small changes can together mean a big win for customers and the business.
At Automattic we’ve experimented with both 1-day bug scrubs in one team all the way up to a full “hack week” — so Tal’s principles strike a chord with me.
Framing the problem is halfway to solving it — I love how he suggests rewording the subject line of a software change to fix a bug as something actionable, not just a description of the problem.
6. Well defined. Only work on tasks that are defined properly. Prefer “Make content scrollable” over “Bug: can’t see content when scrolling”.
Create positive feedback loops — I remember during my days answering WordPress.com Themes bug reports and how rewarding it was to hear directly from the people I helped with a bug fix.
7. Thanks you. There’s nothing like hearing a customer say “Thank you!”. When a quick-fix was suggested by a customer, let the developer email him and tell him the good news.
This is the work: customer kindness — Our latest iteration at Automattic speaks to this customer focus as the goal of the maintenance work — it isn’t just polish or cleanup, this is the product work. We even have a fun acronym for it now! H.A.C.K. — Helping Acts of Customer Kindness.
My notes and takeaways from a long read on anomalies and system complexity called the STELLAReport from the SNAFUcatchers Workshop on Coping With Complexity, 2017. Via Matt.
This paper is one of the best I’ve read in a while. Many lessons here match my experiences developing—and breaking—software for WordPress. I gained new insight into adaptive mental models, how best to coordinate teams during an outage, and how much I both love, and depend on, debugging and troubleshooting.
Building and keeping current a useful representation takes effort. As the world changes representations may become stale. In a fast changing world, the effort needed to keep up to date can be daunting.
Several years ago I told a colleague in passing that my professional goal as a software developer was to build a mental model of everything in our codebase. To know where each piece lives and how it works. They just laughed and wished me luck. I was serious.
Though my approach may have seemed naïve, or maybe unnecessary in my job, I saw it as essential for survival in a bug-hunting role. A step toward mastery and adding more value to the company. What I didn’t know at the time was that we were past the point where one person could keep the entire codebase organized in their head.
What this paper indicates is that my coworker was right to laugh—it’s not useful to hold my own mental model of the entire system. I should however strive to learn from every opportunity to update the working knowledge I do have at any given time.
Note: “Resilient performance” sounds like a fancy word for “uptime.”
Much of my team’s work at Automattic is in the area of software quality: error prevention by blocking deploys when automated tests fail, building developer confidence by creating smarter, faster testing infrastructure. So much more we could do there in the future.
Many big tech companies have a specific role around this called Site Reliability Engineer (SRE). Combined with Release Engineering teams they build safeguards such as deploying to a small percent of production servers for each merge, or starting with a small amount of read-only HTTP requests. When no errors occur, the deploy continues.
At a software quality conference last year I learned how Groupon approaches this via Renato Martins. They use “Canary” tests like those we run on WordPress.com—small, critical tests. Once these pass, they push code into a blue/green deployment system. Which means if any error occurs the deploy system immediately switches all traffic to a previously known safe version (blue) while reverting the broken one (green). A continuous sequence of systems: one known safe version, one new version.
Groupon deploys the blue/green changes to a small subset of the public-facing servers, say 5% of all traffic. On top of that they have a Dark Canary, which is a separate server infrastructure that receives the live production HTTP traffic but doesn’t actually reply to the end user’s requests. They run statistical analysis on the results of this traffic to determine whether the build is reliable or not. For example, looking at HTTP response codes to see how many are non-200. (It’s more sophisticated than that, but basically it’s risk-free testing on a tiny portion of traffic.)
The most interesting piece mentioned is that when Groupon first developed this system, they were failing the build once every two weeks or so. But over time that number dropped to almost zero because the developers became conscious of it, and didn’t want to be the one to induce a failure. So it changed their culture, too.
Back to the STELLA report.
Proactive learning without waiting for failures to occur.
Experts are typically much better at solving problems than at describing accurately how problems are solved.
Eliciting expertise usually depends on tracing how experts solve problems.
The concept of “above-the-line/below-the-line” appeared in Ray Dalio’s Principles book as well. Great leaders are able to navige above and below with ease. In this case it deals with mental models of a system (above) with the actual system (below). Another way of stating it: below the line are details around “why what matters.” Above the line is the deeper understanding around “why what matters matters.”
A somewhat startling consequence of this is that what is below the line is inferred from people’s mental models of The System. What lies below the line is never directly seen or touched but only accessed via representations.
So true. I remember seeing an internal post mapping to explain how a new product worked with reactions from people saying, “Wow, I had no idea it was this complex.” And, “Thank you, now I see and understand it clearly.” I often think to myself when considering a software system, “This is probably only fully represented in one developer’s mind.”
Two challenges I’ve come across in practice:
To keep an accurate representation yourself in order to get work done.
To hold a good enough understanding of how others’ represent it in order to work in a team.
I love the SNAFU stories in this paper. Feeling the pain reading it—for times I’ve caused an outage on WordPress.com or a committed bad code to a default WordPress theme.
Pattern: a cascading “pile on” effect—I’ve seen this with user sessions on WordPress.com accumulating into the hundreds of thousands, until our UI tests started failing. We finally saw enough slowdowns that a deeper analysis was warranted to uncover the cause.
Surprise: where my mental model doesn’t match reality (both situational and fundamental shifts).
Uncertainty: failure to distinguish signal from noise can be wasteful. “It is unanticipated problems that tend to be the most vexing and difficult to manage.”
Evolving understanding: start from a fragmented view, expand as you learn how it really works.
Tracing: sweep across the environment looking for clues.
Tools: command line is closest and most common: “in virtually all cases, those struggling to cope with complex failures searched through the logs and analyzed prior system behaviors using them directly via a terminal window.”
Human coordination is interesting and also complex: “This coordination effort is among the most interesting and potentially important aspects of the anomaly response.” (Coworkers and I have noted “watching the systems channel for the entertainment and thrill of the hunt.”)
Communication: chat logs help with the postmortem (I saw this often in themes and WordPress.com outages).
Conflict between a quick fix and gaining a clear understanding of what/why it happened.
Managing risk: pressure is high for a quick fix, but potential for other effects is also high.
Tagging “postmortems”—which at Automattic we do on internal “P2” sites. The paper made me laugh here by calling the archive of these recaps a “morgue” (also used in the journalism/newspaper industry).
Anomalies are unambiguous but highly encoded messages about how systems really work. Postmortems represent an attempt to decode the messages and share them.
Anomalies are indications of the places where the understanding is both weak and important.
This is a key point: learning from outages helps us gain a more accurate understand of our system. Back to my point about trying to hold it all in my head: “Collectively, our skill isn’t in having a good model of how the system works, our skill is in being able to update our model efficiently and appropriately.”
The authors seem to treat postmortems as a deeply social activity for the teams involved, valuable beyond the dry technical review. At Automattic we could benefit from more intentional structure and synchronous sharing around this activity.
During the anomaly, coordinating the work can be difficult, assigning well-bounded tasks out to individuals to speed up recovery, bringing onlookers and potential helpers up to speed—versus doing it yourself to focus on the problem
Good insight for technical people from software developers to QA to DevOps:
To be immediately productive in anomaly response, experts may need to be regularly in touch with the underlying processes so that they have sufficient context to be effective quickly.
It’s much harder to work across many codebases and products and be effective in helping resolve an outage. There is high value in “shared experience working in teams” so that communication about the underlying issues is unneeded during a crisis; communications are short and pointed. You already know if your coworker is capable of something, so you don’t even have to ask.
“Sense making” is what I feel I often do in my daily investigative work, and a valuable skill—pattern matching and synthesis.
Strange loops are interdependencies in the failure that cause even more issues. For example, when you can’t log errors because the log file stopped working due to kernel TCP/IP freeze; and the failures caused an overloaded log or full storage.
This bit applies to WordPress.com: continuous deployment can change the culture around site outages, making them “ordinary” and quickly resolved as brief emergencies because of automation that’s readily available. But, when that automation itself fails — like a hung deploy command—it becomes an existential issue. Now we can’t break the site because our mechanism to quickly recover is gone.
A good summary of the balance between taking time to avoid or pay technical debt with the pressure to quickly ship visible product changes for customers.
There is an expectation that technical debt will be managed locally, with individuals and teams devoting just enough effort to keep the debt low while still keeping the velocity of development high.
Reminds me of how software development teams expect framework and platform changes to continue during normal product cycles—most teams I’ve worked with struggle with balancing the need to do both.
Technical debt in general is easy to spot before writing code, by looking at code, and is solved by refactoring. Dark debt is not recognized or recognizable until an anomaly occurs: complex system failures.
In a complex, uncertain world where no individual can have an accurate model of the system, it is adaptive capacity that distinguishes the successful.
A key insight: adding new people to the team or bringing in experts for analysis can help answer the question, “Why are things done the way they are?” Often lacking during internal discussions. We fix the point problem and move on; fighting fires instead of making a fire suppression system.
This STELLA report shows that value exists in participating in open discussions with other companies around these issues. Sharing common patterns, which is a big benefit of open source software, where you can follow not only the fix but the discussion around it.
More SRE (site reliability engineering) references:
About this Inclusive Design series —Today (February 16, 2018) I’m giving a talk on inclusive design at WordCamp Phoenix 2018. Leading up to the conference I’ve been publishing notes on voices, stories, products, and other resources: everything I’m learning about this emerging practice. Read more about the series.
Inclusive and diverse teams make better, stronger teams — and these teams make better decisions. Because our work and thought patterns are influenced by our background and biases, working with a diverse group means not only fresh, new ideas, but we also counterbalance the tendency to design for people just like ourselves. A higher standard.
And that is why representation matters, not just to those who are represented, but to all of us. Because it expands our sense of what’s possible, and what we have reason to expect. —Cate Huston
For maximum learning and a broader perspective, not limiting yourself to your immediate team or company; seeking out a wide variety of inputs from mentors, coaches, and other advisors.
If your team is limited and you don’t have the ability to expand, actively seek out people with other perspectives to consult or act as project advisors, and give special consideration to their feedback.
As a company that wants to unleash the potential in every team, depicting people is especially important. How we represent the people who make up teams should be just as important. We’ve always known that the best teams are balanced; made of a diverse group of people with different backgrounds and perspectives, but our illustrations haven’t always reflected that.
The authors found that even though their team aspired to be more inclusive, how they represented themselves visually wasn’t keeping pace with the true diversity of the team.
Promoting diversity and inclusion within our brand is a persistent and multi-faceted effort. And it’s a challenge to depict diversity without it feeling merely perfunctory or symbolic until the reality of our industry truly represents the customers we serve and the world at large. More needs to be done outside of the brand to promote an inclusive workplace, but we’ve found that the results of constant vigilance and open conversation are worth the time and energy.
To truly represent our customers is something Automattic is improving — we still have a long way to go. If you missed the story about updating the WordPress.com brand illustrations to be more diverse, see Inclusive Design, Day 5/15: To See Yourself in Imagery — with illustrator Alice Lee and my designer colleague Joan Rho.
For a thorough treatment of this topic, I highly recommend reading and bookmarking “On Improving Diversity in Hiring” from my Automattic colleague Cate Huston. In this in-depth article, she shares her hiring expertise to build diverse teams, everything from onboarding and recruiting to specific tips and tricks during interviews.
This rule of thumb about stopping the behavior before someone is hired hit home with me as this is something I need to improve on personally. An off-color joke here, a comment there; I’m learning to speak up more when I notice these things.
A good rule for inclusion pre-work to diversity is to stop doing things you would have to change if the demographics of your team better reflected the demographics of the world. —Cate Huston
One practical tip shared by Cate that I’ve put to good use is Textio, a service to help make job descriptions more inclusive. I used it in 2016 to update the Excellence Wrangler job posting, replacing phrases like triage ruthlessly with triage efficiently.
Cate’s influence in the last year or so has helped me improve my hiring to be more inclusive, both in mindset and in practice. She’s inspired me to read more broadly, and think more openly.
For day 15 of 15 of inclusive design, the last day, I’ll share a recap of all the inclusive design learnings I’ve shared in this series so far.
About this Inclusive Design series —Tomorrow I’ll give a talk on inclusive design at WordCamp Phoenix 2018. Leading up to the conference I’ve been publishing notes on voices, stories, products, and other resources: everything I’m learning about this emerging practice. This is day 14 of 15. Read more about the series.
Speed and connectivity should be considered be a major factor in exclusion. Just ride the BART in San Francisco. 😀
Joking aside, much of the world does not enjoy the wonders of high-speed bandwidth yet. Like William Gibson famously said, “The future is here, but it’s not evenly distributed.” Kansas City Google Fiber gigabit on one end, on the other Tegucigalpa less-than-Edge with wires hanging off a string.
As evidence of the disparity consider the “lite”* apps built by tech giants for markets where they want to drive adoption. The need for a reduced-weight experience in places with low-speed wired broadband and tenuous mobile broadband highlights the case of exclusion. Where large populations are left out of the “modern web” due to connectivity limitations, cost of entry, archaic device types, and many more reasons both cultural and political. (*Side note: what the heck is with that spelling?)
One story I noticed recently that mentioned speed as a leading tech market indicator involved WhatsApp’s growth in India even as Facebook lags behind them, via The Economist, January 27, 2018. Sluggish web app performance is a factor in Facebook’s lack of adoption in India. People who pay by the megabyte or gigabyte prefer to use a service that is leaner, faster, less bloated. They’re voting with their app choices.
In more ways than one, WhatsApp is the opposite of Facebook… whereas Facebook requires a fast connection, WhatsApp is not very data-hungry.
As a result [of this and other reasons], WhatsApp has become a social network to rival Facebook in many places, particularly in poorer countries. Of the service’s more than 1.3bn monthly users, 120m live in Brazil and 200m in India.
Extend the Benefits to Everyone
On the plus side, designing for speed brings about broad improvements to everyone else in the world. People should love the simpler interface with fewer settings and menus, alongside the bandwidth savings and reduced footprint for the app’s data storage needs.
Back to the trend of tech giants creating lighter versions of their apps. When I take a closer look at the apps like “Twitter Lite” and “Facebook Lite” — at first they appear to be primarily designed for speed on slow connections. Yet the changes bring a new and different experience to many people who are mobile-first or non-technical.
The design enhancements resulting in a simpler and more intuitive app extends the benefit to a wide variety of people. For example, better readability from larger text size and the usability win from simpler navigation and clearer labels. That sounds like something the AARP crowd would all buy or click on or subscribe to.
If you’re curious about the “lites” — here’s further reading.
With Facebook Lite, our goal is to provide the best possible Facebook experience to everyone, no matter their device or connection. And we hope that by sharing how we built the app, we can encourage more people to build for the next billion coming online. — via How we built Facebook Lite for every Android phone and network
The next billion coming online! Ambitious.
Is Calypso Fast Yet?
Goal: Calypso is the WordPress Lite.
Calypso designers also pay attention to the user interface, of course — recently we’ve made the text size larger and improved the color contrast for readability. My team at WordPress.com is now digging into label changes and interactions needed for a refreshed, simpler navigation for managing WordPress websites.
For those curious, we track speed improvements in Calypso on this data-rich website: iscalypsofastyet.com. And, we’re hoping to improve both the mobile web performance and the usability of the app even more in 2018.
In a blog post Speed is a key design attribute John Maeda highlights two strong voices in recent web history — speaking out on the value of speed and performance: Marissa Mayer and Lara Hogan. They’ve both been preaching this same topic for years. I’m sure today no one argues the pivotal role of speed in Google’s early success and how it led them to market dominance.
I’ve felt this slowness most times I travel, even in the US — in airports, hotels, taxis, trains. Most definitely when in other countries, because I’m limited by my data plan’s built-in speed limitation. Or, as when I visited a WordCamp in Nicaragua, the slow mobile “broadband” is the reality for everyone living there.
Keeping in mind much of the world now sees the web only through a mobile device. Which brings us to a message from Wapuu: Mind the mobile!
For day 14 of 15 of inclusive design, I’ll share behind-the-scenes details of the work Automattic designers put into our inclusive guide and checklist.
About this Inclusive Design series —In 2 days I’ll give a talk on inclusive design at WordCamp Phoenix 2018. Leading up to the conference I’m publishing notes on voices, stories, products, and other resources: everything I’m learning about this emerging practice. This is day 13 of 15. Read more about the series.
Tide, a project started here at XWP and supported by Google, Automattic, and WP Engine, aims to equip WordPress users and developers to make better decisions about the plugins and themes they install and build.
Tide is a service, consisting of an API, Audit Server, and Sync Server, working in tandem to run a series of automated tests against the WordPress.org plugin and theme directories. Through the Tide plugin, the results of these tests are delivered as an aggregated score in the WordPress admin that represents the overall code quality of the plugin or theme. A comprehensive report is generated, equipping developers to better understand how they can increase the quality of their code.
Once up and running these automated tests would update the plugin and theme description with a status and score so everyone knows whether they pass the tests or not, from PHP version compatibility to the quality of the “front-end output.”
The Tide project is now officially moved over to the WordPress project. See the related story on WP Tavern for a longer history. And, if you’re curious like me about the tech “innards” — take a look at the source code on GitHub.
I love the genesis of the name:
…inspired by the proverb ‘A rising tide lifts all boats,’ thinking that if a tool like this could lower the barrier of entry to good quality code for enough developers, it could lift the quality of code across the whole WordPress ecosystem.” Rob Stinson
One key to success: Tide makes it super easy for developers to identify weaknesses in their code — and learn how to fix them. It’s not just about getting a high score or to ranking better against a minimum requirement. It’ll teach us all to improve. I love that.
Because the goal of commercial software development isn’t to create code you love—it’s to create products your customers will love.
Recent efforts with my team at Automattic to improve the WordPress.com experience — and understand our customers better through “exposure hours” — reminded me of this classic software development essay from 2013 (via Andrew).