Question: “How are things going, both in perception and reality?”
This topic comes up a lot for me lately. As I dig into a reply I find myself grappling with a significant gap. I know there’s bound to be a distance between perception and reality, yet often I don’t know how something is perceived because I’m not listening well. Or, I don’t know the truth in order for my answer should point to something real.
Answer: I have work to do on both ends in order to answer first for myself, then provide the feedback to the original asker.
We are rewarded for the answer. Not another question. It’s beaten out of us from kids, and later in work it can be hazardous for your career. —Warren Berger
Via the Farnam Street podcast I loved this cultural insight. An honest assertion that our business culture rewards quick-hit answers instead of rewarding the act of slowing down to find the right question.
Why do I avoid the backlog and overflowing todo list? Why do I shove one more tool into a drawer already full of bits and bobs? Why do I squeeze yet another outfit into an overflowing closet? Because confronting this mess is hard work. It means making tough choices. Most of the time, I’d rather not decide.
To make sense of my environment, my work, my life—I need to confront the mess. Once the clutter is gone I know I’m left with just the essentials. Once the dust is clear, I can get to work.
In The Life-Changing Magic of Tidying Up: The Japanese Art of Decluttering and Organizing Japanese organizing consultant Marie Kondo explains that while the process of decluttering and cleaning your home is important to your physical wellbeing, the true outcome is happiness and clarity in your mind. The habit gives you the freedom to take responsibility for important decisions.
I learned so much from this book, from awareness and mindfulness to practical tips on folding and hanging clothes. The habit of tidiness is now a mindset for me rather than just a chore to be completed.
The process starts by discarding the inessential items. Tidying up defines what is valuable: learning what I can do without; learning which books, clothes, keepsakes, or kitchen tools give me the most joy.
In applying her principles, my books were the hardest. I had hundreds and many in the category of “I’ll read this someday.” I trimmed it down to 80-90 best of the best — including this one! Hah. Keeping sentimental, must-read again, and books I reference often. The rest I gave as gifts to a new home or donated.
Life becomes far easier once you know that things will still work out even if you are lacking something.
A clean home is a perfect metaphor for a clear and organized mind. If my room and desk are clear and tidy I can face the reality of what’s in front of me. “It is by putting one’s own house in order that one’s mindset is changed. When your room is clean and uncluttered, you have no choice but to examine your inner state.” Am I scared of what I’ll find?
Because you have continued to identify and dispense with things that you don’t need, you no longer abdicate responsibility for decision making to other people.
Decisions are now easier as I see more clearly the work in front of me. And I enjoy even more the treasures, clothes, and tools I chose to keep.
Why a Flat Organizational Structure will Fail as You Grow is an insightful and thought-provoking study from Lighthouse, a software tool for managers. Keeping in mind when considering any decision that someone else — somewhere before — solved the same issue. From my personal workflow, to team processes and habits, all the up to key decisions on company structure.
There are a few advantages and many disadvantages to a flat organizational structure as you grow. We share how growth breaks a flat organizational structure
…if you think it’s a good use of your time to try to innovate in employee on-boarding, performance feedback, quarterly reviews, promotions or weekly all hands meetings, you are mistaken at best and destroying your company at worst.
Call ten friends who work at great companies and crowd source the best practices. These best practices are widely understood and broadly implemented, and the differences are minimal or arguably irrelevant.
What makes a company or product unique? What makes it exceptional? Even though we should continually seek to improve, a strong legacy most likely won’t come from rethinking the 1-1 check-in chat, how we process payroll, or even our technical toolkit.
By modeling organizational excellence on what is already known to work everywhere else we can focus our creativity and innovation on improving the product experiences that help our customers succeed.
Note: My colleague Cate points out that the origins of most technology company practices are outlined in Andy Grove’s classic book High Output Management (1983), describing how to build and run a company.
I have a hiring heuristic called ABCDEF, which stands for: agility, brains, communication, drive, empathy and fit. For gatekeepers, I’ve found agility is the most important attribute. To test it, I ask them: ‘Tell me a best practice from your way of working.’ Then I ask: ‘Tell me a situation where that best practice would be inappropriate.’ Only agile thinkers can demonstrate that a best practice isn’t always best,” says Ries. “For an attorney, that might be probing for a situation where you shouldn’t run everything by a lawyer. Hopefully they don’t say ‘criminal conspiracy,’ but you want someone to say something like: ‘You know what? If you’re a two person team, and you’re just doing an MVP, and six people are involved, you don’t need a lawyer.’ It requires some common sense and mental flexibility.
Keep learning. You’ve only touched the edge of the issue. Develop your judgement, which is essentially decision making under uncertainty. Pattern matching: keep growing your pattern matching database, and be very conscious about it.
Heard on the a16z podcast for March 26, 2018 with Andy Rachleff, Wealthfront founder and CEO.
My book review of Dawn of the New Everything, Encounters With Reality and Virtual Reality by Jaron Lanier (2017). The book is enjoyable and readable, though I did skip around over the more esoteric bits.
Alternating between a deep autobiographical dive into Lanier’s life and a straightforward account of the history of technology, with an emphasis on virtual (VR) and augmented reality (AR). I enjoyed learning about early technologists like Ivan Sutherland, whose 1963 SketchPad presentation was the “greatest demo of all time” to Doug Engelbart’s 1968 productivity software demo than reads like a modern tech stack: file versioning, collaborative editing, and video conferencing.
More eye-opening for me than the history is the honest and thought description of VR from one of the industry’s true pioneers.
A lot of joy in VR remains in just thinking about it.
VR trains us to perceive better… we learn to sense what makes reality real. [Because] human cognition is in motion and will generally outpace progress in VR.
A sense of cognitive momentum, of moment-to-moment anticipation, becomes palpable in VR. Like the chi in tai-chi.
The investigation has no end, since people change under investigation.
The technology of noticing experience itself.
Lanier describes VR as feeling your consciousness in its pure form. “It proves you are real.” The exact opposite of what I’d previously thought of when considering VR—I perceived a “fake” or “out of body” experience. Instead, Lanier emphasizes that it’s meant to be temporary. It’s meant to make you think, not just escape. It’s intended to produce the enjoyment of coming back to your true senses, reborn.
Reading notes: I read the hardcover edition from my local library after seeing a mention in both Wired and The Economist. See on Goodreads.
My notes and takeaways from a long read on anomalies and system complexity called the STELLAReport from the SNAFUcatchers Workshop on Coping With Complexity, 2017. Via Matt.
This paper is one of the best I’ve read in a while. Many lessons here match my experiences developing—and breaking—software for WordPress. I gained new insight into adaptive mental models, how best to coordinate teams during an outage, and how much I both love, and depend on, debugging and troubleshooting.
Building and keeping current a useful representation takes effort. As the world changes representations may become stale. In a fast changing world, the effort needed to keep up to date can be daunting.
Several years ago I told a colleague in passing that my professional goal as a software developer was to build a mental model of everything in our codebase. To know where each piece lives and how it works. They just laughed and wished me luck. I was serious.
Though my approach may have seemed naïve, or maybe unnecessary in my job, I saw it as essential for survival in a bug-hunting role. A step toward mastery and adding more value to the company. What I didn’t know at the time was that we were past the point where one person could keep the entire codebase organized in their head.
What this paper indicates is that my coworker was right to laugh—it’s not useful to hold my own mental model of the entire system. I should however strive to learn from every opportunity to update the working knowledge I do have at any given time.
Note: “Resilient performance” sounds like a fancy word for “uptime.”
Much of my team’s work at Automattic is in the area of software quality: error prevention by blocking deploys when automated tests fail, building developer confidence by creating smarter, faster testing infrastructure. So much more we could do there in the future.
Many big tech companies have a specific role around this called Site Reliability Engineer (SRE). Combined with Release Engineering teams they build safeguards such as deploying to a small percent of production servers for each merge, or starting with a small amount of read-only HTTP requests. When no errors occur, the deploy continues.
At a software quality conference last year I learned how Groupon approaches this via Renato Martins. They use “Canary” tests like those we run on WordPress.com—small, critical tests. Once these pass, they push code into a blue/green deployment system. Which means if any error occurs the deploy system immediately switches all traffic to a previously known safe version (blue) while reverting the broken one (green). A continuous sequence of systems: one known safe version, one new version.
Groupon deploys the blue/green changes to a small subset of the public-facing servers, say 5% of all traffic. On top of that they have a Dark Canary, which is a separate server infrastructure that receives the live production HTTP traffic but doesn’t actually reply to the end user’s requests. They run statistical analysis on the results of this traffic to determine whether the build is reliable or not. For example, looking at HTTP response codes to see how many are non-200. (It’s more sophisticated than that, but basically it’s risk-free testing on a tiny portion of traffic.)
The most interesting piece mentioned is that when Groupon first developed this system, they were failing the build once every two weeks or so. But over time that number dropped to almost zero because the developers became conscious of it, and didn’t want to be the one to induce a failure. So it changed their culture, too.
Back to the STELLA report.
Proactive learning without waiting for failures to occur.
Experts are typically much better at solving problems than at describing accurately how problems are solved.
Eliciting expertise usually depends on tracing how experts solve problems.
The concept of “above-the-line/below-the-line” appeared in Ray Dalio’s Principles book as well. Great leaders are able to navige above and below with ease. In this case it deals with mental models of a system (above) with the actual system (below). Another way of stating it: below the line are details around “why what matters.” Above the line is the deeper understanding around “why what matters matters.”
A somewhat startling consequence of this is that what is below the line is inferred from people’s mental models of The System. What lies below the line is never directly seen or touched but only accessed via representations.
So true. I remember seeing an internal post mapping to explain how a new product worked with reactions from people saying, “Wow, I had no idea it was this complex.” And, “Thank you, now I see and understand it clearly.” I often think to myself when considering a software system, “This is probably only fully represented in one developer’s mind.”
Two challenges I’ve come across in practice:
To keep an accurate representation yourself in order to get work done.
To hold a good enough understanding of how others’ represent it in order to work in a team.
I love the SNAFU stories in this paper. Feeling the pain reading it—for times I’ve caused an outage on WordPress.com or a committed bad code to a default WordPress theme.
Pattern: a cascading “pile on” effect—I’ve seen this with user sessions on WordPress.com accumulating into the hundreds of thousands, until our UI tests started failing. We finally saw enough slowdowns that a deeper analysis was warranted to uncover the cause.
Surprise: where my mental model doesn’t match reality (both situational and fundamental shifts).
Uncertainty: failure to distinguish signal from noise can be wasteful. “It is unanticipated problems that tend to be the most vexing and difficult to manage.”
Evolving understanding: start from a fragmented view, expand as you learn how it really works.
Tracing: sweep across the environment looking for clues.
Tools: command line is closest and most common: “in virtually all cases, those struggling to cope with complex failures searched through the logs and analyzed prior system behaviors using them directly via a terminal window.”
Human coordination is interesting and also complex: “This coordination effort is among the most interesting and potentially important aspects of the anomaly response.” (Coworkers and I have noted “watching the systems channel for the entertainment and thrill of the hunt.”)
Communication: chat logs help with the postmortem (I saw this often in themes and WordPress.com outages).
Conflict between a quick fix and gaining a clear understanding of what/why it happened.
Managing risk: pressure is high for a quick fix, but potential for other effects is also high.
Tagging “postmortems”—which at Automattic we do on internal “P2” sites. The paper made me laugh here by calling the archive of these recaps a “morgue” (also used in the journalism/newspaper industry).
Anomalies are unambiguous but highly encoded messages about how systems really work. Postmortems represent an attempt to decode the messages and share them.
Anomalies are indications of the places where the understanding is both weak and important.
This is a key point: learning from outages helps us gain a more accurate understand of our system. Back to my point about trying to hold it all in my head: “Collectively, our skill isn’t in having a good model of how the system works, our skill is in being able to update our model efficiently and appropriately.”
The authors seem to treat postmortems as a deeply social activity for the teams involved, valuable beyond the dry technical review. At Automattic we could benefit from more intentional structure and synchronous sharing around this activity.
During the anomaly, coordinating the work can be difficult, assigning well-bounded tasks out to individuals to speed up recovery, bringing onlookers and potential helpers up to speed—versus doing it yourself to focus on the problem
Good insight for technical people from software developers to QA to DevOps:
To be immediately productive in anomaly response, experts may need to be regularly in touch with the underlying processes so that they have sufficient context to be effective quickly.
It’s much harder to work across many codebases and products and be effective in helping resolve an outage. There is high value in “shared experience working in teams” so that communication about the underlying issues is unneeded during a crisis; communications are short and pointed. You already know if your coworker is capable of something, so you don’t even have to ask.
“Sense making” is what I feel I often do in my daily investigative work, and a valuable skill—pattern matching and synthesis.
Strange loops are interdependencies in the failure that cause even more issues. For example, when you can’t log errors because the log file stopped working due to kernel TCP/IP freeze; and the failures caused an overloaded log or full storage.
This bit applies to WordPress.com: continuous deployment can change the culture around site outages, making them “ordinary” and quickly resolved as brief emergencies because of automation that’s readily available. But, when that automation itself fails — like a hung deploy command—it becomes an existential issue. Now we can’t break the site because our mechanism to quickly recover is gone.
A good summary of the balance between taking time to avoid or pay technical debt with the pressure to quickly ship visible product changes for customers.
There is an expectation that technical debt will be managed locally, with individuals and teams devoting just enough effort to keep the debt low while still keeping the velocity of development high.
Reminds me of how software development teams expect framework and platform changes to continue during normal product cycles—most teams I’ve worked with struggle with balancing the need to do both.
Technical debt in general is easy to spot before writing code, by looking at code, and is solved by refactoring. Dark debt is not recognized or recognizable until an anomaly occurs: complex system failures.
In a complex, uncertain world where no individual can have an accurate model of the system, it is adaptive capacity that distinguishes the successful.
A key insight: adding new people to the team or bringing in experts for analysis can help answer the question, “Why are things done the way they are?” Often lacking during internal discussions. We fix the point problem and move on; fighting fires instead of making a fire suppression system.
This STELLA report shows that value exists in participating in open discussions with other companies around these issues. Sharing common patterns, which is a big benefit of open source software, where you can follow not only the fix but the discussion around it.
More SRE (site reliability engineering) references: