Six Engineers Share Their Tips for Running Blameless Retros
Over the past few years, blamelessness has become an increasingly central guiding principle for engineering teams. A blameless philosophy assumes that individuals have fundamentally good intentions and are doing the best with the information they have at any given time. It encourages teams to look beyond the potential missteps of individuals to instead identify larger, system-level root causes when things go wrong. (See “Further Reading” for some of our favorite writings on the subject.)
There are lots of worthwhile reasons for building a blameless engineering culture, but it can be hard to implement in practice.
We sat down with six engineering leaders to discuss their perspective on how to bring the benefits of blamelessness to sprint retrospectives.
Eric Weinstein (Senior Director of Engineering, ZipRecruiter)
TLDR: Embrace complexity and specificity
Make sure everyone understands the ground rules at the start (ideally laid out by whatever party is running the retrospective; this person's job is to keep the meeting on track and ask clarifying questions). Given that this is a blameless exercise, I advocate focusing on:
- Contributing causes. Identify contributing causes, rather than trying to identify a single root cause. Modern software systems are complex—frequently displaying emergent properties—and are constantly changing. While some incidents have a single clear cause, most have multiple contributing causes. For example, breakdowns in communication among teams often amplify production incidents that occur at the boundaries and integration points of those teams' services and applications.
- Dependencies. Highlight dependencies among systems or teams, especially dependencies that were discovered by the incident being discussed. All of software engineering is about managing complexity, and a big part of that is identifying and managing dependencies between organizational and technical components.
- Concrete action items. As the discussion progresses, you'll start to get a sense of potential changes to process, best practices, or the codebase that could prevent or mitigate future incidents. Make sure the action items you develop are concrete. For example, items such as "pay down tech debt" are vague; while "refactor the payments controller to exponentially back off the payment gateway to avoid generating excessive load" is much better...and assign to a specific person or team, who should be in the room (if the right people we're invited to the retro). If they're not in the room, assign an action item to someone in the room to follow up.
Martin Gordon (Tech Lead, Amperity)
TLDR: Go hard on process, soft on people
I really like the notion of going hard on process and soft on people. It assumes good intent from all parties and steers folks more to problem solving than trying to focus too much on who did something wrong/why they did it
One exercise that I’ve found valuable is the “Five Whys.” No big reveal here – it’s literally exactly what it sounds like. :) Whenever we’re looking at an outcome (why we missed a deadline, for example, or why a critical bug was introduced into production), we force ourselves to ask “why” five times.
It can sometimes feel awkward or silly. But the answer at the end is never “because Martin messed up.” It’s “because we don’t have adequate safeguards in place” or “because we haven’t built in more redundancy.”
Jake Shorty (Sr. Software Engineer, Github)
TLDR: Acknowledge and plan for failure
One approach I’ve seen in blameless retros is to try to strip away any sense of personal agency. For example, it’s common to see after action reviews that encourage language like “a bug was pushed to production.” This is well-intentioned. It tries to shift the focus from assigning blame to getting to the root of what’s broken – which is a systems-level challenge.
But I think the opposite approach is actually better. Whenever I push a bug to production, I try to explicitly acknowledge the role that human judgment (or error) played. The best leaders I’ve worked with do the same.
The goal here isn’t to “fall on the sword.” It’s actually to defang and plan for failure. We all make mistakes, and that’s critical to acknowledge. The focus should be on how we build systems and processes that are resilient to human error.
Andrew Lim (Data Scientist, Amperity)
TLDR: Reward vulnerability. Structure for engagement
The goal of "blamelessness," at least as I understand it, is to induce honesty and impartiality in devising and assessing solutions to problems.
Everyone should know that. I think it's good practice to briefly verbally remind everyone at the start that the retro is blameless.
The senior people organizing the retro must have credibility on this front, or there's no point. And if someone voluntarily shares something they did that didn't show them at their best, a little nudge of encouragement for their honesty is nice.
People need to be engaged – not just one or two people, but the whole team. To that end, structure matters a lot. There should be someone responsible for the timer, ideally a publicly visible one, and in my opinion they should be actively expected to enforce a hard, potentially mid-sentence stop on all speakers. The schedule should also include buffer/"parking lot" time at the end (I'd say at least 10 min if the whole thing is an hour) to return to things you missed.
Sam Dallas (Software Engineer)
TLDR: Build trust. Everything else will follow
The key to running a blameless retrospective is creating an environment of trust. If trust isn’t there, you can introduce whatever processes you want – but team members will still worry about repercussions and look for ways to cover their asses.
Creating trust is easier said than done. But modeling real leadership goes a long way.
In my first engineering job, my manager sat me down and told me: “At some point you’re going to bring down the site.” He wasn’t trying to discourage me. On the contrary, he was trying to empower me to raise my hand when I was in trouble and seek help. For team leads, it’s critical to build a relationship with team members that lets them know that you recognize they’re human – and that’s okay.
It’s also critical to get the big moments right. How does the team lead react when a system failure first surfaces? Your reaction is so important – because when something melts down, you don’t want team members to hide it. Team leads should monitor and critique their own reactions to crisis, including even conducting a mini-retro around the response itself.
Will Cheng (Engineering Lead, T. Rowe Price)
TLDR: It’s all about the follow-up
I spend a lot of time thinking about the fact that blameless retros almost always end up surfacing system issues: processes that need to be reworked, technical debt that needs to be paid down, and the like. Too often, we identify those issues – but then don’t fight for the resources to address them in the next sprint.
As a result, we spend a lot of energy root causing, but then never actually address the root cause because we have to race ahead to the next feature. It’s dispiriting. And eventually the engineering team loses faith in this whole “blameless” thing, since it never seems to actually deliver any improvements or changes.
So my recommendation would be: as an engineering lead, learn how to advocate persuasively for investment in fixing the problems that come out of a blameless retro. Become bilingual in the “language of the business” so that you can get time allocated for servicing technical debt and building more resilient systems.
- Google’s incredible piece about learning from failure from the SRE handbook
- Jason Smale, SVP of Engineering at Zendesk, shares how Zendesk embraces a blameless culture
- A step-by-step to a blameless postmortem from Atlassian
- PagerDuty’s great article on building a culture of blamelessness