Blameless postmortems, or blameless RCA’s are supposed to be the new-normal in devops organisations, but all too often we see that first the team and sometimes the person to blame is sought, and then we tell them to ‘fix it’.
It’s for a large part a remnant of the siloed organisation and the culture that stems from it. And it is a matter of asking the wrong questions. This is, I would say, about 80% of the reason why we are unable to prevent incidents from recurring.
The cool part is in the ‘asking the wrong questions’ thing.
Now, before going into it, let me emphasise that postmortems, RCA’s are about preventing incidents to occur again. You want to know the root cause, because you want to prevent something like the incident to ever happen again. Stop reading if you disagree.
The answer to the question ‘What caused the incident?’ is typically something technical. There was not enough memory in the server. There was a bandwidth problem. There was a bug in the software that resulted in the incident.
There is something very satisfying in the answer to the question ‘What caused the incident?’, it is in the gratification you get from knowing where the weakness in the product is. It’s in the software, in the hardware configuration, in the network infrastructure. Because when we know where the weakness was, we know who is responsible for the weak component. It’s the development team, automation team, the network team. And when we know who is responsible for that component, we know who to blame and who to tell to go fix it.
In about all cases I’ve been involved in RCA’s, putting the blame on somebody was not about punishing that person, it was about identifying who should fix the problem.
The problem is in ‘responsibility’, because the person being held responsible is not necessarily the person that is accountable for the incident. Often, especially in a siloed organisation there is no one accountable.
Although it is important to understand what went wrong, and what caused the impact, we need to realise that this is not the same as understanding what caused the incident. We’re not at the root-cause just yet. But we want to, because this investigation it painful. Colleagues are to blame, and the responsible persons must be called to justice. They must be told that we can never ever feel that impact again. And so, we make sure that next time the impact will be bearable. We increase memory in the server, increase the bandwidth in our network, fix the bug in our software. All holes are plugged. Ready to go.
If only we had addressed the root cause, it all would be honky dory.
That’s a tough question to answer. Why was there a bug in the software? Why was there not enough bandwidth? Why was there not enough memory in the server?
And that’s only the first ‘Y’.
A very common, tried-and-tested, effective way of identifying the ‘real’ root cause of an incident is by applying the 5-Y method. In this approach you ask 5 times ‘Why could the previous answer happen?’. Experience has taught us that going 5 levels deep will get you to the root cause of the problem, sometimes less, hardly ever more than five levels deep.
Let’s assume that the incident was due to insufficient bandwidth and let’s start asking ‘Why?’
Now we know why the incident could occur. Not what caused it, but why it could be caused. Making sure that the marketing manager or a delegate is attending meetings in which product launch strategies are decided will prevent this incident to occur in the future.
Of course, the above is only an example, but it shows that by asking ‘What?’ the solution would be a costly technical solution and by asking ‘Why?’ the solution is better meeting attendance.
Another important conclusion you might’ve drawn is that asking ‘What?’ only involves technical people. Further leading the path to a solution down the costly technical path. Whereas the ‘Why?’ question requires all parties involved in the delivery of the product (the API) to attend the postmortem. Getting to the bottom of the incident’s cause, requires a multi-disciplinary team. Just like delivering a product requires many disciplines.
It makes no sense to think that creating a success is requiring many disciplines, but when it results in a failure, to prevent it, only requires a single discipline. There is no difference between delivering something that works and something that doesn’t. Not from a product delivery perspective.
The question is of course: Who would go through all this trouble and assemble all these people that are involved in delivering a product into the hands of our customers? It’s the one that is held accountable for the incident. More importantly, it’s the person that is held accountable for the fact that the incident doesn’t occur again.
The Product Owner would be my preferred role, ownership of the product implies ownership of the success of the product and all challenges that come with it.
There is a follow-up story that you can find here.
Thanks once again for reading my blog. Please don't be reluctant to Tweet about it, put a link on Facebook or recommend this blog to your network on LinkedIn. Heck, send the link of my blog to all your Whatsapp friends and everybody in your contact-list. But if you really want to show your appreciation, drop a comment with your opinion on the topic, your experiences or anything else that is relevant.
Arc-E-Tect
The text very explicitly communicates my own personal views, experiences and practices. Any similarities with the views, experiences and practices of any of my previous or current clients, customers or employers are strictly coincidental. This post is therefore my own, and I am the sole author of it and am the sole copyright holder of it.
‘What?’ is wrong
All too often, when we are dealing with the aftermath of an incident, we wonder ‘what caused this incident?’. Which is a valid question, but not one that is very valuable. The issue I have with this approach is that when we have an answer to this question, we think we found the root cause of the incident. Which we haven’t.The answer to the question ‘What caused the incident?’ is typically something technical. There was not enough memory in the server. There was a bandwidth problem. There was a bug in the software that resulted in the incident.
There is something very satisfying in the answer to the question ‘What caused the incident?’, it is in the gratification you get from knowing where the weakness in the product is. It’s in the software, in the hardware configuration, in the network infrastructure. Because when we know where the weakness was, we know who is responsible for the weak component. It’s the development team, automation team, the network team. And when we know who is responsible for that component, we know who to blame and who to tell to go fix it.
In about all cases I’ve been involved in RCA’s, putting the blame on somebody was not about punishing that person, it was about identifying who should fix the problem.
The problem is in ‘responsibility’, because the person being held responsible is not necessarily the person that is accountable for the incident. Often, especially in a siloed organisation there is no one accountable.
Although it is important to understand what went wrong, and what caused the impact, we need to realise that this is not the same as understanding what caused the incident. We’re not at the root-cause just yet. But we want to, because this investigation it painful. Colleagues are to blame, and the responsible persons must be called to justice. They must be told that we can never ever feel that impact again. And so, we make sure that next time the impact will be bearable. We increase memory in the server, increase the bandwidth in our network, fix the bug in our software. All holes are plugged. Ready to go.
If only we had addressed the root cause, it all would be honky dory.
‘Why?’ is right
The question that should be asked is not so much about what caused the incident, it’s about why the incident could occur in the first place.That’s a tough question to answer. Why was there a bug in the software? Why was there not enough bandwidth? Why was there not enough memory in the server?
And that’s only the first ‘Y’.
A very common, tried-and-tested, effective way of identifying the ‘real’ root cause of an incident is by applying the 5-Y method. In this approach you ask 5 times ‘Why could the previous answer happen?’. Experience has taught us that going 5 levels deep will get you to the root cause of the problem, sometimes less, hardly ever more than five levels deep.
Let’s assume that the incident was due to insufficient bandwidth and let’s start asking ‘Why?’
- Why was there not enough bandwidth? Too many customers accessed the newly released API.
- Why did too many customers access the new API? Because we announced it prematurely in our global newsletter.
- Why was it announced in our global newsletter? Because the marketing manager wasn’t aware that the API was to be released following the ‘soft-launch protocol’.
- Why was the marketing manager not aware of the fact that the release was to follow the soft-launch protocol? Because she was not in the meeting in which it was decided to follow the soft-launch protocol.
- Why wasn’t she in the meeting in which it was decided that the API was going to follow the soft-launch protocol? Because she was on vacation and didn’t appoint a delegate.
Now we know why the incident could occur. Not what caused it, but why it could be caused. Making sure that the marketing manager or a delegate is attending meetings in which product launch strategies are decided will prevent this incident to occur in the future.
Of course, the above is only an example, but it shows that by asking ‘What?’ the solution would be a costly technical solution and by asking ‘Why?’ the solution is better meeting attendance.
Another important conclusion you might’ve drawn is that asking ‘What?’ only involves technical people. Further leading the path to a solution down the costly technical path. Whereas the ‘Why?’ question requires all parties involved in the delivery of the product (the API) to attend the postmortem. Getting to the bottom of the incident’s cause, requires a multi-disciplinary team. Just like delivering a product requires many disciplines.
It makes no sense to think that creating a success is requiring many disciplines, but when it results in a failure, to prevent it, only requires a single discipline. There is no difference between delivering something that works and something that doesn’t. Not from a product delivery perspective.
Product Owner or Problem Owner?
The Product Owner would be my preferred role, ownership of the product implies ownership of the success of the product and all challenges that come with it.
There is a follow-up story that you can find here.
Thanks once again for reading my blog. Please don't be reluctant to Tweet about it, put a link on Facebook or recommend this blog to your network on LinkedIn. Heck, send the link of my blog to all your Whatsapp friends and everybody in your contact-list. But if you really want to show your appreciation, drop a comment with your opinion on the topic, your experiences or anything else that is relevant.
Arc-E-Tect
The text very explicitly communicates my own personal views, experiences and practices. Any similarities with the views, experiences and practices of any of my previous or current clients, customers or employers are strictly coincidental. This post is therefore my own, and I am the sole author of it and am the sole copyright holder of it.
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.