Arc-E-Tect: March 2019

March 28, 2019

Blame prevention through devops

This post is a follow up of my previous post The question that takes away all blame.
Blameless postmortems, or blameless RCA’s are supposed to be the new-normal in devops organisations, but all too often we see that first the team and sometimes the person to blame is sought, and then we tell them to ‘fix it’.

You might’ve noticed that I wrote devops in all lower case in this post's title. I did that on purpose.

Devops sanse capital 'd'

Even though you typically see DevOps instead of devops. I think that DevOps implies that it’s about Development and Operations engineers working together. In my opinion, devops is about combining the responsibility of the development team and the responsibility of operations team, turning them into the responsibility of a single team. This would make devops a matter responsibility and less of an organisational concern.

The organisational aspect would then be in the form of ‘Product Teams’, responsible for a product with a Product Owner that is accountable for that product.
Something for another day. This article is about blameless post-mortems and root cause analysis driven through devops-tinted glasses.
Blameless post-mortems are something that comes more natural in environments of shared responsibilities. Environments where the same people are responsible for both the quality of the product as well its usage. Environments where devops is considered the combined responsibility of development and operations within a single team.

Silos vs Accountability

I have observed in a number of organisations that one of the main reasons these organisations are considering the move towards devops is based around the concept of shared responsibility. It is the idea that silos prevent this sharing of responsibility. It is a misconception though. Silos don't prevent shared responsibility, although culturally they'll probably inhibit the sharing of responsibilities. What the real problem is, is the lack of accountability in a siloed organisation. Or quite the opposite; Too many persons are accountable for different/conflicting objectives.

In a siloed organisation, each silo is primarily responsible and even accountable for its own output, its immediate contribution to the process of product delivery, but not the full process of delivery itself. Meaning that when the outcome of the process is of the unwanted kind (caused an incident), either one of the silos’ outputs caused the problem (who is to blame?) and when there is no single silo to be blamed, nobody can be held accountable.

This can go as far that a sales team is responsible for selling a product. A signed contract is considered a success. The development team is responsible for changing the product. The release of the change into production is considered a success. The operations team is responsible for 'running' the product and it is considered to be successful when there are no incidents. This would be for SaaS vendors. For more traditional software companies, i.e. those that require implementations of a product at the customer site, the operations team is part of the customer's organisation or at least the operations accountability is typically with the customer. It'll be more complicated, because there will likely be an implementation team that is successful when the product is implemented according to the contract sold.
Success of the product is defined as selling/changing/operating/implementing. With different persons accountable for each of these successes, you see that conflicts are imminent. So when a problem happens anywhere in the delivery, each of the accountable persons will elaborate that they're not to blame, because they are successful. Actually, sales and development were successful, and operations and implementation were given something that prevented them from being successful. Contract was signed based on the availability of missing features at the time the implementation project reached completion. Future releases are feature complete and functionality fully tested. But it's unmanageable, not performant and definitely not secure. And can't be implemented as integrations with other systems not available until project end.

Siloed organisations are structured around tasks, competencies and expertise. By centralising capabilities, they can be shared across products. Siloed organisations are build on shared service centres. Reason behind these structures is cost reduction through utilisation optimisation. I'm not a fan, see: Perish or Survive, or being Efficient vs being Effective.

Output vs Outcome

It’s the difference between output and outcome that often drives ‘blaming’ in a post-mortem.

In siloed organisations, each silo’s focus is on output, its output. The silos are in many cases the result of centralising the responsibility for specific aspects of the delivery process, with a lack of accountability for the full process. Specialists are responsible for doing their 'thing' is efficient as possible.

Often responsibility is mistaken for accountability, so these task-optimised teams, teams of experts, are held accountable for what they deliver, which is not the outcome of the process, but output of their effort. Because they are the experts, they perform their task for different products, i.e. they participate in various delivery processes. And are held accountable for the number of tasks that they completed in total.

The problem therefore is in that the silos are operating truly independent of each other. Each silo services several product delivery processes. Because servicing only one (product delivery) process, would mean that a significant amount of time the silo would be idle. Since the silos are there to optimise resource utilisation, idle time is undesired. Idle time is considered wasted time by many non-Lean'ers. In Lean wasted time is time spend on something that is not immediately needed for the delivery of a product.

Every silo will work hard to meet its numbers. Meet its targets. And when the target is the number of tasks performed instead of the number of products delivered, we're doing a lot and contributing nothing.

Within a context of blameless post-mortems. In a context where blaming should be prevented, we need to make sure that responsibility is shared on the (product delivery) process outcome. Accountability is set to manage that outcome. Meaning that development and operational responsibilities are both defined to contribute to the outcome of the process. Something devops shines at.

Thanks once again for reading my blog. Please don't be reluctant to Tweet about it, put a link on Facebook or recommend this blog to your network on LinkedIn. Heck, send the link of my blog to all your Whatsapp friends and everybody in your contact-list. But if you really want to show your appreciation, drop a comment with your opinion on the topic, your experiences or anything else that is relevant.

Arc-E-Tect

The text very explicitly communicates my own personal views, experiences and practices. Any similarities with the views, experiences and practices of any of my previous or current clients, customers or employers are strictly coincidental. This post is therefore my own, and I am the sole author of it and am the sole copyright holder of it.

March 21, 2019

The question that takes away all blame

Blameless postmortems, or blameless RCA’s are supposed to be the new-normal in devops organisations, but all too often we see that first the team and sometimes the person to blame is sought, and then we tell them to ‘fix it’.

It’s for a large part a remnant of the siloed organisation and the culture that stems from it. And it is a matter of asking the wrong questions. This is, I would say, about 80% of the reason why we are unable to prevent incidents from recurring.

The cool part is in the ‘asking the wrong questions’ thing.

Now, before going into it, let me emphasise that postmortems, RCA’s are about preventing incidents to occur again. You want to know the root cause, because you want to prevent something like the incident to ever happen again. Stop reading if you disagree.

‘What?’ is wrong

All too often, when we are dealing with the aftermath of an incident, we wonder ‘what caused this incident?’. Which is a valid question, but not one that is very valuable. The issue I have with this approach is that when we have an answer to this question, we think we found the root cause of the incident. Which we haven’t.

The answer to the question ‘What caused the incident?’ is typically something technical. There was not enough memory in the server. There was a bandwidth problem. There was a bug in the software that resulted in the incident.

There is something very satisfying in the answer to the question ‘What caused the incident?’, it is in the gratification you get from knowing where the weakness in the product is. It’s in the software, in the hardware configuration, in the network infrastructure. Because when we know where the weakness was, we know who is responsible for the weak component. It’s the development team, automation team, the network team. And when we know who is responsible for that component, we know who to blame and who to tell to go fix it.

In about all cases I’ve been involved in RCA’s, putting the blame on somebody was not about punishing that person, it was about identifying who should fix the problem.

The problem is in ‘responsibility’, because the person being held responsible is not necessarily the person that is accountable for the incident. Often, especially in a siloed organisation there is no one accountable.

Although it is important to understand what went wrong, and what caused the impact, we need to realise that this is not the same as understanding what caused the incident. We’re not at the root-cause just yet. But we want to, because this investigation it painful. Colleagues are to blame, and the responsible persons must be called to justice. They must be told that we can never ever feel that impact again. And so, we make sure that next time the impact will be bearable. We increase memory in the server, increase the bandwidth in our network, fix the bug in our software. All holes are plugged. Ready to go.

If only we had addressed the root cause, it all would be honky dory.

‘Why?’ is right

The question that should be asked is not so much about what caused the incident, it’s about why the incident could occur in the first place.

That’s a tough question to answer. Why was there a bug in the software? Why was there not enough bandwidth? Why was there not enough memory in the server?

And that’s only the first ‘Y’.

A very common, tried-and-tested, effective way of identifying the ‘real’ root cause of an incident is by applying the 5-Y method. In this approach you ask 5 times ‘Why could the previous answer happen?’. Experience has taught us that going 5 levels deep will get you to the root cause of the problem, sometimes less, hardly ever more than five levels deep.

Let’s assume that the incident was due to insufficient bandwidth and let’s start asking ‘Why?’

Why was there not enough bandwidth? Too many customers accessed the newly released API.
Why did too many customers access the new API? Because we announced it prematurely in our global newsletter.
Why was it announced in our global newsletter? Because the marketing manager wasn’t aware that the API was to be released following the ‘soft-launch protocol’.
Why was the marketing manager not aware of the fact that the release was to follow the soft-launch protocol? Because she was not in the meeting in which it was decided to follow the soft-launch protocol.
Why wasn’t she in the meeting in which it was decided that the API was going to follow the soft-launch protocol? Because she was on vacation and didn’t appoint a delegate.

Now we know why the incident could occur. Not what caused it, but why it could be caused. Making sure that the marketing manager or a delegate is attending meetings in which product launch strategies are decided will prevent this incident to occur in the future.

Of course, the above is only an example, but it shows that by asking ‘What?’ the solution would be a costly technical solution and by asking ‘Why?’ the solution is better meeting attendance.

Another important conclusion you might’ve drawn is that asking ‘What?’ only involves technical people. Further leading the path to a solution down the costly technical path. Whereas the ‘Why?’ question requires all parties involved in the delivery of the product (the API) to attend the postmortem. Getting to the bottom of the incident’s cause, requires a multi-disciplinary team. Just like delivering a product requires many disciplines.

It makes no sense to think that creating a success is requiring many disciplines, but when it results in a failure, to prevent it, only requires a single discipline. There is no difference between delivering something that works and something that doesn’t. Not from a product delivery perspective.

Product Owner or Problem Owner?

The question is of course: Who would go through all this trouble and assemble all these people that are involved in delivering a product into the hands of our customers? It’s the one that is held accountable for the incident. More importantly, it’s the person that is held accountable for the fact that the incident doesn’t occur again.

The Product Owner would be my preferred role, ownership of the product implies ownership of the success of the product and all challenges that come with it.

There is a follow-up story that you can find here.

Translate