December 5, 2013

A Bottom-Up approach to Maintenance and Support doesn't make any sense

First of all, please note that my view on IT is that it services the user and when used in an enterprise, it services the enterprise. Typically we refer to the enterprise part that IT is servicing as 'The Business'. There is no other raison d'etre for IT other than to service.
With that being said, it should come as no surprise when I say that I consiser IT to be a tool. It's a tool that is supposed to be utilized.

With that out of the way, let's move on to the topic of Maintenance and Support and why it makes no sense to address this bottom up.

First of all, what is bottom-up in this context. For that we define the stack that an IT system typically consist of. At the very bottom of the stack we have the communications infrastructure. This is the Network. It is somewhat an odd-ball in this discussion, as the Network is not limited to a few systems. But for argument's sake, we start with the Network, the communications layer.
Next we have the hardware layer, this can be physical or virtual. But in case we're talking virtual, it does make sense to separate this layer into two sub-layers. For the purpose of this post, the division is not relevant.
On top of the hardware we have an operating system, and with that the rest of the system infrastructure. Think about anti-virus, system firewalls, etc.
Next we have the middleware, or rather we have software engines. Think in this regard about application servers, database servers, webservers, messaging servers etc. These are not all considered middleware, but they are engines. Generic pieces of software that provide specific services to application specific software.
Which brings me to the next layer, the application software. This is in fact the software that provides the services that the business benefits from.
Finally, there're the business processes in which the applications play a part. As with the communications layer, this is sort of an odd-ball. But within this post it is in fact relevant.

Now looking back at the start of this post, IT services the business. And the business is in fact defined by its processes. Without processes there is no business. Mind that the processes do not necessarily need to be formalized or even repeatable. I'm just saying that the business is defined by how the various actors interact, these interactions are processes, the business processes.
Consequently, when the business can't execute its processes, it doesn't function. Nothing gets doen. When it is impossible to execute the business critical processes, the business seizes to exist.
Thus, from an architect's perspective it becomes critical to understand these processes, or at least their criticality, and the demands regarding the processes to be executable. Hence, and now we're getting close to where I want to be, the architect needs to understand which steps, which actions in the various processes are automated. This is where IT kicks in!

Now you understand that an architect should worry about the availability of the capability to execute a business process and an IT architect, should worry about the automated parts. These are the business services, or in fact the applications. (Yes, I know that I'm simplifying this a bit, but within the context of this post, that is fine.)
So, (parts of) applications need to be available and the data used in these applications need to be secured. I'm saying parts of applications because more and more we see applications to be composed of reasonably disparate parts. What is 'available'? It means that a business service, an application's functionality, is accessible and once is started it is concluded within an acceptable timeframe. This is important, because when the system does execute but not fast enough, it should be considered unavailable.
What does it mean to secure data? Well it means that the data has to be available, can't be tampered with and can only be accessed by those that are supposed to have access.
We typically define these by using KPI's, typically RTO (Recovery Time Objective, i.e. how long till a service is available again), RPO (Recovery Point Objective, how much committed data can be missed) and an up-time defined in an amount of time the service can be unavailable over a period of time.
These are all business requirements, all to be defined by those that can understand what it means when a process can't be executed.

We're getting close to where we need to be, or rather where I want to be with this post. The topic of Maintenance and Support. Once a process is being used, it needs to be supported and maintained. For IT systems, it means the same. These systems need to be supported and maintained as well.
The roles that need to maintain and support are dividable in three areas: Functional, Application and Technical. Basically it's about the people that understand what the system should do, which services it should provide. Functional. The people that understand how the application works and how it can be kept working. Application. The people that understand how it all actually runs on the systems. Technical.

Let's bring in the animal kingdom, shall we; There are different ways to skin a cat. By this I mean that there are different ways to keep a process executing. And that's the whole point! The process needs to be kept executing. This is what maintenance and support is about. Full Stop.
By realizing this, it becomes clear that from a functional perspective it must be defined in what circumstances what needs to be done to keep the process executing. And in the cases IT is needed, it must be defined what needs to be done.
But that's not the point, the point is that those KPI's that are defined, the RTO, RPO, etc are regarding the business service. Explicitly not, and I emphasize this, the application or the infrastructure or the network. This is what needs to be managed. So in order to provide in the requirements regarding this, it must be implemented at the top of the stack. And then you go down the stack realizing these requirements. Just like any other business requirement.

Talking contractual issues, the SLA and OLA, is the exact same issue. The SLA is defined on a business process layer, the SLA defines to what extend the process can execute and then trickles down. Down the stack, to ensure that it can be done. And the OLA's are there to ensure that everybody is on the same page and commits to help meeting the SLA with everything at their disposal.
It should be clear that it is rather pointless to define and agree an SLA on a lower level in the stack when that doesn't support the requirements higher up in the stack. It would be a waste of money as the enterprise will still falter when poop hits the fan.
Again, it's the same as with all other requirements, you don't build something that doesn't help doing business.

Agreed, my post title should've been "A Top-Down approach to Maintenance and Support is the right way" as this is what I'm discussing here. But I choose to be a bit controversial here. Why? Because too often I see that the bottom-up approach is taken. Enterprises consistently fix availability at the infrastructure level. And genuinly believe that this is cutting it. Not realizing that it doesn't. Furthermore, they invest extensively in expensive IT solutions that are typically complex to maintain and support. And because of this, there's rigid standardization enforced to keep costs down.
Consistently, enterprises are solving business problems that are not conceived as business requirements, we actually call them non-functionals most of the time, using technology. Preferably hardware, virtualized.
This is counter-intuitive because any other business requirement is actually addressed top-down.

One of the reasons for this behavior is that 'the business' doesn't think of how important a business process is, how important the automated activities are, how valuable the data is. When asked for the RPO and RTO it is typically conceived to be a technical issue. Typically the answers to RPO and RTO are that they need to be '0', i.e. no downtime and no data loss. Of course this is incorrect in all circumstances, because the RPO and RTO are to be defined on a business service level and in pretty much all cases it needs a significant amount of analysis to come up with the real numbers.

So, yes, the title is correct. Why? Because we keep on doing it the wrong way. We keep on messing up. We keep on spending money where it doesn't need to be spend. And we keep on not delivering what is actually required. Why? Because thinking about it and get to the real answer is actually hard and just throwing more kit at the solution will seem that it will fix the problem.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.