Root cause analysis – do you really want to know?
Root Cause Analysis – it’s not about finger pointing (or at least, it shouldn’t be.)
Sometimes it’s just easier if you can blame some thing (or some one entity.) We all are more comfortable when a handy scapegoat is available – if, however, you really want to solve problems then you have to dig; sometimes (if you are lucky) the digging will be brief but usually, it will be a relatively deep process.
When you have a significant failure (i.e. one that you don’t want to experience again) how can you:
- find the root cause? (RC)
- make changes to mitigate or remove the problem?
Whenever you have processes that can’t fail how do you realize 100% availability, performance?
Some possible steps in root cause analysis (RCA):
- identify the variables (hardware, software, networking, people, etc.)
- identify the process relationships (automated? real-time?, etc.)
- which (if any) of the above are outside of your control (a vendor-side problem?)
If a vendor is identified then hand-off and require resolution. If the ball is in your camp, then armed with the above, proceed by:
- reviewing the components for any recents changes (any hardware, network, OS, application updates/changes?)
- locating/reviewing low-hanging fruit (sometimes the RC is really simple – i.e. the power loss was the result of the CIO testing the emergency power button in the data center.
– Now we all know that the red button really works – and no additional ‘tests’ are planned for this quarter.) - isolating the problem areas/devices/processes and hand off to appropriate groups for further research
- attempting to reproduce the problem (this is actually good news if you succeed since it reduces the variables!)
- reviewing at a detail level – the hardware/OS/software configurations, processes, code (eliminate the network, hardware, OS and OS services; now we are down to the application)
What about visits from Murphy? (i.e. we can’t find an RC…)
- sometimes, stuff just happens – do what you can to avoid it, but
- always be ready to adapt (for any given mission/process do you have a Plan n^x? or at least, Plan n+1?)
No related posts.