“she cannae take any more Cap’n…”
Are you ready for a *Scotty Event? Ready for the next level?
So what are the limits of your system? solution?
Several years ago my employer had a great product with a major problem (which, of course, occurred as a Scotty Event.) Part of the system’s team job was to make sure that the OS and hardware were up to the task – in this case it was pretty much a no-win scenario. The application would reach a death-spin-spiral. The programming side of the house assured us that it was NOT a programming problem – had to be a ’system issue’. Hmm.
The Unix system engineers would check/re-check all possible system level solutions (OS configuration changes, kernel tweaking, hardware tweaking, etc., etc.) Note that the hardware environment was Intel based and the OS was venerable but did not really scale beyond a 4-way CPU configuration. Turns out that the problem:
- was not the OS (the OS/hardware just made it difficult to locate the issue)
- was not a simple hardware issue (i.e. inadequate compute, memory or other resources)
- was not an OS configuration issue (i.e. we were tweaked to the max…)
Throwing more CPU, RAM & disk at the problem (Intel based) did not resolve the issue – note that the OS was retained while initial hardware changes were made. We did finally get passed this particular issue (or at least we reached a scenario that allowed it to eventually be resolved.)
The solution
Moving to new hardware and a new OS led to a resolution. The new combination allowed everyone to clearly see that this was not a system issue – it was a coding/application (file locking) problem. The reason it could not be seen with the previous OS/hardware was that the system would take a hard crash with little clue as to what was causing the problem (other than a heavy load scenario…) One moment all was well and the next every phone in the building was ringing as the system became more and more un-responsive; of course users would begin trying to start new application instances since they needed to complete their tasks (increasing the load further) and down the system went. As we worked to transition (i.e. port to the new OS/hardware) one of our system team members brought in a flashing red light (which, of course, he would switch on when the Scotty Event surfaced…) It became an battle like scenario as we struggled to retain control of the system (using somewhat drastic approaches that should be avoided.) We would have several system engineers ‘working the box’ until we either stabilized the system or the box would take a hard crash…
The problem did not go away with the new OS/hardware – but the box no longer crashed; the application would simply ‘hang’ (all users would become part of an extended queue.) During these ‘hangs’ we were able to locate the file(s) where the problem seemed to concentrate and this led to locating the portion of the application that needed adjustment.
Over time all of our large clients were moved to the new OS/hardware, AND, the code was ‘fixed’.
Soooo – will new OS/hardware fix all problems? ( No). How about moving your operations to Virtual Machines or perhaps Cloud Computing solutions? (No, again.)
Moving to new solution, can, perhaps lead to resolving the real problem(s) – as well as providing new opportunities and maybe, maybe reducing costs. Are the newest technologies going to resolve your Scotty Events?
* Star Trek – reference to the chief engineer, ‘Scotty’.
Related posts:
- SAs and the need for Whoops! Systems Administrators (SAs) need some Whoops! time. In addition to...
- Fedora 10 install problems Could just be me… Could just be luck of the...
- Rails – where is the missing stuff? ROR (Ruby on Rails) is advertised as a data-centric development...
- Splunk – Centralized, Real Time Log Analysis (NOS) NOS = not open source; but there is a ‘free’...
- GeoIP Blocking – examples for Apache The GOOD news – using the GeoIP module (mod_geoip.c.) can...