Savings Plan for System Admin – It’s about Time, not Money.
There are several techniques that how an UNIX administrator can save his time during his regular job, and few of the techniques are like writing effective automation scripts for daily routine jobs, installing handy little desktop tools which helps for easy and quick access to entire UNIX environment and having proper monitoring dashboards which helps for proactive system administration …etc. But the fact is all of these techniques saves our time only during the normal daily jobs when we are not expecting any disastrous stuff to our environment.
What are the situations we consider as worst situations in Unix Admin Job ( few examples):
- Natural disasters
- Complete Data Center Power outages
- Server failures Due to Power Shortage
- Hardware component failures for critical Servers
- User and Systems database corruption for Mission critical application
- Complete Application failures
- Complete Network failures due to Network component failure
- Web server or other necessary server failures, due security vulnerabilities
The bad thing is , many times, these worst moments don’t come with prior notice , so we can’t keep skillful people always ready for those moments , to just sit and wait on the floor, all the times. And the Regular Support administrators ( with beginner and intermediate level skills) who are working on routine system admin tasks usually have their mind trapped in the same state of routine tasks for long period , and it takes a little while to understand the Problem and to take necessary action to quickly resolve the issue. And we can’t simply blame them for not reacting to the situation so promptly, because it is not always their skill problem but it is the problem of human mind and the way it works.
Just in case, if that worst moment happens … do you know what exactly saves our time and gives us a breathing space to work with all the available UNIX resources on the floor, by leveraging their skills irrespective of their experience and knowledge ….. If you are guessing about some kind of information document , yes you are closely right, i.e. RUN BOOK.
Runbook gives the freedom to the team , to work with the required level of expertise during disastrous situations even though the actual available skill levels are little less than required. Runbook also improves the problem response time and resolution times, and hence the team can deliver support within SLA ( Service Level Agreement). In the current competitive market it is highly important to deliver service with the agreed SLAs.
What does Runbook contain?
A run book mentions all the instructions that unix administrator need to perform for day-to-day operations and also contains the information to respond to the emergency situations. The run book should contain all necessary information to enable a staff member to perform any process, from performing a backup to failing over to a remote site.
The Runbook that has day-to-day operational instructions, is considered as Unix Operational Run book. And the Runbook which has instructions to perform critical disaster recovery and fail over Operations is considered as BCM ( Business Continuity Management) Runbook. Depending on the Size of the Organization and IT infrastructure, system administrators manage single or multiple Runbooks for each application and environment.
Normally below information should be included in the Runbook:
- Resource information about the data center and its hardware and software
- Contact Information for Each Resource involved in the RunBook Instructions, or the related tools to find the contact information.
- Process information, including step-by-step procedures for operational and emergency processes
- Every runbook should have Version number and revision history to include proper comments for every update made
When do we first prepare RunBook and How do we Validate the instructions?
Normally, Run books are prepared/updated very first time whenever the applications/servers are placed in the production environment. And it will be updated at least once in a year by performing proper Disaster Recovery tests. Especially for the Financial organisations, it is a mandatory requirement to perform regular DR tests and updating their runbooks regularly.
Is that RUNBOOK same as the document that we see in unixadminschool.com?
Absolutely, No. RunBook is the document with a very customized set of instructions that works for specific application / environment / organization. Runbooks are not generic technical documents, they are confidential and cannot be shared outside of the organization.
Some Aggressive sysadmins ask, Why You need a RunBook, why can’t you script or automate everything and scrap the Run book?
I would recommend, first you prepare the runbook with manual instructions and then automate each task of the Runbook but I don’t advice to automate the entire runbook as single automation task. Because during the disastrous situations we want to have the controlled recovery. In fact, whenever the disaster happens , the scripts are the first thing that breaks and never works so we can’t simple rely on automation for those situations.
Here is the Conclusion …. Invest your time in preparing a run book for the project you work/ deliver, Get Good Returns in Long Term.
If you are working for ” a team with few expert level resources , more intermediate/beginner level resources. You work on projects and also you spending lot of time on resolving escalated issues from other system administrators”, then this is right action to you.
What are thoughts you have about RunBooks, and what do you suggest others ?