Are you ready to encounter a complete Datacenter power failure ?
The role of system administration involves three different responsibilities.
1. Be ready to Prevent the issues which are Expected to Arise
2. Be ready to Resolve the issues which are already Expected but couldn’t prevent from occurring.
3. Be Ready to Resolve the issues which are unexpected in nature.
We all are very well trained to fulfill the first two responsibilities in most efficient manner. But, very few of us are actually ready to deal with the third responsibility, and the reason is not the technical capability but it is because of ” lack of visibility on how our servers, that we are managing, are connected with other IT infrastructure components ( like networks / storage / power supply ) “. In this post I am going shed some light on this third responsibility of the system administrator.
The most unexpected and major outage for any IT infrastructure could happen due to the complete power supply failure in a Data center with some technical / human error. When the power supply fails, it is not only the servers that are going down but also the other major components, like network devices and Storage devices, will also go down. You can look at the below figure to understand how the components in a data center are interconnected to each other.
Recovering from such power failure situation is little difficult task compared to any other system administrator responsibility, and the difficulty is not because of technical complexity but because of procedural complexity which requires ” efficient communication and coordination capabilities” from one side and ” the delivery of technical excellence while racing with the time” in the other side.
Most of the times such incidents happens during the times when they are absolutely unexpected ( example … you are in middle of a weekend’s beach party and you got a call from your boss saying that “we have a situation here, please come back to office” ) . And you know that time, you don’t even get time to understand what is happening and what is required to do but you just have to do something to recover the situation. If you are alone and you haven’t encountered such situations in the past, then it will create high pressure on you which make all your senses stop working.
Well, how is the overall recovery procedure looks like, for such disasters?
For your quick understanding I have prepared a check list with 10 recovery steps for a DC power failure incident. And I will explain each of them in detail in next sessions.
Before going to this checklist the most important thing you have to do is “Escalate the incident to your boss, the moment you realized that incident is major in nature“, that will help your manager to update senior management who can quickly engage the right people to assess the business risk due to the technical failure. And this is most important task before doing any other technical fix to resolve the situation.
Step 1: Collect the Impacted Server list from the Server Configuration Database ( or Asset Management Tool ).
In most of the enterprise environments the complete list of servers is available from a centralized configuration database which is having high level redundancy to tolerate single location disasters. And this configuration database query tool provide search facilities to query the server list based on the location/network segment/region …. etc. You should generate such list for the specific data center which had issue, as first step of recover procedure.
Once you have collected the server list , we should gather the root password for each server along with the remote console information. The reason is , during the power up operations many servers might enter into maintenance mode and wont be available to network connections. We need to connect them using remote console and then login with root user and password to fix the issues.
Step 2: Identify the Power UP Sequence for the list of hosts you have captured from the above list
We all know that not every server in the organization works for same purpose, some of them acts like Infrastructure Servers ( like. DNS / NIS / LDAP / DHCP ), some of them Database Servers and remaining works like application specific servers.And to bring the entire server environment of with minimum errors, we should bring up these servers in a sequence so that each server will meet it’s dependency requirements on other servers.
Normally, in major organisations, support teams will conduct yearly ” DC power down” exercises to get prepared with these kind of unexpected disasters, and during the time each team will prepare their own document to power-up and power-down sequence for their supported components ( for example. Network team will prepare the list for switches and routers, DB team will prepare their list of databases and Sysadmins will prepare for all he servers in the environment)
Step 3 : Power on the Network Components
This task will be carried out by network team, it is the first task to be performed during the power restoration of the servers. Bringing up any other components before having the network ready will cause undesired results and to fix such issues we will require more time than expected.
Step 4: Power on the Storage Devices
In Enterprise Environment, most of the servers rely on external storage and these servers expect storage to be available during the boot time. And if the storage is not available during the boot time, some servers will go into maintenance mode. To Recover the hosts from such unavailable storage situation we have to connect the servers from remote console and have to fix them manually.
Below sample diagram shows the sequence numbers to power on each device of the Data center.
Step 5: Power on Infrastructure Servers and Start all the services.
Infrastructure servers are the servers which provides other servers some critical network services. For example ,
- DNS servers are important for hostname resolutions
- NIS/LDAP Servers are important for Server login
- DHCP server are important for IP allocations.
Once these servers are powered on, we should perform the Server health checks that was mentioned in the Step 8 before we start the services on these hosts.
Either During the bootand after the booting, every other server in the network tries establish connections with these infrastructure servers, and if those connection fails, server will either hang or enter into maintenance mode.
Step 6: Power on The Database Servers
Once we have all the infrastructure servers available, the next servers in the list that should be powered-on are the Database servers. And this is for two reasons
1. Most of the business applications depends on the Databases
2. The Database team will need some time to recover the failed databases and the early handover of the servers to DBA team will avoid extra wait time for the Database recoveries.
But remember that we are just powering the DB servers, but we should not ask the DBA team to start their databases until we finish the server health checks mentioned in the step 8.
Step 7: Power on the Application Servers.
At this step we will just power the application and make sure they are available in the network, but we shouldn’t give go-ahead for the application start until we complete the Server health checks mentioned in step 8 and all the databases started
Step 8: Server Health Checks
Now, we have reached to the level where we can demonstrate our technical competency to fix the below mentioned issues
8.a. Booting Problems
Due to the unexpected power outage it is possible that some the disks/hardware components in the server may fails. To identify and resolve the issue we should connect to the console of these servers, manually, and fix them.
And sometime there might be no issue, but simple boot time parameter settings such as “auto-boot=false in sparc machines” or “boot device set to wrong device” etc. can be changes to fix the issue.
8.b. L0gin Problems
Most of the times the login problems appears if we bring up the servers before the infrastructure servers from step-5 available in the network. And these kind of issues can be fixed by just login into the machine as root and restart the network client service. And if that doesn’t fix the problem simply reboot the machine.
8.c. Checking Network connectivity Status
It is possible that some of the servers will network connectivity status either “because of server NIC issue” or “because of some switch port issues during power failure” , you have to closely investigate them for the below areas:
— Is the interface started with right IP address?
— Is the interface started with right Speed and Duplex?
— Is the interface able to connect to server’s default gateway?
— Are all static routes to the machine available?
8.d. Checking Filesystem Status
at this level we will check whether all the filesystems that are supposed to mounted during the boot time available or not ( e.g. the)(mounts configured in /etc/vfstab for Solaris and /etc/fstab in linux).
— In case If any local filesystem not mounted, then we have to run filesystem check ( FSCK ) and mount them manually.
— In case If any remote network filesystem not available, then check the NFS service both from the server and cient.
— In-case If any external Storage filesystem unavailable, check them with storage team and fix the issues related to volume manager.
— If you have Cluster Servers, make sure the Storage volumes are mounting to right servers as defined in the cluster configuration.
9. Go-Ahead for Database Team
Once we complete all the health checks for the Database servers, just ask the DBA team to start their databases and confirm for application go-ahead.
10. Go-Ahead for Application Teams
Once DBA team confirms with all the databases start up, just give go-ahead for the application team for application start up.
This concludes the power-on procedure… Just wait, we are not yet done.
Many times application team cannot confirm the application health check right immediate and there might be some issue that will arise little later. So, be vigilant and ready to fix them whenever they reach you.
What is your DC power failure experience… Just share to us. We appreciate your comments and feedback on this post.