Are you ready to encounter a complete Datacenter power failure ?

 The role of system administration involves three different responsibilities.

1. Be ready to Prevent the issues which are Expected to Arise

2. Be ready to Resolve the issues which are already Expected but couldn’t prevent from occurring.

3. Be Ready to Resolve the issues which are unexpected in nature.

We all are very well trained to fulfill the  first two responsibilities in most efficient manner. But, very few of us are actually ready to deal with the third responsibility, and the reason is not the technical capability but it is because of  ” lack of visibility on how our servers, that we are managing, are connected with other IT infrastructure components ( like networks / storage / power supply ) “.  In this post I am going  shed some light on this third responsibility of the system administrator.

The most unexpected and major outage for any IT infrastructure could happen due to the complete power supply failure in a Data center with some technical / human error. When the power supply fails, it is not only the servers that are going down but also the other major components, like network devices  and Storage devices, will also go down. You can look at the below figure to understand how the components in a data center are interconnected to each other.

 

 

Recovering from such power failure situation is little difficult task compared to any other system administrator responsibility,  and the difficulty is not because of technical complexity but because  of  procedural complexity which requires ” efficient communication and coordination capabilities”  from one side and  ” the delivery of technical excellence while racing with the time” in the other side.  

Most of the times such incidents happens during the times when they are  absolutely unexpected  ( example … you are in middle of a weekend’s beach party and you got a call from your boss saying that “we have a situation here, please come back to office” ) . And you know that time, you don’t even get time to understand what is happening and what is required to do but you just have to do something to recover the situation.  If you are alone and you haven’t encountered such situations in the past,  then  it will create high  pressure on you which make all your senses stop working.

 

Well,  how is the overall recovery procedure looks like, for such disasters?

 

For your quick understanding I have prepared a  check list with 10 recovery steps for a DC power failure incident. And I will explain each of them in detail in next sessions.

 

Before going to this checklist the most important thing you have to do is “Escalate the incident to your boss, the moment you realized that incident is major in nature“, that will help your manager to update senior management who can quickly engage the right  people to assess the business risk due to the technical failure. And this is most important task before doing any other technical fix to resolve the situation.

 

Step 1: Collect the Impacted Server list from the Server Configuration Database ( or Asset  Management Tool ).

In most of the enterprise environments the complete list of servers is available from a centralized configuration database which is having high level redundancy to tolerate single location disasters. And this configuration database query tool provide search facilities to query the server list based on the  location/network segment/region  …. etc.   You should generate such list for the specific data center which had issue, as first step of recover procedure.

Once you have collected the server list , we should gather the root password for each server along with the remote console information. The reason is , during the power up operations many servers might enter into maintenance mode and wont be available to network connections. We need to connect them using remote console and then login with root user and password to fix the issues.

 

Step 2: Identify the Power UP Sequence for the list of hosts you have captured from the above list

We all know that not every server  in the organization works for same purpose, some of them acts like Infrastructure Servers ( like. DNS / NIS / LDAP / DHCP ), some of them Database Servers and remaining works like application specific servers.And to bring the entire server environment of with minimum errors, we should bring up these servers in a sequence so that each server will meet it’s dependency requirements on other servers. 

Normally, in major organisations, support teams will conduct  yearly ” DC power down” exercises to get prepared with these kind of unexpected disasters, and during the time each team will prepare their own document to power-up and power-down sequence for their supported components ( for example.  Network team will prepare the list for switches and routers, DB team will prepare their list of databases and Sysadmins will prepare for all he servers in the environment)

Step 3 : Power on the Network Components

This task will be carried out by network team, it is the first task to be performed during the power restoration of the servers. Bringing up any other components before having the network ready will cause undesired results  and to fix such issues we will require more time than expected.

Step 4: Power on the Storage Devices

In Enterprise Environment, most of the servers rely on external storage and these servers expect storage to be available during the boot time. And if the storage is not available during the boot time, some servers will go into maintenance mode. To Recover the hosts from such unavailable storage situation we have to connect the servers from remote console and have to fix them manually.

Below sample diagram shows the sequence numbers to power on each device of the Data center.

 

 

Step 5: Power on Infrastructure Servers and Start all the services.

Infrastructure servers are the servers which provides other servers  some critical network services.  For example ,

  • DNS servers are important for hostname resolutions
  • NIS/LDAP  Servers are important for Server login
  • DHCP server are important for IP allocations.

Once these servers are powered on,  we should perform the Server health checks that was mentioned in the Step 8  before we start the services on these hosts.

Either During the bootand after the booting, every other server in the network tries establish connections with these infrastructure servers, and if those connection fails, server will either hang or enter into maintenance mode.

Step 6: Power on The Database Servers

Once we have all the infrastructure servers available, the next servers in the list that should be powered-on are the Database servers. And this is for two reasons

1. Most of the business applications depends on the Databases

2. The Database team will need some time to recover the failed databases and the early handover of the servers to DBA team will avoid extra wait time for the Database recoveries.

But remember that we are just powering the DB servers, but we should not ask the DBA team to start their databases until we finish the server health checks mentioned in the step 8.

Step 7: Power on the Application Servers.

At this step we will just power the application and make sure they are available in the network, but we shouldn’t give go-ahead for the application start until we complete  the Server health checks mentioned in step 8  and all the databases  started

 

Step 8: Server Health Checks

Now, we have reached to the level where we can demonstrate our technical competency to fix the below mentioned issues

8.a.  Booting Problems

 Due to the unexpected power outage it is possible that some the disks/hardware components in the server may fails. To identify and resolve the issue we should connect to the console of  these servers, manually, and fix them. 

And sometime there might be no issue, but simple boot time parameter settings such as “auto-boot=false  in sparc machines” or “boot device set to wrong device” etc. can be changes to fix the issue.

8.b. L0gin Problems

Most of the times the login problems appears if we bring up the servers before the infrastructure servers from step-5 available in the network. And these kind of issues can be fixed by just login into the machine as root and restart the network client service. And if that doesn’t fix the problem simply reboot the machine.

8.c. Checking Network connectivity Status

It is possible that some of the servers will network connectivity status either “because of server NIC issue”  or “because of some switch port issues during power failure” , you have to closely investigate them for the below areas:

— Is the interface started with right IP address?

— Is the interface started with right Speed and Duplex?

— Is the interface able to connect to server’s default gateway?

— Are all static routes to the machine available?

8.d. Checking Filesystem Status

at this level we will check whether all the filesystems that are supposed to mounted during the boot time available or not ( e.g. the)(mounts configured in /etc/vfstab for Solaris and /etc/fstab in linux).

— In case If any local filesystem not mounted, then we have to run filesystem check ( FSCK ) and mount them manually.

— In case If any remote network filesystem not available, then check the NFS service both from the server and cient.

— In-case If any external Storage filesystem unavailable, check them with storage team and fix the issues related to volume manager.

— If you have Cluster Servers, make sure the Storage volumes are mounting to right servers as defined in the cluster configuration.

 

9. Go-Ahead for Database Team

Once we complete all the health checks for the Database servers, just ask the DBA team to start their databases and confirm for application go-ahead.

 

10. Go-Ahead for Application Teams

Once DBA team confirms with all the databases start up, just give go-ahead for the application team for application start up.

 

This concludes the power-on procedure… Just wait, we are not yet done.

Many times application team cannot confirm the application health check right immediate and there might be some issue that will arise little later. So, be vigilant and ready to fix them whenever they reach you.

What is your DC power failure experience… Just share to us. We appreciate your comments and feedback on this post.

 

 

 

 

Ramdev

Ramdev

I have started unixadminschool.com ( aka gurkulindia.com) in 2009 as my own personal reference blog, and later sometime i have realized that my leanings might be helpful for other unixadmins if I manage my knowledge-base in more user friendly format. And the result is today's' unixadminschool.com. You can connect me at - https://www.linkedin.com/in/unixadminschool/

14 Responses

  1. Ramesh says:

    great explanation , once i have participated in DataCenter refreshment , almost we have followed the same procedure.
    The main thing from unix end is we have to notedown dependence order while powering off and power on . i have faced problem i.e when i was brining down vmware linux instances
    from command prompt its giving mesage that server going for shutdown , but vmware engineer is seeing still some instance is online(power On status) those instances later he brougnt from his end .
    As iam not good in VMWARE , but my suspect is that VMware updates are not update so that some instances showing online when we bringdown from cmd line. comments please

  2. Ramdev Ramdev says:

    @Ramesh – Thanks for sharing your experience. Yes that kind of situations not only happens to VM boxes but also for the physical solaris/linux servers and the reason is, many times these servers wait for graceful shutdown of all the applications currently running on those servers.

    Since they already initiates the Shutdown process, sometime we might lose the remote ssh connections to those servers but still the server will stay online for some more time ( and sometimes, it hangs in middle). In Such cases we wont be having any other way other than halting the machine.

  3. Yogesh Raheja says:

    @Ram, very nice though of presenting this situation to all UNIX champs. It remind me one situation where in Power supply (generator) got blasted on Solaris 8 server (v240) having IBM storage. It took approx. 12+ hours where myself, Saurabh and Rajesh (we three) worked together to brought up the server to recognise IBM storage. These kind of situations really create Panic but we need to keep trying to get the solution, and the best method is to start with basics as presented in the Post by Ram. Nice piece of information Ram.

  4. Ramdev Ramdev says:

    @yogesh, thanks for the comment. And the issues that you and ramesh (from the first comment) mentioned are good examples for the issues that we would face in the step 8.

    Actually, I forgot to mention one important point here, the whole procedure that I have mentioned in this post cannot be followed by a single person because this entire procedure required the person to wear two different thinking caps ( mindsets) but one at a time.

    The steps from 1 to 10 except the detailed execution of step 8 required the a mindset with good technical expertise , communication and time bargaining skills. And he will act as a single point of technical contact for the team who interacts with the endusers / business people. And provides frequent updates on ETA for the entire recovery.

    The Step 8 required a complete technical mindset to resolve the specific issues that was assigned to each individual engineer.

    At the times we wear the first cap, we have to ask ourself some questions ..
    — what is the technical challenges we have now?
    — Do we have right people on board to resolve the issue
    — Can we buy more time from the user
    — Can we offer a alternative solution if specific server takes longer time than expected ( for example, advising users to bring up the database on bcp node and start applications, mean while we will make the prod node ready)

    And during the times that we wear second cap – we only focus on the specific technical issue and get it resolved at the best possible time, and it is really doesn’t matter whether it takes 18 or 24 hours to get things right..

  5. Feroz Ahmed says:

    Ramdev and Yogesh. These are very rare information for System Administrators like me who have never been to Data Center. you are absolutely right. We go out of mind when these kind of unexpected situation arises. Definately very informative and much appreciated.

  6. Ramdev Ramdev says:

    @Feroz, That is the whole purpose of the post, to give the remote admins an idea on Data center related incidents. I believe this post met the purpose. Thanks for dropping your comments.

  7. Yogesh Raheja says:

    @Ram, indeed a nice though again in your comments. :-)
    @Feroz, thanks for comments and yes this post is really very well explained and is very useful when SA enters into L3 support and have to handle end to end tasks with all co-ordination.

  8. Karthik says:

    Awesome @Ramdev !!!
    From the article I understood the purpose – How to handle the situation effectively.
    As one shouldn’t think “Are we not prepared for complete Power Failure of Datacenter (DC) ???

    In such scenario
    As known to all in real time they might have servers in different co-locations(say Texas,Ashburn, – just Datacenters in different locations) to handle the power failure.

    Colocation allows you to place your server machine in someone else’s rack and share their bandwidth as your own.

    They have VIP(Virtual IP) configured to serve this purpose and if one co-location/Datacenter is down(due to catastrophic failure) then VIP will point to other co-location where the Web servers/App servers are running fine without any power issues.

    This can be like DNS Road Robin(DNS RR) and it can be seen at the end user level

    Eg try nslookup yahoo.com or google.com or host google.com or nslookup mail.yahoo.com and try after sometime the same command then we will see different IPs or in general we will see multiple IPs (it may indicate that they are from different Datacenters,co-locations,etc)

    Other way they may route the traffic at the network level and taken care by Network Team.

    Or It can be shifting the Akamai traffic
    Eg For web server traffic

    Akamai generally will check the status of web server and if and if only it gets response from the server (say eg response 200 STATUS OK) then it will send the traffic to that web server otherwise it wont sent the traffic.

    Eg

    telnet webservername 80

    GET /status.html (you can have desired file here and just checking whether web server responds for that request)

    If the entire location is down then you wont get response for GET /status.html and no traffic will be routed to those web servers.

    In other sense this holds good for load balancing ie evenly distributing the traffic among different Datacenter locations.

    F5 or BigIP – Sorry am not an expert on those.

    Power failure issue can be viewed at different levels:

    Application team can think their way to have application server clustering at different DCs like Database team they can have RAC(Real Application Clusters) and Weblogic Admin can have Weblogic server clustering,etc

    This way even if the Datacenter(DC) is fully powered down no worries we will run on another DC while the affected DC is looked at.

    They call it as BCP(Business Continuity Planning) and will do some mock drill as well to ensure they can run successfully even without one DC.

    Technologies like DR – Disaster Recovery comes into picture here.

    Even they can have DR/BCP for human resources
    Due to Catastrophic failures even if the entire Call Center is down they can have similar setup at different Geographical locations and resources from there will pick up the calls seamlessly.

    Sorry if I have diverted the folks out of this topic and just want to share the things.

    • Ramdev Ramdev says:

      @Karthik, I understand your point. I believe that you are pointing that the “preparedness that we have discussed in this post” shouldn’t conflict with “the preparedness with Disaster Recovery Procedures“. Actually that is very valid point.

      >>> Infact I have mentioned this in the first section of the post, the second responsibility of sysadmin “2. Be ready to Resolve the issues which are already Expected but couldn’t prevent from occurring.” actually I mean to say about the DR procedures. but i didn’t mention it exclusively.

      >>> And the whole article is only focused on the third responsibility “3. Be Ready to Resolve the issues which are unexpected in nature“.
      —————————————————————————–
      But for sure your point has given me good lead to some of my next posts about DR. Thanks a lot for that. And for other reader who are new to the concept of prod / or BCP … can read one of my old post about enterprise network architecture http://gurkulindia.com/main/2011/05/enterprise-network-environment-and-support-functions/

  9. Yogesh Raheja says:

    @Karthik, you are absolutely right that “Clustering” is the alternative to keep your services concurrently running on other locations without any impact (i.e to avoid single point of failure of services) but with additional cost of vendor support, Licenses, softwares etc etc. Now think the same situation with Business point of view. For Eg: Suppose I have 100 server out of which 30 are prod, 30 are DR and 40 are test/uat/sit/dit/ebf/dev etc etc servers which are used for testing or developement. Now as an organization I will think about the cost and set up DR for 30 servers only. i.e 30 on texas & 30 on ashburn but there is no point of having DR of testing or dev servers as they are not impacting any of the business or users. So setting up Clusters on these servers is just a wastage of money which No architecture will suggest or infact setsup. In this case if power failure occurs then the above post will really helps. Also I would like to share you my experience I worked for some of the well established clients in UK, US & Aus. where they dont have any cluster setup and all is running on standalone servers. For SA’s who are working for such clients will really find this post as a weapon to overcome these critical situation. I think you will find these step of great use in such scenarios.

  10. sonu says:

    Good one Ramdev…

  11. hemant says:

    Great Article sir, very helpful for New Administrators like me…
    Thanks ….and keep posting

  12. Manisha says:

    Great one Sir , Blog explained the perfect sequence of handling the panicky situation like this, I have not been into one like this so far  but I am sure if this happens to me ever , I am going to recall this blog as the first thing .

What is in your mind, about this post ? Leave a Reply

Close
  Our next learning article is ready, subscribe it in your email

What is your Learning Goal for Next Six Months ? Talk to us