Are you ready to encounter a complete Datacenter power failure ?

Ramdev

I have started unixadminschool.com ( aka gurkulindia.com) in 2009 as my own personal reference blog, and later sometime i have realized that my leanings might be helpful for other unixadmins if I manage my knowledge-base in more user friendly format. And the result is today's' unixadminschool.com. You can connect me at - https://www.linkedin.com/in/unixadminschool/

Loading Facebook Comments ...

14 Responses

  1. Ramesh says:

    great explanation , once i have participated in DataCenter refreshment , almost we have followed the same procedure.
    The main thing from unix end is we have to notedown dependence order while powering off and power on . i have faced problem i.e when i was brining down vmware linux instances
    from command prompt its giving mesage that server going for shutdown , but vmware engineer is seeing still some instance is online(power On status) those instances later he brougnt from his end .
    As iam not good in VMWARE , but my suspect is that VMware updates are not update so that some instances showing online when we bringdown from cmd line. comments please

  2. Ramdev says:

    @Ramesh – Thanks for sharing your experience. Yes that kind of situations not only happens to VM boxes but also for the physical solaris/linux servers and the reason is, many times these servers wait for graceful shutdown of all the applications currently running on those servers.

    Since they already initiates the Shutdown process, sometime we might lose the remote ssh connections to those servers but still the server will stay online for some more time ( and sometimes, it hangs in middle). In Such cases we wont be having any other way other than halting the machine.

  3. Yogesh Raheja says:

    @Ram, very nice though of presenting this situation to all UNIX champs. It remind me one situation where in Power supply (generator) got blasted on Solaris 8 server (v240) having IBM storage. It took approx. 12+ hours where myself, Saurabh and Rajesh (we three) worked together to brought up the server to recognise IBM storage. These kind of situations really create Panic but we need to keep trying to get the solution, and the best method is to start with basics as presented in the Post by Ram. Nice piece of information Ram.

  4. Ramdev says:

    @yogesh, thanks for the comment. And the issues that you and ramesh (from the first comment) mentioned are good examples for the issues that we would face in the step 8.

    Actually, I forgot to mention one important point here, the whole procedure that I have mentioned in this post cannot be followed by a single person because this entire procedure required the person to wear two different thinking caps ( mindsets) but one at a time.

    The steps from 1 to 10 except the detailed execution of step 8 required the a mindset with good technical expertise , communication and time bargaining skills. And he will act as a single point of technical contact for the team who interacts with the endusers / business people. And provides frequent updates on ETA for the entire recovery.

    The Step 8 required a complete technical mindset to resolve the specific issues that was assigned to each individual engineer.

    At the times we wear the first cap, we have to ask ourself some questions ..
    — what is the technical challenges we have now?
    — Do we have right people on board to resolve the issue
    — Can we buy more time from the user
    — Can we offer a alternative solution if specific server takes longer time than expected ( for example, advising users to bring up the database on bcp node and start applications, mean while we will make the prod node ready)

    And during the times that we wear second cap – we only focus on the specific technical issue and get it resolved at the best possible time, and it is really doesn’t matter whether it takes 18 or 24 hours to get things right..

  5. Feroz Ahmed says:

    Ramdev and Yogesh. These are very rare information for System Administrators like me who have never been to Data Center. you are absolutely right. We go out of mind when these kind of unexpected situation arises. Definately very informative and much appreciated.

  6. Ramdev says:

    @Feroz, That is the whole purpose of the post, to give the remote admins an idea on Data center related incidents. I believe this post met the purpose. Thanks for dropping your comments.

  7. Yogesh Raheja says:

    @Ram, indeed a nice though again in your comments. :-)
    @Feroz, thanks for comments and yes this post is really very well explained and is very useful when SA enters into L3 support and have to handle end to end tasks with all co-ordination.

  8. Karthik says:

    Awesome @Ramdev !!!
    From the article I understood the purpose – How to handle the situation effectively.
    As one shouldn’t think “Are we not prepared for complete Power Failure of Datacenter (DC) ???

    In such scenario
    As known to all in real time they might have servers in different co-locations(say Texas,Ashburn, – just Datacenters in different locations) to handle the power failure.

    Colocation allows you to place your server machine in someone else’s rack and share their bandwidth as your own.

    They have VIP(Virtual IP) configured to serve this purpose and if one co-location/Datacenter is down(due to catastrophic failure) then VIP will point to other co-location where the Web servers/App servers are running fine without any power issues.

    This can be like DNS Road Robin(DNS RR) and it can be seen at the end user level

    Eg try nslookup yahoo.com or google.com or host google.com or nslookup mail.yahoo.com and try after sometime the same command then we will see different IPs or in general we will see multiple IPs (it may indicate that they are from different Datacenters,co-locations,etc)

    Other way they may route the traffic at the network level and taken care by Network Team.

    Or It can be shifting the Akamai traffic
    Eg For web server traffic

    Akamai generally will check the status of web server and if and if only it gets response from the server (say eg response 200 STATUS OK) then it will send the traffic to that web server otherwise it wont sent the traffic.

    Eg

    telnet webservername 80

    GET /status.html (you can have desired file here and just checking whether web server responds for that request)

    If the entire location is down then you wont get response for GET /status.html and no traffic will be routed to those web servers.

    In other sense this holds good for load balancing ie evenly distributing the traffic among different Datacenter locations.

    F5 or BigIP – Sorry am not an expert on those.

    Power failure issue can be viewed at different levels:

    Application team can think their way to have application server clustering at different DCs like Database team they can have RAC(Real Application Clusters) and Weblogic Admin can have Weblogic server clustering,etc

    This way even if the Datacenter(DC) is fully powered down no worries we will run on another DC while the affected DC is looked at.

    They call it as BCP(Business Continuity Planning) and will do some mock drill as well to ensure they can run successfully even without one DC.

    Technologies like DR – Disaster Recovery comes into picture here.

    Even they can have DR/BCP for human resources
    Due to Catastrophic failures even if the entire Call Center is down they can have similar setup at different Geographical locations and resources from there will pick up the calls seamlessly.

    Sorry if I have diverted the folks out of this topic and just want to share the things.

    • Ramdev says:

      @Karthik, I understand your point. I believe that you are pointing that the “preparedness that we have discussed in this post” shouldn’t conflict with “the preparedness with Disaster Recovery Procedures“. Actually that is very valid point.

      >>> Infact I have mentioned this in the first section of the post, the second responsibility of sysadmin “2. Be ready to Resolve the issues which are already Expected but couldn’t prevent from occurring.” actually I mean to say about the DR procedures. but i didn’t mention it exclusively.

      >>> And the whole article is only focused on the third responsibility “3. Be Ready to Resolve the issues which are unexpected in nature“.
      —————————————————————————–
      But for sure your point has given me good lead to some of my next posts about DR. Thanks a lot for that. And for other reader who are new to the concept of prod / or BCP … can read one of my old post about enterprise network architecture http://gurkulindia.com/main/2011/05/enterprise-network-environment-and-support-functions/

  9. Yogesh Raheja says:

    @Karthik, you are absolutely right that “Clustering” is the alternative to keep your services concurrently running on other locations without any impact (i.e to avoid single point of failure of services) but with additional cost of vendor support, Licenses, softwares etc etc. Now think the same situation with Business point of view. For Eg: Suppose I have 100 server out of which 30 are prod, 30 are DR and 40 are test/uat/sit/dit/ebf/dev etc etc servers which are used for testing or developement. Now as an organization I will think about the cost and set up DR for 30 servers only. i.e 30 on texas & 30 on ashburn but there is no point of having DR of testing or dev servers as they are not impacting any of the business or users. So setting up Clusters on these servers is just a wastage of money which No architecture will suggest or infact setsup. In this case if power failure occurs then the above post will really helps. Also I would like to share you my experience I worked for some of the well established clients in UK, US & Aus. where they dont have any cluster setup and all is running on standalone servers. For SA’s who are working for such clients will really find this post as a weapon to overcome these critical situation. I think you will find these step of great use in such scenarios.

  10. sonu says:

    Good one Ramdev…

  11. hemant says:

    Great Article sir, very helpful for New Administrators like me…
    Thanks ….and keep posting

  12. Manisha says:

    Great one Sir , Blog explained the perfect sequence of handling the panicky situation like this, I have not been into one like this so far  but I am sure if this happens to me ever , I am going to recall this blog as the first thing .

Leave a Reply

Your email address will not be published.

[contact-form to='ramkumar.ramadevu@gmail.com' subject='New Learning Request Submitted'][contact-field label='Name' type='name' required='1'/][contact-field label='Email' type='email' required='1'/][contact-field label='Learning Request' type='textarea' required='1'/][contact-field label='Are you Looking for ' type='radio' required='1' options='Paid Training,Free Training'/][/contact-form]

What is your Learning Goal for Next Six Months ? Talk to us