Day in SA Life : Working Remotely for an hardware issue

Below is  Sample production environment :

1. Server Infrastructure was located in Two different Data center ( DC ).  One DC is for Production Servers and another one for Disaster Recovery( DR) servers

2. System Administrators and Application Developers are sitting in different countries and working remotely.

3. All of the Servers having remote Console Connectivity and Both of the DC having 24×7 support team for DC operations.

Problem Scenario:

One fine morning, you have received a mail/call from the server monitoring ( L1 )  team saying that there are some alerts appearing on server  “Prod-Server” and they have initiated a support ticket ( Incident Ticket) and assigned to his team.

SA Response Procedure, for the incident ticket:

Step 1. Gathering Server Information 

Using the System Name SA gathered following information for further diagnosis and troubleshooting

  • System IP address. If the server name not part of DNS
  • System Location – DC name and Rack Location
  • Application/Business Team’s Contact Person
  • Server Criticality – Whether the Server is Prod or DR. And Currently in Use by any applications
  • Server Serial Number – Just in case if he has to raise any hardware vendor call for the Hardware Replacements

Step 2. Confirming the Issue 

Connect to  the server using IP address / Host name and investigate the issue.  If unable to connect or if the server is not responding, then connect to Console of the server using remote console connection. In case,  If able to connect, then check for System Logs using commands

  • Check /var/adm/messages
  • use dmesg Command
  • If disk errors use – format  and  iostat -En commands, to confirm the failed device
  • If other hardware errors – use prtdiag and prtconf commands

Step 3. Collecting Detailed Diagnosis Information to raise a Vendor ( Hardware ) request for Hardware Maintenance.

Gather all the requested information to perform the hardware replacement or maintenance, using the vendor specific tools, some example tools

  • For SUN Solaris / SUN Hardware diagnosis –  Run  ‘explorer’ Utility
  • For Redhat Linux issues – Run ‘SOSREPORT’ utility
  • For Fujitsu Hardware issues – Run ‘ fjvsnap’ Utility
  • For EMC related storage issues – Run ‘ emcgrab’ utility
  • For HP Hardware related issues –   ILO logs, or run ‘hpacucli or hpasmcli’ utilities
  • For Veritas issues – Run ‘VRTSexplorer’ Utilitiy

Step 4. Vendor Coordination for Further Diagnosis

If the problem is with Hardware and we need to involve Hardware vendor  for the troubleshooting , please refer following sample procedures

  • Call the Global Customer Care number
  • Inform the serial number , to check the contract warranty.
  • Inform the contact person for the call from your team
  • And ask for Case number and send them the log files you collected
  • Ask them to investigate and advice for the replacement in mail or call

Step 5. Once investigation completed and if it requires replacement of the device, ask vendor whether the component can be hot swappable or does it need any server downtime

Step 6. If the Maintenance requires downtime, Just inform to application team about the situation and ask for good time for maintenance.

Step 7. Once you receive the Scheduled downtime from the application team, call back vendor and inform them for suitable time for maintenance.

a. Sometimes Vendor just Courier the component and instruction, and will ask our local support people to perform maintenance  e.g External power supply, HDD
b. In Some cases Vendor will send expert field engineer to perform critical hardware  maintenance – e.g. Memory Replacement , Motherboard or System board Replacements

 

Step 8. Once the maintenance schedule confirmed with both application team and Vendor, Just send a mail to Data center support team  mentioning the below information and Internal IM number

  • Vendor Engineer Details / Component courier details
  • Server name / S.No / and Location
  • And action to perform  – whether to escort the Vendor engineer or to perform the replacement on the server.

Step 9. Most of the times  SA may have to perform some Pre-Maintenance tasks before starting the actual  maintenance  work e.g.

  • Sending an information mail to application team and  Monitoring teams, so that they won’t be panic with the error messages during maintenance.
  • Detaching failed disks from Veritas / SVM / SDS
  • Shutting down machine incase downtime required
  • Stopping services

Step 10. Once maintenance completed, SA will perform post maintenance tasks. e.g.

  • Attach disks to mirror
  • Starting the server and Starting services
  • Informing to application team about server status and ask to confirm application running status
  • Asking Monitoring team to resume monitoring

Step 11.Finally, the most important task is –  close the tickets assigned to your team, with appropriate resolution information related to the error and troubleshooting procedure.

Ramdev

Ramdev

I have started unixadminschool.com ( aka gurkulindia.com) in 2009 as my own personal reference blog, and later sometime i have realized that my leanings might be helpful for other unixadmins if I manage my knowledge-base in more user friendly format. And the result is today's' unixadminschool.com. You can connect me at - https://www.linkedin.com/in/unixadminschool/

9 Responses

  1. pavan says:

    very good real time scenario

  2. Deepak says:

    Good, Nice Post. Thank you

  3. Michael Michael says:

    @Ram you missed out the deadley oncall support :) :) 

  4. dani says:

    great, very good post for starters.

  5. ramakrishna says:

    very very good scenario
    REALLY AWESOME

  6. Ramdev Ramdev says:

    Ramakrishna – welcome to unixadminschool.com

  7. Mahesh babu.R says:

    Hi sir,
          Above explaniation is very good . I have doubt that if the server has more no number of  group is running the more service groups or appilcation. whether it is possible to shift to some other server for time being like in veritas.
    I would request u post this kind of good explaniation on this kind of adminstration  part since i am learner.

  8. Rahul says:

    Great job your doing

What is in your mind, about this post ? Leave a Reply

Close
  Our next learning article is ready, subscribe it in your email

What is your Learning Goal for Next Six Months ? Talk to us