Day in SA Life : Working Remotely for an hardware issue
Below is Sample production environment :
1. Server Infrastructure was located in Two different Data center ( DC ). One DC is for Production Servers and another one for Disaster Recovery( DR) servers
2. System Administrators and Application Developers are sitting in different countries and working remotely.
3. All of the Servers having remote Console Connectivity and Both of the DC having 24×7 support team for DC operations.
One fine morning, you have received a mail/call from the server monitoring ( L1 ) team saying that there are some alerts appearing on server “Prod-Server” and they have initiated a support ticket ( Incident Ticket) and assigned to his team.
SA Response Procedure, for the incident ticket:
Step 1. Gathering Server Information
Using the System Name SA gathered following information for further diagnosis and troubleshooting
- System IP address. If the server name not part of DNS
- System Location – DC name and Rack Location
- Application/Business Team’s Contact Person
- Server Criticality – Whether the Server is Prod or DR. And Currently in Use by any applications
- Server Serial Number – Just in case if he has to raise any hardware vendor call for the Hardware Replacements
Step 2. Confirming the Issue
Connect to the server using IP address / Host name and investigate the issue. If unable to connect or if the server is not responding, then connect to Console of the server using remote console connection. In case, If able to connect, then check for System Logs using commands
- Check /var/adm/messages
- use dmesg Command
- If disk errors use – format and iostat -En commands, to confirm the failed device
- If other hardware errors – use prtdiag and prtconf commands
Step 3. Collecting Detailed Diagnosis Information to raise a Vendor ( Hardware ) request for Hardware Maintenance.
Gather all the requested information to perform the hardware replacement or maintenance, using the vendor specific tools, some example tools
- For SUN Solaris / SUN Hardware diagnosis – Run ‘explorer’ Utility
- For Redhat Linux issues – Run ‘SOSREPORT’ utility
- For Fujitsu Hardware issues – Run ‘ fjvsnap’ Utility
- For EMC related storage issues – Run ‘ emcgrab’ utility
- For HP Hardware related issues – ILO logs, or run ‘hpacucli or hpasmcli’ utilities
- For Veritas issues – Run ‘VRTSexplorer’ Utilitiy
Step 4. Vendor Coordination for Further Diagnosis
If the problem is with Hardware and we need to involve Hardware vendor for the troubleshooting , please refer following sample procedures
- Call the Global Customer Care number
- Inform the serial number , to check the contract warranty.
- Inform the contact person for the call from your team
- And ask for Case number and send them the log files you collected
- Ask them to investigate and advice for the replacement in mail or call
Step 5. Once investigation completed and if it requires replacement of the device, ask vendor whether the component can be hot swappable or does it need any server downtime
Step 6. If the Maintenance requires downtime, Just inform to application team about the situation and ask for good time for maintenance.
Step 7. Once you receive the Scheduled downtime from the application team, call back vendor and inform them for suitable time for maintenance.
a. Sometimes Vendor just Courier the component and instruction, and will ask our local support people to perform maintenance e.g External power supply, HDD
b. In Some cases Vendor will send expert field engineer to perform critical hardware maintenance – e.g. Memory Replacement , Motherboard or System board Replacements
Step 8. Once the maintenance schedule confirmed with both application team and Vendor, Just send a mail to Data center support team mentioning the below information and Internal IM number
- Vendor Engineer Details / Component courier details
- Server name / S.No / and Location
- And action to perform – whether to escort the Vendor engineer or to perform the replacement on the server.
Step 9. Most of the times SA may have to perform some Pre-Maintenance tasks before starting the actual maintenance work e.g.
- Sending an information mail to application team and Monitoring teams, so that they won’t be panic with the error messages during maintenance.
- Detaching failed disks from Veritas / SVM / SDS
- Shutting down machine incase downtime required
- Stopping services
Step 10. Once maintenance completed, SA will perform post maintenance tasks. e.g.
- Attach disks to mirror
- Starting the server and Starting services
- Informing to application team about server status and ask to confirm application running status
- Asking Monitoring team to resume monitoring
Step 11.Finally, the most important task is – close the tickets assigned to your team, with appropriate resolution information related to the error and troubleshooting procedure.