Solaris Troubleshooting X86 : finding cause for system power off
Here we discuss about some guidelines on how to proceed if you find a Sun X64 server has unexpectedly powered off. The examples offered may be specific to a particular command or firmware version, are provided to illustrate a troubleshooting concept, and may not apply to all Sun X64 servers. Always refer to the support documentation for your particular server product to determine the correct equivalent command or procedure.
Various conditions can trigger a system shutdown, including:
- Temperature of a component or ambient air is too high.
- Multiple cooling fan failures.
- A voltage fluctuation beyond the acceptable threshold.
- Multiple power supplies have failed or have been removed causing loss of power redundancy.
- External (computer room) AC or DC power fails, or falls outside the range required by the server power supplies to safely continue to run the system.
- A component hot-swap circuit has faulted.
The first thing to note, is that if the chassis has no power, then the Service Processor (SP) will not function, as it operates from standby / housekeeping voltage. If this is the case then a physical examination of the server is required, as outlined below in the section “Verifying cause of NO chassis power“.
If the SP is accessible, this means external power is being delivered to at least one of the server power supplies, which in turn are supplying standby voltage to the chassis.
Gathering possible reasons for the outage using ipmitool
The ipmitool command can be used to collect information about the possible reasons for the platform state, such as voltage & temperature sensors, fault LEDs & indicators, and the platform System Event Log (SEL). ( instructions to use ipmitool )
Example – SEL entries showing high temperature events that resulted in automatic system power-off:
281 | 05/12/2009 | 22:23:20 | Temperature dbp.t_amb | Upper Critical going high | Reading 34 > Threshold 33 degrees C
282 | 05/12/2009 | 22:42:07 | Temperature dbp.t_amb | Upper Non-recoverable going high | Reading 44 > Threshold 43 degrees C
283 | 05/12/2009 | 22:42:46 | System ACPI Power State sys.acpi | S5/G2: soft-off | Asserted
Example – SEL entry showing Chassis Intrusion switch was triggered when the chassis cover was removed:
200 | 06/24/2008 | 10:35:36 | Physical Security sys.intsw | General Chassis intrusion | Asserted
Example – SEL entries showing the power button was used to power-off the system:
109 | 11/17/2008 | 19:01:26 | Button | Power Button pressed | Asserted
10a | 11/17/2008 | 19:01:29 | System ACPI Power State ACPI | S5/G2: soft-off | Asserted
Gathering possible reasons for the outage using Service Processor web GUI
Integrated Lights Out Manager (ILOM) and Embedded Lights Out Manager (ELOM) based Service Processors provide an easy-to-use web interface for managing the platform. Point your web browser to the Service Processor IP address or resolving DNS hostname, and enter your login credentials when prompted.
Once logged in, click the System Monitoring tab, which reveals access to additional tabs. Click to drill down further:
- Sensor readings.
- Event logs.
- Fault and other Indicator LED states.
- Power Management & utilization.
Note: tab names may differ slightly between ILOM and ELOM versions.
Gathering possible reasons for the outage using the Service Processor Command Line Interface (CLI)
Login to the Service Processor using ssh (requires SP IP address or resolvable DNS hostname):
# ssh -l <username> <SP host name or IP>
Display System Event Logs, and sensor & fault indicator information:
#show -d properties -level all /SYS
#show -o table -level all /SP/faultmgmt (Not available in all ILOM versions).
#show -d properties -level all /SP/SystemInfo
#show -d properties -level all /
V20z & V40z:
#sp get events -v
#sensor get –verbose
#inventory get all -v
#sp get tdulog -f stdout
Gathering possible reasons for the outage from the Operating System
If the system can be powered up and OS booted OK after an unexpected shutdown, check:
- OS messages and event logs: Was the shutdown graceful? Is there any indication of the power button being pressed, temperature or other event recorded?
- OS fault manager (such as Solaris FMA) records?
- Console log: was anything relevant displayed on the system console at or near the time of the shutdown?
Verifying cause of NO chassis power
- Visually inspect each power supply for the status of the AC Present, Power OK, and Fault LEDs. If the Fault LED is illuminated on any of the PSUs then further troubleshooting will be required.
- If AC Present is NOT illuminated, ensure the AC power cords are securely plugged into the server and connected to working AC power outlet(s). Test using known good power cables and power source. Engage a qualified electrician to test voltage on the power cords.