Solaris Troubleshooting : Forcing a coredump on hung x86 / x64 solaris system
- System appears to be hung
- system is not pingable
- can not login
- can not execute commands
- can not mount shares
- can not start/stop services
- system not responding
hanging system will not respond on any command and user interaction – it’s no longer usable.
Here are some of the situations which give the appearance of a system hang:
- Operating system isn’t booted or is rebooting in a loop
- System running low on memory or is overloaded
- Network share is lost due to network errors
- Other network errors
- Video or console output frozen
Here’s how to eliminate the above issues which give the appearance of a system hang:
1 Verify that the system is powered on and os booted or os isn’t booting in a loop. You can check for booting in a loop on the system console.
2 Verify the status LED on the system. Use for example ipmitool -H <ip_of_ilom> -U root power status or platform get power state (for v20z/v40z) to verify the power status.
3 Verify on the console or through your Service Processor console that your Operating System is booted. The system is not hanging if you see any activity on the console
4 Wait for a while – systems which are low on memory (possibly because of a heavy load) use the swap intensively. If you wait a while, the system may become available. Of course, further investigation of what is causing this will be required. On Linux or Solaris operating systems, try to force an activity on the console by issuing a <ctrl> <c>.
5 For Windows 2003 SAC you can try to issue the help command in order to check system availability.
6 Verify that network infrastructure is healthy and configured. Use the “ping” command to ping the the default gateway in the network segment; ping any naming system servers.
If other systems in the same network segment appear to be hung, the network is a good place to start your investigation.
- For Solaris and Linux, search for “NIS server not responding for domain <domainname>” on console or in the messages file. Check the availability of your name services (i.e. NIS, DNS, LDAP).
- For Solaris and Linux, search for “NFS server not responding” on console or in the messages file. Check the availability of your NFS server
- Ask your network administrator for any known issues in the network infrastructure.
7 Verify that all users of the system have the same issue / see a system hang.
On a multiuser system ask the others users if they see the same issue or if they recognized something else
If the above steps all check out, chances are you have in fact got a hung system.
Note: Under Solaris 10 OS, use kmdb
IMPORTANT NOTE: kadb and kmdb work in console mode ONLY. If you drop into the debugger from the keyboard using a monitor attached to the system, there is no way to see the debugger prompt, and it will appear as though the system has frozen. This occurs because, as part of it’s normal functioning, kadb is supposed to suspend the system, including any GUI applications. To work around this problem, it is necessary to be connected to the console or if necessary, you could disable the GUI.
From a command line, type:
kernel /platform/i86pc/multiboot kadb
- The next time the system hangs, send a break, which if it works, will drop to the kadb/kmdb prompt:
From a directly attached keyboard or serial port connection, type:
F1-A - press the "F1" and "A" keys, simultaneously. The control-alt-d key sequence also works.
On X4200/X4100 Servers, once connected to the SP console (start /SP/console from ILOM prompt), and then press Esc followed by shift+b to send break.
On V65x Servers, send a break to the console, and then press the key corresponding to the sysrq-command to send.
On V20z and V40z, once connected via the platform console, press ^Ecl0 to send the break. That is, press CONTROL-E, then the letter ‘c’, then the letter ‘l’, then the number ‘0’.
On Blades (B100x, B200x) to send a break from the SC console, type break s’N’ were ‘N’ is the slot number followed by ‘y’ (yes) when prompted.
NOTE: If using the Serial-over-LAN functionality for the Sun AMD Opteron platform, use the alternate break sequence explained in Technical Instruction 1012587.1
This will generate a core file, which can be retrieved from:
/var/crash/'uname -n' (or wherever the local system stores core files)
Send the core to Sun for analysis.