Redhat Enterprise Linux – Troubleshooting Kernel Panic issues – Part 2

 

 

This is continuation post to our earlier kernel panic reference post ( Redhat Enterprise Linux 6 – Kernel Panic and System Crash – Troubleshooting Quick Reference )  where we have discussed several types of kernel panic issues and their reasons. And in this post I will be talking about the procedures and some guidelines to diagnosis and troubleshoot the common kernel panic issues in redhat linux.

Btw, please note  that these are just guidelines and purely your knowledge purpose, but they doesn’t guarantee  any solutions to your environment specific issues. You need to take extreme care and precaution while troubleshooting  issues which are very specific to your hardware and software environment.

 

How do we analyse System Crash and Kernel Panic Issues 

Normally we expect we have already configured Kdump to gather necessary vmcore to analyse these kind of issues. If we have vmcore ready we could use the crash utility to do some of analyse of some of the stuff as given below:

 

What is Kdump? 

Starting in Red Hat Enterprise Linux 5, kernel crash dumps are captured using the kdump mechanism.  Kexec is used to start another complete copy of the Linux kernel in a reserved area of memory. This secondary kernel takes over and copies the memory pages to the crash dump location.

 

How do we configure the kdump ? 

  In brief it has following steps:
 

Step 1. Install kexec-tools
Step 2. Edit /etc/grub.conF, and add the ” crashkernel=<reservered-memory-setting> ” at the end of kernel line

Example for RHEL 5:

title Red Hat Enterprise Linux Client (2.6.17-1.2519.4.21.el5)
root (hd0,0)
kernel /boot/vmlinuz-2.6.17-1.2519.4.21.el5 ro root=LABEL=/ rhgb quiet crashkernel=128M@16M
initrd /boot/initrd-2.6.17-1.2519.4.21.el5.img
 

crashkernel=memory@offset

+—————————————+
| RAM | crashkernel | crashkernel |
| size | memory | offset |
|———–+————-+————-|
| 0 – 2G | 128M | 16 |
| 2G – 6G | 256M | 24 |
| 6G – 8G | 512M | 16 |
| 8G – 24G | 768M | 32 |
+—————————————+

Example for RHEL 6:

title Red Hat Enterprise Linux Server (2.6.32-71.7.1.el6.x86_64)
root (hd0,0)
kernel /vmlinuz-2.6.32-71.7.1.el6.x86_64 ro root=/dev/mapper/vg_example-lv_root rd_LVM_LV=vg_example/lv_root rd_LVM_LV=vg_example/lv_swap rd_NO_LUKS rd_NO_MD rd_NO_DM LANG=en_US.UTF-8 SYSFONT=latarcyrheb-sun16 KEYBOARDTYPE=pc KEYTABLE=us crashkernel=128M rhgb quiet
initrd /initramfs-2.6.32-71.7.1.el6.x86_64.img

Guidelines for Crash Kernel Reserved Memory Settings:

crashkernel=0M to 2G: 128M, 2G-6G:256M, 6G-8G:512M,8G-:768M

Ram Size CrashKernel
>0GB 128MB
>2GB 256MB
>6GB 512MB
>8GB 768MB

Step 3. configure /etc/kdump.conf

3A) to specify the destination to send the output of kexec,i.e. vmcore. Following destinations can be used

– raw device : raw /dev/sda4
– file : ext3 /dev/sda3 , that will dump vmcore to /dev/sda3:/var/crash
– NFS share : net nfs.example.com:/export/vmcores
– Another system via SSH : net kdump@crash.example.com

3B) Configure Core Collector, to discard unnecessary memory pages and compress the only needed ones

Option    Discard
1               Zero pages
2              Cache pages
4              Cache private
8              User pages
16            Free pages

To  discard all optional pages:

core_collector makedumpfile -d 31 -c 4. reboot the server with crashkernel=<reserved-memory-setting>5. Start kdump service6. Enable the kdump for autostart on

Important Notes about Redhat Linux on HP ASR ( AUTOMATED SERVER RECOVERY ) FEATURE

According to the Automatic System Recovery definition on HP website, “If the internal health LED on a server is amber, the server might shutdown or reboot as soon as the hp Insight Manager agents are installed. This feature is called Automatic Server Recovery, and is enabled to ensure that the server can be recovered even though there is a major hardware problem.”

The Automatic System Recovery generally occurs during low server utilization for an extended period of time.

The Automatic System Recovery feature is a 10 minute timer. If the OS stops communicating the server reboots. It is implemented using a “heartbeat” timer that continually counts down. The hpasm driver frequently reloads the counter to prevent it from counting down to zero. If the Automatic System Recovery timer counts down to 0, it is assumed that the operating system is locked up and the system automatically attempts to reboot.

The logs of system lockup, the Automatic System Recovery and hardware failure are logged in IML . You can check the IML through the ILO interface.

To check whether ASR is enabled in the BIOS or not, we can use the below command.

# hpasmcli -s “show asr”

We have to make sure, that ASR timeout value should not interrupt the complete core collection. So it is often required to

Disable ASR using the command :

 # hpasmcli -s ‘DISABLE ASR’

Or set longer timeout using the command:

           # hpasmcli -s ‘SET ASR 30’

How do we analyse the VMCORE file collected using kdump?

If we assume that we configured that /var/crash as destination for vmcore in /etc/kdump.conf, we will see the file as below

# ls -l /var/crash/127.0.0.1-2013-09-21-19:45:17/vmcore
-rw——-. 1 root root 490958682 Sep 21 18:46 /var/crash/127.0.0.1-2013-09-21-19:45:17/vmcore
 
To Analyse the vmcore we need the “crash” utility. And we need to install following packages to get started with vmcore analysis

# yum install crash
# yum install kernel-debuginfo-2.6.32-220.23.1.el6.x86_64

Purpose of Debuginfo package : Debugging symbols are stripped out of the standard kernel for performance and size reasons. To analyze the vmcore, separate debugging information needs to be provided. This is specific to the exact revision of the kernel which crashed. Debuginfo package will help us to analyse these symbols from the vmcore.

How to Use Crash Utility ?

At this level We will only discuss about basic usage of crash utility from the linux administration point of view.

 

  •  log – Display the kernel ring buffer log. On a running system, dmesg also displays the kernel ring buffer log.

Often times this can capture log messages that were not written to disk due to the crash.

crash> log
— snip —
SysRq : Trigger a crash
BUG: unable to handle kernel NULL pointer dereference at (null)
IP: [<ffffffff8130e126>] sysrq_handle_crash+0x16/0x20
PGD 7a602067 PUD 376ff067 PMD 0
Oops: 0002 [#1] SMP
 

  •  kmem -i – Show available memory at time of crash
  • ps – Show running processes at time of crash. Useful with grep
  • net – Show configured network interfaces at time of crash
  • bt – Backtraces are read upside-down, from bottom to top

crash> bt
PID: 6875 TASK: ffff88007a3aaa70 CPU: 0 COMMAND: “bash”
#0 [ffff88005f0f5de8] sysrq_handle_crash at ffffffff8130e126
#1 [ffff88005f0f5e20] __handle_sysrq at ffffffff8130e3e2
#2 [ffff88005f0f5e70] write_sysrq_trigger at ffffffff8130e49e
#3 [ffff88005f0f5ea0] proc_reg_write at ffffffff811cfdce
#4 [ffff88005f0f5ef0] vfs_write at ffffffff8116d2e8
#5 [ffff88005f0f5f30] sys_write at ffffffff8116dd21
#6 [ffff88005f0f5f80] system_call_fastpath at ffffffff81013172
RIP: 00000037702d4230 RSP: 00007fff85b95f40 RFLAGS: 00010206
RAX: 0000000000000001 RBX: ffffffff81013172 RCX: 0000000001066300
RDX: 0000000000000002 RSI: 00007f04ae8d2000 RDI: 0000000000000001
RBP: 00007f04ae8d2000 R8: 000000000000000a R9: 00007f04ae8c4700
R10: 00000000ffffffff R11: 0000000000000246 R12: 0000000000000002
R13: 0000003770579780 R14: 0000000000000002 R15: 0000003770579780
ORIG_RAX: 0000000000000001 CS: 0033 SS: 002b
 

  •  sys – Displays system data – same information displayed when crash starts

crash> sys
DUMPFILE: /tmp/vmcore [PARTIAL DUMP]
CPUS: 2
DATE: Thu May 5 14:32:50 2011
UPTIME: 00:01:15
LOAD AVERAGE: 1.19, 0.34, 0.12
TASKS: 252
NODENAME: rhel6-desktop
RELEASE: 2.6.32-220.23.1.el6.x86_64
VERSION: #1 SMP Mon Oct 29 19:45:17 EDT 2012
MACHINE: x86_64 (3214 Mhz)
MEMORY: 2 GB
PANIC: “Oops: 0002 [#1] SMP ” (check log for details)
PID: 6875
COMMAND: “bash”
TASK: ffff88007a3aaa70 [THREAD_INFO: ffff88005f0f4000]
CPU: 0
STATE: TASK_RUNNING (PANIC)

  •  dmesg – To check the kernel log from vmcore output

crash> dmesg
— snip —
CPU 0: Machine Check Exception: 0000000000000004
Kernel panic – not syncing: Unable to continue
Redirect Crash output to Regular Commands 

 
Example 1:                crash> log > log.txt
Example 2:                crash> ps | fgrep bash | wc -l
 

Sample Scenario 1 :

System hangs or kernel panics with MCE (Machine Check Exception) in /var/log/messages file.
System was not responding. Checked the messages in netdump server. Found the following messages …”Kernel panic – not syncing: Machine check”.
System crashes under load.
System crashed and rebooted.
Machine Check Exception panic

Observations:

Look for the phrase “Machine Check Exception” in the log just before the panic message. If this message occurs, the rest of the panic message is of no interest.

Analyze vmcore

$crash /path/to/2.6.18-128.1.6.el5/vmlinux vmcore
 
KERNEL: ./usr/lib/debug/lib/modules/2.6.18-128.1.6.el5/vmlinux
DUMPFILE: 563523_vmcore [PARTIAL DUMP]
CPUS: 4
DATE: Thu Feb 21 00:32:46 2011
UPTIME: 14 days, 17:46:38
LOAD AVERAGE: 1.14, 1.20, 1.18
TASKS: 220
NODENAME: gurkulnode1
RELEASE: 2.6.18-128.1.6.el5
VERSION: #1 SMP Tue Mar 24 12:05:57 EDT 2009
MACHINE: x86_64 (2599 Mhz)
MEMORY: 7.7 GB
PANIC: “Kernel panic – not syncing: Uncorrected machine check”
PID: 0
COMMAND: “swapper”
TASK: ffffffff802eeae0 (1 of 4) [THREAD_INFO: ffffffff803dc000]
CPU: 0
STATE: TASK_RUNNING (PANIC)
 
crash> log

CPU 0: Machine Check Exception:7 Bank 4: b40000000005001b
RIP 10:<ffffffff8006b2b0> {default_idle+0x29/0x50}
TSC bc34c6f78de8f ADDR 17fe30000
This is not a software problem!
Run through mcelog –ascii to decode and contact your hardware vendor
Kernel panic – not syncing: Uncorrected machine check

Process error through mcelog –ascii, specify –k8 for events that are for AMD processors and –p4 for a Pentium 4 or Xeon. This resulting information might be helpful to your hardware vendor.

$ cat > mcelog.txt
CPU 0: Machine Check Exception:7 Bank 4: b40000000005001b
RIP 10:<ffffffff8006b2b0> {default_idle+0x29/0x50}
TSC bc34c6f78de8f ADDR 17fe30000
[ctrl]+[d]
 
$ mcelog –ascii –k8 < mcelog.txt
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 0 4 northbridge TSC bc34c6f78de8f
RIP 10:ffffffff8006b2b0
Northbridge GART error
bit61 = error uncorrected
TLB error ‘generic transaction, level generic’
STATUS b40000000005001b MCGSTATUS 7
RIP: default_idle+0x29/0x50}

Observations and Recommended Solution : The information printed by the kernel (the line printed immediately before the panic message) comes from the hardware and should be provided to the hardware support person for analysis. This information should resemble the following:

CPU 10: Machine Check Exception: 4 Bank 0: b200000410000800

Machine Check Exception (MCE) is an error that occurs when a computer’s CPU detects a hardware problem. Typically, the impending hardware failure will cause the kernel to panic in order to protect against data corruption.

Normally at this level we do engage the hardware vendor for further troubleshooting and diagnosis; the message present in the logs just before the kernel panic should be given to hardware support.

How to run a memory test ?

Red Hat Enterprise Linux ships a memory test tool called memtest86+. It is a bootable utility that tests physical memory by writing various patterns to it and reading them back. Since memtest86+ runs directly off the hardware it does not require any operating system support for execution.

This tool is available as an RPM package from Red Hat Network (RHN) as well as a boot option from the Red Hat Enterprise Linux rescue disk.

To boot memtest86+ from the rescue disk, you will need to boot your system from CD 1 of the Red Hat Enterprise Linux installation media, and type the following at the boot prompt (before the Linux kernel is started):

boot: memtest86

If you would rather install memtest86+ on the system, here is an example of how to do it on a Red Hat Enterprise Linux 5 machine registered to RHN:

# yum install memtest86+

For the Red Hat Enterprise Linux version 4, perform the following command to install memtest86+. Make sure current system has been registered to RHN:

# up2date -i memtest86+

Then you will have to configure it to run on next reboot:

# memtest-setup

After reboot, the GRUB menu will list memtest. Select this item and it will start testing the memory.

Please note that once memtest86+ is running it will never stop unless you interrupt it by pressing the Esc key. It is usually a good idea to let it run for a few hours so it has time to test each block of memory several times.

memtest86+ may not always find all memory problems. It is possible that the system memory can have a fault that memtest86+ does not detect.

Sample Scenario 2

Console Screen having the messages as below
 
Northbridge Error, node 1, core: -1
K8 ECC error.
EDAC amd64 MC1: CE ERROR_ADDRESS= 0x101a793400
EDAC MC1: INTERNAL ERROR: row out of range (-22 >= 8)
EDAC MC1: CE – no information available: INTERNAL ERROR
EDAC MC1: CE – no information available: amd64_edacError Overflow

Observations :

check for any other related Kernel Errors from /var/log/messages

kernel: [Hardware Error]: Northbridge Error (node 1): DRAM ECC error detected on the NB.
kernel: [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: RES (no timeout)
kernel: [Hardware Error]: CPU:2 MC4_STATUS[Over|CE|MiscV|-|AddrV|CECC]: 0xdc2c410000000a13
kernel: [Hardware Error]: MC4_ADDR: 0x0000000210b67d90

When the EDAC core monitoring module and various supported chipset drivers are loaded, whenever an error has been detected, an error message will get logged in the syslog message file. Based on the Keywords used in the Error message we can decode the criticality of the error, below are the guideline

– “Non-Fatal Error” – Recoverable error
– “DRAM Controller” – In memory controller module
– MC0 – Memory controller 0
– CE – Correctable Error Page – in memory page
– 0xc6397 Offset – offset into that page at 0x0
– Grain – accuracy of reporting
– Syndrome – error bits from controller (specific)
– Channel – which memory channel (often only channel 0 on machines)
– Row – which DIMM row. How that maps to a chip is vendor specific but often as simple as row 0/1 -> DIMM0 row 2/3 > DIMM1 etc
– label “” – description of this DIMM (NULL string in U3)
– e752x – chip type (eg e7501, AMD76x)

Memory error checking on a memory module used to be accomplished with a parity checking bit that was attached to each byte of memory. The parity bit was calculated when each byte of memory was written, and then verified when each byte of memory was read. If the stored parity bit didn’t match the calculated parity bit on a read, that byte of memory was known to have changed. Parity checking is known to be a reasonably effective method for detecting a single bit change in a byte of memory.

Recommended Solution

— EDAC messages, errors, and warnings are often indicative of a hardware problem such as a memory module failure or memory controller failure.
If EDAC errors are encountered running a hardware diagnostic and contacting your hardware vendor are advised.

— In some cases EDAC errors can be thrown due to a bug in the EDAC kernel module, the kernel, or due to an incompatibility between the system’s chipset and the

Sample Scenario 3

The following error message appearing in /var/log/messages
 
kernel: Dazed and confused, but trying to continue
kernel: Do you have a strange power saving mode enabled?
kernel: Uhhuh. NMI received for unknown reason 21 on CPU 0
kernel: Dazed and confused, but trying to continue
kernel: Do you have a strange power saving mode enabled?
kernel: Uhhuh. NMI received for unknown reason 31 on CPU 0.

Observations :

The above message is typically output when system hardware has generated a non-maskable interrupt not recognized by the kernel. If there are any codes associated with the fault these may be trapped by hpasm software or equivalent HW Vendor monitoring software and logged there, but the fact that the NMI is not known to the kernel suggests the problem is a fundamental hardware issue. NMI 21 and 31 events typically indicate faulty RAM or perhaps CPU. However, hardware issues relating to the motherboard cannot be ruled out.

Schedule some downtime in order to run hardware diagnostics. A standard memtest86 could be run, but this is not guaranteed to find all possible memory issues. memtest86 can be initiated on a system by booting from a RHEL 5 DVD. Following this, any available manufacturer specific hardware diagnostics tools should be run.

Recommeded Solution:

If the issue happens frequenly on a production machine and that causes a system crash or panic, then first to with a BIOS firmware upgrade for the hardware. Should that not resolve the issue, have the support vendor replace faulty hardware, such as memory, CPU or motherboard.

Sample Scenario 4 

Console Shows following Error Message

NMI: IOCK error (debug interrupt?)
CPU 0
Modules linked in: ipt_MASQUERADE iptable_nat ip_nat xt_state ip_conntrack nfnetlink ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge mptctl mptbase bonding be2iscsi ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp bnx2i cnic ipv6 xfrm_nalgo crypto_api uio cxgb3i cxgb3 8021q libiscsi_tcp libiscsi2 scsi_transport_iscsi2 scsi_transport_iscsi dm_round_robin dm_multipath scsi_dh video backlight sbs power_meter hwmon i2c_ec i2c_core dell_wmi wmi button battery asus_acpi acpi_memhotplug ac parport_pc lp parport joydev sr_mod cdrom hpilo bnx2 serio_raw shpchp pcspkr sg dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod usb_storage qla2xxx scsi_transport_fc ata_piix libata cciss sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 0, comm: swapper Not tainted 2.6.18-194.17.4.el5 #1
RIP: 0010:[<ffffffff8019d550>] [<ffffffff8019d550>] acpi_processor_idle_simple+0x14c/0x30e
RSP: 0018:ffffffff803fbf58 EFLAGS: 00000046
RAX: 0000000000d4d87e RBX: ffff81061e10a160 RCX: 0000000000000908
RDX: 0000000000000915 RSI: 0000000000000003 RDI: 0000000000000000
RBP: 0000000000d4d87e R08: ffffffff803fa000 R09: 0000000000000039
R10: ffff810001005710 R11: 0000000000000000 R12: 0000000000000000
R13: ffff81061e10a000 R14: 0000000000000000 R15: 0000000000000000
FS: 0000000000000000(0000) GS:ffffffff803ca000(0000) knlGS:0000000000000000
CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000009013954 CR3: 000000060799d000 CR4: 00000000000006e0
Process swapper (pid: 0, threadinfo ffffffff803fa000, task ffffffff80308b60)
Stack: ffff81061e10a000 ffffffff8019d404 0000000000000000 ffffffff8019d404
0000000000090000 0000000000000000 0000000000000000 ffffffff8004923a
0000000000200800 ffffffff80405807 0000000000090000 0000000000000000
Call Trace:
[<ffffffff8019d404>] acpi_processor_idle_simple+0×0/0x30e
[<ffffffff8019d404>] acpi_processor_idle_simple+0×0/0x30e
[<ffffffff8004923a>] cpu_idle+0×95/0xb8
[<ffffffff80405807>] start_kernel+0×220/0×225
[<ffffffff8040522f>] _sinittext+0x22f/0×236
 
Code: 89 ca ed ed 41 89 c4 41 8a 45 1c 83 e0 30 3c 30 75 15 f0 ff
 
Observations

In these scenarios we normally check dmesg and lspci output to figure out who the culprit might be

Parity and uncorrectable hardware errors are examples of why an IOCHK error could be raised.

Most hardware errors should however be reported through the MCE (Machine Check Exception) mechanism. An MCE indicates that the CPU detected an internal machine error or a bus error, or that an external agent detected a bus error. Normally the hardware manufacturer will be able to provide further details.

Recommended Solution

Use vendor hardware diagnostics software to analyse system health.
Contact the hardware manufacturer for further assistance.
Under RHEL6, the kernel.panic_on_io_nmi = 1 sysctl can be set to have the system panic when an I/O NMI is received.

Ramdev

Ramdev

I have started unixadminschool.com ( aka gurkulindia.com) in 2009 as my own personal reference blog, and later sometime i have realized that my leanings might be helpful for other unixadmins if I manage my knowledge-base in more user friendly format. And the result is today's' unixadminschool.com. You can connect me at - https://www.linkedin.com/in/unixadminschool/

7 Responses

  1. Kiran MS says:

    Crystal clear info on Kernel Panic. Thanks Anna…

  2. Ganesh.E says:

    A end to looooong wait… on Kernel Panic issue…
    Thanks a lot!!!!!

  3. Dinesh Murani says:

    Really NICE article on Kernel Panic. Good one and keep sharing. :))

  4. pankaja das says:

    really awesom article in kernel panic

  5. joemon says:

    If I am not mistaken, kdump comes handy only when kernel crash occurs on a fully booted and running system where kdump is running. How about kernel panics during booting, how do we diagnose them.

  1. September 16, 2015

    […] Read – Troubleshooting Kernel Panic Issues – Part 2 […]

What is in your mind, about this post ? Leave a Reply

Close
  Our next learning article is ready, subscribe it in your email

What is your Learning Goal for Next Six Months ? Talk to us