Solaris Troubleshooting : Deal with memory Errors – Correctable and Uncorrectable

Memory errors are quite common hardware related errors in enterprise environment, here we are going to discuss about two common types of errors  ….

Correctable Memory Errors

 

Symptoms:

Your system may have one or more of the following symptoms.

  • The system may have received CE, ECC errors, or recoverable memory errors.
  • The system may be described as having reported CPU or memory errors
  • Example error messages which may have been reported are shown below:


Correctable ECC error on from a read from system memory

 

The following are types of main memory correctable ECC errors reported by the CPUs and also an example from a Schizo (I/O bridge chip):

Example #1: Main Memory Corrected ECC error detected by CPU3 from data read from the memory DIMM in Slot B J8000

 

SUNW,UltraSPARC-III+: NOTICE: [AFT0] Corrected system bus (CE) Event detected by CPU3 at TL=0, errID 0x...
AFSR 0x00000002<CE>.00000058 AFAR 0x000000b1.08033f40 Fault_PC 0x1002603c Esynd 0x0058 Slot B: J8000
SUNW,UltraSPARC-III+: [AFT0] errID 0x... Corrected Memory Error on Slot B: J8000 is Persistent
SUNW,UltraSPARC-III+: [AFT0] errID 0x... Data Bit 68 was in error and corrected

 

Example #2:  A Main Memory Corrected MTAG ECC error detected by CPU1 on data read from Slot A J3000

 

SUNW,UltraSPARC-III+: NOTICE: [AFT0] EMC Event detected by CPU1 at TL=0, errID 0x... AFSR 0x00010000<EMC>.000b0000
AFAR 0x000000a1.1b01b730 Fault_PC <0x10351860> Msynd 0x000b Slot A: J3000
SUNW,UltraSPARC-III+: [AFT0] errID 0x... Corrected Mtag Error on Slot A: J3000 is Persistent
SUNW,UltraSPARC-III+: [AFT0] errID 0x... MTAG Data Bit 1 was in error and corrected

 

Example #3:  A Main memory corrected ECC error detected by Schizo id 8

 

pcisch: NOTICE: correctable error detected by pci0 (safari id 8) during DVMA read transaction
pcisch:    Transaction was a block operation.
pcisch:    dvma access, Memory safari command, address 000000b1.a8030170, owned_in not asserted.
pcisch:    AFSR=40000000.c800013c AFAR=000000b1.a8030170, quad word offset 00000000.00000003,
Memory Module <Slot B: J8000> port id 8.
pcisch: syndrome bits 13c
pcisch:    mtag 0, mtag ecc syndrome 0

 

CPU correctable ECC and parity errors

CPU Correctable ECC errors are detected and corrected by the CPU module containing the fault.

An example of a CPU L2SRAM Corrected ECC error detected by CPU1 from its own L2SRAM:

 

SUNW,UltraSPARC-III+: NOTICE: [AFT0] EDC Event detected by CPU1 at TL=0, errID 0x... AFSR 0x00000010<EDC>.00000141
AFAR 0x00000000.a745ad50 Fault_PC 0xfe0ba520 Esynd 0x0141
SUNW,UltraSPARC-III+: [AFT0] errID 0x... Data Bit 93 was in error and corrected

 

Additional Events

There are multiple other CPU Correctable events that can be reported and these include a number of recoverable parity errors:

DPE     D$ parity event
DDSPE   D$ data parity event
DTSPE   D$ physical tag parity event
IPE     I$ parity event
IDSPE   I$ data parity event
ITSPE   I$ physical tag parity event
TSCE    software correctable single-bit E$ tag ECC event
THCE    hardware corrected single-bit E$ tag ECC event
UCC     software correctable E$ ECC event
EDC     hardware corrected E$ ECC event
WDC     hardware corrected E$ ECC event for writeback (victimization)
CPC     hardware corrected E$ ECC event for copyout (snoop request)
L3_MECC   Both 16-byte data of L3 cache data access have ECC error (either correctable or uncorrectable ECC error).
L3_THCE   single bit ECC error on L3 cache tag access
L3_EDC    single bit ECC error on L3 cache data access for P-cache and W-cache request
L3_UCC    single bit ECC error on L3 cache data access for I-cache and D-cache request
L3_CPC    single bit ECC error on L3 cache data access for copyout
L3_WDC    single bit ECC error on L3 cache data access for writeback

  • When browsing messages files and observing console output note that [AFT0] is included in these messages, a 0 represents the “Asynchronous Fault Trap” for correctable and recoverable errors. AFT1 is used for uncorrectable errors, AFT2 and AFT3 can be ignored in almost all cases.
  • The above error messaging may change slightly depending on your kernel update patch version.

Steps to Follow to TroubleShoot:
Please validate that each troubleshooting step below is true for your environment. The steps will provide instructions or a link to a document, for validating the step and taking corrective action as necessary. The steps are ordered in the most appropriate sequence to isolate the issue and identify the proper resolution.

Please do not skip a step.

1. Verify that more than a one correctable error has been reported

A certain number of ECC correctable errors are expected to be reported by Sun Systems.  There are no correctable errors where a single error is enough to require parts replacement.


2. Verify if Solaris has disabled any CPUs

Many of the correctable errors reported by the CPUs will result in the CPU being disabled (where there is more than one CPU). There are a number of ways to check is CPUs have been disabled.  One method is as follows:

  • Run psrinfo and check for CPUs in a state other than on-line.
  • Then check the /var/adm/messages file to identify the errors which caused the fault.
    • On Solaris 8 and 9 a user offlined CPU will look exactly the same as a system offlined CPU.
    • With Solaris 10 a new faulted state in used for FMA/system offlined CPUs.

3. Collect Data to allow Sun Support to progress your call

Uncorrectable errors can generate very large amounts of error information in messages files. Diagnosing any fault from looking at a small number of messages, when a thousand have been reported greatly increases the chances of misdiagnosis. On the midrange and high end platforms the System Controllers capture extensive hardware level failure data which is also important.

  • Collect at a minimum for diagnosis:
    • /var/adm/messages
    • uname -a
      • To confirm that you are not hitting known error reporting bugs
  • So that the correct FRU can be ordered if required:
    • prtdiag -v
      • Required to see what FRUs are installed.
      • Also contains the OBP revision, for the OBP you can also use prtconf -V
    • prtfru -x
      • FRU part and serial numbers required for some FCO checks and to confirm if a FRU is RoHS or not.
      • On the 3800-6900 class systems the prtfru -x output can only be collected using an explorer

==========================================================================================

Uncorrectable Memory Errors

Your system may have one or more of the following symptoms.

  • The system may have unexpectedly rebooted and cause is unknown.
  • The system may have received UE, ECC errors, or recoverable memory errors.
  • The system may be described as crashed, gone down, paniced, panic’d, panic’ed, panicked, rebooted, or received CPU or memory errors
  • Example error messages which may have been reported are as follows:

 

A. Uncorrectable ECC error on from a read from system memory

 

Main memory uncorrectable ECC error detected by CPU3 from the bank of DIMMs in Slot A: J8100 J8101 J8201 J8200

SUNW,UltraSPARC-IV: WARNING: [AFT1] Uncorrectable system bus (UE) Event detected by CPU3 in Privileged mode at TL=0, errID 0x… AFSR 0x00100004<PRIV,UE>.000000aa AFAR 0x000000a0.0c06f1e0  Fault_PC 0x1015725c Esynd 0x00aa Slot A: J8100 J8101 J8201 J8200
SUNW,UltraSPARC-IV: [AFT1] errID 0x… Two Bits were in error

Main memory uncorrectable ECC error for a prefetch or store queue fill read.

SUNW,UltraSPARC-IV: [ID 581396 kern.warning] WARNING: [AFT1] DUE Event detected by CPU0 at TL=0, errID 0x… AFSR 0x00400000<DUE>.000000aa AFAR 0x000000a0.0c0ab1f0 Fault_PC 0xff1c1c80 Esynd 0x00aa Slot A: J8100 J8101 J8201 J8200
SUNW,UltraSPARC-IV: [ID 468316 kern.notice] [AFT1] errID 0x… Two Bits were in error

A Main memory uncorrectable ECC error detected by Schizo id 9

pcisch: WARNING: uncorrectable error detected by pci0 (safari id 00000000.00000009) during DVMA read transaction
pcisch:     Transaction was a block operation.
pcisch:     dvma access, Memory safari command, address 000000d0.cb1489a0, owned_in not asserted.
pcisch:     AFSR=40000000.89000063 AFAR=000000d0.cb1489a0, quad word offset 00000000.00000002, Memory Module Slot D: J3100 J3101 J3201 J3200 id 9.
pcisch:     mtag 0, mtag ecc syndrome 0

Uncorrectable Mtag ECC errors from main memory cause a fatal reset, domain pause or dstop depending on the platform.

 

B. CPU Uncorrectable ECC errors

 

SUNW,UltraSPARC-III+: WARNING: [AFT1] EDU Event detected by CPU1 at TL=0, errID 0x…. AFSR 0x00000018<EDC,EDU>.0000017c AFAR 0x000000a0.0c0ab1f0 Fault_PC 0x1000c19c Esynd 0x017c
SUNW,UltraSPARC-III+: [AFT1] errID 0x…. Four Bits were in error

UCU     uncorrectable E$ ECC event
EDU:ST  uncorrectable E$ ECC event for store merge
EDU:BLD uncorrectable E$ ECC event for block load
WDU     uncorrectable E$ ECC event for writeback (victimization)
CPU     uncorrectable E$ ECC event for copyout (snoop request)
L3_TUE_SH multiple-bit ECC error on L3 cache tag access due to copyback, or tag update from foreign Fireplane device, snoop request
L3_TUE    multiple-bit ECC error on L3 cache tag access due to core specific tag access
L3_EDU    multiple-bit ECC error on L3 cache data access for P-cache and W-cache request
L3_UCU    multiple-bit ECC error on L3 cache data access for I-cache and -cache request
L3_CPU    multiple-bit ECC error on L3 cache data access for copyout
L3_WDU    multiple-bit ECC error on L3 cache data access for writeback

Error Messaging Notes

  • When browsing messages files and observing console output note that [AFT1] is included in these messages, a 1 represents the “Asynchronous Fault Trap” for uncorrectable and unrecoverable errors. AFT0 is used for correctable errors, AFT2 and AFT3 can be ignored in almost all cases.
  • The above error messaging may change slightly depending on your kernel update patch version.
  • It is important to understand that uncorrectable ECC errors can be reported by multiple components.  At no point will the corrupted data actually be used.

 


Ramdev

Ramdev

I have started unixadminschool.com ( aka gurkulindia.com) in 2009 as my own personal reference blog, and later sometime i have realized that my leanings might be helpful for other unixadmins if I manage my knowledge-base in more user friendly format. And the result is today's' unixadminschool.com. You can connect me at - https://www.linkedin.com/in/unixadminschool/

1 Response

  1. September 18, 2015

    […] Read – How to Deal with memory Errors – Correctable and Uncorrectable – in Solaris […]

What is in your mind, about this post ? Leave a Reply

Close
  Our next learning article is ready, subscribe it in your email

What is your Learning Goal for Next Six Months ? Talk to us