Memory Management in Solaris (how to repair temporary memory faults)
Memory Management in Solaris (how to repair temporary memory faults)
Herein I am trying to present a simple method to recover faulty temporary memory errors for Solaris OS. Below are the steps:
A) Check the memory errors/faults via FMA (fault management Administration). Check the errors with fmadm command.
# fmadm faulty
yogesh-test# fmadm faulty
————— ———————————— ————– ———
TIME EVENT-ID MSG-ID SEVERITY
————— ———————————— ————– ———
Aug 28 15:54:37 2a45e7b1-75df-e29f-cff8-fceae3a73d12 SUN4US-8000-7D MinorFault class : fault.memory.page.ce
Affects : mem:///unum=C0S01-SLOT%23D00,/physaddr=11fabe6e80
faulted and taken out of service
FRU : mem:///unum=C0S01-SLOT%23D00,
faultyDescription : The number of single bit errors associated with this memory
module continues to exceed acceptable levels. Refer to
http://www.fujitsu.com/global/services/computing/server/unix/prmpwr_msg/SUN4US-8000-7D
for more information.Response : An attempt will be made to remove this memory page from service.
Impact : Total system memory capacity will be reduced as pages are
retired.Action : Schedule a repair procedure to replace the affected memory
module. Use fmdump -v -u to identify the module.
yogesh-test# fmdump
TIME UUID SUNW-MSG-ID
Aug 28 15:54:37.6053 2a45e7b1-75df-e29f-cff8-fceae3a73d12 SUN4US-8000-7D
yogesh-test# fmdump -v -u 2a45e7b1-75df-e29f-cff8-fceae3a73d12
TIME UUID SUNW-MSG-ID
Aug 28 15:54:37.6053 2a45e7b1-75df-e29f-cff8-fceae3a73d12 SUN4US-8000-7D
100% fault.memory.page.ceProblem in: mem:///unum=C0S01-SLOT#D00,/physaddr=11fabe6e80
Affects: mem:///unum=C0S01-SLOT#D00,/physaddr=11fabe6e80
FRU: mem:///unum=C0S01-SLOT#D00,
Location: –
B) For each fault listed in the ‘fmadm faulty’ run repair option to recover the bad blocks from the memory modules.
# fmadm repair
yogesh-test# fmadm repair 2a45e7b1-75df-e29f-cff8-fceae3a73d12
fmadm: recorded repair to 2a45e7b1-75df-e29f-cff8-fceae3a73d12
C) Now check the status of the memory module:
yogesh-test# fmdump -v -u 2a45e7b1-75df-e29f-cff8-fceae3a73d12
TIME UUID SUNW-MSG-ID
Aug 28 15:54:37.6053 2a45e7b1-75df-e29f-cff8-fceae3a73d12 SUN4US-8000-7D
100% fault.memory.page.ceProblem in: mem:///unum=C0S01-SLOT#D00,/physaddr=11fabe6e80
Affects: mem:///unum=C0S01-SLOT#D00,/physaddr=11fabe6e80
FRU: mem:///unum=C0S01-SLOT#D00,
Location: –Aug 29 15:15:16.6917 2a45e7b1-75df-e29f-cff8-fceae3a73d12 FMD-8000-4M Repaired ——> Module Repaired
100% fault.memory.page.ceProblem in: mem:///unum=C0S01-SLOT#D00,/physaddr=11fabe6e80
Affects: mem:///unum=C0S01-SLOT#D00,/physaddr=11fabe6e80
FRU: mem:///unum=C0S01-SLOT#D00,
Location: –
yogesh-test# fmdump
Aug 28 15:54:37.6053 2a45e7b1-75df-e29f-cff8-fceae3a73d12 SUN4US-8000-7D
Aug 29 15:15:16.6917 2a45e7b1-75df-e29f-cff8-fceae3a73d12 FMD-8000-4M Repaired ——> Module Repaired
D) Now the real mission starts to Clear all the error logs and resource cache for FMA.
# cd /var/fm/fmd
yogesh-test# cd /var/fm/fmd
yogesh-test# ls -lrt
total 21394
drwx—— 3 root sys 512 Dec 13 2007 ckpt
-rw-r–r– 1 root root 0 Apr 14 2008 errlog.6
-rw-r–r– 1 root root 0 Apr 14 2008 errlog.5
-rw-r–r– 1 root root 0 Apr 14 2008 errlog.4
-rw-r–r– 1 root root 0 Apr 14 2008 errlog.3
-rw-r–r– 1 root root 0 Apr 14 2008 errlog.2
-rw-r–r– 1 root root 0 Aug 14 2008 errlog.1
-rw-r–r– 1 root root 2097684 Aug 16 01:12 errlog.0
-rw-r–r– 1 root root 0 Aug 16 07:13 errlog-
-rw-r–r– 1 root root 262240 Aug 28 16:02 errlog
-rw-r–r– 1 root root 727304 Aug 29 15:15 fltlog
drwx—— 2 root sys 7814656 Aug 29 15:15 rsrc
# rm e* f* c*/eft/* r*/*
yogesh-test# rm e* f* c*/eft/* r*/*
yogesh-test# ls -lrt
—>EMPTY<—-
E) Reset the fmd modules to fresh them up.
yogesh-test# fmadm reset cpumem-diagnosis
yogesh-test# fmadm reset cpumem-retire
yogesh-test# fmadm reset eft
yogesh-test# fmadm reset io-retire
Error Encountered while refreshing the fmd modules:
yogesh-test# fmadm reset cpumem-diagnosis
fmadm: failed to reset module cpumem-diagnosis: specified module is not loaded in fault manager
Resolution is given as:
# svcadm clear fmd —> (it wont work because there are no errors)
Then I did some investigations/reseach and proceeded by restarting the fmd before firing the reset commands
# svcadm restart fmd —–> services restarted successfully.
and then I was able to fire fmadm reset commands successfully.
yogesh-test# fmadm reset cpumem-diagnosis
fmadm: cpumem-diagnosis module has been reset
F) Final step is to Reboot the system to clear up the server cache as well. :-)
Special Notes:
==========
1.) Ensure the latest firmware/patches are installed on the system.
2.) If again the errors encountered in the fmadm/fmdump outputs then raise a case and replace the memory module (which Oracle will suggest based on explorer logs) :-)