Memory Management in Solaris (how to repair temporary memory faults)

Memory Management in Solaris (how to repair temporary memory faults)

Herein I am trying to present a simple method to recover faulty temporary memory errors for Solaris OS. Below are the steps:

A) Check the memory errors/faults via FMA (fault management Administration). Check the errors with fmadm command.

# fmadm faulty

yogesh-test# fmadm faulty
————— ———————————— ————– ———
TIME EVENT-ID MSG-ID SEVERITY
————— ———————————— ————– ———
Aug 28 15:54:37 2a45e7b1-75df-e29f-cff8-fceae3a73d12 SUN4US-8000-7D Minor

Fault class : fault.memory.page.ce
Affects : mem:///unum=C0S01-SLOT%23D00,/physaddr=11fabe6e80
faulted and taken out of service
FRU : mem:///unum=C0S01-SLOT%23D00,
faulty

Description : The number of single bit errors associated with this memory
module continues to exceed acceptable levels. Refer to
http://www.fujitsu.com/global/services/computing/server/unix/prmpwr_msg/SUN4US-8000-7D
for more information.

Response : An attempt will be made to remove this memory page from service.

Impact : Total system memory capacity will be reduced as pages are
retired.

Action : Schedule a repair procedure to replace the affected memory
module. Use fmdump -v -u to identify the module.

yogesh-test# fmdump
TIME UUID SUNW-MSG-ID
Aug 28 15:54:37.6053 2a45e7b1-75df-e29f-cff8-fceae3a73d12 SUN4US-8000-7D

yogesh-test# fmdump -v -u 2a45e7b1-75df-e29f-cff8-fceae3a73d12
TIME UUID SUNW-MSG-ID
Aug 28 15:54:37.6053 2a45e7b1-75df-e29f-cff8-fceae3a73d12 SUN4US-8000-7D
100% fault.memory.page.ce

Problem in: mem:///unum=C0S01-SLOT#D00,/physaddr=11fabe6e80
Affects: mem:///unum=C0S01-SLOT#D00,/physaddr=11fabe6e80
FRU: mem:///unum=C0S01-SLOT#D00,
Location: –

B) For each fault listed in the ‘fmadm faulty’ run repair option to recover the bad blocks from the memory modules.

# fmadm repair

yogesh-test# fmadm repair 2a45e7b1-75df-e29f-cff8-fceae3a73d12
fmadm: recorded repair to 2a45e7b1-75df-e29f-cff8-fceae3a73d12

C) Now check the status of the memory module:

yogesh-test# fmdump -v -u 2a45e7b1-75df-e29f-cff8-fceae3a73d12
TIME UUID SUNW-MSG-ID
Aug 28 15:54:37.6053 2a45e7b1-75df-e29f-cff8-fceae3a73d12 SUN4US-8000-7D
100% fault.memory.page.ce

Problem in: mem:///unum=C0S01-SLOT#D00,/physaddr=11fabe6e80
Affects: mem:///unum=C0S01-SLOT#D00,/physaddr=11fabe6e80
FRU: mem:///unum=C0S01-SLOT#D00,
Location: –

Aug 29 15:15:16.6917 2a45e7b1-75df-e29f-cff8-fceae3a73d12 FMD-8000-4M Repaired ——> Module Repaired
100% fault.memory.page.ce

Problem in: mem:///unum=C0S01-SLOT#D00,/physaddr=11fabe6e80
Affects: mem:///unum=C0S01-SLOT#D00,/physaddr=11fabe6e80
FRU: mem:///unum=C0S01-SLOT#D00,
Location: –

yogesh-test# fmdump
Aug 28 15:54:37.6053 2a45e7b1-75df-e29f-cff8-fceae3a73d12 SUN4US-8000-7D
Aug 29 15:15:16.6917 2a45e7b1-75df-e29f-cff8-fceae3a73d12 FMD-8000-4M Repaired ——> Module Repaired

D) Now the real mission starts to Clear all the error logs and resource cache for FMA.

# cd /var/fm/fmd

yogesh-test# cd /var/fm/fmd

yogesh-test# ls -lrt
total 21394
drwx—— 3 root sys 512 Dec 13 2007 ckpt
-rw-r–r– 1 root root 0 Apr 14 2008 errlog.6
-rw-r–r– 1 root root 0 Apr 14 2008 errlog.5
-rw-r–r– 1 root root 0 Apr 14 2008 errlog.4
-rw-r–r– 1 root root 0 Apr 14 2008 errlog.3
-rw-r–r– 1 root root 0 Apr 14 2008 errlog.2
-rw-r–r– 1 root root 0 Aug 14 2008 errlog.1
-rw-r–r– 1 root root 2097684 Aug 16 01:12 errlog.0
-rw-r–r– 1 root root 0 Aug 16 07:13 errlog-
-rw-r–r– 1 root root 262240 Aug 28 16:02 errlog
-rw-r–r– 1 root root 727304 Aug 29 15:15 fltlog
drwx—— 2 root sys 7814656 Aug 29 15:15 rsrc

# rm e* f* c*/eft/* r*/*

yogesh-test# rm e* f* c*/eft/* r*/*

yogesh-test# ls -lrt
—>EMPTY<—-

E) Reset the fmd modules to fresh them up.

yogesh-test# fmadm reset cpumem-diagnosis
yogesh-test# fmadm reset cpumem-retire
yogesh-test# fmadm reset eft
yogesh-test# fmadm reset io-retire

Error Encountered while refreshing the fmd modules:

yogesh-test# fmadm reset cpumem-diagnosis
fmadm: failed to reset module cpumem-diagnosis: specified module is not loaded in fault manager

Resolution is given as:

# svcadm clear fmd —> (it wont work because there are no errors)

Then I did some investigations/reseach and proceeded by restarting the fmd before firing the reset commands

# svcadm restart fmd —–> services restarted successfully.

and then I was able to fire fmadm reset commands successfully.

yogesh-test# fmadm reset cpumem-diagnosis
fmadm: cpumem-diagnosis module has been reset

F) Final step is to Reboot the system to clear up the server cache as well. :-)

Special Notes:
==========

1.) Ensure the latest firmware/patches are installed on the system.

2.) If again the errors encountered in the fmadm/fmdump outputs then raise a case and replace the memory module (which Oracle will suggest based on explorer logs) :-)

Yogesh Raheja

Yogesh working as a Consultant in Unix Engineering by profession. And he has multiple years experience in Solaris, Linux , AIX and Veritas Administration. He has been certified for SCSA9, SCSA10, SCNA10, VXVM, VCS, ITILv3. He is very much passionate about sharing his knowledge with others. Specialties: Expertize in Unix/Solaris Server, Linux (RHEL), AIX, Veritas Volume Manager, ZFS, Liveupgrades, Storage Migrations, Cluster deployment (VCS and HACMP) and administration and upgrade on Banking, Telecom, IT Infrastructure, and Hosting Services.

What is in your mind, about this post ? Leave a Reply

Close
  Our next learning article is ready, subscribe it in your email

What is your Learning Goal for Next Six Months ? Talk to us