Solaris Troubleshooting NFs : ls -la hangs on root (/)

Running “ls -la /” hangs, yet running “ls -la” to other root directories  (i.e. ls -la /usr) does NOT hang. And the system logs ( i.e. /var/adm/messages) shows NFS related errors, even though this is NOT a true NFS client.

 

Here are the some sample errors that may appear in the /var/adm/messages file when the “ls -la /” hangs:

Mar 28 09:23:19 moe nfs: [ID 333984 kern.notice] NFS server for volume management (/vol) not responding still trying
Mar 28 09:31:13 moe nfs: [ID 664466 kern.notice] NFS getattr failed for server for volume management (/vol): error 23 (RPC: Unitdata error)

In this particular situation, this client was mounting a CD remotely from another system, which was shutdown before unsharing the CD, and before the client could unmount the remote CD mount. The tail end of a truss shows that it was hanging on /vol as well (line numbers set and it was hanging on line 256-257):

# cd /
# truss -fall -vall -wall -rall ls -la


.
.
.
249   4884/1:                 lstat64(“./xfn”, 0xFFBEFAC0)                                       = 0
250   4884/1:                         d=0x04680002 i=7         m=0040555 l=1   u=0         g=0         sz=1
251   4884/1:                                 at = Mar 27 14:15:01 EST 2002   [ 1017256501 ]
252   4884/1:                                 mt = Mar 27 14:15:01 EST 2002   [ 1017256501 ]
253   4884/1:                                 ct = Mar   8 20:24:51 EST 2002   [ 1015637091 ]
254   4884/1:                         bsz=8192   blks=1         fs=autofs
255   4884/1:                 acl(“./xfn”, GETACLCNT, 0, 0x00000000)                   = 4
256   4884/1:                 lstat64(“./vol”, 0xFFBEFAC0)       (sleeping…)
257   4884/1:                 lstat64(“./vol”, 0xFFBEFAC0)                                       Err#131 ECONNRESET
258   4884/1:                         Received signal #2, SIGINT [default]
259   4884/1:                                 *** process killed ***


Err#131   ECONNRESET  says that  ‘Connection reset by peer’  that means A connection was forcibly closed by a peer.   This   normally   results   from   a   loss   of the connection on the remote host because of a timeout or a reboot.


Follow below instructions to troubleshoot:


When ” ls -la / ” hangs,  check the /etc/mnttab file for a PID associated with /vol.  If you run ” ps -ef | grep vol-PID ” and it does not come back with any processes. Use below command to get the PID

 

# grep “/vol” /etc/mnttab

moe:vold(pid222)               /vol       nfs         ignore,dev=39c0001           1017179906

# ps -ef | grep 222 <== no output returns

The real solution is to unmount /vol:

# umount /vol

You may have to force the unmount. The -f option (forcibly umount) is only available in Solaris 8 Operating Environment.

# umount -f /vol

If you are NOT running Solaris 8, you may have to do a reboot to clear the “ls -la” hang.

Note: Check in the /var/statmon/sm and /var/statmon/sm.bak directories to see if there is a connection still open for this server. If there is, then there is a chance of the system looking to remount the filesystem after reboot. The system will not try to remount if the umount command is successful

Learn Everyday something new at unixadminschool.com and
also get my personal reference Guide "Storage Configuration for Unix & Linux Admin"

9 thoughts on “Solaris Troubleshooting NFs : ls -la hangs on root (/)

  1. Hi Ram, Could you pls explain # truss -fall -vall -wall -rall ls -la
    command switches…..Is this can be used everytime for above symtoms?

  2. Hi Ramdev,

    I am using Solaris10 in VMware (compatibility Workstation 5). and using it from XShell. (windows). Here i am facing a problem.

    After connecting to the Solaris it is going to disconnect after few seconds. what might be the problem. can u help me.

    Thanks in advance..
    Gowtham.

    • @Goutham – 1. please check if you are able to ping to your VMware Solaris IP, if not pinging may be your windows firewall blocking you 2. If pining, check if the Xshell using whether telnet or ssh? In solaris 10 , telnet was disabled by default. 3. if you dont know whether it is telnet or ssh, just download putty and connect to solaris IP.

  3. This Post is helpfull to troubleshoot stale mount issues in a environment where NFS is used a Lot and stale mounts has been created because of network or power outages . ( Like large setup with NAS /Filers , Centralized Home dir automounts ) .

    Good Post and thanks for sharing

  4. NFS getattr failed for server ccmdbprd01.tsacorp.com: error 5 (RPC: 1832-008 Timed out) getting this error on all clients on which NFS mount from solaris server shared. Did all search and nothing strange found.
    But user’s getting connection timeout and slowness with this error, please help to track the issue

What is in your mind, about this post ? Leave a Reply