Solaris Troubleshooting : determine link status of ethernet interfaces

Ramdev

I have started unixadminschool.com ( aka gurkulindia.com) in 2009 as my own personal reference blog, and later sometime i have realized that my leanings might be helpful for other unixadmins if I manage my knowledge-base in more user friendly format. And the result is today's' unixadminschool.com. You can connect me at - https://www.linkedin.com/in/unixadminschool/

Loading Facebook Comments ...

8 Responses

  1. santhosh says:

    Hi Ramdev
    NIC failure detected on qfe0
    in.mpathd has detected that NIC qfe0 is repaired and operational
    Successfully failed back to NIC qfe0 to NIC qfe1
    in.mpathd has restored network traffic back to NIC qfe0, which is now repaired and operational.

    these days we are getting ises regarding NIC same as above messahes and it is in veritas cluster …when NIC failed and failed back ..in mean time service groups are failing over to another nodes in cluster where application floks want RCA on it..can you breif about this more which helps me in giving RCA to them ..i have opened a case with oracle and symantec ..waiting for response..

    with regards,
    santhosh

    • Gurkulindia says:

      @Santhosh… what is the version of OS and VCS?
      few questions: 1. did you ask network team that why the specific port connected to your primary interface flipping contnuously?
      2. what is default failover timeout value in /etc/default/mpathd

  2. santhosh says:

    solaris 10 and vcs 4.1 version

  3. santhosh says:

    regarding Vcs service group failure due to nic failover on 5 node cluster

    Problem have been identified as the link failure for 10 sec caused the nic in IPMP group to failover and then a vcs resource failed and service group failed over.

    As per the analysis from Oracle, currently the link failure is been seen only for 10 to 15 sec. So the changing the FAILURE-DETECTION-TIME for IPMP will might resolve the issue from OS end. But the root cause is still yet to be identified.

    I have made changes for the IPMP multipathing on nodes
    /etc/default/mpathd

    FAILURE_DETECTION_TIME=30000 # previously value was 10000 (10 sec)

    2: then re-read the pid of mpathd

    kill -HUP

    3: To collect data for analysis (only on suomp45k)

    cd /tmp

    root@cannotgiveservername:/> ps -ef | grep snoop

    root@xxx:/> kstat -p -Td 1 > /tmp/kstat.out1 &

    root@xxx:/>truss -DEd -fael -wall -rall -vall -o truss.in.mpathd.out -p 9911 &

    To do

    Need to implement the VCS Ip resource timeout

    1) Turn OFF the critical setting so it can’t trigger a fail over
    2) Raise the RestartLimit on the NIC and or IP resource to give VCS a chance to correct locally before a FAULT

    For this below steps have been formulated for (one node in cluster)xxxxx cluster. Not implemented as we have not tested this setting on a test enviornment. we have not found any issues as of now after changing the IPMP FAILURE_DETECTION_TIME to 30 sec. If the issue again reoccur, then we may also have to try the below settings.

    haconf -makerw
    hares -override IP_xxx RestartLimit
    hares -modify IP_xxx RestartLimit 1
    .
    .
    .
    same for all Ip services

    modify critucal:

    hares -modify IP_xxx Critical 0
    .
    .
    .
    same for all Ip service

    haconf -dump -makero

  4. vamsi says:

    Hello, we are using Sun Netra X4200 servers with Solaris 5.10 running on them. I have noticed NDD Link Failure alarms on one of our servers. But we are not seeing any hardware related problems or any switch problems. Can you suggest how to know the root cause of these alarms.

    • Ramdev says:

      @Vamsi – you can check few things in this case :

      1. IPMP configured? if configured what is the failure_detection_time.
      2.. Incase if these alerts are repeating in a specific time of the day, then you have to look for any new cron jobs, related to ftp / rsync / backup, introduced on the global zone ( non global zones, if exist),
      3. are there any network related messages before or after the link failure messages in /var/asm/messages.
      4. Normally when you ask network team help to find out issues on switch port, they will just look at the current status of the port… you have to ask them to check the port logs during the period you see the error messages in your server.

      Please let us know , how is it going with the troubleshooting

  5. edward says:

    Outstanding story there. What happened after? Take care!

  1. September 16, 2015

    […] Read – Determine link status of Ethernet interfaces […]

Leave a Reply

Your email address will not be published. Required fields are marked *

[contact-form to='ramkumar.ramadevu@gmail.com' subject='New Learning Request Submitted'][contact-field label='Name' type='name' required='1'/][contact-field label='Email' type='email' required='1'/][contact-field label='Learning Request' type='textarea' required='1'/][contact-field label='Are you Looking for ' type='radio' required='1' options='Paid Training,Free Training'/][/contact-form]

What is your Learning Goal for Next Six Months ? Talk to us