Solaris IPMP – Diagnosis and Troubleshooting

Symptoms:

*  mpathd error messages in /var/adm/messages:

  • No test address configured on interface <interface_name> disabling probe-based failure detection on it
  • Test address address is not unique; disabling probe based failure detection on <interface_name>
  • The link has gone down on <interface_name>
  • Successfully failed over from  NIC  <interface_name1> to NIC <interface_name2>
  • NIC repair detected on <interface_name>
  • Successfully failed back to NIC <interface_name>
  • The link has come up on <interface_name>

*  interfaces configured for IPMP missing an UP and/or RUNNING flag in the ifconfig -a output
*  interfaces configured for IPMP showing as FAILED in ifconfig -a output

Diagnosis and Troubleshooting

Please validate that each troubleshooting step below is true for your environment. The steps will provide instructions or a link to a document, for validating the step and taking corrective action as necessary. The steps are ordered in the most appropriate sequence to isolate the issue and identify the proper resolution. Please do not skip a step.

STEP 1: Check and validate the IPMP configuration.

For Solaris 10, link-based: Check Configuration

For Solaris 8, 9 and 10: Check Configuration

Ensure eeprom is configured to issue unique MAC addresses to all system interfaces.

STEP 2. Check the status of the the interfaces in the IPMP group.

The “ifconfig -a” output for the interfaces in the IPMP group MUST indicate “UP” *AND* “RUNNING”.

If “UP” is missing from the output:

# ifconfig <interface in group> up

If “RUNNING” is missing:

Check the physical link between the interface and the switchport for faulty/disconnected cabling and/or faulty/uninitialized switch port. Eliminate any misconfigurations affecting communication by ensuring that auto-negotiation is enabled on the Sun interface (the default setting) and on the switch side (consult the switch documentation):

(use ndd for older devices, like hme):

# ndd -get /dev/<interface> adv_autoneg_cap

(use kstat for most devices):

# kstat -p |grep e1000g:0 |grep auto

(use dladm for GLDv3 devices like nxge, e1000g, bge):

# dladm show-dev

The proper setting for “adv_autoneg_cap” is 1, meaning that the Sun interface is advertising it’s autonegotiation capability to the link partner (switch).

If “adv_autoneg_cap” is set to “0”, correct with ndd for an immediate change:

Note:  ce and hme device requires the instance to be set before any commands. Other devices identify the instance in the /dev/ argument e.g. to retrieve information on the first instance of bge: ndd -get /dev/bge0 adv_autoneg_cap.

# ndd -set /dev/ce instance (device instance)

to check:

# ndd -get /dev/ce adv_autoneg_cap

# ndd -set /dev/ce instance 0
# ndd -get /dev/ce adv_autoneg_cap

1

if the setting  shows “1” after running the ndd command, but the link is not restored:

-ensure the switchport is set to autonegotiate.
-disconnect and reconnect the cable from the interface to the switch to allow the link partners to re-negotiate.

Use OBP “watch-net-all” to test Sun interfaces on SPARC hardware:
If you need further assistance to verify your network or switch connections, please consult your local network administrator.

STEP 3.  Determine if the default router is properly answering ICMP probes.

If Solaris 8 or 9 or Solaris 10 probe-based (to determine, there must be an interface marked as “-failover” in the ifconfig -a output):

# pkill -USR1 mpathd

# tail -20 /var/adm/messages

Mar 5 15:06:23 solarishost27 in.mpathd[6338]: [ID 942985 daemon.error] Missed sending total of 0 probes spread over 0 occurrences
Mar 5 15:06:23 solarishost27 in.mpathd[6338]: [ID 373034 daemon.error]
Mar 5 15:06:23 solarishost27 Probe stats on (inet aggr6)
Mar 5 15:06:23 solarishost27 Number of probes sent 419987
Mar 5 15:06:23 solarishost27 Number of probe acks received 419987
Mar 5 15:06:23 solarishost27 Number of probes/acks lost 0  <<———-
Mar 5 15:06:23 solarishost27 Number of valid unacknowledged probes 0
Mar 5 15:06:23 solarishost27 Number of ambiguous probe acks received 0
Mar 5 15:06:23 solarishost27 Probe stats on (inet aggr1)
Mar 5 15:06:23 solarishost27 Number of probes sent 419923
Mar 5 15:06:23 solarishost27 Number of probe acks received 123490
Mar 5 15:06:23 solarishost27 Number of probes/acks lost 296324
Mar 5 15:06:23 solarishost27 Number of valid unacknowledged probes 0
Mar 5 15:06:23 solarishost27 Number of ambiguous probe acks received 0

The pkill command can be repeated for ongoing checks or when troubleshooting link failover/failback situations.

If configuration link-based (i.e. no interface marked as “-failover” in the “ifconfig -a” output)   skip to step #6.

STEP 4. Are systems on the subnet able to respond to all-hosts multicast?

For Solaris, use netstat and check for the interfaces’ membership in 224.0.0.1 OR ALL-SYSTEMS.MCAST.NET:

solarishost#             netstat -g|grep ALL-SYSTEMS.MCAST.NET
lo0 ALL-SYSTEMS.MCAST.NET 1
hme0 ALL-SYSTEMS.MCAST.NET 1

solarishost#             netstat -gn|grep 224.0.0.1
lo0 224.0.0.1 1
hme0 224.0.0.1 1

If the netstat -gn outputs show interfaces that cannot respond to ALL-SYSTEMS multicast, the configuration MUST
be setup using “host routes”.

STEP 5. Is Veritas “Multi-NIC” in use along with IPMP?

To determine:

# ps -ef|grep -i multi
# grep -i LLT /var/adm/messages
# grep -i GAB /var/adm/messages

Identify and clear any errors for LLT and/or GAB.

Consult Symantec for information and assistance with MultiNIC

STEP 6. Gather troubleshooting and configuration data specified below and contact Sun Support.

At this point, if you have validated that each troubleshooting step above is true for your environment, and the issue still exists, further troubleshooting is required:
I. packet capture using the “snoop” command.  Follow these steps:

a. snoop -d (first interface in the group) -o /tmp/<interface name or instance> -s 54 -q

b. snoop -d (second interface in the group) -o /tmp/<interface name or instance> -s 54 -q

c. monitor for error condition in messages:

tail -f /var/adm/messages  or otherwise reproduce the failure

d. then control-c the snoop commands and provide the output files /tmp/<interface name or instance> for each network interface in the IPMP group.
note: explorer should be run with the “-w localzones” option to collect information on any configured local zones.

II. collect the following outputs to a file using these commands:

# dladm show-dev > show-dev.out
# dladm show-link > show-link.out
# dladm show-aggr -L > show-aggr.out

The following commands will be collected for machines till Solaris 10 update4
1.dladm_show-link.out
2.dladm_show-dev.out
3.dladm_show-aggr_-L.out

And the following commands will be collected for machines Solaris 10 update 4 onwards
1.dladm_show-link.out
2.dladm_show-dev.out
3.dladm_show-aggr_-L.out
4.dladm_show-linkprop.out

 

Ramdev

Ramdev

I have started unixadminschool.com ( aka gurkulindia.com) in 2009 as my own personal reference blog, and later sometime i have realized that my leanings might be helpful for other unixadmins if I manage my knowledge-base in more user friendly format. And the result is today's' unixadminschool.com. You can connect me at - https://www.linkedin.com/in/unixadminschool/

4 Responses

  1. Vignesh says:

    Can you pls explain, what is link and probe based and why we assigning test address?

  2. Ramdev Ramdev says:

    Vignesh – this post might answer your questions. If not please let me know , i will explain.

    http://gurkulindia.com/main/2011/05/ipmp-link-based-only-failure-detection-with-solaris-10/

  3. Yogesh Raheja says:

    @Vignesh, the link given by Ram will explain Link Based and Probe based multipathing techniques. As far as test address is concerned, it is only used in Probe-Based IPMP to continously monitor the interfaces which are in IPMP and in.mpathd is the daemon responsible for the same. All parameters can be checked in /etc/default/mpathd. Go through the link mentioned by Ram and let usknow in case of any questions.

  1. April 10, 2014

    […] This article will give you quick overview of various in.mapthd error messages, the meaning and corrective action for each message. To understand more about in.mpathd troubleshooting you can refer the following articles  – Solaris IPMP Troubleshooting  […]

What is in your mind, about this post ? Leave a Reply

Close
  Our next learning article is ready, subscribe it in your email

What is your Learning Goal for Next Six Months ? Talk to us