Solaris Admin Reference – IPMP Diagnosis and Troubleshooting

Symptoms:

*  mpathd error messages in /var/adm/messages:
“No test address configured on interface <interface_name> disabling probe-based failure detection on it”
“Test address address is not unique; disabling probe based failure detection on <interface_name>”
“The link has gone down on <interface_name>”
“Successfully failed over from  NIC  <interface_name1> to NIC <interface_name2>
“NIC repair detected on <interface_name>”
“Successfully failed back to NIC <interface_name>”
“The link has come up on <interface_name>”

*  interfaces configured for IPMP missing an UP and/or RUNNING flag in the ifconfig -a output
*  interfaces configured for IPMP showing as FAILED in ifconfig -a output

Diagnosis and Troubleshooting

Please validate that each troubleshooting step below is true for your environment. The steps will provide instructions or a link to a document, for validating the step and taking corrective action as necessary. The steps are ordered in the most appropriate sequence to isolate the issue and identify the proper resolution. Please do not skip a step.

STEP 1: Check and validate the IPMP configuration.

For Solaris 10, link-based:             Check Configuration

For Solaris 8, 9 and 10:                   Check Configuration

Ensure eeprom is configured to issue unique MAC addresses to all system interfaces.

 

STEP 2. Check the status of the the interfaces in the IPMP group.

The “ifconfig -a” output for the interfaces in the IPMP group MUST indicate “UP” *AND* “RUNNING”.

If “UP” is missing from the output:

# ifconfig <interface in group> up

If “RUNNING” is missing:

Check the physical link between the interface and the switchport for faulty/disconnected cabling and/or faulty/uninitialized switch port. Eliminate any misconfigurations affecting communication by ensuring that auto-negotiation is enabled on the Sun interface (the default setting) and on the switch side (consult the switch documentation):

(use ndd for older devices, like hme):

# ndd -get /dev/<interface> adv_autoneg_cap

(use kstat for most devices):

# kstat -p |grep e1000g:0 |grep auto

(use dladm for GLDv3 devices like nxge, e1000g, bge):

# dladm show-dev

The proper setting for “adv_autoneg_cap” is 1, meaning that the Sun interface is advertising it’s autonegotiation capability to the link partner (switch).

If “adv_autoneg_cap” is set to “0”, correct with ndd for an immediate change:

Note:  ce and hme device requires the instance to be set before any commands. Other devices identify the instance in the /dev/ argument e.g. to retrieve information on the first instance of bge: ndd -get /dev/bge0 adv_autoneg_cap.

# ndd -set /dev/ce instance (device instance)

to check:

# ndd -get /dev/ce adv_autoneg_cap

 

# ndd -set /dev/ce instance 0
# ndd -get /dev/ce adv_autoneg_cap

1

if the setting  shows “1” after running the ndd command, but the link is not restored:

-ensure the switchport is set to autonegotiate.
-disconnect and reconnect the cable from the interface to the switch to allow the link partners to re-negotiate.

Use OBP “watch-net-all” to test Sun interfaces on SPARC hardware:
If you need further assistance to verify your network or switch connections, please consult your local network administrator.

STEP 3.  Determine if the default router is properly answering ICMP probes.

If Solaris 8 or 9 or Solaris 10 probe-based (to determine, there must be an interface marked as “-failover” in the ifconfig -a output):

# pkill -USR1 mpathd

# tail -20 /var/adm/messages

Mar 5 15:06:23 solarishost27 in.mpathd[6338]: [ID 942985 daemon.error] Missed sending total of 0 probes spread over 0 occurrences
Mar 5 15:06:23 solarishost27 in.mpathd[6338]: [ID 373034 daemon.error]
Mar 5 15:06:23 solarishost27 Probe stats on (inet aggr6)
Mar 5 15:06:23 solarishost27 Number of probes sent 419987
Mar 5 15:06:23 solarishost27 Number of probe acks received 419987
Mar 5 15:06:23 solarishost27 Number of probes/acks lost 0  <<———-
Mar 5 15:06:23 solarishost27 Number of valid unacknowledged probes 0
Mar 5 15:06:23 solarishost27 Number of ambiguous probe acks received 0
Mar 5 15:06:23 solarishost27 Probe stats on (inet aggr1)
Mar 5 15:06:23 solarishost27 Number of probes sent 419923
Mar 5 15:06:23 solarishost27 Number of probe acks received 123490
Mar 5 15:06:23 solarishost27 Number of probes/acks lost 296324
Mar 5 15:06:23 solarishost27 Number of valid unacknowledged probes 0
Mar 5 15:06:23 solarishost27 Number of ambiguous probe acks received 0

The pkill command can be repeated for ongoing checks or when troubleshooting link failover/failback situations.

If configuration link-based (i.e. no interface marked as “-failover” in the “ifconfig -a” output)   skip to step #6.

STEP 4. Are systems on the subnet able to respond to all-hosts multicast?

For Solaris, use netstat and check for the interfaces’ membership in 224.0.0.1 OR ALL-SYSTEMS.MCAST.NET:

solarishost#             netstat -g|grep ALL-SYSTEMS.MCAST.NET
lo0 ALL-SYSTEMS.MCAST.NET 1
hme0 ALL-SYSTEMS.MCAST.NET 1

solarishost#             netstat -gn|grep 224.0.0.1
lo0 224.0.0.1 1
hme0 224.0.0.1 1

If the netstat -gn outputs show interfaces that cannot respond to ALL-SYSTEMS multicast, the configuration MUST
be setup using “host routes”.

STEP 5. Is Veritas “Multi-NIC” in use along with IPMP?

To determine:

# ps -ef|grep -i multi
# grep -i LLT /var/adm/messages
# grep -i GAB /var/adm/messages

Identify and clear any errors for LLT and/or GAB.

Consult Symantec for information and assistance with MultiNIC

STEP 6. Gather troubleshooting and configuration data specified below and contact Sun Support.

At this point, if you have validated that each troubleshooting step above is true for your environment, and the issue still exists, further troubleshooting is required:
I. packet capture using the “snoop” command.  Follow these steps:

a. snoop -d (first interface in the group) -o /tmp/<interface name or instance> -s 54 -q

b. snoop -d (second interface in the group) -o /tmp/<interface name or instance> -s 54 -q

c. monitor for error condition in messages:

tail -f /var/adm/messages  or otherwise reproduce the failure

d. then control-c the snoop commands and provide the output files /tmp/<interface name or instance> for each network interface in the IPMP group.
note: explorer should be run with the “-w localzones” option to collect information on any configured local zones.

II. collect the following outputs to a file using these commands:

# dladm show-dev > show-dev.out
# dladm show-link > show-link.out
# dladm show-aggr -L > show-aggr.out

The following commands will be collected for machines till Solaris 10 update4
1.dladm_show-link.out
2.dladm_show-dev.out
3.dladm_show-aggr_-L.out

And the following commands will be collected for machines Solaris 10 update 4 onwards
1.dladm_show-link.out
2.dladm_show-dev.out
3.dladm_show-aggr_-L.out
4.dladm_show-linkprop.out

Ramdev

Ramdev

I have started unixadminschool.com ( aka gurkulindia.com) in 2009 as my own personal reference blog, and later sometime i have realized that my leanings might be helpful for other unixadmins if I manage my knowledge-base in more user friendly format. And the result is today's' unixadminschool.com. You can connect me at - https://www.linkedin.com/in/unixadminschool/

3 Responses

  1. Manoj says:

    Hi Ram,

    This is Manoj one of your follower working as solaris admin

    encountered an issue with ipmp which is under sun cluster 3.2

    there held a fail over in a primary nic and moved on to secondary nic and the server is working under secondary nic right now, but the primary is not getting failed back from secondary

    The particular primary and secondary are in same group

    Now my query is there is no fail back happening from secondary to primary

    In primary nic there is repair is detected but no fail back

    What might be the issue and can you please guide me to resolve this particular issue

    Thanks & Regards
    Manoj

  1. September 16, 2015

    […] Read – Diagnosis and Troubleshooting […]

What is in your mind, about this post ? Leave a Reply

Close
  Our next learning article is ready, subscribe it in your email

What is your Learning Goal for Next Six Months ? Talk to us