OpenSolaris

Printable Version Enter a New Search
Bug ID 6726235
Synopsis IPMP group failure can sometimes lead to an extra failover
State 10-Fix Delivered (Fix available in build)
Category:Subcategory network:ipmp
Keywords clearview
Responsible Engineer Peter Memishian
Reported Against snv_90
Duplicate Of
Introduced In solaris_9
Commit to Fix snv_107
Fixed In snv_107
Release Fixed solaris_nevada(snv_107)
Related Bugs 6653933 , 6783149
Submit Date 16-July-2008
Last Update Date 28-January-2009
Description
When fixing CR 6653933, we found the issue caused by ipmp, the details are: 

During our project testing, we found some cases of the IPMP test suite failed. Basically the same set of cases (2 or 3 cases) failed at v46 and v6 mode. Specifically,     
	Number of Tests : 145
	PASS            : 140
	FAIL            : 5
	TP 4 tc_failover_back FAIL
	TP 4 tc_failover_backv46 FAIL
	TP 15 tc_failover_backv46 FAIL
	TP 4 tc_failover_backv6 FAIL
	TP 15 tc_failover_backv6 FAIL     
The core issue is that it seems in.mpathd does a failover when it shouldn't. It could be a test suite bug or an in.mpathd bug. Before we're sure it's an in.mpathd bug, just put it here.
*** (#1 of 1): 2008-01-23 10:56:01 CST  xxxxx@xxxxx.com

I added some print info in order to catch more details. Finally I found
that the failover of the first interface really happened when we
simulated failure on the all two interfaces by using "hit -d" almost at
the same. So I think if the assertion is right, then maybe there is some
issues.

I ran the cases tp_004_ti2linkfail and tp_015_ti3linkfail in v4, v6 and
v46 modes, both failed with the same feature, here are some output of
journal file and console:

    520|0 4 176237 1 10|earthtone 09:51:47 ti2linkfail: Simulate
    failures on interface(s) e1000g1 e1000g2
    520|0 4 176237 1 11|earthtone 09:51:47 ti2linkfail: e1000g1 failed
    in 0 seconds
    520|0 4 176237 1 12|earthtone 09:51:47 ti2linkfail: e1000g2 failed
    in 0 seconds
    520|0 4 176237 1 13|earthtone 09:51:47 ti2linkfail: Check that
    e1000g1 is not failed over
    520|0 4 176237 1 14|earthtone 09:51:47 ti2linkfail: Check that
    2007:56::214:4fff:fe82:5961 is on interface e1000g1
    520|0 4 176237 1 15|earthtone 09:51:47 ti2linkfail: FAILURE:
    2007:56::214:4fff:fe82:5961 not on e1000g1


From the journal file log, we can see "FAILURE:
2007:56::214:4fff:fe82:5961 not on e1000g1" happened at 09:51:47

    Jul 3 09:51:47 earthtone in.mpathd[176404]: The link has gone down
    on e1000g1
    Jul 3 09:51:47 earthtone in.mpathd[176404]: NIC failure detected on
    e1000g1 of group tester
    Jul 3 09:51:47 earthtone in.mpathd[176404]: Successfully failed over
    from NIC e1000g1 to NIC e1000g2
    Jul 3 09:51:47 earthtone in.mpathd[176404]: The link has gone down
    on e1000g2
    Jul 3 09:51:47 earthtone in.mpathd[176404]: All Interfaces in group
    tester have failed
    Jul 3 09:51:48 earthtone in.mpathd[176404]: The link has come up on
    e1000g2
    Jul 3 09:51:48 earthtone in.mpathd[176404]: The link has come up on
    e1000g1

And at the same time, in.mpathd reported that the "Successfully failed
over from NIC e1000g1 to NIC e1000g2".   The machine info is:

    Sun Microsystems Inc. SunOS 5.11 onnv-gate:2008-07-02 Jul. 02, 2008
    SunOS Internal Development: gk 2008-07-02 [onnv-gate]
    bfu'ed from /ws/onnv-gate/archives/sparc/nightly on 2008-07-03
    Sun Microsystems Inc. SunOS 5.11 snv_72 October 2007
*** (#2 of 4): 2008-07-16 16:49:48 CST  xxxxx@xxxxx.com

And the reply from  xxxxx@xxxxx.COM: 
Yes, I think there's basically a problem where group failure doesn't
always work right, leading to an extra failover.  However, the whole group
still ends up failed and basically the right thing happens from an
end-user perspective.
*
Work Around
N/A
Comments
N/A