|
Description
|
When fixing CR 6653933, we found the issue caused by ipmp, the details are:
During our project testing, we found some cases of the IPMP test suite failed. Basically the same set of cases (2 or 3 cases) failed at v46 and v6 mode. Specifically,
Number of Tests : 145
PASS : 140
FAIL : 5
TP 4 tc_failover_back FAIL
TP 4 tc_failover_backv46 FAIL
TP 15 tc_failover_backv46 FAIL
TP 4 tc_failover_backv6 FAIL
TP 15 tc_failover_backv6 FAIL
The core issue is that it seems in.mpathd does a failover when it shouldn't. It could be a test suite bug or an in.mpathd bug. Before we're sure it's an in.mpathd bug, just put it here.
*** (#1 of 1): 2008-01-23 10:56:01 CST xxxxx@xxxxx.com
I added some print info in order to catch more details. Finally I found
that the failover of the first interface really happened when we
simulated failure on the all two interfaces by using "hit -d" almost at
the same. So I think if the assertion is right, then maybe there is some
issues.
I ran the cases tp_004_ti2linkfail and tp_015_ti3linkfail in v4, v6 and
v46 modes, both failed with the same feature, here are some output of
journal file and console:
520|0 4 176237 1 10|earthtone 09:51:47 ti2linkfail: Simulate
failures on interface(s) e1000g1 e1000g2
520|0 4 176237 1 11|earthtone 09:51:47 ti2linkfail: e1000g1 failed
in 0 seconds
520|0 4 176237 1 12|earthtone 09:51:47 ti2linkfail: e1000g2 failed
in 0 seconds
520|0 4 176237 1 13|earthtone 09:51:47 ti2linkfail: Check that
e1000g1 is not failed over
520|0 4 176237 1 14|earthtone 09:51:47 ti2linkfail: Check that
2007:56::214:4fff:fe82:5961 is on interface e1000g1
520|0 4 176237 1 15|earthtone 09:51:47 ti2linkfail: FAILURE:
2007:56::214:4fff:fe82:5961 not on e1000g1
From the journal file log, we can see "FAILURE:
2007:56::214:4fff:fe82:5961 not on e1000g1" happened at 09:51:47
Jul 3 09:51:47 earthtone in.mpathd[176404]: The link has gone down
on e1000g1
Jul 3 09:51:47 earthtone in.mpathd[176404]: NIC failure detected on
e1000g1 of group tester
Jul 3 09:51:47 earthtone in.mpathd[176404]: Successfully failed over
from NIC e1000g1 to NIC e1000g2
Jul 3 09:51:47 earthtone in.mpathd[176404]: The link has gone down
on e1000g2
Jul 3 09:51:47 earthtone in.mpathd[176404]: All Interfaces in group
tester have failed
Jul 3 09:51:48 earthtone in.mpathd[176404]: The link has come up on
e1000g2
Jul 3 09:51:48 earthtone in.mpathd[176404]: The link has come up on
e1000g1
And at the same time, in.mpathd reported that the "Successfully failed
over from NIC e1000g1 to NIC e1000g2". The machine info is:
Sun Microsystems Inc. SunOS 5.11 onnv-gate:2008-07-02 Jul. 02, 2008
SunOS Internal Development: gk 2008-07-02 [onnv-gate]
bfu'ed from /ws/onnv-gate/archives/sparc/nightly on 2008-07-03
Sun Microsystems Inc. SunOS 5.11 snv_72 October 2007
*** (#2 of 4): 2008-07-16 16:49:48 CST xxxxx@xxxxx.com
And the reply from xxxxx@xxxxx.COM:
Yes, I think there's basically a problem where group failure doesn't
always work right, leading to an extra failover. However, the whole group
still ends up failed and basically the right thing happens from an
end-user perspective.
*
|