|
Description
|
During IPMP stress testing, we'd occasionally see in.mpathd crash with the
following message:
in.mpathd[100391]: phyint_inst_timer: invalid state 4
This means that we're attempting to send probes through a phyint in the
PI_OFFLINE (4) state, which should never happen. After instrumenting
in.mpathd to provide CTF data and digging through the source, the issue
became clear: when select_test_ifs() is called (to find a test address to
use for probing), it's possible that IFF_OFFLINE has been cleared and a
test address has been brought IFF_UP, but that the phyint itself is still
PI_OFFLINE. This could happen because an external program changed
IFF_OFFLINE, or because setting the flags via SIOCSLIFFLAGS itself is not
atomic and the IFF_OFFLINE flag got lost in the process.
Indeed, it's easy to prove this theory by writing a small program that
clears the IFF_OFFLINE flag but does not tell in.mpathd to bring the
interface online. For instance, on Nevada we configure a two interface
group, assign a test address to under1 and take it offline, resulting in:
# ifconfig -a
lo0: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232
inet 127.0.0.1 netmask ff000000
under0: flags=201000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,CoS> mtu 1500
inet 10.8.57.34 netmask ffffff00 broadcast 10.8.57.255
groupname a
ether 0:3:ba:94:3b:74
under1: flags=289000842<BROADCAST,RUNNING,MULTICAST,IPv4,NOFAILOVER,OFFLINE,CoS>
inet 10.8.57.202 netmask ffffff00 broadcast 10.8.57.255
groupname a
ether 0:3:ba:94:3b:75
We then clear offline and bring under1 back up:
# /tmp/clear-offline under1
# ifconfig under1 up
Jan 5 22:14:04 purple-198 in.mpathd[100391]: phyint_inst_timer: invalid
state 4
#
|