OpenSolaris

Printable Version Enter a New Search
Bug ID 6760922
Synopsis devname doesn't handle stale dev_t's in sdev_node cache entries
State 10-Fix Delivered (Fix available in build)
Category:Subcategory kernel:devname
Keywords rtiq_reviewed
Responsible Engineer Phil Kirk
Reported Against
Duplicate Of
Introduced In solaris_nevada
Commit to Fix snv_103
Fixed In snv_103
Release Fixed solaris_nevada(snv_103)
Related Bugs 6413127 , 4085089
Submit Date 17-October-2008
Last Update Date 2-November-2009
Description
Seemingly at random, tests will start failing in the suite suite because snoop -I fails to open the interface with an error messages like so:

snoop: cannot open "e1000g1": DLPI link does not exist

Here's a console session what this looked like.  Notice that snoop starts working again after some time (a minute or so?).  Also, I'm really curious since do ls /dev/ipnet reported the correct devices and snoop immediately worked after that ... maybe listing the /dev/ipnet directory might actually be fixing the problem?

15,0$ ifconfig -a
e1000g0: flags=201004843<UP,BROADCAST,RUNNING,MULTICAST,DHCP,IPv4,CoS> mtu 1500 index 2
        inet 10.8.57.93 netmask ffffff00 broadcast 10.8.57.255
lo0: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
        inet 127.0.0.1 netmask ff000000 
lo0:1: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
        zone mathesar_tz1
        inet 127.0.0.1 netmask ff000000 
lo0:2: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
        zone mathesar_tz2
        inet 127.0.0.1 netmask ff000000 
lo0: flags=2002000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv6,VIRTUAL> mtu 8252 index 1
        inet6 ::1/128 
lo0:1: flags=2002000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv6,VIRTUAL> mtu 8252 index 1
        zone mathesar_tz1
        inet6 ::1/64 
lo0:2: flags=2002000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv6,VIRTUAL> mtu 8252 index 1
        zone mathesar_tz2
        inet6 ::1/64 
e1000g0: flags=202000841<UP,RUNNING,MULTICAST,IPv6,CoS> mtu 1500 index 2
        inet6 fe80::214:4fff:fe20:8224/10 
e1000g0:1: flags=202080841<UP,RUNNING,MULTICAST,ADDRCONF,IPv6,CoS> mtu 1500 index 2
        inet6 2002:a08:39f0:1:214:4fff:fe20:8224/64 
e1000g1: flags=202004841<UP,RUNNING,MULTICAST,DHCP,IPv6,CoS> mtu 1500 index 19
        inet6 fe80::214:4fff:fe20:8225/10 
e1000g1:1: flags=202000841<UP,RUNNING,MULTICAST,IPv6,CoS> mtu 1500 index 19
        inet6 2000:1::123:feed:129/64 
e1000g1:2: flags=202000841<UP,RUNNING,MULTICAST,IPv6,CoS> mtu 1500 index 19
        zone mathesar_tz1
        inet6 2000:1::123:feed:130/64 
e1000g1:3: flags=202000841<UP,RUNNING,MULTICAST,IPv6,CoS> mtu 1500 index 19
        zone mathesar_tz2
        inet6 2000:1::123:feed:131/64 
e1000g2: flags=202004841<UP,RUNNING,MULTICAST,DHCP,IPv6,CoS> mtu 1500 index 20
        inet6 fe80::214:4fff:fe20:822a/10 
e1000g2:1: flags=202000841<UP,RUNNING,MULTICAST,IPv6,CoS> mtu 1500 index 20
        inet6 2000:2::123:feed:129/64 
e1000g2:2: flags=202000841<UP,RUNNING,MULTICAST,IPv6,CoS> mtu 1500 index 20
        zone mathesar_tz1
        inet6 2000:2::123:feed:130/64 
e1000g2:3: flags=202000841<UP,RUNNING,MULTICAST,IPv6,CoS> mtu 1500 index 20
        zone mathesar_tz2
        inet6 2000:2::123:feed:131/64 
mathesar(...138328/personal/ipobs-rewhack/src/suites/net/ipobs)
16,0$ snoop -I e1000g1
snoop: cannot open "e1000g1": DLPI link does not exist
mathesar(...138328/personal/ipobs-rewhack/src/suites/net/ipobs)
17,0$ su     
Password: 
# snoop -I e1000g1
snoop: cannot open "e1000g1": DLPI link does not exist
# snoop -I e1000g2 
snoop: cannot open "e1000g2": DLPI link does not exist
# ls /dev/ipnet 
e1000g0  e1000g1  e1000g2  lo0
# snoop -I e1000g1
Using device ipnet/e1000g1 (promiscuous mode)
fe80::214:4fff:fe20:7a75 -> ff02::1      ICMPv6 Neighbor advertisement
2000:1::123:feed:132 -> ff02::1      ICMPv6 Neighbor advertisement
fe80::214:4fff:fe20:8225 -> ff02::1      ICMPv6 Neighbor advertisement
John, is there a crash dump stashed away from an induced panic while this problem is happening?
It turns out that this is quite easily reproducible, one just needs to execute the right sequence of operations.  On my test machine with a bge1 interface, the second interation through the following loop triggers this situation:

while `/bin/true`; do
        ifconfig bge1 plumb
        ifconfig bge1 11.0.0.1/24 up
        sleep 1
        ls /dev/ipnet/bge1
        ifconfig bge1 unplumb
done

After that point, "ls /dev/ipnet/bge1" always fails until one does a "ls /dev/ipnet", after which "ls /dev/ipnet/bge1" works again:

bash-3.2# ls /dev/ipnet/bge1
/dev/ipnet/bge1: No such file or directory
bash-3.2# ls /dev/ipnet
bge0  bge1  lo0
bash-3.2# ls /dev/ipnet/bge1
/dev/ipnet/bge1

I'm still trying to root-cause this, but a reproducible test case is a huge step forward.
Work Around
N/A
Comments
I believe I've found the problem.  The issue relates to the validation
of sdev cache nodes in devname_lookup_func().  Consider the following
scenario:

1. The ipnet module allocates a minor number and dev_t for a "bge1"
interface, which can be opened out of /dev/ipnet.

2. /dev/ipnet/bge1 is opened, and the bge1 dev_t is stored in the vattr
associated with the bge1 sdev cache entry.

2. The /dev/ipnet/bge1 device is closed.

3. The bge1 interface is unplumbed, and the ipnet module frees the minor
number for bge1.

4. The bge1 interface is plumbed again, and the ipnet module allocates a
new minor number (and thus a new dev_t) for bge1.

5. /dev/ipnet/bge1 is opened again.  Herein lies the problem.
devipnet_lookup() gets called, which in turn calls
devname_lookup_func().  devname_lookup_func() finds the existing (old)
sdev cache entry for bge1, and calls devipnet_validate() to validate it,
which returns SDEV_VTOR_INVALID because the dev_t of the old cache entry
doesn't match the dev_t of the existing "bge1" node.

Here, instead of deleting this cache entry and trying to re-create it by
calling callback(), devname_lookup_func() gives up by returning failure,
and leaves this stale cache entry in the cache forevermore until it
times out.

It needs to be verified that this is indeed a bug in devnames and not an intentional behavior that we're simply running into and need to work around.