If a system uses a configured e1000g network port and the link to that port is down
then a cpu can end up looping into the kernel until the link becomes up. This can
hang the entire system if it has few cpus.
This is easily reproducible on a system a e1000g port:
vha-v40zc# ifconfig e1000g1
e1000g1: flags=1008803<UP,BROADCAST,MULTICAST,PRIVATE,IPv4> mtu 1500 index 3
inet 172.16.1.1 netmask ffffff80 broadcast 172.16.1.127
ether 0:4:23:b5:b:df
- take the link down, for example by unplugging the network cable, and you see the
link down message:
vha-v40zc e1000g: NOTICE: pci8086,1079 - e1000g[1] : Adapter copper link is down.
- ping a remote host using that network port:
vha-v40zc# ping -s 172.16.1.2 32000
PING 172.16.1.2: 32000 data bytes
- obviously ping will never return, but after a while a cpu will loop into the
kernel; this can be seen with mdb -k:
mdb> ::cpuinfo
ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC
0 fffffffffbc22ae0 1b 0 0 59 no no t-0 ffffffff8c1911a0 mdb
1 ffffffff87a91000 1b 0 0 -1 no no t-0 fffffe80002abc80 (idle)
2 ffffffff87c05000 1b 0 0 60 yes no t-9349 fffffe8000185c80 sched
3 ffffffff87da4000 1b 0 0 -1 no no t-0 fffffe80006f0c80 (idle)
cpu2 starts looping, after several hours we have:
mdb> ::cpuinfo
ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC
0 fffffffffbc22ae0 1b 0 0 59 no no t-0 ffffffff8c1911a0 mdb
1 ffffffff87a91000 1b 0 0 -1 no no t-0 fffffe80002abc80 (idle)
2 ffffffff87c05000 1b 0 0 60 yes no t-3856340 fffffe8000185c80 sched
3 ffffffff87da4000 1b 0 0 -1 no no t-0 fffffe80006f0c80 (idle)
The same thread is still onproc looping. If the network cable is plugged back then
the thread immediatly stops looping and the systems goes back to normal.
The impact is critical when e1000g interfaces are used as the Sun Cluster interconnect
on small system (like Galaxy) because this can hang cluster nodes and make HA services
unavailable.