OpenSolaris

Printable Version Enter a New Search
Bug ID 6405012
Synopsis dld_wsrv() can hog a cpu when driver runs out of tx resources
State 10-Fix Delivered (Fix available in build)
Category:Subcategory kernel:gld
Keywords e1000g | nemo | nge
Responsible Engineer Eric Cheng
Reported Against s10_56 , s10_57 , snv_37 , s10u1_19 , s10u2_fcs , solaris_10u3
Duplicate Of
Introduced In solaris_nevada
Commit to Fix snv_50
Fixed In snv_50
Release Fixed solaris_nevada(snv_50) , solaris_10u4(s10u4_06) (Bug ID:2142798)
Related Bugs 6317659 , 6337966 , 6404881 , 6422939 , 6492263 , 6516096 , 6626082
Submit Date 28-March-2006
Last Update Date 2-February-2007
Description
If a system uses a configured e1000g network port and the link to that port is down
then a cpu can end up looping into the kernel until the link becomes up. This can
hang the entire system if it has few cpus.

This is easily reproducible on a system a e1000g port:

vha-v40zc# ifconfig e1000g1
e1000g1: flags=1008803<UP,BROADCAST,MULTICAST,PRIVATE,IPv4> mtu 1500 index 3
        inet 172.16.1.1 netmask ffffff80 broadcast 172.16.1.127
        ether 0:4:23:b5:b:df 

- take the link down, for example by unplugging the network cable, and you see the
  link down message:

  vha-v40zc e1000g: NOTICE: pci8086,1079 - e1000g[1] : Adapter copper link is down.

- ping a remote host using that network port:

  vha-v40zc# ping -s 172.16.1.2 32000
  PING 172.16.1.2: 32000 data bytes

- obviously ping will never return, but after a while a cpu will loop into the
  kernel; this can be seen with mdb -k:

mdb> ::cpuinfo
 ID ADDR        FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD      PROC
  0 fffffffffbc22ae0  1b    0    0  59   no    no t-0    ffffffff8c1911a0 mdb
  1 ffffffff87a91000  1b    0    0  -1   no    no t-0    fffffe80002abc80 (idle)
  2 ffffffff87c05000  1b    0    0  60  yes    no t-9349 fffffe8000185c80 sched
  3 ffffffff87da4000  1b    0    0  -1   no    no t-0    fffffe80006f0c80 (idle)

cpu2 starts looping, after several hours we have:

mdb> ::cpuinfo
 ID ADDR        FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD      PROC
  0 fffffffffbc22ae0  1b    0    0  59   no    no t-0    ffffffff8c1911a0 mdb
  1 ffffffff87a91000  1b    0    0  -1   no    no t-0    fffffe80002abc80 (idle)
  2 ffffffff87c05000  1b    0    0  60  yes    no t-3856340 fffffe8000185c80 sched
  3 ffffffff87da4000  1b    0    0  -1   no    no t-0    fffffe80006f0c80 (idle)

The same thread is still onproc looping. If the network cable is plugged back then
the thread immediatly stops looping and the systems goes back to normal.

The impact is critical when e1000g interfaces are used as the Sun Cluster interconnect
on small system (like Galaxy) because this can hang cluster nodes and make HA services
unavailable.
Work Around
N/A
Comments
N/A