OpenSolaris

Printable Version Enter a New Search
Bug ID 6827967
Synopsis x86 hangs in net boot of osol_0906_111 and snv_112
State 10-Fix Delivered (Fix available in build)
Category:Subcategory kernel:gld
Keywords crossbow1
Responsible Engineer Eric Cheng
Reported Against snv_106 , snv_112
Duplicate Of
Introduced In solaris_nevada
Commit to Fix snv_113
Fixed In snv_111a
Release Fixed solaris_nevada(snv_111a)
Related Bugs 6798461 , 6802926 , 6827290 , 6827291 , 6828133
Submit Date 8-April-2009
Last Update Date 21-April-2009
Description
This bug is seen in osol and it used to be tracked under bugzilla:

http://defect.opensolaris.org/bz/show_bug.cgi?id=6630


Now we're seeing this on a X8420 blade (oaf602) - which has four e1000g nics.

Loading kmdb...
SunOS Release 5.11 Version snv_111 64-bit
Copyright 1983-2009 Sun Microsystems, Inc.  All rights reserved.
Use is subject to license terms.
[.. Hang ..]
Welcome to kmdb
kmdb: unable to determine terminal type: assuming `vt100'
Loaded modules: [ scsi_vhci mac uppc neti sd ufs unix cpu_ms.AuthenticAMD.15 
krtld s1394 uhci hook genunix ip usba specfs pcplusmp cpu.generic sctp arp 
sockfs ]
[0]> ::ptree
fffffffffbc2c030  sched
     ffffff01d3274a48  fsflush
     ffffff01d32756a8  pageout
     ffffff01d3276308  init
          ffffff01d3270008  dlmgmtd
          ffffff01d3272528  svc.configd
          ffffff01d3273188  svc.startd
               ffffff01d3273de8  net-physical
                    ffffff01d326e6b0  netstrategy
[0]> :c
According to Sean, this is also seen on x4600 with the following configuration:

with osol_0906-109 its still hanging around the same place.

Some more investigation shows it could be to do with the network
 interfaces on this box.

booting again gets us here:
.
..
installing namefs, module id 153.
load 'sys/portfs' id 154 loaded @ 0xfffffffff7ed6000/0xffffffffc004bfd0 size
28032/304
installing portfs, module id 154.
Booting to milestone "milestone/single-user:default".
load 'exec/intpexec' id 155 loaded @ 0xfffffffff7e659b0/0xffffffffc0040a48 size
1456/136
installing intpexec, module id 155.
load 'drv/sysevent' id 156 loaded @ 0xfffffffff7e233e8/0xffffffffc004c100 size
4448/368
installing sysevent, module id 156.
/pci@0,0/pci108e,cb84@2/storage@4/disk@0,0 (sd0) online


at this point the last process running was netstrategy:

[4]> ::ptree
fffffffffbc2ba70  sched
     ffffff08ef051a48  fsflush
     ffffff08ef0526a8  pageout
     ffffff08ef053308  init
          ffffff08ef04dc68  dlmgmtd
          ffffff08ef050de8  svc.configd
          ffffff08ef050188  svc.startd
               ffffff08ef04b6b0  net-physical
                    ffffff08ef04aa50  netstrategy

[4]> ::ps
S    PID   PPID   PGID    SID    UID      FLAGS             ADDR NAME
R      0      0      0      0      0 0x00000001 fffffffffbc2ba70 sched
R      3      0      0      0      0 0x00020001 ffffff08ef051a48 fsflush
R      2      0      0      0      0 0x00020001 ffffff08ef0526a8 pageout
R      1      0      0      0      0 0x4a004000 ffffff08ef053308 init
R     16      1     16     16     15 0x42000000 ffffff08ef04dc68 dlmgmtd
R      9      1      9      9      0 0x42000000 ffffff08ef050de8 svc.configd
R      7      1      7      7      0 0x42000000 ffffff08ef050188 svc.startd
R     17      7      7      7      0 0x42014000 ffffff08ef04b6b0 net-physical
R     19     17      7      7      0 0x4a004000 ffffff08ef04aa50 netstrategy

and netstrategy seems to be waiting for a nic to come back:

[4]> 0t19::pid2proc | ::walk thread | ::findstack
stack pointer for thread ffffff08ef6b1a80: ffffff003ca26520
[ ffffff003ca26520 _resume_from_idle+0xf1() ]
  ffffff003ca26550 swtch+0x160()
  ffffff003ca265b0 cv_wait_sig+0x14b()
  ffffff003ca26610 str_cv_wait+0xbc()
  ffffff003ca266c0 strwaitq+0x1fe()
  ffffff003ca267d0 kstrgetmsg+0x3dc()
  ffffff003ca26820 ldi_getmsg+0x9b()
  ffffff003ca268b0 dl_op+0x63()
  ffffff003ca26910 dl_bind+0x8f()
  ffffff003ca26970 strplumb`getmacaddr+0xec()
  ffffff003ca269c0 strplumb`matchmac+0x87()
  ffffff003ca26a30 walk_devs+0x4f()
  ffffff003ca26aa0 walk_devs+0xff()
[4]> 
[4]> ffffff003ca26970-10
0xffffff003ca26960:             0xffffff003ca269880xffffff08e87f7138
                0xffffff003ca269c0strplumb`matchmac+0x87
[4]> 0xffffff08e87f7138 ::whatis

ffffff08e87f7138 is ffffff08e87f7138+0, allocated from dev_info_node_cache
[4]> 0xffffff08e87f7138 ::print -t struct dev_info
{
    struct dev_info *devi_parent = 0xffffff08e0a44ae0
    struct dev_info *devi_child = 0
    struct dev_info *devi_sibling = 0xffffff08e87f6ec8
    char *devi_binding_name = 0xffffff08e0bf4ac5 "pciex8086,105e"
    char *devi_addr = 0xffffff08ea30ee00 "0"
    int devi_nodeid = 0x3a
    int devi_instance = 0
    struct dev_ops *devi_ops = e1000g`ws_ops
    void *devi_parent_data = 0xffffff08e8c3a000
    void *devi_driver_data = 0xffffff08e0bb6000
    ddi_prop_t *devi_drv_prop_ptr = 0xffffff08ea43d5f8
    ddi_prop_t *devi_sys_prop_ptr = 0
    struct ddi_minor_data *devi_minor = 0xffffff08e8c53380
    struct dev_info *devi_next = 0xffffff08e87f6ec8
    kmutex_t devi_lock = {
        void *[1] _opaque = [ 0 ]
    }
.
.
.

 so its waiting for a response from a e1000g nic.

 This x4600 has 8 x e1000g, 1 x ixgb and 2 x nxge nics in it, its a heavy
  networking rig:

dladm show-phys from snv_108:
LINK         MEDIA                STATE      SPEED  DUPLEX    DEVICE
e1000g4      Ethernet             up         1000   full      e1000g4
nxge0        Ethernet             up         10000  full      nxge0
e1000g1      Ethernet             up         1000   full      e1000g1
e1000g5      Ethernet             up         1000   full      e1000g5
e1000g0      Ethernet             up         1000   full      e1000g0
ixgb0        Ethernet             up         10000  full      ixgb0
e1000g2      Ethernet             up         1000   full      e1000g2
e1000g6      Ethernet             up         1000   full      e1000g6
e1000g3      Ethernet             up         1000   full      e1000g3
e1000g7      Ethernet             up         1000   full      e1000g7
nxge1        Ethernet             unknown    0      unknown   nxge1
Work Around
N/A
Comments
Seeing this now on three machines, x4600, X8420  and a SuperMicro x86 box).
 all three boxes have e1000g nics

  the supermicro box could previously install osol_0906-109 fine.

from the x8420 (oaf602) we have it hung during net boot of snv_112

from ::stacks -m e1000g we see one of the nics looks to be hung here:
 (or in some strange interrupt loop?)

[0]> ffffff000801fc60 ::findstack -v
dblk_lastfree+0x70(ffffff01f1436220, ffffff01f1433cc0)
freemsg+0x84(ffffff01f1436220)
freemsgchain+0x21(ffffff01d1bcdc60)
mac`mac_rx+0x206(ffffff01cb16ea98, 0, ffffff01d1bcdc60)
mac`mac_rx_ring+0x4c(ffffff01cb16ea98, 0, ffffff01d1bcdc60, 1
e1000g`e1000g_intr_pciexpress+0x17e(fffffffffb828184)
0x36fb89d12b()
dispatch_hardint+0x41(36, 2)
switch_sp_and_call+0x13()
0xffffff01ce60a580()
[0]>

but theres nothing blocking this thread:
[0]> ffffff000801fc60 ::thread -b
            ADDR            WCHAN               TS             PITS    SOBJ OPS
ffffff000801fc60                0 ffffff01d34923e0                0           0
[0]>


More debug output below:

^[kmdb: target stopped at:
kmdb_enter+0xb: movq   %rax,%rdi
[0]> ::ptree
fffffffffbc2c370  sched
     ffffff01d3252a48  fsflush
     ffffff01d32536a8  pageout
     ffffff01d3254308  init
          ffffff01d324adf0  devfsadm
          ffffff01d324ec68  dlmgmtd
          ffffff01d3251de8  svc.configd
          ffffff01d3251188  svc.startd
               ffffff01d3250528  install-discover
                    ffffff01d324d310  cut
                         ffffff01d3243df8  netstrategy
                    ffffff01d324e008  dial
[0]> ffffff01d3243df8 ::walk thread | ::findstack -v
stack pointer for thread ffffff01d37193c0: ffffff0008313520
[ ffffff0008313520 _resume_from_idle+0xf1() ]
  ffffff0008313550 swtch+0x147()
  ffffff00083135b0 cv_wait_sig+0x14b(ffffff01f13eadd2, ffffff01f143de28)
  ffffff0008313610 str_cv_wait+0xbc(ffffff01f13eadd2, ffffff01f143de28, 
  ffffffffffffffff, 0)
  ffffff00083136c0 strwaitq+0x1fe(ffffff01f143dda8, 8, 0, 0, ffffffffffffffff, 
  ffffff000831376c)
  ffffff00083137d0 kstrgetmsg+0x3dc(ffffff01f13ef080, ffffff0008313848, 0, 
  ffffff00083137f7, ffffff00083137f0, ffffffffffffffff, ffffff00083137f8)
  ffffff0008313820 ldi_getmsg+0x9b(ffffff01e3d57018, ffffff0008313848, 0)
  ffffff00083138b0 dl_op+0x63(ffffff01e3d57018, ffffff00083138c8, 4, 18, 0, 0)
  ffffff0008313910 dl_bind+0x8f(ffffff01e3d57018, 800, 0)
  ffffff0008313970 strplumb`getmacaddr+0xec(ffffff01cdb977b0, ffffff0008313988)
  ffffff00083139c0 strplumb`matchmac+0x87(ffffff01cdb977b0, ffffff0008313bc8)
  ffffff0008313a30 walk_devs+0x4f(ffffff01cdb977b0, fffffffff79b4190, 
  ffffff0008313bc8, 1)
  ffffff0008313aa0 walk_devs+0xff(ffffff01cd278508, fffffffff79b4190, 
  ffffff0008313bc8, 1)

[0]> ffffff01cdb977b0::whatis
ffffff01cdb977b0 is ffffff01cdb977b0+0, allocated from dev_info_node_cache

[0]> ffffff01cdb977b0::devinfo
ffffff01cdb977b0 pciex8086,105e, instance #0 (driver name: e1000g)
        Driver properties at ffffff01ce667210:
            name='fm-accchk-capable' type=any items=0
            name='fm-dmachk-capable' type=any items=0
            name='fm-errcb-capable' type=any items=0
            name='fm-ereport-capable' type=any items=0
        Hardware properties at ffffff01ce667120:
            name='pci-msi-capid-pointer' type=int items=1
                value=000000d0
            name='acpi-namespace' type=string items=1
                value='\_SB_.PCI0.P0PE.S1F0'
            name='assigned-addresses' type=int items=15
                value=820f0010.00000000.8db80000.00000000.00020000.820f0014.0000
                0000.8db60000.00000000.00020000.810f0018.00000000.0000b800.0000
                0000.00000020
            name='reg' type=int items=20
                value=000f0000.00000000.00000000.00000000.00000000.020f0010.0000
                0000.00000000.00000000.00020000.020f0014.00000000.00000000.0000
                0000.00020000.010f0018.00000000.00000000.00000000.00000020
            name='compatible' type=string items=13
                value='pciex8086,105e.8086.105e.6' + 'pciex8086,105e.8086.105e
                ' + 'pciex8086,105e.6' + 'pciex8086,105e' + 'pciexclass,020000
                ' + 'pciexclass,0200' + 'pci8086,105e.8086.105e.6' + '
...
...


[0]> ffffff01cdb977b0 ::print -t struct dev_info
{
    struct dev_info *devi_parent = 0xffffff01cd274010
    struct dev_info *devi_child = 0
    struct dev_info *devi_sibling = 0xffffff01cdb97530
    char *devi_binding_name = 0xffffff01caee1385 "pciex8086,105e"
    char *devi_addr = 0xffffff01ce33a600 "0"
    int devi_nodeid = 0x21
    int devi_instance = 0
    struct dev_ops *devi_ops = e1000g`ws_ops
    void *devi_parent_data = 0xffffff01cdf8f440
    void *devi_driver_data = 0xffffff01cc3d4000
    ddi_prop_t *devi_drv_prop_ptr = 0xffffff01ce667210
    ddi_prop_t *devi_sys_prop_ptr = 0
    struct ddi_minor_data *devi_minor = 0xffffff01cdf884b8
    struct dev_info *devi_next = 0xffffff01cdb97530
    kmutex_t devi_lock = {
        void *[1] _opaque = [ 0 ]
    }
 .....

[0]> 0xffffff01cc3d4000::print -t e1000g_t
{
    int instance = 0
    dev_info_t *dip = 0xffffff01cdb977b0
    dev_info_t *priv_dip = 0xffffff01ce54d800
    private_devi_list_t *priv_devi_node = 0xffffff01ce65e7f8
    mac_handle_t mh = 0xffffff01cb16f8c8
    mac_resource_handle_t mrh = 0
    struct e1000_hw shared = {
        void *back = 0xffffff01cc3d8330
        u8 *hw_addr = 0xffffff0186f1c000
        u8 *flash_address = 0
        unsigned long io_base = 0xb800
        struct e1000_mac_info mac = {
            struct e1000_mac_operations ops = {
                int (*)() init_params = e1000g`e1000_init_mac_params_82571
                int (*)() blink_led = e1000g`e1000_blink_led_generic
                int (*)() check_for_link = 
e1000g`e1000_check_for_copper_link_generic
                int (*)() check_mng_mode = e1000g`e1000_check_mng_mode_generic
                int (*)() cleanup_led = e1000g`e1000_cleanup_led_generic
                int (*)() clear_hw_cntrs = e1000g`e1000_clear_hw_cntrs_82571
                int (*)() clear_vfta = e1000g`e1000_clear_vfta_82571
                int (*)() get_bus_info = e1000g`e1000_get_bus_info_pcie_generic

....

[0]> ::stacks -m e1000g
THREAD           STATE    SOBJ                COUNT
ffffff000801fc60 FREE     <NONE>                  1
                 dblk_lastfree+0x70
                 freemsg+0x84
                 freemsgchain+0x21
                 mac`mac_rx+0x206
                 mac`mac_rx_ring+0x4c
                 e1000g`e1000g_intr_pciexpress+0x17e
                 0x36fb89d12b
                 dispatch_hardint+0x41

ffffff0007f6bc60 FREE     <NONE>                  1
                 e1000g`e1000g_check_dma_handle+0x1e
                 e1000g`e1000g_receive+0x6c
                 e1000g`e1000g_intr_pciexpress+0x159
                 0x33fb89d12b
                 dispatch_hardint+0x41

ffffff0008025c60 FREE     <NONE>                  1
                 e1000g`e1000g_check_dma_handle+0x1e
                 e1000g`e1000g_receive+0x6c
                 e1000g`e1000g_intr_pciexpress+0x159
                 0x36fb89d12b
                 dispatch_hardint+0x41

[0]> 

[0]> ffffff000801fc60::findstack -v
stack pointer for thread ffffff000801fc60 (TS_FREE): ffffff000801fa40
  ffffff000801fa70 dblk_lastfree+0x70(ffffff01f1436220, ffffff01f1433cc0)
  ffffff000801faa0 freemsg+0x84(ffffff01f1436220)
  ffffff000801fac0 freemsgchain+0x21(ffffff01d1bcdc60)
  ffffff000801fb10 mac`mac_rx+0x206(ffffff01cb16ea98, 0, ffffff01d1bcdc60)
  ffffff000801fb50 mac`mac_rx_ring+0x4c(ffffff01cb16ea98, 0, ffffff01d1bcdc60, 1
  )
  ffffff000801fbb0 e1000g`e1000g_intr_pciexpress+0x17e(fffffffffb828184)
  ffffff000801fc00 0x36fb89d12b()
  ffffff000801fc40 dispatch_hardint+0x41(36, 2)
  ffffff0008025ad0 switch_sp_and_call+0x13()
  ffffff01ce60a580 0xffffff01ce60a580()
[0]> 

[0]> ffffff000801fc60 ::thread -b
            ADDR            WCHAN               TS             PITS    SOBJ OPS
ffffff000801fc60                0 ffffff01d34923e0                0           0
[0]>
Pegasus+ (Sun Blade X6440) with igb hits the same problem, so it's not e1000g specific.
Petr, are you saying that you see the same kernel stack for a hung netstrategy process?
David, yep.
[7]> ::ptree
fffffffffbc2c370  sched
     ffffff04eae5da48  fsflush
     ffffff04eae5e6a8  pageout
     ffffff04eae5f308  init
          ffffff04eae538d0  devfsadm
          ffffff04eae59008  dlmgmtd
          ffffff04eae5a8c8  svc.configd
          ffffff04eae5c188  svc.startd
               ffffff04eae58310  install-discover
                    ffffff04eae56a50  cut
                         ffffff04eae51318  netstrategy
                    ffffff04eae576b0  dial

[7]> ::threadlist
            ADDR             PROC              LWP CMD/LWPID
ffffff04eb05a3a0 ffffff04eae51318 ffffff04eaf10b50 netstrategy/1
[7]> ffffff04eb05a3a0::findstack
stack pointer for thread ffffff04eb05a3a0: ffffff001f910520
[ ffffff001f910520 _resume_from_idle+0xf1() ]
  ffffff001f910550 swtch+0x147()
  ffffff001f9105b0 cv_wait_sig+0x14b()
  ffffff001f910610 str_cv_wait+0xbc()
  ffffff001f9106c0 strwaitq+0x1fe()
  ffffff001f9107d0 kstrgetmsg+0x3dc()
  ffffff001f910820 ldi_getmsg+0x9b()
  ffffff001f9108b0 dl_op+0x63()
  ffffff001f910910 dl_bind+0x8f()
  ffffff001f910970 strplumb`getmacaddr+0xec()
  ffffff001f9109c0 strplumb`matchmac+0x87()
  ffffff001f910a30 walk_devs+0x4f()
  ffffff001f910aa0 walk_devs+0xff()
[7]> 
[7]> ::interrupts
IRQ  Vect IPL Bus    Trg Type   CPU Share APIC/INT# ISR(s) 
4    0xb0 12  ISA    Edg Fixed  7   1     0x0/0x4   asy`asyintr
9    0x81 9   PCI    Lvl Fixed  1   1     0x0/0x9   acpica`acpi_wrapper_isr
21   0x84 9   PCI    Lvl Fixed  8   1     0x0/0x15  ehci`ehci_intr
22   0x85 9   PCI    Lvl Fixed  9   1     0x0/0x16  ohci`ohci_intr
48   0x82 7   PCI    Edg MSI    2   1     -         pcie_pci`pepb_intr_handler
49   0x83 7   PCI    Edg MSI    2   1     -         pcie_pci`pepb_intr_handler
50   0x60 6   PCI    Edg MSI-X  3   1     -         igb`igb_intr_tx_other
51   0x61 6   PCI    Edg MSI-X  4   1     -         igb`igb_intr_rx
52   0x62 6   PCI    Edg MSI-X  4   1     -         igb`igb_intr_tx_other
53   0x63 6   PCI    Edg MSI-X  6   1     -         igb`igb_intr_rx
54   0x30 4   PCI    Edg MSI    10  1     -         pcie_pci`pepb_intr_handler
55   0x31 4   PCI    Edg MSI    10  1     -         pcie_pci`pepb_intr_handler
56   0x86 7   PCI    Edg MSI    11  1     -         pcie_pci`pepb_intr_handler
57   0x87 7   PCI    Edg MSI    11  1     -         pcie_pci`pepb_intr_handler
58   0x32 4   PCI    Edg MSI    12  1     -         pcie_pci`pepb_intr_handler
59   0x33 4   PCI    Edg MSI    12  1     -         pcie_pci`pepb_intr_handler
60   0x88 7   PCI    Edg MSI    13  1     -         pcie_pci`pepb_intr_handler
61   0x89 7   PCI    Edg MSI    13  1     -         pcie_pci`pepb_intr_handler
62   0x40 5   PCI    Edg MSI    0   1     -         emlxs`emlxs_sli3_msi_intr
63   0x41 5   PCI    Edg MSI    0   1     -         emlxs`emlxs_sli3_msi_intr
64   0x42 5   PCI    Edg MSI    1   1     -         emlxs`emlxs_sli3_msi_intr
65   0x43 5   PCI    Edg MSI    1   1     -         emlxs`emlxs_sli3_msi_intr
160  0xa0 0          Edg IPI    all 0     -         poke_cpu
192  0xc0 13         Edg IPI    all 1     -         xc_serv
208  0xd0 14         Edg IPI    all 1     -         kcpc_hw_overflow_intr
209  0xd1 14         Edg IPI    all 1     -         cbe_fire
210  0xd3 14         Edg IPI    all 1     -         cbe_fire
240  0xe0 15         Edg IPI    all 1     -         xc_serv
241  0xe1 15         Edg IPI    all 1     -         pcplusmp`apic_error_intr

I can give you the console, a dump-device is not configured so I can't provide you
with the core.
I can give