|
Description
|
This bug is seen in osol and it used to be tracked under bugzilla:
http://defect.opensolaris.org/bz/show_bug.cgi?id=6630
Now we're seeing this on a X8420 blade (oaf602) - which has four e1000g nics.
Loading kmdb...
SunOS Release 5.11 Version snv_111 64-bit
Copyright 1983-2009 Sun Microsystems, Inc. All rights reserved.
Use is subject to license terms.
[.. Hang ..]
Welcome to kmdb
kmdb: unable to determine terminal type: assuming `vt100'
Loaded modules: [ scsi_vhci mac uppc neti sd ufs unix cpu_ms.AuthenticAMD.15
krtld s1394 uhci hook genunix ip usba specfs pcplusmp cpu.generic sctp arp
sockfs ]
[0]> ::ptree
fffffffffbc2c030 sched
ffffff01d3274a48 fsflush
ffffff01d32756a8 pageout
ffffff01d3276308 init
ffffff01d3270008 dlmgmtd
ffffff01d3272528 svc.configd
ffffff01d3273188 svc.startd
ffffff01d3273de8 net-physical
ffffff01d326e6b0 netstrategy
[0]> :c
According to Sean, this is also seen on x4600 with the following configuration:
with osol_0906-109 its still hanging around the same place.
Some more investigation shows it could be to do with the network
interfaces on this box.
booting again gets us here:
.
..
installing namefs, module id 153.
load 'sys/portfs' id 154 loaded @ 0xfffffffff7ed6000/0xffffffffc004bfd0 size
28032/304
installing portfs, module id 154.
Booting to milestone "milestone/single-user:default".
load 'exec/intpexec' id 155 loaded @ 0xfffffffff7e659b0/0xffffffffc0040a48 size
1456/136
installing intpexec, module id 155.
load 'drv/sysevent' id 156 loaded @ 0xfffffffff7e233e8/0xffffffffc004c100 size
4448/368
installing sysevent, module id 156.
/pci@0,0/pci108e,cb84@2/storage@4/disk@0,0 (sd0) online
at this point the last process running was netstrategy:
[4]> ::ptree
fffffffffbc2ba70 sched
ffffff08ef051a48 fsflush
ffffff08ef0526a8 pageout
ffffff08ef053308 init
ffffff08ef04dc68 dlmgmtd
ffffff08ef050de8 svc.configd
ffffff08ef050188 svc.startd
ffffff08ef04b6b0 net-physical
ffffff08ef04aa50 netstrategy
[4]> ::ps
S PID PPID PGID SID UID FLAGS ADDR NAME
R 0 0 0 0 0 0x00000001 fffffffffbc2ba70 sched
R 3 0 0 0 0 0x00020001 ffffff08ef051a48 fsflush
R 2 0 0 0 0 0x00020001 ffffff08ef0526a8 pageout
R 1 0 0 0 0 0x4a004000 ffffff08ef053308 init
R 16 1 16 16 15 0x42000000 ffffff08ef04dc68 dlmgmtd
R 9 1 9 9 0 0x42000000 ffffff08ef050de8 svc.configd
R 7 1 7 7 0 0x42000000 ffffff08ef050188 svc.startd
R 17 7 7 7 0 0x42014000 ffffff08ef04b6b0 net-physical
R 19 17 7 7 0 0x4a004000 ffffff08ef04aa50 netstrategy
and netstrategy seems to be waiting for a nic to come back:
[4]> 0t19::pid2proc | ::walk thread | ::findstack
stack pointer for thread ffffff08ef6b1a80: ffffff003ca26520
[ ffffff003ca26520 _resume_from_idle+0xf1() ]
ffffff003ca26550 swtch+0x160()
ffffff003ca265b0 cv_wait_sig+0x14b()
ffffff003ca26610 str_cv_wait+0xbc()
ffffff003ca266c0 strwaitq+0x1fe()
ffffff003ca267d0 kstrgetmsg+0x3dc()
ffffff003ca26820 ldi_getmsg+0x9b()
ffffff003ca268b0 dl_op+0x63()
ffffff003ca26910 dl_bind+0x8f()
ffffff003ca26970 strplumb`getmacaddr+0xec()
ffffff003ca269c0 strplumb`matchmac+0x87()
ffffff003ca26a30 walk_devs+0x4f()
ffffff003ca26aa0 walk_devs+0xff()
[4]>
[4]> ffffff003ca26970-10
0xffffff003ca26960: 0xffffff003ca269880xffffff08e87f7138
0xffffff003ca269c0strplumb`matchmac+0x87
[4]> 0xffffff08e87f7138 ::whatis
ffffff08e87f7138 is ffffff08e87f7138+0, allocated from dev_info_node_cache
[4]> 0xffffff08e87f7138 ::print -t struct dev_info
{
struct dev_info *devi_parent = 0xffffff08e0a44ae0
struct dev_info *devi_child = 0
struct dev_info *devi_sibling = 0xffffff08e87f6ec8
char *devi_binding_name = 0xffffff08e0bf4ac5 "pciex8086,105e"
char *devi_addr = 0xffffff08ea30ee00 "0"
int devi_nodeid = 0x3a
int devi_instance = 0
struct dev_ops *devi_ops = e1000g`ws_ops
void *devi_parent_data = 0xffffff08e8c3a000
void *devi_driver_data = 0xffffff08e0bb6000
ddi_prop_t *devi_drv_prop_ptr = 0xffffff08ea43d5f8
ddi_prop_t *devi_sys_prop_ptr = 0
struct ddi_minor_data *devi_minor = 0xffffff08e8c53380
struct dev_info *devi_next = 0xffffff08e87f6ec8
kmutex_t devi_lock = {
void *[1] _opaque = [ 0 ]
}
.
.
.
so its waiting for a response from a e1000g nic.
This x4600 has 8 x e1000g, 1 x ixgb and 2 x nxge nics in it, its a heavy
networking rig:
dladm show-phys from snv_108:
LINK MEDIA STATE SPEED DUPLEX DEVICE
e1000g4 Ethernet up 1000 full e1000g4
nxge0 Ethernet up 10000 full nxge0
e1000g1 Ethernet up 1000 full e1000g1
e1000g5 Ethernet up 1000 full e1000g5
e1000g0 Ethernet up 1000 full e1000g0
ixgb0 Ethernet up 10000 full ixgb0
e1000g2 Ethernet up 1000 full e1000g2
e1000g6 Ethernet up 1000 full e1000g6
e1000g3 Ethernet up 1000 full e1000g3
e1000g7 Ethernet up 1000 full e1000g7
nxge1 Ethernet unknown 0 unknown nxge1
|
|
Comments
|
Seeing this now on three machines, x4600, X8420 and a SuperMicro x86 box).
all three boxes have e1000g nics
the supermicro box could previously install osol_0906-109 fine.
from the x8420 (oaf602) we have it hung during net boot of snv_112
from ::stacks -m e1000g we see one of the nics looks to be hung here:
(or in some strange interrupt loop?)
[0]> ffffff000801fc60 ::findstack -v
dblk_lastfree+0x70(ffffff01f1436220, ffffff01f1433cc0)
freemsg+0x84(ffffff01f1436220)
freemsgchain+0x21(ffffff01d1bcdc60)
mac`mac_rx+0x206(ffffff01cb16ea98, 0, ffffff01d1bcdc60)
mac`mac_rx_ring+0x4c(ffffff01cb16ea98, 0, ffffff01d1bcdc60, 1
e1000g`e1000g_intr_pciexpress+0x17e(fffffffffb828184)
0x36fb89d12b()
dispatch_hardint+0x41(36, 2)
switch_sp_and_call+0x13()
0xffffff01ce60a580()
[0]>
but theres nothing blocking this thread:
[0]> ffffff000801fc60 ::thread -b
ADDR WCHAN TS PITS SOBJ OPS
ffffff000801fc60 0 ffffff01d34923e0 0 0
[0]>
More debug output below:
^[kmdb: target stopped at:
kmdb_enter+0xb: movq %rax,%rdi
[0]> ::ptree
fffffffffbc2c370 sched
ffffff01d3252a48 fsflush
ffffff01d32536a8 pageout
ffffff01d3254308 init
ffffff01d324adf0 devfsadm
ffffff01d324ec68 dlmgmtd
ffffff01d3251de8 svc.configd
ffffff01d3251188 svc.startd
ffffff01d3250528 install-discover
ffffff01d324d310 cut
ffffff01d3243df8 netstrategy
ffffff01d324e008 dial
[0]> ffffff01d3243df8 ::walk thread | ::findstack -v
stack pointer for thread ffffff01d37193c0: ffffff0008313520
[ ffffff0008313520 _resume_from_idle+0xf1() ]
ffffff0008313550 swtch+0x147()
ffffff00083135b0 cv_wait_sig+0x14b(ffffff01f13eadd2, ffffff01f143de28)
ffffff0008313610 str_cv_wait+0xbc(ffffff01f13eadd2, ffffff01f143de28,
ffffffffffffffff, 0)
ffffff00083136c0 strwaitq+0x1fe(ffffff01f143dda8, 8, 0, 0, ffffffffffffffff,
ffffff000831376c)
ffffff00083137d0 kstrgetmsg+0x3dc(ffffff01f13ef080, ffffff0008313848, 0,
ffffff00083137f7, ffffff00083137f0, ffffffffffffffff, ffffff00083137f8)
ffffff0008313820 ldi_getmsg+0x9b(ffffff01e3d57018, ffffff0008313848, 0)
ffffff00083138b0 dl_op+0x63(ffffff01e3d57018, ffffff00083138c8, 4, 18, 0, 0)
ffffff0008313910 dl_bind+0x8f(ffffff01e3d57018, 800, 0)
ffffff0008313970 strplumb`getmacaddr+0xec(ffffff01cdb977b0, ffffff0008313988)
ffffff00083139c0 strplumb`matchmac+0x87(ffffff01cdb977b0, ffffff0008313bc8)
ffffff0008313a30 walk_devs+0x4f(ffffff01cdb977b0, fffffffff79b4190,
ffffff0008313bc8, 1)
ffffff0008313aa0 walk_devs+0xff(ffffff01cd278508, fffffffff79b4190,
ffffff0008313bc8, 1)
[0]> ffffff01cdb977b0::whatis
ffffff01cdb977b0 is ffffff01cdb977b0+0, allocated from dev_info_node_cache
[0]> ffffff01cdb977b0::devinfo
ffffff01cdb977b0 pciex8086,105e, instance #0 (driver name: e1000g)
Driver properties at ffffff01ce667210:
name='fm-accchk-capable' type=any items=0
name='fm-dmachk-capable' type=any items=0
name='fm-errcb-capable' type=any items=0
name='fm-ereport-capable' type=any items=0
Hardware properties at ffffff01ce667120:
name='pci-msi-capid-pointer' type=int items=1
value=000000d0
name='acpi-namespace' type=string items=1
value='\_SB_.PCI0.P0PE.S1F0'
name='assigned-addresses' type=int items=15
value=820f0010.00000000.8db80000.00000000.00020000.820f0014.0000
0000.8db60000.00000000.00020000.810f0018.00000000.0000b800.0000
0000.00000020
name='reg' type=int items=20
value=000f0000.00000000.00000000.00000000.00000000.020f0010.0000
0000.00000000.00000000.00020000.020f0014.00000000.00000000.0000
0000.00020000.010f0018.00000000.00000000.00000000.00000020
name='compatible' type=string items=13
value='pciex8086,105e.8086.105e.6' + 'pciex8086,105e.8086.105e
' + 'pciex8086,105e.6' + 'pciex8086,105e' + 'pciexclass,020000
' + 'pciexclass,0200' + 'pci8086,105e.8086.105e.6' + '
...
...
[0]> ffffff01cdb977b0 ::print -t struct dev_info
{
struct dev_info *devi_parent = 0xffffff01cd274010
struct dev_info *devi_child = 0
struct dev_info *devi_sibling = 0xffffff01cdb97530
char *devi_binding_name = 0xffffff01caee1385 "pciex8086,105e"
char *devi_addr = 0xffffff01ce33a600 "0"
int devi_nodeid = 0x21
int devi_instance = 0
struct dev_ops *devi_ops = e1000g`ws_ops
void *devi_parent_data = 0xffffff01cdf8f440
void *devi_driver_data = 0xffffff01cc3d4000
ddi_prop_t *devi_drv_prop_ptr = 0xffffff01ce667210
ddi_prop_t *devi_sys_prop_ptr = 0
struct ddi_minor_data *devi_minor = 0xffffff01cdf884b8
struct dev_info *devi_next = 0xffffff01cdb97530
kmutex_t devi_lock = {
void *[1] _opaque = [ 0 ]
}
.....
[0]> 0xffffff01cc3d4000::print -t e1000g_t
{
int instance = 0
dev_info_t *dip = 0xffffff01cdb977b0
dev_info_t *priv_dip = 0xffffff01ce54d800
private_devi_list_t *priv_devi_node = 0xffffff01ce65e7f8
mac_handle_t mh = 0xffffff01cb16f8c8
mac_resource_handle_t mrh = 0
struct e1000_hw shared = {
void *back = 0xffffff01cc3d8330
u8 *hw_addr = 0xffffff0186f1c000
u8 *flash_address = 0
unsigned long io_base = 0xb800
struct e1000_mac_info mac = {
struct e1000_mac_operations ops = {
int (*)() init_params = e1000g`e1000_init_mac_params_82571
int (*)() blink_led = e1000g`e1000_blink_led_generic
int (*)() check_for_link =
e1000g`e1000_check_for_copper_link_generic
int (*)() check_mng_mode = e1000g`e1000_check_mng_mode_generic
int (*)() cleanup_led = e1000g`e1000_cleanup_led_generic
int (*)() clear_hw_cntrs = e1000g`e1000_clear_hw_cntrs_82571
int (*)() clear_vfta = e1000g`e1000_clear_vfta_82571
int (*)() get_bus_info = e1000g`e1000_get_bus_info_pcie_generic
....
[0]> ::stacks -m e1000g
THREAD STATE SOBJ COUNT
ffffff000801fc60 FREE <NONE> 1
dblk_lastfree+0x70
freemsg+0x84
freemsgchain+0x21
mac`mac_rx+0x206
mac`mac_rx_ring+0x4c
e1000g`e1000g_intr_pciexpress+0x17e
0x36fb89d12b
dispatch_hardint+0x41
ffffff0007f6bc60 FREE <NONE> 1
e1000g`e1000g_check_dma_handle+0x1e
e1000g`e1000g_receive+0x6c
e1000g`e1000g_intr_pciexpress+0x159
0x33fb89d12b
dispatch_hardint+0x41
ffffff0008025c60 FREE <NONE> 1
e1000g`e1000g_check_dma_handle+0x1e
e1000g`e1000g_receive+0x6c
e1000g`e1000g_intr_pciexpress+0x159
0x36fb89d12b
dispatch_hardint+0x41
[0]>
[0]> ffffff000801fc60::findstack -v
stack pointer for thread ffffff000801fc60 (TS_FREE): ffffff000801fa40
ffffff000801fa70 dblk_lastfree+0x70(ffffff01f1436220, ffffff01f1433cc0)
ffffff000801faa0 freemsg+0x84(ffffff01f1436220)
ffffff000801fac0 freemsgchain+0x21(ffffff01d1bcdc60)
ffffff000801fb10 mac`mac_rx+0x206(ffffff01cb16ea98, 0, ffffff01d1bcdc60)
ffffff000801fb50 mac`mac_rx_ring+0x4c(ffffff01cb16ea98, 0, ffffff01d1bcdc60, 1
)
ffffff000801fbb0 e1000g`e1000g_intr_pciexpress+0x17e(fffffffffb828184)
ffffff000801fc00 0x36fb89d12b()
ffffff000801fc40 dispatch_hardint+0x41(36, 2)
ffffff0008025ad0 switch_sp_and_call+0x13()
ffffff01ce60a580 0xffffff01ce60a580()
[0]>
[0]> ffffff000801fc60 ::thread -b
ADDR WCHAN TS PITS SOBJ OPS
ffffff000801fc60 0 ffffff01d34923e0 0 0
[0]>
Pegasus+ (Sun Blade X6440) with igb hits the same problem, so it's not e1000g specific.
Petr, are you saying that you see the same kernel stack for a hung netstrategy process?
David, yep.
[7]> ::ptree
fffffffffbc2c370 sched
ffffff04eae5da48 fsflush
ffffff04eae5e6a8 pageout
ffffff04eae5f308 init
ffffff04eae538d0 devfsadm
ffffff04eae59008 dlmgmtd
ffffff04eae5a8c8 svc.configd
ffffff04eae5c188 svc.startd
ffffff04eae58310 install-discover
ffffff04eae56a50 cut
ffffff04eae51318 netstrategy
ffffff04eae576b0 dial
[7]> ::threadlist
ADDR PROC LWP CMD/LWPID
ffffff04eb05a3a0 ffffff04eae51318 ffffff04eaf10b50 netstrategy/1
[7]> ffffff04eb05a3a0::findstack
stack pointer for thread ffffff04eb05a3a0: ffffff001f910520
[ ffffff001f910520 _resume_from_idle+0xf1() ]
ffffff001f910550 swtch+0x147()
ffffff001f9105b0 cv_wait_sig+0x14b()
ffffff001f910610 str_cv_wait+0xbc()
ffffff001f9106c0 strwaitq+0x1fe()
ffffff001f9107d0 kstrgetmsg+0x3dc()
ffffff001f910820 ldi_getmsg+0x9b()
ffffff001f9108b0 dl_op+0x63()
ffffff001f910910 dl_bind+0x8f()
ffffff001f910970 strplumb`getmacaddr+0xec()
ffffff001f9109c0 strplumb`matchmac+0x87()
ffffff001f910a30 walk_devs+0x4f()
ffffff001f910aa0 walk_devs+0xff()
[7]>
[7]> ::interrupts
IRQ Vect IPL Bus Trg Type CPU Share APIC/INT# ISR(s)
4 0xb0 12 ISA Edg Fixed 7 1 0x0/0x4 asy`asyintr
9 0x81 9 PCI Lvl Fixed 1 1 0x0/0x9 acpica`acpi_wrapper_isr
21 0x84 9 PCI Lvl Fixed 8 1 0x0/0x15 ehci`ehci_intr
22 0x85 9 PCI Lvl Fixed 9 1 0x0/0x16 ohci`ohci_intr
48 0x82 7 PCI Edg MSI 2 1 - pcie_pci`pepb_intr_handler
49 0x83 7 PCI Edg MSI 2 1 - pcie_pci`pepb_intr_handler
50 0x60 6 PCI Edg MSI-X 3 1 - igb`igb_intr_tx_other
51 0x61 6 PCI Edg MSI-X 4 1 - igb`igb_intr_rx
52 0x62 6 PCI Edg MSI-X 4 1 - igb`igb_intr_tx_other
53 0x63 6 PCI Edg MSI-X 6 1 - igb`igb_intr_rx
54 0x30 4 PCI Edg MSI 10 1 - pcie_pci`pepb_intr_handler
55 0x31 4 PCI Edg MSI 10 1 - pcie_pci`pepb_intr_handler
56 0x86 7 PCI Edg MSI 11 1 - pcie_pci`pepb_intr_handler
57 0x87 7 PCI Edg MSI 11 1 - pcie_pci`pepb_intr_handler
58 0x32 4 PCI Edg MSI 12 1 - pcie_pci`pepb_intr_handler
59 0x33 4 PCI Edg MSI 12 1 - pcie_pci`pepb_intr_handler
60 0x88 7 PCI Edg MSI 13 1 - pcie_pci`pepb_intr_handler
61 0x89 7 PCI Edg MSI 13 1 - pcie_pci`pepb_intr_handler
62 0x40 5 PCI Edg MSI 0 1 - emlxs`emlxs_sli3_msi_intr
63 0x41 5 PCI Edg MSI 0 1 - emlxs`emlxs_sli3_msi_intr
64 0x42 5 PCI Edg MSI 1 1 - emlxs`emlxs_sli3_msi_intr
65 0x43 5 PCI Edg MSI 1 1 - emlxs`emlxs_sli3_msi_intr
160 0xa0 0 Edg IPI all 0 - poke_cpu
192 0xc0 13 Edg IPI all 1 - xc_serv
208 0xd0 14 Edg IPI all 1 - kcpc_hw_overflow_intr
209 0xd1 14 Edg IPI all 1 - cbe_fire
210 0xd3 14 Edg IPI all 1 - cbe_fire
240 0xe0 15 Edg IPI all 1 - xc_serv
241 0xe1 15 Edg IPI all 1 - pcplusmp`apic_error_intr
I can give you the console, a dump-device is not configured so I can't provide you
with the core.
I can give
|