|
Description
|
One of our machines hung during our stress tests. Some commands
would work, but anything that required opening a device would
hang in an unkillable state.
The first thread was attempting to close a file descriptor:
stack pointer for thread ffffff051ae4e020: ffffff001f5f0cb0
[ ffffff001f5f0cb0 _resume_from_idle+0xf1() ]
ffffff001f5f0cf0 swtch+0x221()
ffffff001f5f0d20 cv_wait+0x73()
ffffff001f5f0d60 holdlwps+0xfe()
ffffff001f5f0e00 closeandsetf+0x27b()
ffffff001f5f0e40 doorfs`door_revoke+0x9f()
ffffff001f5f0eb0 doorfs`doorfs32+0xae()
ffffff001f5f0f00 sys_syscall32+0x1fc()
It successfully stopped a second thread:
stack pointer for thread ffffff069d967020: ffffff001fcd4b60
[ ffffff001fcd4b60 _resume_from_idle+0xf1() ]
ffffff001fcd4ba0 swtch+0x221()
ffffff001fcd4c10 stop+0x904()
ffffff001fcd4cd0 issig_forreal+0x20f()
ffffff001fcd4d00 issig+0x24()
ffffff001fcd4db0 doorfs`door_upcall+0x326()
ffffff001fcd4e10 doorfs`door_ki_upcall_cred+0x51()
ffffff001fcd4e50 stubs_common_code+0x51()
ffffff001fcd4f10 dls`i_dls_mgmt_upcall+0xf5()
ffffff001fcd50b0 dls`dls_mgmt_create+0xad()
ffffff001fcd5220 softmac`softmac_create+0x2ee()
ffffff001fcd5270 net_dacf`net_postattach+0x37()
ffffff001fcd52f0 dacf_op_invoke+0x1f0()
ffffff001fcd5350 dacf_process_rsrvs+0xa6()
ffffff001fcd5390 dacfc_postattach+0x3f()
ffffff001fcd53d0 postattach_node+0x3c()
ffffff001fcd5420 i_ndi_config_node+0xa2()
ffffff001fcd5450 i_ddi_attachchild+0x67()
ffffff001fcd5490 devi_attach_node+0xfd()
ffffff001fcd5560 devi_config_one+0x2c6()
ffffff001fcd55d0 ndi_devi_config_one+0xd8()
ffffff001fcd5690 devfs`dv_find+0x1e0()
ffffff001fcd5710 devfs`devfs_lookup+0x69()
ffffff001fcd57b0 fop_lookup+0xf2()
ffffff001fcd5a00 lookuppnvp+0x351()
ffffff001fcd5aa0 lookuppnat+0x125()
ffffff001fcd5b80 lookupnameat+0x82()
ffffff001fcd5d20 vn_openat+0x1c9()
ffffff001fcd5e80 copen+0x33e()
ffffff001fcd5eb0 open64+0x30()
ffffff001fcd5f00 sys_syscall32+0x1fc()
But the call to holdlwps() never completed, because there
was a third thread waiting for the second thread to complete:
stack pointer for thread ffffff051ae58a80: ffffff001f2d83e0
[ ffffff001f2d83e0 _resume_from_idle+0xf1() ]
ffffff001f2d8420 swtch+0x221()
ffffff001f2d8450 cv_wait+0x73()
ffffff001f2d8490 ndi_devi_enter+0xbe()
ffffff001f2d8560 devi_config_one+0x16d()
ffffff001f2d85d0 ndi_devi_config_one+0xd8()
ffffff001f2d8690 devfs`dv_find+0x212()
ffffff001f2d8710 devfs`devfs_lookup+0x69()
ffffff001f2d87b0 fop_lookup+0xf2()
ffffff001f2d8a00 lookuppnvp+0x351()
ffffff001f2d8aa0 lookuppnat+0x125()
ffffff001f2d8b80 lookupnameat+0xb9()
ffffff001f2d8d20 vn_openat+0x1c9()
ffffff001f2d8e80 copen+0x33e()
ffffff001f2d8eb0 open32+0x2b()
ffffff001f2d8f00 sys_syscall32+0x1fc()
At this point the process (and system) is deadlocked in
an unkillable state. What's worse, it's grabbed a system
resource (the devi structure) that it cannot release, so
the whole box needs to be rebooted.
Summary: door_revoke (A) suspended B:ffffff069d967020 when it was about to
perform an upcall from dacf with an active ndi_devi_enter. Door_revoke
A is now waiting for C:ffffff051ae58a80 to suspend, but that is not
completing because C is blocked on the ndi_devi_enter held below
suspeded thread B.
I think that the 'dacf->door-up-call->ISSIG' code path is new, added by crossbow.
The A,B,C threads are all associated with /usr/lib/ak/akd, which is FW specific:
B is doing a lookup of /dev/../devices/pci@0,0/pci10de,cb84@9:nge1
C is doing a lookup of /dev/scsi/ses/../../../devices/pci@0,0/pci10de,375@f/pci1000,3150@0/ses@18,0:ses
when A's door_revoke occurs.
Reassigning to Crossbow folks, as per Chris's comments.
A minor correction here: Clearview, not crossbow introduced this
'dacf->door-up-call->ISSIG' code path. See more in the "comments" section.
|