OpenSolaris

Printable Version Enter a New Search
Bug ID 6696737
Synopsis ndi_devi_enter() can deadlock with stopped threads, hanging system
State 10-Fix Delivered (Fix available in build)
Category:Subcategory kernel:gld
Keywords
Responsible Engineer Cathy Zhou
Reported Against
Duplicate Of
Introduced In solaris_nevada
Commit to Fix snv_97
Fixed In snv_97
Release Fixed solaris_nevada(snv_97)
Related Bugs 6548905 , 6577618 , 6787558
Submit Date 1-May-2008
Last Update Date 13-October-2008
Description
One of our machines hung during our stress tests.  Some commands
would work, but anything that required opening a device would
hang in an unkillable state.

The first thread was attempting to close a file descriptor:

stack pointer for thread ffffff051ae4e020: ffffff001f5f0cb0
[ ffffff001f5f0cb0 _resume_from_idle+0xf1() ]
  ffffff001f5f0cf0 swtch+0x221()
  ffffff001f5f0d20 cv_wait+0x73()
  ffffff001f5f0d60 holdlwps+0xfe()
  ffffff001f5f0e00 closeandsetf+0x27b()
  ffffff001f5f0e40 doorfs`door_revoke+0x9f()
  ffffff001f5f0eb0 doorfs`doorfs32+0xae()
  ffffff001f5f0f00 sys_syscall32+0x1fc()

It successfully stopped a second thread:

stack pointer for thread ffffff069d967020: ffffff001fcd4b60
[ ffffff001fcd4b60 _resume_from_idle+0xf1() ]
  ffffff001fcd4ba0 swtch+0x221()
  ffffff001fcd4c10 stop+0x904()
  ffffff001fcd4cd0 issig_forreal+0x20f()
  ffffff001fcd4d00 issig+0x24()
  ffffff001fcd4db0 doorfs`door_upcall+0x326()
  ffffff001fcd4e10 doorfs`door_ki_upcall_cred+0x51()
  ffffff001fcd4e50 stubs_common_code+0x51()
  ffffff001fcd4f10 dls`i_dls_mgmt_upcall+0xf5()
  ffffff001fcd50b0 dls`dls_mgmt_create+0xad()
  ffffff001fcd5220 softmac`softmac_create+0x2ee()
  ffffff001fcd5270 net_dacf`net_postattach+0x37()
  ffffff001fcd52f0 dacf_op_invoke+0x1f0()
  ffffff001fcd5350 dacf_process_rsrvs+0xa6()
  ffffff001fcd5390 dacfc_postattach+0x3f()
  ffffff001fcd53d0 postattach_node+0x3c()
  ffffff001fcd5420 i_ndi_config_node+0xa2()
  ffffff001fcd5450 i_ddi_attachchild+0x67()
  ffffff001fcd5490 devi_attach_node+0xfd()
  ffffff001fcd5560 devi_config_one+0x2c6()
  ffffff001fcd55d0 ndi_devi_config_one+0xd8()
  ffffff001fcd5690 devfs`dv_find+0x1e0()
  ffffff001fcd5710 devfs`devfs_lookup+0x69()
  ffffff001fcd57b0 fop_lookup+0xf2()
  ffffff001fcd5a00 lookuppnvp+0x351()
  ffffff001fcd5aa0 lookuppnat+0x125()
  ffffff001fcd5b80 lookupnameat+0x82()
  ffffff001fcd5d20 vn_openat+0x1c9()
  ffffff001fcd5e80 copen+0x33e()
  ffffff001fcd5eb0 open64+0x30()
  ffffff001fcd5f00 sys_syscall32+0x1fc()

But the call to holdlwps() never completed, because there
was a third thread waiting for the second thread to complete:

stack pointer for thread ffffff051ae58a80: ffffff001f2d83e0
[ ffffff001f2d83e0 _resume_from_idle+0xf1() ]
  ffffff001f2d8420 swtch+0x221()
  ffffff001f2d8450 cv_wait+0x73()
  ffffff001f2d8490 ndi_devi_enter+0xbe()
  ffffff001f2d8560 devi_config_one+0x16d()
  ffffff001f2d85d0 ndi_devi_config_one+0xd8()
  ffffff001f2d8690 devfs`dv_find+0x212()
  ffffff001f2d8710 devfs`devfs_lookup+0x69()
  ffffff001f2d87b0 fop_lookup+0xf2()
  ffffff001f2d8a00 lookuppnvp+0x351()
  ffffff001f2d8aa0 lookuppnat+0x125()
  ffffff001f2d8b80 lookupnameat+0xb9()
  ffffff001f2d8d20 vn_openat+0x1c9()
  ffffff001f2d8e80 copen+0x33e()
  ffffff001f2d8eb0 open32+0x2b()      
  ffffff001f2d8f00 sys_syscall32+0x1fc()

At this point the process (and system) is deadlocked in
an unkillable state.  What's worse, it's grabbed a system
resource (the devi structure) that it cannot release, so
the whole box needs to be rebooted.
Summary: door_revoke (A) suspended B:ffffff069d967020 when it was about to
perform an upcall from dacf with an active ndi_devi_enter.  Door_revoke
A is now waiting for C:ffffff051ae58a80 to suspend, but that is not
completing because C is blocked on the ndi_devi_enter held below
suspeded thread B.

I think that the 'dacf->door-up-call->ISSIG' code path is new, added by crossbow.

The A,B,C threads are all associated with /usr/lib/ak/akd, which is FW specific:
 B is doing a lookup of /dev/../devices/pci@0,0/pci10de,cb84@9:nge1
 C is doing a lookup of /dev/scsi/ses/../../../devices/pci@0,0/pci10de,375@f/pci1000,3150@0/ses@18,0:ses
when A's door_revoke occurs.
Reassigning to Crossbow folks, as per Chris's comments.
A minor correction here: Clearview, not crossbow introduced this
'dacf->door-up-call->ISSIG' code path. See more in the "comments" section.
Work Around
N/A
Comments
N/A