OpenSolaris

Printable Version Enter a New Search
Bug ID 6789874
Synopsis ipnet_nicevent_cb() may call taskq_dispatch() on a bogus taskq
State 10-Fix Delivered (Fix available in build)
Category:Subcategory kernel:tcp-ip
Keywords
Responsible Engineer Peter Memishian
Reported Against
Duplicate Of
Introduced In solaris_nevada
Commit to Fix snv_107
Fixed In snv_107
Release Fixed solaris_nevada(snv_107)
Related Bugs 6794035 , 4085089
Submit Date 3-January-2009
Last Update Date 28-January-2009
Description
During Clearview IPMP stress testing, we hit the following panic: 
 
  panic[cpu1]/thread=ffffff000783fc60:  
  assertion failed: tq != NULL, file: ../../common/os/taskq.c, line: 832 
   
    > $c 
    vpanic() 
    assfail+0x7e(fffffffffbf5e6d8, fffffffffbf5e830, 340) 
    taskq_dispatch+0x4f1(0, fffffffff8077050, ffffff0208389568, 1) 
    ddi_taskq_dispatch+0x25(0, fffffffff8077050, ffffff0208389568, 1) 
    ipnet_nicevent_cb+0xa3(ffffff01cf9566c0, ffffff01d51066e8, ... 
    hook_run+0xa3(ffffff01d1bfcb80, ffffff01cf9566c0, ffffff01d51066e8) 
    ip_ne_queue_func+0x5f(ffffff01d51066e0) 
    taskq_thread+0x1b5(ffffff01cfa8ed20) 
    thread_start+8() 
 
It seems the above thread attempted to call ddi_taskq_dispatch() with a 
NULL taskq (first argument), which isn't surprising given that another 
thread is still creating the taskq that ipnet_nicevent_cb() is trying to 
dispatch to: 
 
  stack pointer for thread ffffff01fac45020: ffffff000863ecc0 
    ffffff000863ed40 plcnt_inc_dec+0x12e(ffffff0002cedb48, 1, 0, ... 
    ffffff000863ede0 page_ctr_sub_internal+0xa9(fffffffffb8278a5, 1, ... 
    ffffff000863ee30 cpus() 
    ffffff000863ee90 do_interrupt+0x120(ffffff000863eea0, fffffffffbc196f8) 
    ffffff000863eea0 _interrupt+0x1ec() 
    ffffff000863efb0 mutex_enter+0x10() 
    ffffff000863eff0 hment_mapcnt+0x1d(ffffff0002cedb48) 
    ffffff000863f010 hat_page_getshare+0x16(ffffff0002cedb48) 
    ffffff000863f0e0 page_create_va+0x3e5(fffffffffbc3bf00, ... 
    ffffff000863f1c0 segkp_get_internal+0x59a(fffffffffbc3d160, 5000, e, ... 
    ffffff000863f210 segkp_cache_get+0xe7(1) 
    ffffff000863f2a0 thread_create+0x104(0, 0, fffffffffbe78980, ... 
    ffffff000863f350 taskq_create_common+0x251(fffffffff8077ec8, 0, 1, 3c, 
    ffffff000863f3e0 taskq_create_instance+0x73(fffffffff8077ec8, 0, 1, 3c, 
->  ffffff000863f470 ddi_taskq_create+0xaf(0, fffffffff8077ec8, 1, ffffffff, 
->  ffffff000863f490 ipnet`_init+0x78() 
    ffffff000863f4c0 modinstall+0x115(ffffff01fbba6178) 
    ffffff000863f4f0 mod_hold_stub+0x12b(fffffffffbc0ef68) 
    ffffff000863f540 stubs_common_code+0x1f() 
    ffffff000863f560 devipnet_validate+0x21(ffffff0206104a10) 
    ffffff000863f6b0 devname_lookup_func+0x49a(ffffff01d1ed08c8, ... 
    ffffff000863f710 devipnet_lookup+0x4e(ffffff01d1ece600, 
    ffffff000863f7b0 fop_lookup+0xed(ffffff01d1ece600, ffffff000863f870, 
    ffffff000863f9f0 lookuppnvp+0x3a3(ffffff000863fab0, 0, 1, 0, 
    ffffff000863fa90 lookuppnat+0x12c(ffffff000863fab0, 0, 1, 0, 
    ffffff000863fb70 lookupnameat+0x91(80467a0, 0, 1, 0, ffffff000863fbf0, 
    ffffff000863fd20 vn_openat+0x235(80467a0, 0, 3, 6cc, ffffff000863fd68, 
    ffffff000863fe80 copen+0x418(ffd19553, 80467a0, 3, fedf46cc) 
    ffffff000863feb0 open32+0x2f(80467a0, 2, fedf46cc) 
    ffffff000863ff00 sys_syscall32+0x1fc() 
 
Looking at ipnet`_init(), the problem is clear: we call netstack_register() (which 
indirectly registers ipnet_nicevent_cb() to be called back) before we create the 
taskq's: 

        netstack_register(NS_IPNET, ipnet_stack_init, NULL, ipnet_stack_fini); 
        /* 
         * We call ddi_taskq_create() with nthread == 1 to ensure in-order 
         * delivery of packets to clients. 
         */ 
        ipnet_taskq = ddi_taskq_create(NULL, "ipnet", 1, TASKQ_DEFAULTPRI, 0); 
        ipnet_nicevent_taskq = ddi_taskq_create(NULL, "ipnet_nic_event_queue", 
            1, TASKQ_DEFAULTPRI, 0); 
 
Clearly, this needs to be reversed.  There's a similar problem in _fini(): 
 
        ddi_taskq_destroy(ipnet_nicevent_taskq); 
        ddi_taskq_destroy(ipnet_taskq); 
        netstack_unregister(NS_IPNET); 
 
... and indeed, we are aware of at least one instance of an infinite loop in 
ddi_taskq_destroy() because a message was dispatched while the taskq was in the 
process of being destroyed (this also occurred during IPMP stress testing).
Work Around
N/A
Comments
N/A