|
Description
|
During Clearview IPMP stress testing, we hit the following panic:
panic[cpu1]/thread=ffffff000783fc60:
assertion failed: tq != NULL, file: ../../common/os/taskq.c, line: 832
> $c
vpanic()
assfail+0x7e(fffffffffbf5e6d8, fffffffffbf5e830, 340)
taskq_dispatch+0x4f1(0, fffffffff8077050, ffffff0208389568, 1)
ddi_taskq_dispatch+0x25(0, fffffffff8077050, ffffff0208389568, 1)
ipnet_nicevent_cb+0xa3(ffffff01cf9566c0, ffffff01d51066e8, ...
hook_run+0xa3(ffffff01d1bfcb80, ffffff01cf9566c0, ffffff01d51066e8)
ip_ne_queue_func+0x5f(ffffff01d51066e0)
taskq_thread+0x1b5(ffffff01cfa8ed20)
thread_start+8()
It seems the above thread attempted to call ddi_taskq_dispatch() with a
NULL taskq (first argument), which isn't surprising given that another
thread is still creating the taskq that ipnet_nicevent_cb() is trying to
dispatch to:
stack pointer for thread ffffff01fac45020: ffffff000863ecc0
ffffff000863ed40 plcnt_inc_dec+0x12e(ffffff0002cedb48, 1, 0, ...
ffffff000863ede0 page_ctr_sub_internal+0xa9(fffffffffb8278a5, 1, ...
ffffff000863ee30 cpus()
ffffff000863ee90 do_interrupt+0x120(ffffff000863eea0, fffffffffbc196f8)
ffffff000863eea0 _interrupt+0x1ec()
ffffff000863efb0 mutex_enter+0x10()
ffffff000863eff0 hment_mapcnt+0x1d(ffffff0002cedb48)
ffffff000863f010 hat_page_getshare+0x16(ffffff0002cedb48)
ffffff000863f0e0 page_create_va+0x3e5(fffffffffbc3bf00, ...
ffffff000863f1c0 segkp_get_internal+0x59a(fffffffffbc3d160, 5000, e, ...
ffffff000863f210 segkp_cache_get+0xe7(1)
ffffff000863f2a0 thread_create+0x104(0, 0, fffffffffbe78980, ...
ffffff000863f350 taskq_create_common+0x251(fffffffff8077ec8, 0, 1, 3c,
ffffff000863f3e0 taskq_create_instance+0x73(fffffffff8077ec8, 0, 1, 3c,
-> ffffff000863f470 ddi_taskq_create+0xaf(0, fffffffff8077ec8, 1, ffffffff,
-> ffffff000863f490 ipnet`_init+0x78()
ffffff000863f4c0 modinstall+0x115(ffffff01fbba6178)
ffffff000863f4f0 mod_hold_stub+0x12b(fffffffffbc0ef68)
ffffff000863f540 stubs_common_code+0x1f()
ffffff000863f560 devipnet_validate+0x21(ffffff0206104a10)
ffffff000863f6b0 devname_lookup_func+0x49a(ffffff01d1ed08c8, ...
ffffff000863f710 devipnet_lookup+0x4e(ffffff01d1ece600,
ffffff000863f7b0 fop_lookup+0xed(ffffff01d1ece600, ffffff000863f870,
ffffff000863f9f0 lookuppnvp+0x3a3(ffffff000863fab0, 0, 1, 0,
ffffff000863fa90 lookuppnat+0x12c(ffffff000863fab0, 0, 1, 0,
ffffff000863fb70 lookupnameat+0x91(80467a0, 0, 1, 0, ffffff000863fbf0,
ffffff000863fd20 vn_openat+0x235(80467a0, 0, 3, 6cc, ffffff000863fd68,
ffffff000863fe80 copen+0x418(ffd19553, 80467a0, 3, fedf46cc)
ffffff000863feb0 open32+0x2f(80467a0, 2, fedf46cc)
ffffff000863ff00 sys_syscall32+0x1fc()
Looking at ipnet`_init(), the problem is clear: we call netstack_register() (which
indirectly registers ipnet_nicevent_cb() to be called back) before we create the
taskq's:
netstack_register(NS_IPNET, ipnet_stack_init, NULL, ipnet_stack_fini);
/*
* We call ddi_taskq_create() with nthread == 1 to ensure in-order
* delivery of packets to clients.
*/
ipnet_taskq = ddi_taskq_create(NULL, "ipnet", 1, TASKQ_DEFAULTPRI, 0);
ipnet_nicevent_taskq = ddi_taskq_create(NULL, "ipnet_nic_event_queue",
1, TASKQ_DEFAULTPRI, 0);
Clearly, this needs to be reversed. There's a similar problem in _fini():
ddi_taskq_destroy(ipnet_nicevent_taskq);
ddi_taskq_destroy(ipnet_taskq);
netstack_unregister(NS_IPNET);
... and indeed, we are aware of at least one instance of an infinite loop in
ddi_taskq_destroy() because a message was dispatched while the taskq was in the
process of being destroyed (this also occurred during IPMP stress testing).
|