OpenSolaris

Printable Version Enter a New Search
Bug ID 6787091
Synopsis assertion failure in ipcl_conn_cleanup() due to non-NULL conn_ilg
State 10-Fix Delivered (Fix available in build)
Category:Subcategory kernel:tcp-ip
Keywords
Responsible Engineer Peter Memishian
Reported Against
Duplicate Of
Introduced In solaris_10
Commit to Fix snv_107
Fixed In snv_107
Release Fixed solaris_nevada(snv_107)
Related Bugs 6783149 , 6787102 , 6787918 , 4898645
Submit Date 18-December-2008
Last Update Date 28-January-2009
Description
We hit an interesting panic today when running the IPMP test suite:

   > ::status
   debugging crash dump vmcore.3 (64-bit) from earthnews
   operating system: 5.11 clearview-ipmp-build:12/16/08 (sun4u)
   panic message: 
   assertion failed: connp->conn_ilg == NULL, file:
   ../../common/inet/ip/ipclassifier.c, line: 2286
   dump content: kernel pages only

Indeed, we have a conn_t with a valid conn_ilg:

   > $c
   vpanic(12f4da0, 7bb6a090, 7bb69400, 8ee, 600303231e0, 600303231e0)
   assfail+0x74(7bb6a090, 7bb69400, 8ee, 1854c00, 12f4c00, 0)
   ipcl_conn_cleanup+0x128(300f65cb700, 3, 0, 300f65cb700, 3004f5cda10,
   ipcl_conn_destroy+0x460(300f65cb700, 300f65cb700, 300f65cbb80, 7bb69400,
   rawip_do_close+0xe0(300f65cb700, 1, 80001150, 1000, 0, 1)
   rawip_close+4(300f65cb700, 3, 60030cd3e70, 7bf8c000, 0, 0)
   so_close+0x1c(3007dd65a58, 3, 60030cd3e70, 3007dd65a78, 3007dd65a58,
   fop_close+0x48(306b9541100, 3, 1, 0, 60030cd3e70, 0)
   closef+0xa4(600305ef060, 30058025880, 6003006e2a0, 1800c1515b9, 12eb3d0,
   closeandsetf+0x41c(8, 0, 600305ef060, 180c000, 60030df24d0, 200)
   close+8(8, 0, ffbfa198, 68bc0, 1010101, 0)
   syscall_trap32+0x1e8(8, 0, ffbfa198, 68bc0, 1010101, 0)

   > 300f65cb700::print conn_t conn_ilg conn_ilg_inuse
   conn_ilg = 0x6003090ef08
   conn_ilg_inuse = 0x1

This ilg was likely very recently allocated, as the thread that did the
allocation is still busy draining its IPSQ via ip_rput_process_nondata()
-> ipsq_exit():

  > 0x6003090ef08::whatis -b
  6003090ef08 is 6003090ef00+8, bufctl 3008ebaa298 allocated from
  kmem_alloc_896
  > 3008ebaa298$<bufctl_audit
              ADDR          BUFADDR        TIMESTAMP           THREAD
                              CACHE          LASTLOG         CONTENTS
       3008ebaa298      6003090ef00     39df4701c27c      2a101295ca0
                        3000004f3a0      3000db85f00      300434bfb40
                   kmem_cache_alloc+0x90
                   kmem_alloc+0x2c
                   mi_alloc+0xc
                   mi_zalloc+8
                   conn_ilg_alloc+0x58
                   ilg_add_v6+0x2b0
                   ip_opt_add_group_v6+0x2bc
                   ip_opt_set+0x1344
                   svr4_optcom_req+0x660
                   ip_restart_optmgmt+0x84
                   ipsq_drain+0x11c
                   ipsq_exit+0x80
                   ip_rput_process_notdata+0x1b4
                   ip_input+0x220
                   putnext+0x390

  > 2a101295ca0::findstack
  stack pointer for thread 2a101295ca0: 2a101293461
  [ 000002a101293461 panic_idle+0x1c() ]
    000002a101293511 ktl0+0x48()
    000002a101293661 mac_flow_lookup+0xd0()
    000002a101293771 mac_tx_classify+0x14()
    000002a101293831 mac_tx_send+0x368()
    000002a101293931 mac_tx_single_ring_mode+0x108()
    000002a101293a01 mac_tx+0x364()
    000002a101293ac1 str_mdata_fastpath_put+0xb0()
    000002a101293b71 dld_wput+0xe0()
    000002a101293c21 putnext+0x390()
    000002a101293cd1 hbwputmod+0x90()
    000002a101293d81 putnext+0x390()
    000002a101293e31 ip_xmit_v4+0x3dc()
    000002a101293ef1 ip_wput_ire+0x2cc8()
    000002a101294341 igmpv3_sendrpt+0x410()
    000002a101294461 igmp_joingroup+0x174()
    000002a101294511 ip_addmulti+0x214()
    000002a1012945d1 ilg_add+0x3b4()
    000002a1012946a1 ip_opt_add_group+0x154()
    000002a101294781 ip_opt_set+0x9e0()
    000002a1012948c1 svr4_optcom_req+0x660()
    000002a1012949a1 ip_restart_optmgmt+0x84()
    000002a101294a51 ipsq_drain+0x11c()
    000002a101294b01 ipsq_exit+0x80()
    000002a101294bb1 ip_rput_process_notdata+0x1b4()
    000002a101294c61 ip_input+0x220()
    000002a101294e41 putnext+0x390()
    000002a101294ef1 hbrput+0x164()
    000002a101294fa1 putnext+0x390()
    000002a101295051 proto_disabmulti_req+0x11c()
    000002a101295111 dld_wput_nondata_task+0x74()
    000002a1012951c1 taskq_d_thread+0xbc()
    000002a101295291 thread_start+4()

Looking at the code, there seems to be an issue with CONN_CLOSING.
In particular, the comment above CONN_CLOSING in ip_quiesce_conn()
states that conn_ilg shouldn't change after it's set:

        /*
         * Mark the conn as closing, and this conn must not be
         * inserted in future into any list. Eg. conn_drain_insert(),
         * won't insert this conn into the conn_drain_list.
         * Similarly ill_pending_mp_add() will not add any mp to
         * the pending mp list, after this conn has started closing.
         *
-->      * conn_idl, conn_pending_ill, conn_down_pending_ill, conn_ilg
         * cannot get set henceforth.
         */

However, I don't see any code to enforce this in the conn_ilg_alloc()
path.  Without this check, conn_ilg can still be set by another thread
(e.g., a taskq thread as shown above) after ip_quiesce_conn() has set
ilg_cleanup_reqd but before the refcounts drop at the end of
ip_quiesce_conn().  It will then panic as shown above.

Crash dump is at /net/mdb.eng/cores/meem/ilg/*.3
Work Around
N/A
Comments
Although this bug was exposed by Volo, it has existed since Fire
Engine integrated; setting "Introduced in Release" and "Introduced
in Build" accordingly.