OpenSolaris

Printable Version Enter a New Search
Bug ID 6372094
Synopsis zil_commit() may be called recursively and deadlock with itself
State 10-Fix Delivered (Fix available in build)
Category:Subcategory kernel:zfs
Keywords deadlock | onnv_triage | z-triage
Responsible Engineer Neil Perrin
Reported Against snv_32
Duplicate Of
Introduced In solaris_nevada
Commit to Fix snv_33
Fixed In snv_33
Release Fixed solaris_nevada(snv_33)
Related Bugs 6356243 , 6372702
Submit Date 12-January-2006
Last Update Date 8-February-2006
Description
One of the processes doing I/O to zfs volume got stuck forever with the following stack 
trace:

> ::ps!grep get 
R  13530  23761  23761    622  86710 0x42004000 ffffffff8c0f6520 get
> ffffffff8c0f6520::ps -t
S    PID   PPID   PGID    SID    UID      FLAGS             ADDR NAME
R  13530  23761  23761    622  86710 0x42004000 ffffffff8c0f6520 get
        T  0xffffffff89c5d0c0 <TS_SLEEP>
> 0xffffffff89c5d0c0::findstack -v
stack pointer for thread ffffffff89c5d0c0: fffffe800013d9c0
[ fffffe800013d9c0 _resume_from_idle+0xde() ]
  fffffe800013da00 swtch+0x185()
  fffffe800013da30 cv_wait+0x6f(ffffffff8558cbec, ffffffff8558cb80)
  fffffe800013dab0 zil_commit+0x8d(ffffffff8558cb80, 12353, 40)
  fffffe800013db60 zfs_putapage+0x280(fffffe80d6984940, fffffffffa2283a0, 0, 0, 
  10000, ffffffffa89dc0c8)
  fffffe800013dc20 pvn_vplist_dirty+0x342(fffffe80d6984940, 0, fffffffff427a33f
  , 10000, ffffffffa89dc0c8)
  fffffe800013dc70 zfs_inactive+0x72(fffffe80d6984940, ffffffffa89dc0c8)
  fffffe800013dc90 fop_inactive+0x20(fffffe80d6984940, ffffffffa89dc0c8)
  fffffe800013dcc0 vn_rele+0x66(fffffe80d6984940)
  fffffe800013dd20 zfs_get_data+0x172(ffffffff85524a00, ffffffff87e74b88)
  fffffe800013dd90 zil_lwb_commit+0x89(ffffffff8558cb80, ffffffff87e74b68, 
  ffffffffacf7c600)
  fffffe800013de10 zil_commit+0x1b2(ffffffff8558cb80, 12345, 10)
  fffffe800013de60 zfs_fsync+0x54(fffffe80c4d13bc0, 10, ffffffffa89dc0c8)
  fffffe800013de90 fop_fsync+0x24(fffffe80c4d13bc0, 10, ffffffffa89dc0c8)
  fffffe800013dec0 fdsync+0x3b(b, 10)
  fffffe800013df10 sys_syscall32+0x101()

Note that zil_commit calls zil_commit() recursively.

Here is the zilog_t structure:

{
    zl_lock = {
        _opaque = [ 0 ]
    }
    zl_dmu_pool = 0xffffffff82a058c0
    zl_spa = 0xffffffff82f68900
    zl_header = 0xffffffff83d8c200
    zl_os = 0xffffffff83daa368
    zl_get_data = zfs_get_data
    zl_itx_seq = 0x1235b
    zl_ss_seq = 0x1235b
    zl_destroy_txg = 0
    zl_replay_seq = [ 0, 0, 0, 0 ]
    zl_suspend = 0
    zl_cv_write = {
        _opaque = 0x2
    }
    zl_cv_seq = {
        _opaque = 0
    }
    zl_stop_replay = 0
    zl_stop_sync = 0
    zl_writer = 0x1
    zl_log_error = 0
    zl_itx_list = {
        list_size = 0x40
        list_offset = 0
        list_head = {
            list_next = 0xffffffff8558cc08
            list_prev = 0xffffffff8558cc08
        }
    }
    zl_itx_list_sz = 0xc0
    zl_cur_used = 0x5e8
    zl_prev_used = 0x11f0
    zl_lwb_list = {
        list_size = 0xd0
        list_offset = 0xb8
        list_head = {
            list_next = 0xffffffffacf7c6b8
            list_prev = 0xffffffffacf7c6b8
        }
    }
    zl_vdev_list = {
        list_size = 0x20
        list_offset = 0x10
        list_head = {
            list_next = 0xffffffff8558cc60
            list_prev = 0xffffffff8558cc60
        }
    }
    zl_clean_taskq = 0xffffffff8565fdb0
    zl_dva_tree = {
        avl_root = 0
        avl_compar = 0
        avl_offset = 0
        avl_numnodes = 0
        avl_size = 0
    }
    zl_destroy_lock = {
        _opaque = [ 0 ]
    }
}

Note that there are two waiters on zl_cv_write and zl_writer is set (as expected by the
previous zil_commit() upper on the stack. The code waits for zl_writer to be dropped:

        for (;;) {
		...
                if (zilog->zl_writer == B_FALSE) /* no one writing, do it */
                        break;

                cv_wait(&zilog->zl_cv_write, &zilog->zl_lock);
        }

which will never happen because we are called by a writer. Hence a deadlock.

A side question - what is the second waiting thread? This is the bringover command which
is also waiting on the same condition variable:

stack pointer for thread ffffffffa6160b80: fffffe800122bac0
[ fffffe800122bac0 _resume_from_idle+0xde() ]
  fffffe800122bb00 swtch+0x185()
  fffffe800122bb30 cv_wait+0x6f()
  fffffe800122bbb0 zil_commit+0x8d()
  fffffe800122bc60 zfs_putapage+0x280()
  fffffe800122bd20 pvn_vplist_dirty+0x342()
  fffffe800122bd70 zfs_inactive+0x72()
  fffffe800122bd90 fop_inactive+0x20()
  fffffe800122bdc0 vn_rele+0x66()
  fffffe800122be00 closef+0x7e()
  fffffe800122bea0 closeandsetf+0x47f()
  fffffe800122bec0 close+0x16()
  fffffe800122bf10 sys_syscall32+0x101()

> ffffffffa6160b80::print kthread_t t_procp|::ps -t
S    PID   PPID   PGID    SID    UID      FLAGS             ADDR NAME
R  23761    622  23761    622  86710 0x42004000 ffffffffa5ebaca0 bringover
        T  0xffffffffa6160b80 <TS_SLEEP>

Note that zil_commit() calls cv_signal() but this seems fine as long as zil_commit() 
always calls cv_signal() on return.

So, why is zil_commit() called recursively? In the end of zil_get_data() there is
VN_RELE(ZTOV(zp)). So vn_rele() sees that vp->v_count is one and calls VOP_INACTIVE()
which, in turn, calls zfs_inactive(). The zfs_inactive sees that the vnode has pages
and calls  pvn_vplist_dirty():

        /*
         * Attempt to push any data in the page cache.  If this fails
         * we will get kicked out later in zfs_zinactive().
         */
        if (vn_has_cached_data(vp))
                (void) pvn_vplist_dirty(vp, 0, zfs_putapage, B_INVAL, cr);

This, in turn, calls zfs_putapage(), which calls zil_commit() again. And we have a 
deadlock!
Work Around
N/A
Comments
N/A