OpenSolaris

Printable Version Enter a New Search
Bug ID 6762445
Synopsis defer signals while locking a mutex and setting its owner field
State 10-Fix Delivered (Fix available in build)
Category:Subcategory library:libc
Keywords
Responsible Engineer Roger Faulkner
Reported Against
Duplicate Of
Introduced In solaris_8
Commit to Fix snv_102
Fixed In snv_102
Release Fixed solaris_nevada(snv_102)
Related Bugs
Submit Date 22-October-2008
Last Update Date 6-November-2008
Description
A user-level mutex contains three fields that must all be set properly
when it is locked:

mutex_lockw	the lock byte, non-zero if locked
mutex_owner	the thread pointer (ulwp_t) of the owning thread
mutex_ownerpid	the owning thread's process-id, if the mutex is process-shared

The mutex_lockw and mutex_ownerpid fields are set and unset together atomically
by locking code in libc and the kernel, so they are always self-consistent.

However, the mutex_owner field is only set after the lock byte
has been set (and is cleared before the lock byte is cleared).
This opens up a race condition window such that if a UNIX signal
interrupts the lock or unlock operation in the window, the thread
can end up in a signal handler with its mutex locked but the
owner field set to zero.

This self-inconsistency is bothersome in general.
However, in the case of recursive mutexes it leads to deadlock
if the recursive mutex is acquired both at main level and
within a signal handler, a scenario that should work.

See the attached interrupt.c test case.
Compile it this way:

cc -O -D_REENTRANT  -o interrupt_test interrupt_test.c 

Run it this way:

    ./interrupt_test [-rs]
        default: the recursive mutex is normal process-private
        -r       make the recursive mutex be robust
        -s       make the recursive mutex be process-shared

Do the runs on a two-processor test machine.
It will quickly stop producing any output (a stream of dots).
Then examine the process with pstack and mdb.
You will see that the threads are deadlocked on the recursive
mutex and that the mutex's owner field is zero.

The fix to this problem is to defer signals around the combined
operations of setting the lock byte and setting mutex_owner.
(There are many places in libc's locking code where this happens.)
There will still be a small window in which the mutex is not fully
self-consistent, but no signal handler will be invoked in that window.

See the suggested fix.
With the fix in place, the test case continues producing dots forever.
Work Around
N/A
Comments
N/A