|
Description
|
A user-level mutex contains three fields that must all be set properly
when it is locked:
mutex_lockw the lock byte, non-zero if locked
mutex_owner the thread pointer (ulwp_t) of the owning thread
mutex_ownerpid the owning thread's process-id, if the mutex is process-shared
The mutex_lockw and mutex_ownerpid fields are set and unset together atomically
by locking code in libc and the kernel, so they are always self-consistent.
However, the mutex_owner field is only set after the lock byte
has been set (and is cleared before the lock byte is cleared).
This opens up a race condition window such that if a UNIX signal
interrupts the lock or unlock operation in the window, the thread
can end up in a signal handler with its mutex locked but the
owner field set to zero.
This self-inconsistency is bothersome in general.
However, in the case of recursive mutexes it leads to deadlock
if the recursive mutex is acquired both at main level and
within a signal handler, a scenario that should work.
See the attached interrupt.c test case.
Compile it this way:
cc -O -D_REENTRANT -o interrupt_test interrupt_test.c
Run it this way:
./interrupt_test [-rs]
default: the recursive mutex is normal process-private
-r make the recursive mutex be robust
-s make the recursive mutex be process-shared
Do the runs on a two-processor test machine.
It will quickly stop producing any output (a stream of dots).
Then examine the process with pstack and mdb.
You will see that the threads are deadlocked on the recursive
mutex and that the mutex's owner field is zero.
The fix to this problem is to defer signals around the combined
operations of setting the lock byte and setting mutex_owner.
(There are many places in libc's locking code where this happens.)
There will still be a small window in which the mutex is not fully
self-consistent, but no signal handler will be invoked in that window.
See the suggested fix.
With the fix in place, the test case continues producing dots forever.
|