|
Comments
|
See Evaluation.
The following bugfixes need to be putback into Nevada addressing some issues with
the Callout subsystem:
1. There is an ASSERT in the Intel Cyclic Backend (cbe) that pertains to the
old implementation of callouts. This is no longer applicable to the new
implementation. This needs to be removed. It causes an ASSERT failure in
the debug Intel kernel under some race conditions.
2. The new callout implementation uses hrtime rather than ticks for processing
callouts. It uses TICK_TO_NSEC() to convert from ticks to nanoseconds. For
really high tick values, this can cause an overflow. The callout code should
really check for overflow and do the right thing. Otherwise, a timer that is
meant to expire in the distant future will end up expiring immediately.
3. The new callout implementation provides a facility to suspend callou processing
which is used during CPR as well as while dropping into KMDB. The facility
does not allow nested suspends. We need to fix it so suspends can be nested.
4. delay(9F) is a function that is used by drivers and the kernel to wait for a
specified number of clock ticks. This function is implemented using a normal
callout. Since normal callouts are processed at CY_LOW_LEVEL, the delay(9F)
function should not be called at higher than CY_LOW_LEVEL. Instead, drv_usecwait()
should be used. The man pages for these functions are clear on their usage.
However, there are some violators of this rule. E.g., the serial driver,
se_driver.c, calls delay() from interrupt context on the V880. This problem
was fixed in the new callout implementation by executing the normal cyclic
at CY_LOCK_LEVEL instead of CY_LOW_LEVEL. But we now feel that that was a
big hammer. In this bugfix, we will fix delay(9F) to use drv_usecwait()
internally for PIL > 0. Any other violators of this rule need to found
and fixed.
5. In the new callout implementation, normal callout cyclics are executed at
CY_LOCK_LEVEL as explained in point 4. This is causing performance issues because
of the extra activity at lock PIL. This needs to be at CY_LOW_LEVEL as well. E.g.,
TCP-based workloads will suffer from this. Since the delay(9F) issue is going
to be fixed a different way, the cyclic's level should be restored to
CY_LOW_LEVEL.
6. When a CPU is onlined, its callout cyclics are created, if necessary, and
moved to the CPU. But they are not bound. So, when the no-intr option is
set on a CPU, its unbound cyclics get juggled away. So, callouts issued
from the CPU from that point involve X-calls as their cyclics reside on
another CPU. This results in performance degradation. This manifests itself
especially when a processor set is created with many processors and
the no-intr option is set on them. Applications running on the processor
set will experience performance degradation when they generate callouts. The
solution is to bind the cyclics to the CPU to prevent the juggling. However,
during CPU offline, the cyclics need to be unbound so that the cyclic subsystem
can juggle the cyclics away.
7. A deadlock was discovered in the changes made to the condition variable code
by the callout putback in NV 103. When a timed wait call is made, the
condition variable code issues a realtime timeout and blocks the calling thread
by putting it on a sleepq. If the realtime callout fires before the block
happens, then the thread will never wakeup. So, the realtime handler must
wait for the block to complete. To do this wait, the realtime callout handler
acquires and releases the mutex associated with the condition variable to
synchronize. This causes a deadlock in a weird scenario:
1. CPU offline is called for a CPU. That, in turn, calls cyclic_offline().
cyclic_offline() juggles cyclics away from the CPU. One of the cyclics,
a callout cyclic, is executing callouts. So, cyclic_offline() waits for
that to finish.
2. The callout cyclic is executing the handler cv_wakeup() to wake up some thread
that is sleeping on a condition variable. cv_wakeup() attempts to lock
the CV mutex to sync with the CV timed wait code, finds the lock busy.
So, it waits for the lock either spinning or sleeping.
3. Now, condition variables are always used along with a mutex. The mutex also
typically protects some data structures apart from the condition variable
itself. So, some other thread grabs the same mutex for something else. After
grabbing the mutex, it acquires cpu_lock for something. But cpu_lock is held
by step 1. So, deadlock.
The solution is to cut the dependency loop. So, the fix will use a per-thread
mutex to synchronize rather than the CV mutex. This deadlock can happen any time
a cyclic is juggled away, not just during CPU offline. It will cause callout
processing to cease for a CPU and cause application hangs.
|