OpenSolaris

Printable Version Enter a New Search
Bug ID 6784948
Synopsis Bug fixes to the scalable callout implementation
State 10-Fix Delivered (Fix available in build)
Category:Subcategory kernel:time
Keywords kt-scalability | rtiq_regression | sps-scale | spsw-see-s10u8-goal
Responsible Engineer Madhavan Venkataraman
Reported Against snv_103
Duplicate Of
Introduced In solaris_nevada
Commit to Fix snv_107
Fixed In snv_107
Release Fixed solaris_nevada(snv_107) , solaris_10u8(s10u8_04) (Bug ID:2175853)
Related Bugs 6311743 , 6565503 , 6771686 , 6775360 , 6784915 , 6785225 , 6792081 , 6800617 , 6802283 , 6810916 , 6860107
Submit Date 15-December-2008
Last Update Date 12-June-2009
Description
The Callout subsystem was redesigned and putback in SNV 103. Some bugfixes relating to
that need to be put back.
See Public Comments for details.
Work Around
N/A
Comments
See Evaluation.
The following bugfixes need to be putback into Nevada addressing some issues with
the Callout subsystem:

1. There is an ASSERT in the Intel Cyclic Backend (cbe) that pertains to the
   old implementation of callouts. This is no longer applicable to the new
   implementation. This needs to be removed. It causes an ASSERT failure in
   the debug Intel kernel under some race conditions.

2. The new callout implementation uses hrtime rather than ticks for processing
   callouts. It uses TICK_TO_NSEC() to convert from ticks to nanoseconds. For
   really high tick values, this can cause an overflow. The callout code should
   really check for overflow and do the right thing. Otherwise, a timer that is
   meant to expire in the distant future will end up expiring immediately.

3. The new callout implementation provides a facility to suspend callou processing
   which is used during CPR as well as while dropping into KMDB. The facility
   does not allow nested suspends. We need to fix it so suspends can be nested.

4. delay(9F) is a function that is used by drivers and the kernel to wait for a
   specified number of clock ticks. This function is implemented using a normal
   callout. Since normal callouts are processed at CY_LOW_LEVEL, the delay(9F)
   function should not be called at higher than CY_LOW_LEVEL. Instead, drv_usecwait()
   should be used. The man pages for these functions are clear on their usage.
   However, there are some violators of this rule. E.g., the serial driver, 
   se_driver.c, calls delay() from interrupt context on the V880. This problem
   was fixed in the new callout implementation by executing the normal cyclic
   at CY_LOCK_LEVEL instead of CY_LOW_LEVEL. But we now feel that that was a
   big hammer. In this bugfix, we will fix delay(9F) to use drv_usecwait()
   internally for PIL > 0. Any other violators of this rule need to found
   and fixed.

5. In the new callout implementation, normal callout cyclics are executed at
   CY_LOCK_LEVEL as explained in point 4. This is causing performance issues because
   of the extra activity at lock PIL. This needs to be at CY_LOW_LEVEL as well. E.g.,
   TCP-based workloads will suffer from this. Since the delay(9F) issue is going
   to be fixed a different way, the cyclic's level should be restored to
   CY_LOW_LEVEL.

6. When a CPU is onlined, its callout cyclics are created, if necessary, and
   moved to the CPU. But they are not bound. So, when the no-intr option is
   set on a CPU, its unbound cyclics get juggled away. So, callouts issued
   from the CPU from that point involve X-calls as their cyclics reside on
   another CPU. This results in performance degradation. This manifests itself
   especially when a processor set is created with many processors and
   the no-intr option is set on them. Applications running on the processor
   set will experience performance degradation when they generate callouts. The
   solution is to bind the cyclics to the CPU to prevent the juggling. However,
   during CPU offline, the cyclics need to be unbound so that the cyclic subsystem
   can juggle the cyclics away.

7. A deadlock was discovered in the changes made to the condition variable code
   by the callout putback in NV 103. When a timed wait call is made, the
   condition variable code issues a realtime timeout and blocks the calling thread
   by putting it on a sleepq. If the realtime callout fires before the block
   happens, then the thread will never wakeup. So, the realtime handler must
   wait for the block to complete. To do this wait, the realtime callout handler
   acquires and releases the mutex associated with the condition variable to
   synchronize. This causes a deadlock in a weird scenario:

   1. CPU offline is called for a CPU. That, in turn, calls cyclic_offline().
      cyclic_offline() juggles cyclics away from the CPU. One of the cyclics,
      a callout cyclic, is executing callouts. So, cyclic_offline() waits for
      that to finish.

   2. The callout cyclic is executing the handler cv_wakeup() to wake up some thread
      that is sleeping on a condition variable. cv_wakeup() attempts to lock
      the CV mutex to sync with the CV timed wait code, finds the lock busy.
      So, it waits for the lock either spinning or sleeping.

   3. Now, condition variables are always used along with a mutex. The mutex also
      typically protects some data structures apart from the condition variable
      itself. So, some other thread grabs the same mutex for something else. After
      grabbing the mutex, it acquires cpu_lock for something. But cpu_lock is held
      by step 1. So, deadlock.

   The solution is to cut the dependency loop. So, the fix will use a per-thread
   mutex to synchronize rather than the CV mutex. This deadlock can happen any time
   a cyclic is juggled away, not just during CPU offline. It will cause callout
   processing to cease for a CPU and cause application hangs.