OpenSolaris

Printable Version Enter a New Search
Bug ID 4468181
Synopsis low priority TS threads on a sleep queue can be victimised
State 10-Fix Delivered (Fix available in build)
Category:Subcategory kernel:sched
Keywords TS_SLEEP | csp | lock | priority | queue | reader | rwlock | sleep | starvation | ts_update | ts_update_list | victimised | writer
Responsible Engineer Andrew Tucker
Reported Against 2.6 , 5.6 , 5.7 , 5.8
Duplicate Of
Introduced In solaris_2.2
Commit to Fix s81_45
Fixed In s81_45
Release Fixed solaris_9(s81_45)
Related Bugs 4251223 , 4491031 , 4655980 , 6698028
Submit Date 11-June-2001
Last Update Date 12-January-2007
Description

The customer's Oracle database is terminating after an async i/o issued by
an Oracle dbwriter has not returned in 10 minutes. The error 'ora 27062'
is logged. 

A forced crash dump at the point where Oracle flags the error shows the 
following:

We have 10 dbwriters all issuing async i/o to the database which is on a 
vxfs filesystem. When the crash dump was taken, there were 1457 threads,
belonging to the 10 dbwriter processes, which were all waiting for an exclusive
writer lock on the vxfs lock which serialises writes to a file
( the lock is of type vx_rwsleep_rec, which is a vxfs implimentation of a 
reader/writer lock ). When a thread needs to wait for a writer lock, we call 
cv_wait() which results in the thread being inserted in a sleep queue. When
the owner releases the lock, cv_signal() is called resulting in the first 
thread on the sleep queue being woken up.


The threads for dbwriters 2 to 10 are predominatly at priority 59, while
the threads for dbwriter 1 are at a variety of lower priorities ( I understand
the first dbwriter has some 'master' resonsibilties, and hence probably uses
more cpu than dbwriters 2 to 10, which have a more dedicated i/o function )

Three threads in particluar, belonging to the first dbwriter, have been waiting
for 10 minutes on the sleep queue, at priority 17.

cv_wait()/cv_block() will call ts_sleep() to bump up the threads priority a 
little ( using ts_slpret from the ts dispatch table ), before  calling
sleepq_insert() to insert the thread onto the sleep queue in priority
order. A thread with low prioriy is put toward the back of the queue. 
Because the priority of threads is currently not adjusted while on the sleepq 
( ts_udpate_list() updates dispwait, but skips the recalculation of priority 
while a thread is in state TS_SLEEP  ), low priority threads can be 
victimised when higher priority threads are continuously joining the same
sleep queue.

james.mcpherson@aus 2001-07-10

The above condition does not just affect oracle and async io! Under solaris 2.6 KT Freetel's 
heavily loaded domain papa2 failed to respond to pings. Analysis of the core revealed that 
there were several threads on the sleepq waiting for the muxifier mutex and the thread which 
held muxifier was prioritised down to 5. Most of the other threads on the sleepq were 
prioritised way, way down (below 10) from what one might expect them to be at.

The result was therefore that the customer perceived the domain to have hung so they dropped 
it and generated a crash dump.

While trying to determine how pathological the scheduler / priority determination algorithm 
can get I came across bugids 4042155 and 4246211 which are earlier accounts of this current bug. 

This customer is particularly susceptible to this problem due to (1) incredibly agressive 
system tuning and (2) heavily overloading the domain - in spite of APAC TSG and Geo SSE recommendations. 

cores and explorers are available at

/net/necrom.aus/tsg/calls/10096228

Radiance/APAC case number is 10096228.

 xxxxx@xxxxx.com 2001-07-12

Company               xxxxx 

[Bob Sneed wrote ...]
The Telephone Data Systems (TDS) case (Radiance #62562532) demonstrates
that a large number of threads are not required to provoke this thread
starvation issue. In the TDS case, a small set of Oracle client processes
(< 100) concurrently trying to use the same file for disk sorting resulted
in client process failures from ORA-27062.   
Also of great concern to the customer is that the same workload completes
with no client failures on their NT system, and that the variance of
client completion times is dramatically less under NT.

The threads on sleep queue with low priority will never get chance to run. Need built in mechanism to 
give fair chance to low priority threads to run in timely manner.  


Oracle data, explorer output and unix.1 and vmunix.1 files at /net/cores.central/cores/62562532.

System Configuration:  Sun Microsystems  sun4u 8-slot Sun Enterprise E4500/E5500
System clock frequency: 100 MHz
Memory size: 4096Mb
SunOS metaunix 5.7 Generic_106541-16
ORACLE RDBMS Version: 8.1.6.1.0.
System was force panic to produce core file during Oracle client 10 min timeouts in trace file.
WARNING: aiowait timed out 1 times
Oracle is running with default (1) db_writer_process.
 
Threads summary:
 156   threads ran in the last second (106 user, 50 kernel)
  377   threads ran in the last minute (318 user, 59 kernel)
   19   runnable threads (16 user, 3 kernel)
    0   zombied threads
    1   stopped threads (0 user, 1 kernel)
   55   free_threads (0 user, 55 kernel)
    0   mutexes pending
    1*  rwlocks pending (1 user, 0 kernel)
  404   condition variables pending (285 user, 119 kernel)
    2*  semaphores pending (1 user, 1 kernel)
    0   user-level sobjs pending
   16   shuttle (doors) sobjs pending (16 user, 0 kernel)
    2*  threads in biowait (2 user, 0 kernel)
   19   threads in dispatch queues (16 user, 3 kernel)
    0   swapped threads
    0   interrupt threads running
 1067   total threads (876 user, 191 kernel)

There are 71 blocked oracle threads, 68 via pwrite(), 3 via pread()..
8 have run in the last 2.5 seconds (tspri>=49), while the other 63 (tspri<=46)
have not run for >= 598 seconds.

cmd: oracleTBSP tid: 0x300080cd9e0  pri: 58(TS)  idle: 0.03 sec pread+0x118
cmd: oracleTBSP tid: 0x300080cd760  pri: 58(TS)  idle: 0.03 sec pread+0x118
cmd: oracleTBSP tid: 0x30008072320  pri: 58(TS)  idle: 0.13 sec pwrite+0x148
cmd: oracleTBSP tid: 0x30007deafa0  pri: 58(TS)  idle: 0.22 sec pwrite+0x148
cmd: oracleTBSP tid: 0x30007ad79a0  pri: 58(TS)  idle: 0.08 sec pread+0x118
cmd: oracleTBSP tid: 0x300075f1d00  pri: 58(TS)  idle: 0.09 sec pwrite+0x148
cmd: oracleTBSP tid: 0x300075c6160  pri: 58(TS)  idle: 0.09 sec pwrite+0x148
cmd: oracleTBSP tid: 0x30007656e40  pri: 49(TS)  idle: 2.22 sec pwrite+0x148
cmd: oracleTBSP tid: 0x30007d5ea60  pri: 46(TS)  idle: 598.58 sec pwrite+0x148
cmd: oracleTBSP tid: 0x300078fcde0  pri: 46(TS)  idle: 599.48 sec pwrite+0x148
cmd: oracleTBSP tid: 0x300076ac700  pri: 46(TS)  idle: 600.04 sec pwrite+0x148
cmd: oracleTBSP tid: 0x300076ac480  pri: 46(TS)  idle: 600.03 sec pwrite+0x148
cmd: oracleTBSP tid: 0x30007e6fc80  pri: 45(TS)  idle: 598.39 sec pwrite+0x148
cmd: oracleTBSP tid: 0x30007e6f780  pri: 45(TS)  idle: 598.39 sec pwrite+0x148
cmd: oracleTBSP tid: 0x30007e374e0  pri: 45(TS)  idle: 598.48 sec pwrite+0x148
cmd: oracleTBSP tid: 0x30007e37260  pri: 45(TS)  idle: 598.48 sec pwrite+0x148
cmd: oracleTBSP tid: 0x30007e134c0  pri: 45(TS)  idle: 598.35 sec pwrite+0x148
cmd: oracleTBSP tid: 0x30007e13240  pri: 45(TS)  idle: 598.35 sec pwrite+0x148
cmd: oracleTBSP tid: 0x30007d90800  pri: 45(TS)  idle: 598.57 sec pwrite+0x148
cmd: oracleTBSP tid: 0x30007d90580  pri: 45(TS)  idle: 598.57 sec pwrite+0x148
cmd: oracleTBSP tid: 0x300075a5a40  pri: 45(TS)  idle: 601.24 sec pwrite+0x148
cmd: oracleTBSP tid: 0x30007f47ac0  pri: 44(TS)  idle: 598.4 sec pwrite+0x148
cmd: oracleTBSP tid: 0x30007f47840  pri: 44(TS)  idle: 598.4 sec pwrite+0x148
cmd: oracleTBSP tid: 0x30007f461c0  pri: 44(TS)  idle: 598.37 sec pwrite+0x148
cmd: oracleTBSP tid: 0x30007edb560  pri: 44(TS)  idle: 598.08 sec pwrite+0x148
cmd: oracleTBSP tid: 0x30007edb2e0  pri: 44(TS)  idle: 598.08 sec pwrite+0x148
cmd: oracleTBSP tid: 0x30007eb1040  pri: 44(TS)  idle: 598.49 sec pwrite+0x148
cmd: oracleTBSP tid: 0x30007eb0dc0  pri: 44(TS)  idle: 598.49 sec pwrite+0x148
cmd: oracleTBSP tid: 0x30007e7a8a0  pri: 44(TS)  idle: 598.49 sec pwrite+0x148
cmd: oracleTBSP tid: 0x30007e7a620  pri: 44(TS)  idle: 598.4 sec pwrite+0x148
cmd: oracleTBSP tid: 0x30007d411a0  pri: 44(TS)  idle: 598.22 sec pwrite+0x148
cmd: oracleTBSP tid: 0x30007d40f20  pri: 44(TS)  idle: 598.22 sec pwrite+0x148
cmd: oracleTBSP tid: 0x300079d7100  pri: 44(TS)  idle: 599.28 sec pwrite+0x148
cmd: oracleTBSP tid: 0x300079d6e80  pri: 44(TS)  idle: 599.28 sec pwrite+0x148
cmd: oracleTBSP tid: 0x3000771aca0  pri: 44(TS)  idle: 600.35 sec pwrite+0x148
cmd: oracleTBSP tid: 0x300074ab200  pri: 44(TS)  idle: 601.29 sec pwrite+0x148
cmd: oracleTBSP tid: 0x30007f37320  pri: 43(TS)  idle: 598.4 sec pwrite+0x148
cmd: oracleTBSP tid: 0x30007f370a0  pri: 43(TS)  idle: 598.4 sec pwrite+0x148
cmd: oracleTBSP tid: 0x30007ef5080  pri: 43(TS)  idle: 598.45 sec pwrite+0x148
cmd: oracleTBSP tid: 0x30007e6e600  pri: 43(TS)  idle: 598.48 sec pwrite+0x148
cmd: oracleTBSP tid: 0x30007e6e380  pri: 43(TS)  idle: 598.49 sec pwrite+0x148
cmd: oracleTBSP tid: 0x30007d522c0  pri: 43(TS)  idle: 598.57 sec pwrite+0x148
cmd: oracleTBSP tid: 0x30007d52040  pri: 43(TS)  idle: 598.59 sec pwrite+0x148
cmd: oracleTBSP tid: 0x30007891ca0  pri: 43(TS)  idle: 599.71 sec pwrite+0x148
cmd: oracleTBSP tid: 0x30007868100  pri: 43(TS)  idle: 599.3 sec pwrite+0x148
cmd: oracleTBSP tid: 0x300074f7c40  pri: 43(TS)  idle: 601.21 sec pwrite+0x148
cmd: oracleTBSP tid: 0x300074b40a0  pri: 43(TS)  idle: 601.16 sec pwrite+0x148
cmd: oracleTBSP tid: 0x300077c6820  pri: 38(TS)  idle: 600.05 sec pwrite+0x148
cmd: oracleTBSP tid: 0x30008047c00  pri: 36(TS)  idle: 598.29 sec pwrite+0x148
cmd: oracleTBSP tid: 0x3000802a060  pri: 36(TS)  idle: 598.3 sec pwrite+0x148
cmd: oracleTBSP tid: 0x30007bdba60  pri: 36(TS)  idle: 598.8 sec pwrite+0x148
cmd: oracleTBSP tid: 0x30007bdb7e0  pri: 36(TS)  idle: 598.4 sec pwrite+0x148
cmd: oracleTBSP tid: 0x30007ab0d00  pri: 36(TS)  idle: 599.09 sec pwrite+0x148
cmd: oracleTBSP tid: 0x3000757f520  pri: 36(TS)  idle: 601.27 sec pwrite+0x148
cmd: oracleTBSP tid: 0x3000757f2a0  pri: 36(TS)  idle: 601.27 sec pwrite+0x148
cmd: oracleTBSP tid: 0x30007fdcf20  pri: 35(TS)  idle: 598.31 sec pwrite+0x148
cmd: oracleTBSP tid: 0x300080fba00  pri: 34(TS)  idle: 598.22 sec pwrite+0x148
cmd: oracleTBSP tid: 0x300080fb780  pri: 34(TS)  idle: 598.21 sec pwrite+0x148
cmd: oracleTBSP tid: 0x30008046580  pri: 34(TS)  idle: 598.27 sec pwrite+0x148
cmd: oracleTBSP tid: 0x30008046300  pri: 34(TS)  idle: 598.26 sec pwrite+0x148
cmd: oracleTBSP tid: 0x3000802b960  pri: 34(TS)  idle: 598.31 sec pwrite+0x148
cmd: oracleTBSP tid: 0x3000802b6e0  pri: 34(TS)  idle: 598.31 sec pwrite+0x148
cmd: oracleTBSP tid: 0x300080011c0  pri: 34(TS)  idle: 598.31 sec pwrite+0x148
cmd: oracleTBSP tid: 0x30007faaa00  pri: 34(TS)  idle: 598.17 sec pwrite+0x148
cmd: oracleTBSP tid: 0x30007faa780  pri: 34(TS)  idle: 598.17 sec pwrite+0x148
cmd: oracleTBSP tid: 0x30007f72480  pri: 34(TS)  idle: 598.19 sec pwrite+0x148
cmd: oracleTBSP tid: 0x30007f72200  pri: 34(TS)  idle: 598.19 sec pwrite+0x148
cmd: oracleTBSP tid: 0x30008092fc0  pri: 24(TS)  idle: 598.24 sec pwrite+0x148
cmd: oracleTBSP tid: 0x30008092d40  pri: 24(TS)  idle: 598.24 sec pwrite+0x148
cmd: oracleTBSP tid: 0x30007ef5d00  pri: 24(TS)  idle: 598.43 sec pwrite+0x148
cmd: oracleTBSP tid: 0x30007ef5a80  pri: 24(TS)  idle: 598.33 sec pwrite+0x148

All threads with pri <=46 are ready to issue aiowait timed error message in 
Oracle trace files..

james.mcpherson@aus 2001-08-06

 xxxxx  had an ora-27062 fallover a few days ago. The first core
(0) showed a failfast timeout panic with the system mpsun10 having been up for
around 40 days. I checked the rt and ts classes. The RT class had 31 threads
in it, and these were for the usual suncluster processes. The TS class showed
one slightly interesting thread - 0x3001dea9040, pid 12812 for the reboot
command which was in suspend and idle for 153 ticks (1.53 seconds). The process 
table when viewed tree-style shows this:

fm3(vmcore.0):6> proc tree
0     sched
  3     fsflush
  2     pageout
  1     /etc/init -
    12833 /usr/lib/saf/sac -t 300
      12850 /usr/lib/saf/ttymon
    8601  ksh -o vi
      12812 reboot
    3037  /opt/openFT/ft/ftvfsm -a
    3033  /opt/bin/fta -sn
    1886  /opt/SUNWpnm/bin/pnmd -s -c mpsunc01 -l 0
    1427  /opt/SUNWcluster/bin/ccdd -f /etc/opt/SUNWcluster/conf/ccd.database.init
    1373  /opt/SUNWcluster/bin/clustd -f /etc/opt/SUNWcluster/conf/mpsunc01.cdb
      12813 /bin/ksh -p /opt/SUNWcluster/bin/reconf_ener cmmabort mpsunc01
        12861 /bin/ksh /opt/SUNWcluster/etc/reconf/conf.d/rcA.d/05_loghost
          12893 /opt/SUNWcluster/bin/timed_run 10 /opt/SUNWcluster/ha/oracle/oracle_fm_stop  cw
            12894 /bin/ksh /opt/SUNWcluster/ha/oracle/oracle_fm_stop  cwhlh2,cwhlh1 60
              12905 /bin/ksh /opt/SUNWcluster/bin/oracle_status_svcs -mode all -hosts cwhlh2,cwhlh1
                12916 tr \012  
                  12917 hareg -q oracle -H
                    12918 sh -c /opt/SUNWcluster/bin/ccdadm mpsunc01  -w
    1093  -sh
    1065  /opt/SUNWsma/bin/smad
      1066  /opt/SUNWsma/bin/smad
    783   /opt/etc/tnsxd
    60    /usr/lib/devfsadm/devfseventd
    18    vxconfigd -m boot


which I take to indicate that the cluster software was trying to shutdown the
system because the reboot command had been issued. There are some hardware error
messages listed in the msgbuf for ssd14 (/sbus@2,0/SUNW,socal@d,10000/sf@0,0/ssd@w2100002037901d75,0)
which is c4t19d0. There are also some read and vxfs vx_nospace errors,
presumably as as result of c4t19d0's unrecoverable media errors.

I didn't see anything else of interest in core 0.

The second core (2) did show some interesting things. Firstly, there are three
Oracle instances running - DWHI, DMPOI and DWH_TEST. Both DWHI and DWH_TEST have
three dbwrs, and DMPOI only has one. Interestingly, each dbwr had 258 threads 
associated with it. DWHI's ckpt had 23 threads, it's lgwr had 24. DMPOI's ckpt
and lgwr each had 11 threads. DWH_TEST's ckpt had 11 while its lgwr had 38 
threads. Out of the 203 processes in the system 88 were oracle-owned or 
oracle-related.

Three threads were in biowait, one listener and two dbwr but all for the DWHI instance:

thread: 0x30021c59820  pid:  6311  cmd: oracleDWHI (LOCAL=NO)
idle: 2 ticks (0.02 seconds)
age:  14 ticks (0.14 seconds)
buf @ 0x3000dec4730
  b_bcount: 49152
  b_edev: 209(vxio),85004
  b_vp: 0x0
  b_blkno: 0x3e610e0

thread: 0x3000ef68d40  pid:  5276  cmd: ora_dbw2_DWHI
idle: 0 ticks (0 seconds)
age:  49271 ticks (8 minutes 12.71 seconds)
buf @ 0x3000de347e8
  b_bcount: 16384
  b_edev: 209(vxio),85004
  b_vp: 0x0
  b_blkno: 0x3358400

thread: 0x30007d89a60  pid:  5276  cmd: ora_dbw2_DWHI
idle: 0 ticks (0 seconds)
age:  43595 ticks (7 minutes 15.95 seconds)
buf @ 0x300113077b8
  b_bcount: 16384
  b_edev: 209(vxio),85001
  b_vp: 0x0
  b_blkno: 0x4f8b4f0

   3 matching threads found.

threads in biowait() by device:
count   device
    2   209(vxio),85004
    1   209(vxio),85001

And the same three threads are waiting on semaphores. The filesystems which 
have 209,8500[4|1] as major,minor are

0x30008316c18 0x300002cabc8/vx_vfs   209(vxio),85004 vxfs     /cwhlh2/(0x3000821dc48<VROOT>)/oradata05
0x300034dfeb8 0x30007709480/vx_vfs   209(vxio),85001 vxfs     /cwhlh2/(0x3000821dc48<VROOT>)/oradata02


For condition variables (cv) there are 1052 threads, with 651 waiting on the
same one: 0x3000e304fc8. Amazingly (not!) these are all for 
the DWHI instance:  

 228 ora_dbw0_DWHI
 231 ora_dbw1_DWHI
 188 ora_dbw2_DWHI
   4 oracleDWHI

And the list of those which were waiting more than 1 minute is
  thread        pri         idle   pid         wchan command
  0x3001706c000  31     10m1.41s  3462 0x3000e304fc8 oracleDWHI (LOCAL=NO)
  0x30011334800  31     10m1.39s  3462 0x3000e304fc8 oracleDWHI (LOCAL=NO)
  0x30016553280  31     10m1.38s  3462 0x3000e304fc8 oracleDWHI (LOCAL=NO)
  0x3001707e320  31     10m1.37s  3462 0x3000e304fc8 oracleDWHI (LOCAL=NO)
  0x3000ef41800  49    26m32.41s  5268 0x3000e304fc8 ora_dbw0_DWHI
  0x3000ee5f2e0  49    28m25.27s  5268 0x3000e304fc8 ora_dbw0_DWHI
  0x30003bcba40  49    34m55.72s  5268 0x3000e304fc8 ora_dbw0_DWHI
  0x3000ef002e0  49     2m55.90s  5268 0x3000e304fc8 ora_dbw0_DWHI
  0x3000dd7d820  49    14m42.80s  5272 0x3000e304fc8 ora_dbw1_DWHI
  0x3000ee20560  49    34m44.80s  5272 0x3000e304fc8 ora_dbw1_DWHI
  0x3000eefcae0  49     5m38.21s  5272 0x3000e304fc8 ora_dbw1_DWHI
  0x3000ee765c0  49     34m9.99s  5276 0x3000e304fc8 ora_dbw2_DWHI
  0x3000edf8820  49     34m9.73s  5276 0x3000e304fc8 ora_dbw2_DWHI
  0x3000eda6da0  49    25m38.95s  5276 0x3000e304fc8 ora_dbw2_DWHI


This core also shows the vx_nospace message, consistently for the /dev/vx/dsk/cwhlh2dg/vol01
filesystem which is mounted as /cwhlh2/oradata01.... Let's check the memory...

meminfo shows that priority_paging was not operational, desfree and lotsfree
were not set. They are using vxfs and have not tuned ncsize to be between 
50-80% of the vxfs_ninode value. With the defaults for this system ncsize is 
set at approx 7.9% of vxfs_ninode which is not quite good enough.

The RT class (52 threads) shows the usual cluster processes (clustd, rpc.pnmd 
etc), nothing unusual or interesting. The TS class (2440 threads) shows that 
most of the threads it contains are at priority 58:

num  priority
   9 0
   5 10
   2 12
   4 20
   2 21
   5 24
  11 25
   2 28
   1 29
   4 30
  13 31
   6 32
   3 33
  30 34
  26 38
   1 42
   2 43
   1 44
   3 45
  31 48
  43 49
   5 50
   6 52
  18 53
   6 54
   2 55
2037 58
 161 59


Just as a frinstance if we check the threads at priority 48, then we see that
23 of these are lgwr threads for the DWH_TEST instance, and they have been idle 
for approx 2 days 17 hours 34 minutes and 43 seconds. I would not normally have 
said that priority 48 was all that low. Priority 58 has the interesting stuff - 
heaps of threads from DWH_TEST idle for more than 13 hours, and lots of threads 
from DWHI which have been idle for more than 1 day.

Each dbwr has some threads in the kaio stack but the majority of the kaio 
threads are all from the one instance - DWHI. Those threads are also all (except 
for one listener) at priority 58 and have been idle in aio_cleanup_thread for 
more than 10 minutes. The threads in aiowait have been idle for between 1 tick 
and 145 ticks.

Coming back to the fm thread summary:

  173   threads ran in the last second (105 user, 68 kernel)
  976   threads ran in the last minute (886 user, 90 kernel)
    2   runnable threads (0 user, 2 kernel)
    0   zombied threads
    1   stopped threads (0 user, 1 kernel)
   80   free_threads (0 user, 80 kernel)
    0   mutexes pending
    0   rwlocks pending
 1052   condition variables pending (926 user, 126 kernel)
    4*  semaphores pending (3 user, 1 kernel)
 1538   user-level sobjs pending (1538 user, 0 kernel)
    0   shuttle (doors) sobjs pending
    3*  threads in biowait (3 user, 0 kernel)
    2   threads in dispatch queues (0 user, 2 kernel)
    2*  threads in dispq of cpu running idle thread (0 user, 2 kernel)
    0   swapped threads
    0   interrupt threads running
 2712   total threads (2486 user, 226 kernel)


This clearly isn't a hung system - I'd say it was just overloaded ;|

Infineon's system is an e6500, 8x400MHz/8Mb cache cpus, 8Gb ram. The root/usr/var fs and swap are all mirrored with vxvm. The cluster filesystems (shared dgs etc) are all on vxfs.



Work Around
Use vxfs quick i/o which avoids the bottleneck of a lock serialising write
access to the file.

bob.sneed@East 2001-07-16

UFS forcedirectio (S8 @ U3) also avoids the single-writer lock, but also
removes any benefit Oracle might gain from use of the UNIX buffer cache, and
so cannot be recommended blindly.  VxFS QIO, beyond avoiding the single
writer lock, causes the KAIO call from libaio to succeed, thus avoiding the
LWP AIO inheritance of the TS scheduler bug by fully avoiding the LWP AIO
code path.  VxFS QIO incurs additional adminsitrative overhead, and also
complicates the matter of growing file sizes.  Thus, each of these actions
have side effects that complicate promoting them as "workarounds".

While the broader class of ORA-27062 timeouts can only be worked around by
disabling AIO in Oracle, disabling AIO in Oracle requires alternate Oracle
tuning be used (eg: multiple DB writers or use of I/O slaves), and for all
but the most read-biased workloads cannot result in performance equivalent
to that obtainable with AIO.

In most cases with Oracle, optimal tuning and disk layout can dramatically 
reduce the probability of ORA-27062 incident from this bug, but these measures
hardly qualify as a "workaround".  With or without optimal tuning, this bug
impacts Oracle users seemingly at random, and can be provoked by transient
I/O competition, soft hardware errors, etc - at I/O levels far below what one
would predict could possibly cause 10 minute timeouts.

The simgle biggest tunable in Oracle which throttles the probability of dying
due to this bug is db_writer_processes.  This should not be tuned past 1 when
using AIO unless measurable business benefit is attained by doing so.

The only real workaround to the TS scheduling bug per se (as it impacts
Oracle by the bug being inherited by the LWP-based AIO) is to manipulate
the scheduler so as to implement a fixed-priority scheduling scheme for
Oracle backgrounds proceses.  This should prevent Oracle processes and
their associated AIO LWP's from priority shuffling that can lead to thread
starvation.  Cases such as the TDS case demonstrate that client processes
are also vulnerable to thread starvation per this bug.  One might approach
the client scenario this by ensuring that the Oracle listener runs with a flat
scheduling priority scheme (perhaps at a priority just below that of the
Oracle background processes).  Beware however in the case of client processes
that local client processes can be children of various server-side processes
which may be lauched from shells that rightfully belong in the TS class.
Thus, reigning in all these shadow processes to a fixed priority scheme could
be problematic.
Comments
N/A