OpenSolaris

Printable Version Enter a New Search
Bug ID 4518644
Synopsis I/O wait statistic is still misleading and should be dropped
State 10-Fix Delivered (Fix available in build)
Category:Subcategory kernel:other
Keywords SAE | observability | ssperf
Responsible Engineer Krister Johansen
Reported Against sunos
Duplicate Of
Introduced In
Commit to Fix s10_60
Fixed In s10_60
Release Fixed solaris_10(s10_60)
Related Bugs 4116873 , 6199237 , 6215332 , 6218160
Submit Date 24-October-2001
Last Update Date 24-November-2005
Description
In the past, it was possible for an otherwise idle machine with many CPUs
to show 100% I/O wait if just one thread was blocked on long-term I/O ...

  e.g. mt -f /dev/rmt/<N> rewind

This was addressed by the fix provided for ...

  4116873: I/O wait statistic misleading

However, this fix is incomplete, and I/O wait continues to be a misleading
and confusing statitic to many customers.

For example, let's take the case of an otherwise idle, 64-way E10K which
just happens to have 64 tape drives ...

  #!/bin/ksh
  # script to show 100% I/O wait on 64-way with 64 tape drives

  n=0
  while (( n < 64 ))
  do
    pbind -b $n $$
    mt -f /dev/rmt/$n rewind
  done


  #!/bin/ksh
  # script to show 1-2% I/O wait on 64-way with 64 tape drives

  n=0
  while (( n < 64 ))
  do
    pbind -b 0 $$
    mt -f /dev/rmt/$n rewind
  done

Ok, this is contrived ... but it does show that the same I/O load could
exhibit 1% to 100% I/O wait on the same system ... it just depends on
which CPU(s) the I/O was submitted.

Also, a system with one application seeing 100% I/O wait could have this
masked by another, unrelated application which was 100% user ...

  #!/bin/ksh
  # script to consume 100% user on 64-way machine

  n=0
  while (( n < 64 ))
  do
    ( while :; do :; done ) &
  done
  wait

This script would have no impact on 64 tapes rewinding, but would show
0% I/O wait (instead of 1% to 100%).

I/O wait continues to cause confusion to customers. Here is a recent
example:

Customer: My system runs faster when I turn off CPUs.

Me:       How are you measruing this?

Customer: I start with 4 CPUs and see 60% I/O wait and 0% idle. Then I
          turn off 2 CPUs and get just 20% I/O wait and 0% idle.

Me:       There is no difference ... 60% I/O wait and 0% idle indicates
          40% utilisation. This would correspon to 80% utilisation in a
          2 CPU machine ... which is where the 20% figure comes from.
          Did you measure any difference in application performance?

Customer: No.

This is not uncommon. The confusion would be avoided if the current I/O
wait statitic was incorporated into the idle statistic.

There is no need to change the utilities - indeed they should probably
be left unchanged for the time being for compatablity reasons. This fix
can easily be applied at the kstat level simply by changing CPU_WAIT and
CPU_IDLE.
Work Around
N/A
Comments
N/A