OpenSolaris

Printable Version Enter a New Search
Bug ID 6498304
Synopsis too much CPU time winding up in LMS_WAIT_CPU
State 10-Fix Delivered (Fix available in build)
Category:Subcategory kernel:sched
Keywords accounting | caps | microstate
Responsible Engineer Krister Johansen
Reported Against
Duplicate Of
Introduced In
Commit to Fix snv_57
Fixed In snv_57
Release Fixed solaris_nevada(snv_57) , solaris_10u5(s10u5_02) (Bug ID:2151108)
Related Bugs 6327235
Submit Date 28-November-2006
Last Update Date 24-September-2007
Description
While running a few different microbenchmarks, (among them, "fork" in libmicro), it was observed that while ~ 90% of the CPU time was being chalked up to "sys", the lwp microstate data (via prstat -m) reflected that ~50% was winding up in sys, and another ~40% was going into LMS_WAIT_CPU. In this case, almost half of the CPUs time is winding up in an lwp microstate bucket that ideally, would have no "real" CPU time in it.

This may be an artifact of a curious little window of time where a thread is both on the run queue, and running on the CPU. Because threads manage themselves in the dispatcher, there is a window during a context switch where it has enqueued itself on a run queue somewhere, but hasen't made it to swtch()/resume() (to actually yield). By the nature of these microbenchmarks, more time is being spent in the dispatcher, so we have many more of these little windows, potentially contributing a non-trivial amount of time to the wrong place.

Ideally, the begin time for "LMS_WAIT_CPU" shouldn't begin until the thread actually switches off, because:
	- the thread isn't yet waiting, it's still running. So that time should go to sys
	- LMS_WAIT_CPU implies that the euqueued thread is runnable, and ready...but
	  during this window it isn't.
This bug is very relevant for the CPU caps project which uses LMS_USER+LMS_SYSTEM times
as a measure of CPU time used by a thread. When this time is under-accounted because
a bunch of time is reported as LMS_WAIT_CPU, the project/zone will get more CPU time
than allowed by a CPU cap.

DTrace experiments confirmed that overall time on CPU as measured by on-cpu/off-cpu
probes is larger than microstate-based adjustments based on LMS_USER+LMS_SYSTEM times
for fork/exit micro-benchmark.
You can also see the same problem by running the following little program:

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

int main(int argc, char *argv[])
{
	for (;;)
		yield();
}

Running prstat will show that CPU time is noticeably less than the actual CPU time
used by this program.
The program calling yield() in a loop on a single-CPU system uses 100% of CPU time, 
while prstat reports only 56% of CPU time (and prstat -m shows 44% in the LAT column).

The following piece of code in setbackdq()/setfrontdq() is responsible for it:

        if (tp->t_waitrq == 0) {
                hrtime_t curtime;

                curtime = gethrtime_unscaled();
                (void) cpu_update_pct(tp, curtime);
                tp->t_waitrq = curtime;
        } 

In this case we over-account for the wait time.
Work Around
N/A
Comments
N/A