OpenSolaris

Printable Version Enter a New Search
Bug ID 6811996
Synopsis acpi_cpu_cstate() in idle thread context is trying to print a message on offlined/quiesced CPU
State 10-Fix Delivered (Fix available in build)
Category:Subcategory kernel:arch-x86
Keywords
Responsible Engineer Bill Holler
Reported Against
Duplicate Of
Introduced In solaris_nevada
Commit to Fix snv_111
Fixed In snv_111
Release Fixed solaris_nevada(snv_111) , solaris_10u8(s10u8_02) (Bug ID:2175201)
Related Bugs 6700904 , 6807891
Submit Date 2-March-2009
Last Update Date 25-March-2009
Description
On Nehalem system, I was able to reproduce this panic while running togproc

WARNING: _CST: cs_type c66d44f8 bad asid type bb

WARNING: _CST: cs_type c66d44f8 bad asid type bb

WARNING: _CST: cs_type c66d44f8 bad asid type bb

WARNING: _CST: cs_type c66d44f8 bad asid type bb

WARNING: _CST: cs_type c66d44f8 bad asid type bb

WARNING: _CST: cs_type c66d44f8 bad asid type bb

WARNING: _CST: cs_type c66d44f8 bad asid type bb


panic[cpu13]/thread=c68a6dc0:
assertion failed: (cp->cpu_flags & CPU_QUIESCED) == 0, file: ../../common/disp/d
isp.c, line: 1259


c68a6968 genunix:assfail+5a (fe8f7134, fe8f72f0,)
c68a69b8 unix:setbackdq+4b4 (c765fdc0)
c68a69d8 genunix:sleepq_wakeone_chan+67 (fec971f0, c5116a98,)
c68a6a08 genunix:cv_signal+95 (c5116a98, f6a8c04b)
c68a6a48 genunix:taskq_bucket_dispatch+c4 (c39d5f0c, fea40d6c,)
c68a6a98 genunix:taskq_dispatch+f2 (c4f19848, fea40d6c,)
c68a6ac8 genunix:qenable_locked+148 (cac95930, c68a6aec,)
c68a6b08 genunix:putq+3aa (cac95930, c8efc280,)
c68a6b68 genunix:log_sendmsg+2d8 (c5107bc0, 0, 5b, 0)
c68a6ca8 genunix:cprintf+3c9 (fe8ec55c, c68a6cf8,)
c68a6ce8 genunix:cmn_err+4b (2, fe8ec55c, c66d44)
c68a6d48 unix:acpi_cpu_cstate+324 (c64db840, 20018962,)
c68a6d88 unix:cpu_acpi_idle+123 (c7810980, c63fb240,)
c68a6d98 unix:cpu_idle_adaptive+12 (0, 0, c68a6db8, fe8)
c68a6da8 unix:idle+56 (0, 0)
c68a6db8 unix:thread_start+8 ()

[13]>
The dump file shows acpi_cpu_cstate() was called with a 
cpu_acpi_cstate_t structure one-passed the end of the cstate array.
The array element before this is the C3 element.

acpi_cpu_cstate+0x324(c64db840, 20018962, c68a6d88, fe80674e)
cpu_acpi_idle+0x123()
cpu_idle_adaptive+0x12()
idle+0x56(0, 0)
thread_start+8()
> c64db840 ::print -a cpu_acpi_cstate_t
{
    c64db840 cs_addrspace_id = 0xfeedfabb
    c64db844 cs_address = 0x3ec1
    c64db848 cs_type = 0xc66d44f8
    c64db84c cs_latency = 0x677d8c15
    c64db850 cs_power = 0xbaddcafe
    c64db854 promotion = 0xbaddcafe
    c64db858 demotion = 0xbaddcafe
    c64db85c cs_ksp = 0xbaddcafe
}
> c64db820 ::print cpu_acpi_cstate_t
{
    cs_addrspace_id = 0x1
    cs_address = 0x415
    cs_type = 0x3       <---- ACPI C3 state
    cs_latency = 0xf5
    cs_power = 0x15e
    promotion = 0
    demotion = 0
    cs_ksp = 0xc62bb000
}


What is going on is:
    cpu_acpi_idle() determined the CPU should enter the C3 idle state.
    cpu_acpi_idle() assumes the structure for C3 is in
    cstate[CPU_ACPI_C3 -1].

This system nehalem2 does not have an ACPI C2 state.  The cstate array
was initialized as:
	cstate[0] = ACPI C1 info	// correct
	cstate[1] = ACPI C3 into	// incorrect {should be C2 info}
	cstate[2] = Uninitiallized	// incorrect {should be C3 info}

This panic and bugid 6807891 have the same cause: Solaris C-state support
expect C2 and C3 to exist.


------------------------------------------------------------------------
Idle threads cannot call cmn_err() because they cannot block.
The cmn_err() that caused this panic was removed during development
because it no longer served its purpose due to code re-arrangement 
and because idle threads cannot block.  I am not sure how this
cmn_err() got back in here.  :-(
We did not think acpi_cpu_idle() could ever be called with this bogus
cpus[2] entry because: the latency for ACPI C2 state "cs_C2_latency"
was initialized to an very large value CPU_CSTATE_LATENCY_UNDEF that
would cause the c-state selection algorithm to never get passed it.

The c-state selection algorithm keeps track of data about the CPU's
idle duration etc.  This data is not cleared when the CPU goes offline
and then online as was the case on this system which is running a
CPU online/offline.  (This is normally not a problem.)  The c-state
selection code considered the offline time as idle time in its
statistics calculations. That is why the CPU was thought to be able
to go idle long enough to select C2 with latency: CPU_CSTATE_LATENCY_UNDEF.
Webrev is here:
	http://cr.opensolaris.org/~bholler/6807891wr/index.html

hg pdiffs are attached in file "my_pdiffs".
Work Around
If CPUs are to be offlined and then brought online again, first disable
Deep C-states by adding /etc/system entry:
	set idle_cpu_no_deep_idle=1
and then run pmconfig.
Comments
N/A