OpenSolaris

Printable Version Enter a New Search
Bug ID 6617465
Synopsis Pentium IIIs die with divide error trap
State 10-Fix Delivered (Fix available in build)
Category:Subcategory kernel:fm
Keywords fma_s10u5
Responsible Engineer Gavin Maltby
Reported Against snv_76
Duplicate Of
Introduced In solaris_nevada
Commit to Fix snv_77
Fixed In snv_77
Release Fixed solaris_nevada(snv_77) , solaris_10u5(s10u5_05) (Bug ID:2155806)
Related Bugs 6619031 , 6620537
Submit Date 16-October-2007
Last Update Date 13-November-2007
Description
After bfu'ing pretty much the b76 bits, both PIII systems I use failed rather
spectacularly:

features: 1003fff<cpuid,sse,sep,pat,cx8,pae,mca,mmx,cmov,de,pge,mtrr,msr,tsc,lgpg>                                                                              
mem = 3145212K (0xbff7f000)                                                     
                                                                                
panic[cpu0]/thread=fec1f7a0: BAD TRAP: type=0 (#de Divide error) rp=fec3881c addr=fec20a38                                                                      
                                                                                
#de Divide error                                                                
addr=0xfec20a38                                                                 
pid=0, pc=0xfe8042b8, sp=0xfec1f7a0, eflags=0x10046                             
cr0: 80050011<pg,wp,et,pe> cr4: 98<pge,pse,de>                                  
cr2: 0cr3: 2538000                                                              
         gs: c43c01b0  fs:                                                      
panic[cpu0]/thread=fec1f7a0: BAD TRAP: type=8 (#df Double fault) rp=fec24a4c addr=0                                                                             

The panics appears to be here:

uint_t
cmi_ntv_hwcoreid(cpu_t *cp)
{
        return (cpuid_get_coreid(cp) % cpuid_get_ncore_per_chip(cp));
}

and on inspection the code never sets the number of cores in some systems:

        if (cpi->cpi_xmaxeax & 0x80000000) {
		....
		cpi->cpi_ncore_per_chip = ....
	}

Unfortunately, if the first test fails, the number of cores per chip remains 0 and
the divide error panic is triggered.
Oops, test base did not include PIII :-(

It looks like cpi_ncpu_per_chip *is* always initialized in cpuid.c, but
cpi_ncore_per_chip is only initialized in certain cases and otherwise
left as 0.  So callers of cpuid_get_ncpu_per_chip (which includes the new
cmi_ntv_hwstrandid) and a number of long-established callers) are ok,
but it appears that my recent putback is the first actual caller of
cpuid_get_ncore_per_chip so it's unhappy when it finds 0.

Obviously we can protect against this in cmi_ntv_hwcoreid and
cmi_ntv_hwstrandid, but it also seems correct to default
the number of cores per chip to 1 - not sure if the value 0
has any magic powers there, however.
Work Around
The easiest way to boot an affected system was:

	- boot under kmdb (add -kd to the unix line in the grub menu)
	- when stopping in the debugger:
		> use_mp/W0  (disable all but one CPU)
		use_mp:         0x1             =       0x0

		> cpuid_info0  ::print -a cpi_ncore_per_chip |/W1
		cpuid_info0+0x188:              0x0             =       0x1

Then continue (:c) to boot up properly with a single CPU; then install
a fixed kernel.
Comments
N/A