OpenSolaris

Printable Version Enter a New Search
Bug ID 6618751
Synopsis Include memboard in T5440 FBR/FBU diagnosis
State 10-Fix Delivered (Fix available in build)
Category:Subcategory fma:mem
Keywords batoka-fma-interest
Responsible Engineer Louis Tsien
Reported Against
Duplicate Of
Introduced In solaris_nevada
Commit to Fix snv_103
Fixed In snv_103
Release Fixed solaris_nevada(snv_103) , solaris_10u8(s10u8_01) (Bug ID:2169383)
Related Bugs 6612334 , 2169378
Submit Date 18-October-2007
Last Update Date 30-June-2009
Description
Current diagnosis of FBDIMM channel errors on T5440 do not include the memory
board as a suspect. This board is part of the channel pathway and should be
included.
T5440 running S10U6 with kernel patch 138888-08.
HEre's more system's info:

Sun System Firmware 7.1.6.g 2008/10/20 22:18
Host flash versions:
   Hypervisor 1.6.7.a 2008/08/30 05:20
   OBP 4.29.0.a 2008/09/15 12:02
   POST 4.29.0.a 2008/09/15 12:35


System had a Memory failure and the moment fmd sees the fault it core dumps continuously. Had 2 core files created. pstack output and mdb showed the following:

*CORE FILES & PSTACK:*
core.fmd.4101
core.fmd.4101.pstack.txt
core.fmd.4113
core.fmd.4113.pstack.txt 

**  mdb showed this on both core files;
 > $C
fde7b800 libc.so.1`strncpy+0x134(fde7b8a0, 42523000, 4, 1, 464677, 80808080)
fde7b840 cpumem-diagnosis.so`*cmd_bank_fault*+0x6c(17e180, 45a440, 0, 177a0, ffb9c0e9, 0)
fde7bca0 cpumem-diagnosis.so`cmd_ue_common+0x1f0(0, 0, 0, 0, 0, 0)
 > ::stack
libc.so.1`strncpy+0x134(fde7b8a0, 42523000, 4, 1, 464677, 80808080)
cpumem-diagnosis.so`cmd_bank_fault+0x6c(17e180, 45a440, 0, 177a0, ffb9c0e9, 0)
cpumem-diagnosis.so`cmd_ue_common+0x1f0(0, 0, 0, 0, 0, 0)


** pstack showed this on both output:
# pstack core.fmd.4101.pstack.txt
-----------------  lwp# 12 / thread# 12  --------------------
 fefb3774 strncpy  (fde7b8a0, 42523000, 4, 1, 464677, 80808080) + 134
 fdd4e8b8* cmd_bank_fault* (17e180, 45a440, 0, 177a0, ffb9c0e9, 0) + 6c
 fdd4ebe4 *cmd_ue_common* (0, 0, 0, 0, 0, 0) + 1f0


I have the core files and pstack saved under here:
/net/cores.central/cores/dir18/71162614/FMD_corefiles_prior_mem_replacement/oldfm


Customer replaced the bad memory cards (2) but with incorrect part number.
The moment customer re-enabled fmd it core dumps and created 2 core files then it stopped.

Core FIles:
-rwxrwxrwx   1 root     root     19501989 Jun  9 20:36 core.fmd.1016*
-rwxrwxrwx   1 root     root     19553545 Jun  9 20:45 core.fmd.1021* 
Pstack:
-rwxrwxrwx   1 root     other      13639 Jun 10 14:29 pstack.core1016.txt*
-rwxrwxrwx   1 root     other      12603 Jun 10 14:29 pstack.core1021.txt* 

Files are located here:
/net/cores.central/cores/dir18/71162614/fmd_corefiles_pstack

I have Steve Hanson analyzed them and he pointed that customer is experiencing 
CR 6716862 where patch fix is included in S10 latest kernel patch 139555-08.
Work Around
N/A
Comments
N/A