OpenSolaris

Printable Version Enter a New Search
Bug ID 6355260
Synopsis galaxy machines hang post install on any nevada gate after november 15th
State 10-Fix Delivered (Fix available in build)
Category:Subcategory kernel:driver-mpt-x86
Keywords onnv_triage
Responsible Engineer Jie Cao
Reported Against snv_28
Duplicate Of
Introduced In solaris_nevada
Commit to Fix snv_32
Fixed In snv_32
Release Fixed solaris_nevada(snv_32) , solaris_10u2(s10u2_06) (Bug ID:2135482)
Related Bugs 2131383 , 6210716
Submit Date 24-November-2005
Last Update Date 23-January-2006
Description
galaxy machines are observed to hang during the first boot post install using the nevada nightly gate. We see the following output on console:

Copyright 1983-2005  xxxxx , Inc.  All rights reserved.
Use is subject to license terms.
WARNING: Time of Day clock error: reason [Stalled]. --  Stopped tracking Time Of Day clock.
WARNING: /pci@0,0/pci1022,7450@2/pci1000,1000@3/sd@0,0 (sd2):
        SCSI transport failed: reason 'reset': retrying command

WARNING: /pci@0,0/pci1022,7450@2/pci1000,1000@3/sd@0,0 (sd2):
        SCSI transport failed: reason 'reset': giving up

WARNING: Error reading ufs log
WARNING: ufs log for / changed state to Error
WARNING: Please umount(1M) / and run fsck(1M)
WARNING: /pci@0,0/pci1022,7450@2/pci1000,1000@3/sd@0,0 (sd2):
        SCSI transport failed: reason 'reset': giving up

WARNING: /pci@0,0/pci1022,7450@2/pci1000,1000@3/sd@0,0 (sd2):
        SCSI transport failed: reason 'reset': giving up

vn_rdwr failed with error 0x5
procfs error reading sections
WARNING: Cannot mount /proc

The above is from a p1 galaxy, from a rr galaxy:

Use is subject to license terms.
WARNING: Time of Day clock error: reason [Stalled]. --  Stopped tracking Time Of Day clock.
WARNING: /pci@0,0/pci1022,7450@2/pci1000,3060@3/sd@0,0 (sd0):
        SCSI transport failed: reason 'reset': retrying command

WARNING: /pci@0,0/pci1022,7450@2/pci1000,3060@3/sd@0,0 (sd0):
        SCSI transport failed: reason 'reset': giving up

WARNING: Error reading ufs log
WARNING: ufs log for / changed state to Error
WARNING: Please umount(1M) / and run fsck(1M)
WARNING: /pci@0,0/pci1022,7450@2/pci1000,3060@3 (mpt0):
        mpt_send_handshake_msg task 3 failed

WARNING: mpt0: fault detected in device; service unavailable
WARNING: mpt0: hard reset failed
WARNING: /pci@0,0/pci1022,7450@2/pci1000,3060@3 (mpt0):
        mpt_send_handshake_msg task 4 failed


The machine then hangs. This is not due to broken hardware in the machine.
It has been reproduced on several machines which will then install previous gates fine.
We noticed the problem using the daily of the 18th.

Running under kmdb the system panics as follows:

panic[cpu0]/thread=fec1e520: BAD TRAP: type=e (#pf Page fault) rp=fec34b94 addr=ffe8ab occurred in module "procfs" due to an illegal access to a user address

#pf Page fault
Bad kernel fault at addr=0xffe8ab
pid=0, pc=0xfeb122d4, sp=0xfe996340, eflags=0x10202
cr0: 80050033<pg,wp,ne,et,mp,pe> cr4: 6f8<xmme,fxsr,pge,mce,pae,pse,de>
cr2: ffe8ab cr3: 1033e000
         gs: fec401b0  fs: c8350000  es:      160  ds: c1950160
        edi:        2 esi: c1957248 ebp: fec34bf4 esp: fec34bcc
        ebx: feb122c6 edx: c18e01d0 ecx: feb122c6 eax:        0
        trp:        e err:        2 eip: feb122d4  cs:      158
        efl:    10202 usp: fe996340  ss: c811de00

fec34aec unix:die+98 (e, fec34b94, ffe8ab)
fec34b80 unix:trap+1169 (fec34b94, ffe8ab, 0)
fec34b94 unix:cmntrap+9b (fec401b0, c8350000,)
fec34bf4 procfs:_info+e (c1957248, c811de00)
fec34c38 genunix:mod_load+118 (c1957248, 1)
fec34c50 genunix:mod_hold_installed_mod+53 (c1950b90, 1, fec34c)
fec34c8c genunix:modrload+c1 (fec4f514, fec4f524,)
fec34ca4 genunix:modload+13 (fec4f514, fec4f524)
fec34cc8 genunix:vfs_getvfssw+5e (fec34cf0)
fec34d70 genunix:domount+f8 (0, fec34d90, c194a0)
fec34dc0 genunix:vfs_mountfs+5d (fec4f5b0, fec4f5a8,)
fec34df0 genunix:vfs_mountroot+188 (fe800000, 1010af8, )
fec34e04 genunix:main+87 ()

panic: entering debugger (no dump device, continue to reboot)

The problem is reproducible on all galaxy machines we have available. 
The problem does not reproduce on v20z, I have not managed to test any other machines.
It may be the different type of disks, sas on galaxy that is causing the problem
to exhibit on galaxy only.

I have tracked the problem to being a putback which occured on the 15th and 
am working to establish which putback is the cause. Ill update
and reassign this bug if necessary once the problem is narrowed to a particular putback.
Work Around
N/A
Comments
N/A