|
Description
|
galaxy machines are observed to hang during the first boot post install using the nevada nightly gate. We see the following output on console:
Copyright 1983-2005 xxxxx , Inc. All rights reserved.
Use is subject to license terms.
WARNING: Time of Day clock error: reason [Stalled]. -- Stopped tracking Time Of Day clock.
WARNING: /pci@0,0/pci1022,7450@2/pci1000,1000@3/sd@0,0 (sd2):
SCSI transport failed: reason 'reset': retrying command
WARNING: /pci@0,0/pci1022,7450@2/pci1000,1000@3/sd@0,0 (sd2):
SCSI transport failed: reason 'reset': giving up
WARNING: Error reading ufs log
WARNING: ufs log for / changed state to Error
WARNING: Please umount(1M) / and run fsck(1M)
WARNING: /pci@0,0/pci1022,7450@2/pci1000,1000@3/sd@0,0 (sd2):
SCSI transport failed: reason 'reset': giving up
WARNING: /pci@0,0/pci1022,7450@2/pci1000,1000@3/sd@0,0 (sd2):
SCSI transport failed: reason 'reset': giving up
vn_rdwr failed with error 0x5
procfs error reading sections
WARNING: Cannot mount /proc
The above is from a p1 galaxy, from a rr galaxy:
Use is subject to license terms.
WARNING: Time of Day clock error: reason [Stalled]. -- Stopped tracking Time Of Day clock.
WARNING: /pci@0,0/pci1022,7450@2/pci1000,3060@3/sd@0,0 (sd0):
SCSI transport failed: reason 'reset': retrying command
WARNING: /pci@0,0/pci1022,7450@2/pci1000,3060@3/sd@0,0 (sd0):
SCSI transport failed: reason 'reset': giving up
WARNING: Error reading ufs log
WARNING: ufs log for / changed state to Error
WARNING: Please umount(1M) / and run fsck(1M)
WARNING: /pci@0,0/pci1022,7450@2/pci1000,3060@3 (mpt0):
mpt_send_handshake_msg task 3 failed
WARNING: mpt0: fault detected in device; service unavailable
WARNING: mpt0: hard reset failed
WARNING: /pci@0,0/pci1022,7450@2/pci1000,3060@3 (mpt0):
mpt_send_handshake_msg task 4 failed
The machine then hangs. This is not due to broken hardware in the machine.
It has been reproduced on several machines which will then install previous gates fine.
We noticed the problem using the daily of the 18th.
Running under kmdb the system panics as follows:
panic[cpu0]/thread=fec1e520: BAD TRAP: type=e (#pf Page fault) rp=fec34b94 addr=ffe8ab occurred in module "procfs" due to an illegal access to a user address
#pf Page fault
Bad kernel fault at addr=0xffe8ab
pid=0, pc=0xfeb122d4, sp=0xfe996340, eflags=0x10202
cr0: 80050033<pg,wp,ne,et,mp,pe> cr4: 6f8<xmme,fxsr,pge,mce,pae,pse,de>
cr2: ffe8ab cr3: 1033e000
gs: fec401b0 fs: c8350000 es: 160 ds: c1950160
edi: 2 esi: c1957248 ebp: fec34bf4 esp: fec34bcc
ebx: feb122c6 edx: c18e01d0 ecx: feb122c6 eax: 0
trp: e err: 2 eip: feb122d4 cs: 158
efl: 10202 usp: fe996340 ss: c811de00
fec34aec unix:die+98 (e, fec34b94, ffe8ab)
fec34b80 unix:trap+1169 (fec34b94, ffe8ab, 0)
fec34b94 unix:cmntrap+9b (fec401b0, c8350000,)
fec34bf4 procfs:_info+e (c1957248, c811de00)
fec34c38 genunix:mod_load+118 (c1957248, 1)
fec34c50 genunix:mod_hold_installed_mod+53 (c1950b90, 1, fec34c)
fec34c8c genunix:modrload+c1 (fec4f514, fec4f524,)
fec34ca4 genunix:modload+13 (fec4f514, fec4f524)
fec34cc8 genunix:vfs_getvfssw+5e (fec34cf0)
fec34d70 genunix:domount+f8 (0, fec34d90, c194a0)
fec34dc0 genunix:vfs_mountfs+5d (fec4f5b0, fec4f5a8,)
fec34df0 genunix:vfs_mountroot+188 (fe800000, 1010af8, )
fec34e04 genunix:main+87 ()
panic: entering debugger (no dump device, continue to reboot)
The problem is reproducible on all galaxy machines we have available.
The problem does not reproduce on v20z, I have not managed to test any other machines.
It may be the different type of disks, sas on galaxy that is causing the problem
to exhibit on galaxy only.
I have tracked the problem to being a putback which occured on the 15th and
am working to establish which putback is the cause. Ill update
and reassign this bug if necessary once the problem is narrowed to a particular putback.
|