|
Description
|
System hung while running snv_97.
A crash was forced so that we could see why the system hung.
The crash dump can be accessed via NFS at:
/net/irperf.ireland/export/work/crash_images/Wed.Sep.03.2008.oaf-x.specweb99-102_SJWS61_SP9.snv_97
OS Details:
SunOS oaf-x 5.11 snv_97 sun4u sparc SUNW,Sun-Fire-V240
Solaris Express Community Edition snv_97 SPARC
Copyright 2008 xxxxx Inc. All Rights Reserved.
Use is subject to license terms.
Assembled 27 August 2008
Sun Microsystems Inc. SunOS 5.11 snv_97 November 2008
We have the following console messages:
oaf-x had hung for 4 days
/dev/rdsk/c0t1d0s2 is clean
/dev/rdsk/c0t1d0s0 is clean
/dev/rdsk/c0t1d0s1 is clean
Aug 31 22:02:44 svc.startd[7]: svc:/system/power:default: Method or service exit timed out. Killing contract 23.
Aug 31 22:02:44 svc.startd[7]: svc:/system/power:default: Method "/lib/svc/method/svc-power start" failed due to signal KILL.
Aug 31 22:03:45 svc.startd[7]: svc:/system/power:default: Method or service exit timed out. Killing contract 42.
Aug 31 22:03:45 svc.startd[7]: svc:/system/power:default: Method "/lib/svc/method/svc-power start" failed due to signal KILL.
Aug 31 22:04:44 svc.startd[7]: svc:/system/metasync:default: Method or service exit timed out. Killing contract 20.
Aug 31 22:04:44 svc.startd[7]: svc:/system/metasync:default: Method "/lib/svc/method/svc-metasync" failed due to signal KILL.
Aug 31 22:04:45 svc.startd[7]: svc:/system/power:default: Method or service exit timed out. Killing contract 44.
Aug 31 22:04:45 svc.startd[7]: svc:/system/power:default: Method "/lib/svc/method/svc-power start" failed due to signal KILL.
Aug 31 22:04:45 svc.startd[7]: system/power:default failed: transitioned to maintenance (see 'svcs -xv' for details)
Aug 31 22:07:45 svc.startd[7]: svc:/system/metasync:default: Method or service exit timed out. Killing contract 46.
Aug 31 22:07:45 svc.startd[7]: svc:/system/metasync:default: Method "/lib/svc/method/svc-metasync" failed due to signal KILL.
Aug 31 22:10:46 svc.startd[7]: svc:/system/metasync:default: Method or service exit timed out. Killing contract 50.
Aug 31 22:10:46 svc.startd[7]: svc:/system/metasync:default: Method "/lib/svc/method/svc-metasync" failed due to signal KILL.
Aug 31 22:10:46 svc.startd[7]: system/metasync:default failed: transitioned to maintenance (see 'svcs -xv' for details)
[advise:pts/634 connecting to advise:pts/235]
sc>
Extract from mdb.log
---------------------
mdb: logging to "mdb.log"
> :> :> s> t> a> t> u> s>
debugging crash dump vmcore.0 (64-bit) from oaf-x
operating system: 5.11 snv_97 (sun4u)
panic message: sync initiated
dump content: kernel pages only
> :> :> c> p> u> i> n> f> o>
ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC
0 00001838078 1b 2 0 105 no no t-7 2a10038fca0 sched
1 0000180c000 1d 1 0 -1 no no t-0 2a10001fca0 (idle)
> $> c>
0xf004609c(600108f604e, 6, 6, 600108f65c7, a, c)
mutex_vector_enter+0x20(0, 1812e30, ffffffffffffffff, 2, 3, 10c5180)
Re-assigning to solaris->kernel->pm as it hung trying to start power services.
While testing changes for 6673281, I had to test enable/disable mpxio on SVM mirrored
root. This failed for me, repeatedly, and in the same fashion every time.
The core provided by IR perf shows these threads as KILLed::
SolarisCAT(vmcore.0/11U)> tlist killed
==== user (LWP_SYS) thread: 0x30001101420 PID: 133 ====
cmd: /usr/sbin/metadevadm -l -r
fmri: svc:/system/metasync:default
t_wchan: 0x30000df1d7c sobj: condition var (from genunix:ndi_devi_enter+0x54)
t_procp: 0x60010b44010
p_as: 0x60010a39190 size: 3399680 RSS: 1703936
hat: 0x300002a6380 cpusran: 0,1
zone: global
t_stk: 0x2a1009d5ae0 sp: 0x2a1009d4bd1 t_stkbase: 0x2a1009d0000
t_pri: 59(TS) pctcpu: 0.093727
t_lwp: 0x600114119b0 machpcb: 0x2a1009d5ae0
mstate: LMS_SLEEP ms_prev: LMS_SYSTEM
ms_state_start: 2 days 17 hours 14 minutes 9.2138443 seconds earlier
ms_start: 2 days 17 hours 14 minutes 13.2970679 seconds earlier
psrset: 0 last CPU: 1
idle: 23477944 ticks (2 days 17 hours 12 minutes 59.44 seconds)
start: Sun Aug 31 14:01:44 2008
age: 234785 seconds (2 days 17 hours 13 minutes 5 seconds)
syscall: #152 modctl(, 0xffbffa28) ()
tstate: TS_SLEEP - awaiting an event
tflg: T_DFLTSTK - stack is default size
tpflg: TP_TWAIT - wait to be freed by lwp_wait
TP_MSACCT - collect micro-state accounting information
tsched: TS_LOAD - thread is in memory
TS_DONT_SWAP - thread/LWP should not be swapped
pflag: SKILLED - SIGKILL has been posted to the process
SMSACCT - process is keeping micro-state accounting
SMSFORK - child inherits micro-state accounting
pc: genunix:cv_wait+0x3c: call unix:swtch
genunix:cv_wait+0x3c(0x30000df1d7c, 0x30000df1c90, 0x5f6, 0x2fb, 0x368)
genunix:ndi_devi_enter+0x54(0x30000df1c28, 0x2a1009d5608, , 0x0)
genunix:devi_config_one+0x138(0x30000df1c28, 0x60012716800, 0x2a1009d5788, , 0x0)
genunix:ndi_devi_config_one+0xb0(0x30000df1c28, 0x60012716800, 0x2a1009d5788, 0x4080)
genunix:resolve_pathname+0x148(0x60012dc4940, 0x2a1009d5870, 0x0, 0x0, , 0xffffffffffffffff)
genunix:e_ddi_hold_devi_by_path+0x10(0x60012dc4940, 0x0, 0x60012dc4940, 0x100, 0x20, 0x157)
genunix:ddi_dev_pathname+0xc0(, , 0x60012dc4940)
genunix:modctl_devid2paths+0x2a4(0x25980, 0x25680, 0x0, 0xffbffa94, 0x0)
genunix:modctl+0x3b0(, , 0x25680, 0x0, 0xffbffa94)
unix:syscall_trap32+0xcc()
-- switch to user thread's user stack --
==== user (LWP_SYS) thread: 0x3000112db80 PID: 152 ====
cmd: /usr/sbin/pmconfig
fmri: svc:/system/power:default
t_wchan: 0x60012e1d634 sobj: condition var (from genunix:pm_dispatch_to_dep_thread+0x104)
t_procp: 0x60012c28020
p_as: 0x60010a38c50 size: 2146304 RSS: 1433600
hat: 0x30001f696c0 cpusran: 0,1
zone: global
t_stk: 0x2a1004adae0 sp: 0x2a1004ac9c1 t_stkbase: 0x2a1004a8000
t_pri: 59(TS) pctcpu: 0.067734
t_lwp: 0x60010bbe770 machpcb: 0x2a1004adae0
mstate: LMS_SLEEP ms_prev: LMS_SYSTEM
ms_state_start: 2 days 17 hours 14 minutes 12.3965233 seconds earlier
ms_start: 2 days 17 hours 14 minutes 12.8756538 seconds earlier
psrset: 0 last CPU: 1
idle: 23478263 ticks (2 days 17 hours 13 minutes 2.63 seconds)
start: Sun Aug 31 14:01:44 2008
age: 234785 seconds (2 days 17 hours 13 minutes 5 seconds)
syscall: #54 ioctl(, 0xffbffce8) ()
tstate: TS_SLEEP - awaiting an event
tflg: T_DFLTSTK - stack is default size
tpflg: TP_TWAIT - wait to be freed by lwp_wait
TP_MSACCT - collect micro-state accounting information
tsched: TS_LOAD - thread is in memory
TS_DONT_SWAP - thread/LWP should not be swapped
pflag: SKILLED - SIGKILL has been posted to the process
SMSACCT - process is keeping micro-state accounting
SMSFORK - child inherits micro-state accounting
pc: genunix:cv_wait+0x3c: call unix:swtch
genunix:cv_wait+0x3c(0x60012e1d634, 0x18c05e8, 0x0, 0x0, 0x0)
genunix:pm_dispatch_to_dep_thread+0x104(0x7, 0x2a1004ad678, 0x2a1004ad7b0, 0x1, 0x0, 0x0)
pm:pm_ioctl+0xb88(, 0x32)
specfs:spec_ioctl(0x60012e53d00, 0x32, 0xffbffd50, 0x100003, , 0x2a1004adadc, 0x0) - frame recycled
genunix:fop_ioctl+0x58(0x60012e53d00, 0x32, 0xffbffd50, 0x100003, 0x0, 0x2a1004adadc, 0x0)
genunix:ioctl+0x16c()
unix:syscall_trap32+0xcc()
-- switch to user thread's user stack --
==== user (LWP_SYS) thread: 0x3000112c4c0 PID: 338 ====
cmd: /usr/sbin/pmconfig
fmri: svc:/system/power:default
t_wchan: 0x30000df1d7c sobj: condition var (from genunix:ndi_devi_enter+0x54)
t_procp: 0x60010be0018
p_as: 0x60010a39350 size: 1892352 RSS: 1146880
hat: 0x300002a61c0 cpusran: 1
zone: global
t_stk: 0x2a100a3dae0 sp: 0x2a100a3cc21 t_stkbase: 0x2a100a38000
t_pri: 59(TS) pctcpu: 0.000000
t_lwp: 0x600114a87d0 machpcb: 0x2a100a3dae0
mstate: LMS_SLEEP ms_prev: LMS_SYSTEM
ms_state_start: 2 days 17 hours 13 minutes 12.8160047 seconds earlier
ms_start: 2 days 17 hours 13 minutes 12.8197786 seconds earlier
psrset: 0 last CPU: 1
idle: 23472305 ticks (2 days 17 hours 12 minutes 3.05 seconds)
start: Sun Aug 31 14:02:44 2008
age: 234725 seconds (2 days 17 hours 12 minutes 5 seconds)
syscall: #54 ioctl(, 0xffbff888) ()
tstate: TS_SLEEP - awaiting an event
tflg: T_DFLTSTK - stack is default size
tpflg: TP_TWAIT - wait to be freed by lwp_wait
TP_MSACCT - collect micro-state accounting information
tsched: TS_LOAD - thread is in memory
TS_DONT_SWAP - thread/LWP should not be swapped
pflag: SKILLED - SIGKILL has been posted to the process
SMSACCT - process is keeping micro-state accounting
SMSFORK - child inherits micro-state accounting
pc: genunix:cv_wait+0x3c: call unix:swtch
genunix:cv_wait+0x3c(0x30000df1d7c, 0x30000df1c90, 0x600114a87d0, 0x0, 0xf0)
genunix:ndi_devi_enter+0x54(0x30000df1c28, 0x60013608ec0, , 0x6)
devinfo:di_copytree+0x6c(0x30000df1c28, 0x30001cca020, 0x60014872ac0)
devinfo:di_snapshot+0x180(0x60014872ac0)
devinfo:di_snapshot_and_clean+0xc(0x60014872ac0, 0x4)
devinfo:di_ioctl+0x5cc()
specfs:spec_ioctl(0x600148b6b40, 0xdf04, 0xffbff8ec, 0x100001, , 0x2a100a3dadc, 0x0) - frame recycled
genunix:fop_ioctl+0x58(0x600148b6b40, 0xdf04, 0xffbff8ec, 0x100001, 0x0, 0x2a100a3dadc, 0x0)
genunix:ioctl+0x16c()
unix:syscall_trap32+0xcc()
-- switch to user thread's user stack --
==== user (LWP_SYS) thread: 0x30001124b00 PID: 344 ====
cmd: /usr/sbin/pmconfig
fmri: svc:/system/power:default
t_wchan: 0x30000df1d7c sobj: condition var (from genunix:ndi_devi_enter+0x54)
t_procp: 0x60012dc1088
p_as: 0x60010a38d30 size: 1892352 RSS: 1146880
hat: 0x30001f68e00 cpusran: 1
zone: global
t_stk: 0x2a100639ae0 sp: 0x2a100638c21 t_stkbase: 0x2a100634000
t_pri: 59(TS) pctcpu: 0.015948
t_lwp: 0x600114a8f10 machpcb: 0x2a100639ae0
mstate: LMS_SLEEP ms_prev: LMS_SYSTEM
ms_state_start: 2 days 17 hours 12 minutes 12.2113141 seconds earlier
ms_start: 2 days 17 hours 12 minutes 12.2151854 seconds earlier
psrset: 0 last CPU: 1
idle: 23466244 ticks (2 days 17 hours 11 minutes 2.44 seconds)
start: Sun Aug 31 14:03:45 2008
age: 234664 seconds (2 days 17 hours 11 minutes 4 seconds)
syscall: #54 ioctl(, 0xffbff888) ()
tstate: TS_SLEEP - awaiting an event
tflg: T_DFLTSTK - stack is default size
tpflg: TP_TWAIT - wait to be freed by lwp_wait
TP_MSACCT - collect micro-state accounting information
tsched: TS_LOAD - thread is in memory
TS_DONT_SWAP - thread/LWP should not be swapped
pflag: SKILLED - SIGKILL has been posted to the process
SMSACCT - process is keeping micro-state accounting
SMSFORK - child inherits micro-state accounting
pc: genunix:cv_wait+0x3c: call unix:swtch
genunix:cv_wait+0x3c(0x30000df1d7c, 0x30000df1c90, 0x600114a8f10, 0x0, 0xf0)
genunix:ndi_devi_enter+0x54(0x30000df1c28, 0x60013608980, , 0x6)
devinfo:di_copytree+0x6c(0x30000df1c28, 0x30001cc2020, 0x60010a80080)
devinfo:di_snapshot+0x180(0x60010a80080)
devinfo:di_snapshot_and_clean+0xc(0x60010a80080, 0x4)
devinfo:di_ioctl+0x5cc()
specfs:spec_ioctl(0x600148b7340, 0xdf04, 0xffbff8ec, 0x100001, , 0x2a100639adc, 0x0) - frame recycled
genunix:fop_ioctl+0x58(0x600148b7340, 0xdf04, 0xffbff8ec, 0x100001, 0x0, 0x2a100639adc, 0x0)
genunix:ioctl+0x16c()
unix:syscall_trap32+0xcc()
-- switch to user thread's user stack --
4 threads found.
#####
note that the pmconfig thread is the youngest of the lot. I seriously
doubt that kernel/pm is the right category.
(continues)
My example, on nsvr-533.central, does not show the pmconfig thread, but the
SVM-related threads are showing very similar problems - all stuck in devid-related
functions. Example:
0000030004d2f100 600249ad098 60022ef0048 1 59 30001677d7c
PC: cv_wait+0x3c CMD: /usr/sbin/mdmonitord
stack pointer for thread 30004d2f100: 2a1023c85b1
[ 000002a1023c85b1 cv_wait+0x3c() ]
ndi_devi_enter+0x54()
devi_config_one+0x138()
ndi_devi_config_one+0xb0()
resolve_pathname+0x148()
e_ddi_hold_devi_by_path+0x10()
e_devid_cache_to_devt_list+0x2a8()
ddi_lyr_devid_to_devlist+0x10()
md`mddb_devid_validate+0x24()
md`ridev+0x80()
md`mddb_configure+0x158()
md`md_base_ioctl+0x8dc()
md`md_admin_ioctl+0x14()
md`mdioctl+0xf4()
fop_ioctl+0x58()
ioctl+0x16c()
syscall_trap32+0xcc()
0000030004f9cc20 600272318a0 60022f7b298 1 60 30004100350
PC: cv_wait+0x3c CMD: /usr/bin/savecore /var/crash/nsvr-533
stack pointer for thread 30004f9cc20: 2a1023a8731
[ 000002a1023a8731 cv_wait+0x3c() ]
md_mirror`mirror_mark_resync_region+0x118()
md_mirror`mirror_write_strategy+0x55c()
md`mdstrategy+0xc8()
ufs`lufs_write_strategy+0x118()
ufs`ufs_putapage+0x4b0()
ufs`ufs_putpages+0x2a8()
fop_putpage+0x48()
segmap_release+0x350()
ufs`wrip+0x99c()
ufs`ufs_trans_write+0x44()
ufs`ufs_write+0x548()
fop_write+0x48()
pwrite+0x140()
syscall_trap+0xac()
0000030004c633a0 60021fd7838 60021fdf228 0 60 1964a18
PC: turnstile_block+0x394 CMD: fsflush
stack pointer for thread 30004c633a0: 2a101ef0881
[ 000002a101ef0881 turnstile_block+0x394() ]
mutex_vector_enter+0x3f4()
md`mddb_setenter+0x2c()
md`mddb_commitrec+0x18()
md`mddb_commitrec_wrapper+0xc()
md_mirror`mirror_mark_resync_region+0x278()
md_mirror`mirror_write_strategy+0x55c()
md`mdstrategy+0xc8()
ufs`lufs_write_strategy+0x118()
ufs`ufs_putapage+0x4b0()
ufs`ufs_putpages+0x2a8()
fop_putpage+0x48()
fsflush_do_pages+0x3f8()
fsflush+0x3d0()
thread_start+4()
Something looks really rotten in SVM-land. There were some putbacks to snv_95
which touched this area last, 6480431 is the first one to check.
I saw this problem on nsvr-533 without MPxIO enabled, and *with* MPxIO enabled.
I also saw it with SAS as well as FC disks.
Another salient point - every time I'd boot up, my d0={d1|d2} mirror was in a
'needs maintenance' state. Running "metasync 2048 d0" always resulted in a clean
mirror in a short period of time, but it is always an item of concern to have
this state on boot.
On our systems we are seeing this crash on a ufs root with a SVM partition. We are seeing this across a number of V240's.
|