|
Description
|
Cluster nice2have, 4-node SunFire-15K, was running net_stress w/faults. When pnice2have4 was rebooted by the fault server and upon its rejoining the cluster metadb lost from all 4 nodes. Node 1-3 panic'ed due to "ucmmd died", and node 4 couldn't take any dataservice online. ds1/ds2 are ordinary sets, oban is a multi-owner set.
fault injected:
[Jul 29 20:19:46]
REPORT:START:5291561:ucmm_fault pnice2have4 panic:3:02
Jul 29 20:21:21 pnice2have1 cl_runtime: WARNING: Path pnice2have1:ce3 -
pnice2have4:ce3 initiation encountered errors, errno = 62. Remote node may be
down or unreachable through this path.
Jul 29 20:22:13 pnice2have1 metaclust: Timeout expired in step2: 120:000:866
Jul 29 20:22:15 pnice2have1 SUNWscucm.ucmm_reconf: svm requests reconfiguration
in step cmmstep3
Jul 29 20:22:15 pnice2have1 Cluster.OPS.UCMMD: prog <ucmm_reconf> failed on step
<cmmstep3> retcode <205>
Jul 29 20:24:26 pnice2have1 cl_runtime: NOTICE: clcomm: Path pnice2have1:ce0 -
pnice2have4:ce0 being initiated
Jul 29 20:24:26 pnice2have1 cl_runtime: NOTICE: clcomm: Path pnice2have1:ce3 -
pnice2have4:ce3 being initiated
Jul 29 20:24:26 pnice2have1 cl_runtime: NOTICE: CMM: Node pnice2have4 (nodeid:
4, incarnation #: 1122693831) has become reachable.
Jul 29 20:24:26 pnice2have1 cl_runtime: NOTICE: clcomm: Path pnice2have1:ce0 -
pnice2have4:ce0 online
Jul 29 20:24:26 pnice2have1 cl_runtime: NOTICE: clcomm: Path pnice2have1:ce3 -
pnice2have4:ce3 online
Jul 29 20:24:26 pnice2have1 cl_runtime: NOTICE: CMM: Node pnice2have4 (nodeid =
4) is up; new incarnation number = 1122693831.
Jul 29 20:24:26 pnice2have1 cl_runtime: NOTICE: CMM: Cluster members:
pnice2have1 pnice2have2 pnice2have3 pnice2have4.
Jul 29 20:24:28 pnice2have1 cl_runtime: NOTICE: CMM: node reconfiguration #2089
completed.
Jul 29 20:26:35 pnice2have1 cl_runtime: NOTICE: Load balancer setting
distribution on net_test_r0:
Jul 29 20:26:35 pnice2have1 cl_runtime: NOTICE: Node pnice2have1: weight 1
Jul 29 20:26:35 pnice2have1 cl_runtime: NOTICE: Node pnice2have2: weight 1
Jul 29 20:26:35 pnice2have1 cl_runtime: NOTICE: Node pnice2have3: weight 1
Jul 29 20:26:35 pnice2have1 cl_runtime: NOTICE: Node pnice2have4: weight 1
Jul 29 20:26:45 pnice2have1 cl_runtime: NOTICE: Scalable service instance
[TCP,10.6.176.180,1500] registered on node pnice2have4.
Jul 29 20:26:51 pnice2have1 metaclust: pnice2have4: Synchronization of user
records in set oban failed
Jul 29 20:26:51 pnice2have1 : there are no existing databases
Jul 29 20:26:51 pnice2have1 metaclust: exiting with 1
Jul 29 20:26:51 pnice2have1 SUNWscucm.ucmm_reconf: svm exited with error 1 in
step cmmstep3
Jul 29 20:27:51 pnice2have2 metaclust: pnice2have4: there are no existing
databases
Jul 29 20:27:51 pnice2have2 metaclust: exiting with 1
Jul 29 20:27:54 pnice2have2 SUNWscucm.ucmm_reconf: svm exited with error 1 in
step cmmstep2
Jul 29 20:27:18 pnice2have3 metaclust: pnice2have4: there are no existing
databases
Jul 29 20:27:18 pnice2have3 metaclust: exiting with 1
Jul 29 20:27:19 pnice2have3 SUNWscucm.ucmm_reconf: svm exited with error 1 in
step cmmstep2
Jul 29 20:28:28 pnice2have4 Cluster.Framework: stderr: metaset: pnice2have4:
there are no existing databases
Jul 29 20:28:31 pnice2have4 last message repeated 15 times
Jul 29 20:28:31 pnice2have4
SC[SUNW.HAStoragePlus:2,nfsrg,hastp-nfs,hastorageplus_prenet_start_private]:
Global service ds2 associated with path /global/ufs is unable to become a
primary on node 4.
pnice2have4# ls /var/sadm/patch
118551-01 118822-13 119042-02 IDR120361-01
118553-01 119015-02 119578-06
pnice2have4# cat /etc/release
Solaris 10 3/05 s10_74L2a SPARC
pnice2have4# scstat -D
-- Device Group Servers --
Device Group Primary Secondary
------------ ------- ---------
Device group servers: ds1 - -
Device group servers: ds2 - -
-- Device Group Status --
Device Group Status
------------ ------
Device group status: ds1 Offline
Device group status: ds2 Offline
-- Multi-owner Device Groups --
Device Group Online Status
------------ -------------
Multi-owner device group: oban pnice2have4
pnice2have4# metaset -s oban
metaset: pnice2have4: system/mdmonitor:default: service not online in SMF
pnice2have4# svcs -a | grep md
legacy_run 20:25:46 lrc:/etc/rc2_d/S95SUNWmd_binddevs
online 20:24:25 svc:/network/rpc/mdcomm:default
online 20:24:26 svc:/network/rpc/scadmd:default
online 20:24:26 svc:/network/rpc/scrcmd:default
online 20:24:31 svc:/system/fmd:default
maintenance 20:24:33 svc:/system/mdmonitor:default
pnice2have4# egrep "ds1|ds2|oban" /etc/cluster/ccr/*
/etc/cluster/ccr/dcs_service_521:DCS_ServiceName oban
/etc/cluster/ccr/dcs_service_521.bak:DCS_ServiceName oban
/etc/cluster/ccr/dcs_service_522:DCS_ServiceName ds1
/etc/cluster/ccr/dcs_service_522.bak:DCS_ServiceName ds1
/etc/cluster/ccr/dcs_service_523:DCS_ServiceName ds2
/etc/cluster/ccr/dcs_service_523.bak:DCS_ServiceName ds2
pnice2have4# ps -ef |grep rpc
daemon 228 1 0 20:24:20 console 0:00 /usr/sbin/rpcbind
root 2938 287 0 20:26:17 console 0:00 /usr/sbin/rpc.metad
root 2052 1 0 20:25:49 console 0:01 /usr/cluster/lib/sc/rpc.fed
root 2146 1 0 20:25:51 console 0:00
/usr/cluster/lib/sc/sparcv9/rpc.pmfd
|