OpenSolaris

Printable Version Enter a New Search
Bug ID 6304858
Synopsis S10 cluster lost metadb
State 10-Fix Delivered (Fix available in build)
Category:Subcategory kernel:svm-sc
Keywords cmdlib | onnv_triage | s10u1-req
Responsible Engineer Susan Kamm-worrell
Reported Against s10u1_07 , s10_b74l2a
Duplicate Of
Introduced In solaris_9
Commit to Fix s10u1_18
Fixed In s10u1_18
Release Fixed solaris_10u1(s10u1_18) , solaris_nevada(snv_26) (Bug ID:2129670) solaris_9(s9patch) (Bug ID:2129671,)
Related Bugs 4394256
Submit Date 1-August-2005
Last Update Date 12-June-2006
Description
Cluster nice2have, 4-node SunFire-15K, was running net_stress w/faults. When pnice2have4 was rebooted by the fault server and upon its rejoining the cluster metadb lost from all 4 nodes. Node 1-3 panic'ed due to "ucmmd died", and node 4 couldn't take any dataservice online. ds1/ds2 are ordinary sets, oban is a multi-owner set.

fault injected:
[Jul 29 20:19:46] 
REPORT:START:5291561:ucmm_fault pnice2have4 panic:3:02

Jul 29 20:21:21 pnice2have1 cl_runtime: WARNING: Path pnice2have1:ce3 - 
pnice2have4:ce3 initiation encountered errors, errno = 62. Remote node may be 
down or unreachable through this path.
Jul 29 20:22:13 pnice2have1 metaclust: Timeout expired in step2:  120:000:866
Jul 29 20:22:15 pnice2have1 SUNWscucm.ucmm_reconf: svm requests reconfiguration 
in step cmmstep3
Jul 29 20:22:15 pnice2have1 Cluster.OPS.UCMMD: prog <ucmm_reconf> failed on step 
<cmmstep3> retcode <205>
Jul 29 20:24:26 pnice2have1 cl_runtime: NOTICE: clcomm: Path pnice2have1:ce0 - 
pnice2have4:ce0 being initiated
Jul 29 20:24:26 pnice2have1 cl_runtime: NOTICE: clcomm: Path pnice2have1:ce3 - 
pnice2have4:ce3 being initiated
Jul 29 20:24:26 pnice2have1 cl_runtime: NOTICE: CMM: Node pnice2have4 (nodeid: 
4, incarnation #: 1122693831) has become reachable.
Jul 29 20:24:26 pnice2have1 cl_runtime: NOTICE: clcomm: Path pnice2have1:ce0 - 
pnice2have4:ce0 online
Jul 29 20:24:26 pnice2have1 cl_runtime: NOTICE: clcomm: Path pnice2have1:ce3 - 
pnice2have4:ce3 online
Jul 29 20:24:26 pnice2have1 cl_runtime: NOTICE: CMM: Node pnice2have4 (nodeid = 
4) is up; new incarnation number = 1122693831.
Jul 29 20:24:26 pnice2have1 cl_runtime: NOTICE: CMM: Cluster members: 
pnice2have1 pnice2have2 pnice2have3 pnice2have4.
Jul 29 20:24:28 pnice2have1 cl_runtime: NOTICE: CMM: node reconfiguration #2089 
completed.
Jul 29 20:26:35 pnice2have1 cl_runtime: NOTICE: Load balancer setting 
distribution on net_test_r0:
Jul 29 20:26:35 pnice2have1 cl_runtime: NOTICE: Node pnice2have1: weight 1
Jul 29 20:26:35 pnice2have1 cl_runtime: NOTICE: Node pnice2have2: weight 1
Jul 29 20:26:35 pnice2have1 cl_runtime: NOTICE: Node pnice2have3: weight 1
Jul 29 20:26:35 pnice2have1 cl_runtime: NOTICE: Node pnice2have4: weight 1
Jul 29 20:26:45 pnice2have1 cl_runtime: NOTICE: Scalable service instance 
[TCP,10.6.176.180,1500] registered on node pnice2have4.
Jul 29 20:26:51 pnice2have1 metaclust: pnice2have4: Synchronization of user 
records in set oban failed
Jul 29 20:26:51 pnice2have1 : there are no existing databases
Jul 29 20:26:51 pnice2have1 metaclust: exiting with 1
Jul 29 20:26:51 pnice2have1 SUNWscucm.ucmm_reconf: svm exited with error 1 in 
step cmmstep3

Jul 29 20:27:51 pnice2have2 metaclust: pnice2have4: there are no existing 
databases
Jul 29 20:27:51 pnice2have2 metaclust: exiting with 1
Jul 29 20:27:54 pnice2have2 SUNWscucm.ucmm_reconf: svm exited with error 1 in 
step cmmstep2

Jul 29 20:27:18 pnice2have3 metaclust: pnice2have4: there are no existing 
databases
Jul 29 20:27:18 pnice2have3 metaclust: exiting with 1
Jul 29 20:27:19 pnice2have3 SUNWscucm.ucmm_reconf: svm exited with error 1 in 
step cmmstep2

Jul 29 20:28:28 pnice2have4 Cluster.Framework: stderr: metaset: pnice2have4: 
there are no existing databases
Jul 29 20:28:31 pnice2have4 last message repeated 15 times
Jul 29 20:28:31 pnice2have4 
SC[SUNW.HAStoragePlus:2,nfsrg,hastp-nfs,hastorageplus_prenet_start_private]: 
Global service ds2 associated with path /global/ufs is unable to become a 
primary on node 4.

pnice2have4#  ls /var/sadm/patch
118551-01     118822-13     119042-02     IDR120361-01
118553-01     119015-02     119578-06
pnice2have4# cat /etc/release
                         Solaris 10 3/05 s10_74L2a SPARC
pnice2have4# scstat -D

-- Device Group Servers --

                         Device Group        Primary             Secondary
                         ------------        -------             ---------
  Device group servers:  ds1                 -                   -
  Device group servers:  ds2                 -                   -


-- Device Group Status --

                              Device Group        Status              
                              ------------        ------              
  Device group status:        ds1                 Offline
  Device group status:        ds2                 Offline


-- Multi-owner Device Groups --

                              Device Group        Online Status
                              ------------        -------------
  Multi-owner device group:   oban                pnice2have4

pnice2have4# metaset -s oban
metaset: pnice2have4: system/mdmonitor:default: service not online in SMF

pnice2have4# svcs -a | grep md
legacy_run     20:25:46 lrc:/etc/rc2_d/S95SUNWmd_binddevs
online         20:24:25 svc:/network/rpc/mdcomm:default
online         20:24:26 svc:/network/rpc/scadmd:default
online         20:24:26 svc:/network/rpc/scrcmd:default
online         20:24:31 svc:/system/fmd:default
maintenance    20:24:33 svc:/system/mdmonitor:default
pnice2have4# egrep "ds1|ds2|oban" /etc/cluster/ccr/*
/etc/cluster/ccr/dcs_service_521:DCS_ServiceName        oban
/etc/cluster/ccr/dcs_service_521.bak:DCS_ServiceName    oban
/etc/cluster/ccr/dcs_service_522:DCS_ServiceName        ds1
/etc/cluster/ccr/dcs_service_522.bak:DCS_ServiceName    ds1
/etc/cluster/ccr/dcs_service_523:DCS_ServiceName        ds2
/etc/cluster/ccr/dcs_service_523.bak:DCS_ServiceName    ds2

pnice2have4#  ps -ef |grep rpc
  daemon   228     1   0 20:24:20 console     0:00 /usr/sbin/rpcbind
    root  2938   287   0 20:26:17 console     0:00 /usr/sbin/rpc.metad
    root  2052     1   0 20:25:49 console     0:01 /usr/cluster/lib/sc/rpc.fed
    root  2146     1   0 20:25:51 console     0:00 
/usr/cluster/lib/sc/sparcv9/rpc.pmfd
Work Around
N/A
Comments
N/A