It was verifying vxfs bug 6227073 binary fix: I/O error while vxfs FS full. Running ha-nfs load test on drevil1 with io-largefile, dirtree and io-sync stressers using scate w/faults. ha-nfsrg switchover brought nfsrg into stop failed state.
Fault used to triger this problem:
[Jun 21 21:41:15] Calling interface standard::rg_switchover
[Jun 21 21:41:15]
REPORT:START:5212063:rg_switchover nfsrg phys-drevil1-2
[Jun 21 21:41:15] Switching resource group nfsrg over to phys-drevil1-2...
[Jun 21 21:41:15] Connected to phys-drevil1-2:9876
[Jun 21 21:51:47] Resource Group switchover failed!
Please see the fault server logs on phys-drevil1-2 for more details.
FATAL ERROR: Fault client stopped due to rg_switchover failure
Group Name Resources
---------- ---------
Resources: nfsrg drevil1-173-1 hastp nfsrs
-- Resource Groups --
Group Name Node Name State
---------- --------- -----
Group: nfsrg phys-drevil1-1 Offline
Group: nfsrg phys-drevil1-2 Error--stop failed
-- Resources --
Resource Name Node Name State Status Message
------------- --------- ----- --------------
Resource: drevil1-173-1 phys-drevil1-1 Offline Offline - LogicalHostname offline.
Resource: drevil1-173-1 phys-drevil1-2 Online but not monitored Online - LogicalHostname online.
Resource: hastp phys-drevil1-1 Offline Offline
Resource: hastp phys-drevil1-2 Online but not monitored Online
Resource: nfsrs phys-drevil1-1 Offline Offline - Completed successfully.
Resource: nfsrs phys-drevil1-2 Stop failed Faulted
Terminated
Jun 21 21:23:00 phys-drevil1-2 nfssrv: NOTICE: nfs_server: server is now quiesced; NFSv4 state has been preserved
Jun 21 21:23:01 phys-drevil1-2 ip: TCP_IOC_ABORT_CONN: local = 010.006.173.174:0, remote = 000.000.000.000:0, start = -2, end = 6
Jun 21 21:23:01 phys-drevil1-2 ip: TCP_IOC_ABORT_CONN: aborted 0 connection
Jun 21 21:23:04 phys-drevil1-2 nfssrv: NOTICE: nfs_server: server was previously quiesced; existing NFSv4 state will be re-used
Jun 21 21:29:11 phys-drevil1-2 /usr/lib/nfs/nfsd[3857]: _nfssys(NFS_SVC_REQUEST_QUIESCE) failed: No such file or directory
Jun 21 21:34:08 phys-drevil1-2 Cluster.RGM.rgmd: Method <nfs_svc_start> on resource <nfsrs>, resource group <nfsrg>, is_frozen=<0>: Method timed out.
Jun 21 21:34:09 phys-drevil1-2 /usr/lib/nfs/nfsd[3857]: _nfssys(NFS_SVC_REQUEST_QUIESCE) failed: No such file or directory
Jun 21 21:34:52 phys-drevil1-2 ufs: NOTICE: alloc: /global/ufs: file system full
Jun 21 21:36:51 phys-drevil1-2 last message repeated 4 times
Jun 21 21:37:14 phys-drevil1-2 vxfs: NOTICE: msgcnt 1 mesg 001: V-2-1: vx_nospace - /dev/vx/dsk/dg1/vol01 file system full (1 block extent)
/dev/vx/dsk/dg1/vol04
15482443 15327622 0 100% /global/ufs
/dev/vx/dsk/dg1/vol02
5242880 18367 4897988 1% /global/vxfs2
/dev/vx/dsk/dg1/vol01
5242880 5242880 0 100% /global/vxfs
/dev/did/dsk/d10s3 494235 3656 441156 1% /global/.devices/node@2
/dev/vx/dsk/dg1/vol03
5242880 2246675 2815159 45% /local/vxfs
xxxxx@xxxxx.com 2005-06-22 07:17:28 GMT
Capturing the previous synopsis before changing to new:
nfs_svc_stop timeout with NFSv4: _nfssys(NFS_SVC_REQUEST_QUIESCE) failed: No such file or directory
Work Around
As Jingco noted, if NFSv4 is not being used, limiting the server to v3 will avoid
the problem (as the quiesce will be ignored). Set the following in
/etc/default/nfs:
NFS_SERVER_VERSMAX=3
and restart the NFS service (or reboot)
xxxxx@xxxxx.com 2005-06-26 14:33:44 GMT