OpenSolaris

Printable Version Enter a New Search
Bug ID 6602575
Synopsis process stuck in close() when closing TCP socket
State 10-Fix Delivered (Fix available in build)
Category:Subcategory kernel:tcp-ip
Keywords rtiq_regression | test-stopper
Responsible Engineer Jonathan Anderson
Reported Against 3.2u1_08 , 3.2u1_09
Duplicate Of
Introduced In solaris_nevada
Commit to Fix snv_79
Fixed In snv_79
Release Fixed solaris_nevada(snv_79) , solaris_10u5(s10u5_07) (Bug ID:2155971)
Related Bugs 6240605 , 6624459
Submit Date 9-September-2007
Last Update Date 5-December-2007
Description
The cluster was running rac load test.
SW stacks: s10u4_12b, sc3.2u1_08, SunNas, udlm 3349, rac 10.2.0.3 (5749953, 5769259).
Increased udlm_step1_timeout=200, it doesn't help.

Sep  9 00:10:09 ppuka1 nscd[341]: yp_all: failed to get server's name
Sep  9 00:11:08 ppuka1 last message repeated 32 times
Sep  9 00:13:31 ppuka1 nscd[341]: yp_all: failed to get server's name
Sep  9 00:13:32 ppuka1 last message repeated 1 time
Sep  9 00:16:50 ppuka1 ID[SUNWudlm.udlmctl]: transition 'step1' timed out for cluster, forcing reconfiguration.
Sep  9 00:16:52 ppuka1 nscd[341]: yp_all: failed to get server's name
Sep  9 00:16:53 ppuka1 last message repeated 1 time
Sep  9 00:20:12 ppuka1 ID[SUNWudlm.udlmctl]: transition 'step1' timed out for cluster, forcing reconfiguration.

From ucmm_reconfig.log file : 
Sun Sep  9 00:03:07 PDT 2007 SUNWscucm.ucmm_reconf Step: cmmstep9 CURRNODES=0 1
2
Sun Sep  9 00:03:08 PDT 2007 SUNWscucm.ucmm_reconf Step: cmmstep10 CURRNODES=0 1
 2
Sun Sep  9 00:10:05 PDT 2007 SUNWscucm.ucmm_reconf Step: cmmreturn CURRNODES=0 1
 2
Sun Sep  9 00:10:07 PDT 2007 SUNWscucm.ucmm_reconf Step: cmmstep1 CURRNODES=0 1
2 3
Sun Sep  9 00:10:07 PDT 2007 SUNWscucm.ucmm_reconf ucmm reconfiguration step 1 s
tarted
Sun Sep  9 00:10:07 PDT 2007 SUNWscucm.ucmm_reconf reserve started in cmmstep1
Sun Sep  9 00:10:07 PDT 2007 SUNWscucm.ucmm_reconf reserve completed successfull
y in cmmstep1
Sun Sep  9 00:10:07 PDT 2007 SUNWscucm.ucmm_reconf ucmm reconfiguration step 1 c
ompleted
Sun Sep  9 00:10:08 PDT 2007 SUNWscucm.ucmm_reconf Step: cmmstep2 CURRNODES=0 1
2 3
Sun Sep  9 00:10:08 PDT 2007 SUNWscucm.ucmm_reconf ucmm reconfiguration step 2 s
tarted
On my 16 nodes cluster, sc3.2fcs, the problem showed up intermittently after these patches were installed, 120011-14, 119254-42, 122660-10, 125369-10, 125503-02, 125547-02, 126419-01, 126897-02.

The patches were needed to solve problem found in this cr, 6580729

>> below is Robert Bart's analysis of the issue
I did a savecore to capture the kernel state, files are:

    /var/crash/pbank2/*.12

I then did a gcore on the UDLM processes, these cores
are in /var/tmp/rbart.

Looking at the dlmmon process core, I see that it's waiting
on a close() call.

root@pbank2-1545-> pstack core.2918
core 'core.2918' of 2918:       dlmmon -n 2 -c /opt/SUNWudlm/etc/udlm.conf -i 1
 ffffffff7dcd3d9c close    (d)
 ffffffff7f316a30 lmcsini (d, ffffffffffffffff, ffffffff7f425d10, 0, ffffffff7ffff22c, 165c) + b10
 ffffffff7f318890 lkrcfg (ffffffff7f42bc30, 12c, 6, 2, ffffffff7f42bc48, ffffffff7f14c628) + bf0
 ffffffff7f319494 lkpcln (10, 3, 12004440, ffffffff7f425d10, 12000000, 0) + 94
 ffffffff7f309574 lkinit (ffffffff7ffff910, 12000000, 12004e18, 0, 4e20, 4c00) + 254
 ffffffff7f30a194 main (36b0, 0, ffffffff7f14c628, 0, 36b0, ae5) + 394
 0000000100000afc _start (0, 0, 0, 0, 0, 0) + 17c

The kernel shows that this close() is blocked in cv_wait_sig(), which explains
why doing the gcore causes the process to continue.

root@pbank2-1547-> mdb -k /var/crash/pbank2/*.12
Loading modules: [ unix krtld genunix dtrace specfs ufs sd px ssd fcp fctl qlc ip hook neti sctp arp usba nca md lofs zfs random logindmux ptm cpc sppp crypto nfs ipc ]
> ::pgrep lkmgr
S    PID   PPID   PGID    SID    UID      FLAGS             ADDR NAME
R   2918      1   2918   2918      0 0x4a014000 0000060019e398c8 lkmgr
Z   2992   2918   2918   2918      0 0x42010002 00000600193f8c68 lkmgr
R   2931   2918   2918   2918      0 0x4a014000 000006001accc4f0 lkmgr
R   2930   2918   2918   2918      0 0x4a014000 000006001ad098e0 lkmgr
> 0000060019e398c8::walk thread
300045aad00
> 300045aad00::findstack -v
stack pointer for thread 300045aad00: 2a104c5cc91
[ 000002a104c5cc91 cv_wait_sig+0x114() ]
  000002a104c5cd41 tcp_close+0xb8(6001c303058, 6001c30d080, 7b6b0ee8, 0, 6001c30d280, 0)
  000002a104c5cdf1 qdetach+0x90(6001c303058, 704cf530, 83, 600194d66f0, 0, 20204032)
  000002a104c5cea1 strclose+0x3d4(6001c382d00, 6001c303058, 6001c3033e0, 200000, 40000, 6001c3032e8)
  000002a104c5cf71 socktpi_close+0x184(6001c382d00, 83, 0, 2a0000013b, 600194d66f0, 6001c30b080)
  000002a104c5d021 fop_close+0x20(6001c382d00, 83, 1, 0, 600194d66f0, 7bfc6768)
  000002a104c5d0d1 closef+0x4c(600198a3118, d, 0, 0, 600198a3118, 0)
  000002a104c5d181 closeandsetf+0x3bc(d, 0, 0, 1ff, ffffffff7ffff22c, 0)
  000002a104c5d231 close+8(d, 1, 0, 0, ffffffff7ffff22c, 0)
  000002a104c5d2e1 syscall_trap+0xac(d, 1, 0, 0, ffffffff7ffff22c, 0)
> 0000060019e398c8::pfiles
FD   TYPE            VNODE INFO
   0  CHR 0000060001d05800 /devices/pseudo/mm@0:null
   1  CHR 0000060001d05800 /devices/pseudo/mm@0:null
   2  CHR 0000060001d05800 /devices/pseudo/mm@0:null
   3  REG 000006001c379880 /var/cluster/ucmm/dlm_pbank2/logs/dlmstart.log
   4 DOOR 000006001944c440 /var/run/name_service_door [door to 'nscd' (proc=60003e40040)]
   5  REG 000006001c378680 /var/cluster/ucmm/dlm_pbank2/pipes/lmpid
   6 SOCK 000006001c008300 socket: AF_UNIX /var/cluster/ucmm/dlm_pbank2/pipes/lmep
   7  CHR 0000060019b0dc00 /devices/pseudo/tl@0:ticlts
   8  CHR 000006001acc25c0 /devices/pseudo/tl@0:ticlts
   9  REG 000006001c37d2c0 /var/cluster/ucmm/dlm_pbank2/logs/dlm.log
  10 DOOR 000006001c3482c0 [door to 'failfastd' (proc=60003d40480)]
  11  REG 000006001c37c4c0 /var/cluster/ucmm/dlm_pbank2/logs/dlm.trc
  12 SOCK 000006001c349dc0 socket: AF_INET 172.16.4.2 6012
  16 SOCK 000006001c246600
Other diagnostic info:

The following symptoms are seen in the dlm log files when this bug is hit on a RAC cluster.

  On the nodes where the lkmgr is not blocked in tcp_close()

23:56:19-00000-08427- Reconfiguration started: Thu Nov 15 23:56:19 2007
23:56:19-00000-08427- List of nodes: 0,1,2,3,
23:56:19-00000-08427- NO_1_TO_2_RCFG start: was only node?0, # members= 4
23:56:19-00000-08427-  Lock DB frozen
23:56:19-00000-08427-  Non-local Process blocks cleaned out
23:56:19-00000-08427-  Non-local Group blocks cleaned out
23:56:19-00000-08427-  Resources and locks cleaned out
23:56:19-00000-08427-  na2-db1-2-sjl connected to node 172.16.4.1, port (5002)
23:56:19-00000-08427-  na2-db1-2-sjl connected to node 172.16.4.4, port (5002)
23:56:30-00000-08427-  Connect completion call to 172.16.4.3 failed 25 times:Connection refused
23:56:40-00000-08427-  Connect completion call to 172.16.4.3 failed 50 times:Connection refused
23:56:51-00000-08427-  Connect completion call to 172.16.4.3 failed 75 times:Connection refused
23:57:02-00000-08427-  Connect completion call to 172.16.4.3 failed 100 times:Connection refused
23:57:12-00000-08427-  Connect completion call to 172.16.4.3 failed 125 times:Connection refused
23:57:23-00000-08427-  Connect completion call to 172.16.4.3 failed 150 times:Connection refused
23:57:34-00000-08427-  Connect completion call to 172.16.4.3 failed 175 times:Connection refused
23:57:45-00000-08427-  Connect completion call to 172.16.4.3 failed 200 times:Connection refused
23:57:55-00000-08427-  Connect completion call to 172.16.4.3 failed 225 times:Connection refused

  where the IP address identifies the node where the reconfiguration is stuck.

  these last entries continue to be logged till step1 times out and the udlm reconfigures.
A TCP endpoint can hang indefinitely. It only happens under the following
circumstances:

1) The process is in the RT class and has a higher absolute priority than
SYS class threads.
2) The process is bound to the same CPU as the squeue servicing the
connection.
3) There are messages already on the squeue when tcp_close() is called and/or
the squeue is currently being processed by another thread.
4) There is a signal pending.

The problem exhibits itself as an infinite loop in tcp_close(). The root
cause of this is the pending signal which causes cv_wait_sig() to return
having set the calling thread as runnable. The calling thread is higher
priority than the squeue worker but bound to the same CPU thus pinning it
and never allowing it to complete it's work of closing the connection.
Work Around
For the case where the Oracle RAC lkmgr is blocked during reconfiguration in a Sun Cluster 3.2 configuration, the steps described in CR 6624459 are a workaround for this issue.
Comments
N/A