OpenSolaris

Printable Version Enter a New Search
Bug ID 6626381
Synopsis rpc.nisd on subdomain-replica deadlocked when trying to access NIS+ admin group in cache
State 10-Fix Delivered (Fix available in build)
Category:Subcategory nis_plus:libnis
Keywords rtiq_reviewed
Responsible Engineer Satish Murugesan
Reported Against
Duplicate Of
Introduced In solaris_2.6
Commit to Fix snv_105
Fixed In snv_105
Release Fixed solaris_nevada(snv_105) , solaris_10u8(s10u8_01) (Bug ID:2174256)
Related Bugs
Submit Date 6-November-2007
Last Update Date 13-March-2009
Description
Customer detected that 1 (among 3) of the NIS+ replica servers in one of his NIS+ subdomains did no  longer reply to client request at some time. Perhaps the rpc.nisd on this replica may have been deadlocked in that status already 1-2 weeks before the customer noticed this issue. This replica is running S10, based on KJP 118833-36.


When the customer had reported this issue to us, I asked him to:
- Create an application core of the running rpc.nisd process
  by using gcore(1)
- Start trusses on the rpc.nisd, rpcbind and keyserv daemons on that
  replica, and then run the commands
      # rpcinfo -T tcp <replica-hostname> 100300 3
  and # rpcinfo -T tcp <replica-hostname> 100300 3
  on any NIS+ client
  These client-side rpcinfo commands timed out.
- Then use gcore again to create a 2nd application coredump of
  the rpc.nisd.

After collecting these debug information the customer disabled/enabled the NIS+ service in SMF on this replica, and then that newly started rpc.nisd daemon process was responsive again to NIS+ client requests.


The truss of the rpc.nisd that was captured during this test, did not show any processing of that process, which means it was somehow deadlocked or "hanging", although the rpcbind daemon had forwarded
the client requests to the rpc.nisd daemon.

The 2 application cores, that we took immediatelly before and after this test, show more or less the same.


The rpc.nisd daemon was running 5 LWPs:
(dbx) lwps
  l@1 running          in __lwp_park()
  l@2 running          in __lwp_park()
  l@2760263 running          in __lwp_park()
  l@2760264 running          in __lwp_park()
 >l@2760262 running          in __lwp_park()


3 of these LWPs show exactly the same stack trace where the functions rwlock_lock() and rw_rdlock_impl() got called with exactly the same pointer 0xff2a7780 as their first arguement, that is pointing to a rwlock_t structure which was handed over from the calling function __do_ismember() as the variable g_cache_lock. Example:
(dbx) where
=>[1] __lwp_park(0x0, 0x0, 0x0, 0x0, 0xfebb2030, 0x1), at 0xff1c0594 
  [2] rwlock_lock(0xff2a7780, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xff1b2140 
  [3] rw_rdlock_impl(0xff2a7780, 0x0, 0x0, 0xfeb30960, 0xfeb30800, 0xff1e8298), at 0xff1b231c 
  [4] __do_ismember(0xee6fba30, 0x4bfce0, 0x405e0, 0x80808080, 0x342b4, 0x2e), at 0xff26eaf0 
  [5] __can_do(0x100, 0x40400, 0x4bfce0, 0xee6fba30, 0x2, 0x1), at 0x26918 
  [6] nis_iblist_svc(0xee6fbe9c, 0xe6f88, 0x21ca6e0, 0x0, 0x0, 0x40c00), at 0x1ad44 
  [7] nis_prog_svc(0xe6f88, 0xb06f0, 0x1e0, 0xee6fbe9c, 0xf, 0x1ac84), at 0x33068 
  [8] _svc_prog_dispatch(0xb06f0, 0x59f840, 0xe6f88, 0x3, 0x32b64, 0x0), at 0xff24f7ec 
  [9] _svc_run_mt(0xb06f0, 0xff2ace4c, 0x0, 0xff2a2d90, 0xffffff39, 0xff2a706c), at 0xff254ab8 


The rwlock_t structure is defined as follows:
# vi /usr/include/synch.h
...
typedef struct _rwlock {
        int32_t         readers;        /* -1 == writer else # of readers */
        uint16_t        type;
        uint16_t        magic;
        mutex_t         mutex;          /* used to indicate ownership */
        cond_t          readercv;       /* unused */
        cond_t          writercv;       /* unused */
} rwlock_t;


If we look at this data, we'll see the 1st 32 bits contain xffffffff, so this is a writer, and the mutex is pointing to 0x00004d58.

(dbx) examine 0xff2a7780/32x
0xff2a7780: g_cache_lock       :         0xffff 0xffff 0x0000 0x5257 0x0000 0x0000 0x0000 0x4d58
0xff2a7790: g_cache_lock+0x0010:         0x0000 0x0000 0xff00 0x0001 0x0000 0x0000 0xfeca 0x2000
0xff2a77a0: g_cache_lock+0x0020:         0x0000 0x0000 0x0000 0x4356 0x0000 0x0000 0x0000 0x0000
0xff2a77b0: g_cache_lock+0x0030:         0x0000 0x0000 0x0000 0x4356 0x0000 0x0000 0x0000 0x0000


Here we can see these 3 LWPs are waiting on the same lock:
(dbx) sync -info 0xff2a7780
g_cache_lock (0xff2a7780): thread  read lock
Threads blocked by this lock are:
 >t@2760262  a l@2760262   _svc_run_mt()   sleep on 0xff2a7780  in __lwp_park()
  t@2760263  a l@2760263   _svc_run_mt()   sleep on 0xff2a7780  in __lwp_park()
  t@2760264  a l@2760264   _svc_run_mt()   sleep on 0xff2a7780  in __lwp_park()

Since all these 3 LWPs show up a __do_ismember() in their stack, we know the rpc.nisd was in progress of accessing a NIS+ admin group in it's cache.


When a rpc.nisd is running fine (replying 2 client requests) we should normally see it's 1st (main) LWP waiting for incoming client requests in any kind of poll() function.

If we now take a look into the 1st LWP of this process, we'll see also this LWP was waiting in a cond_sleep_queue() function. From here we see it started with a "normal" NIS+ client request, because the function nis_iblist_svc() is somewhere at the top of this stack. Then __do_ismember() got called
to lookup for the cached NIS+ admin group. Then the rpc.nisd had to refresh it's (client-side) binding, that resulted in communicating to itself again. ... and finally the local lookup for the NIS+ admin group resulted in a cond_sleep_queue().

(dbx) where
=>[1] __lwp_park(0x0, 0x0, 0x0, 0x0, 0xfebb2000, 0x1), at 0xff1c0594 
  [2] cond_sleep_queue(0xc1848, 0xc1830, 0x0, 0x0, 0x0, 0x0), at 0xff1ba50c 
  [3] cond_wait_queue(0xc1848, 0xc1830, 0x0, 0x0, 0x0, 0x0), at 0xff1ba628 
  [4] cond_wait(0xc1848, 0xc1830, 0x0, 0xff3ee980, 0x32380, 0xff3ee268), at 0xff1baba8 
  [5] pthread_cond_wait(0xc1848, 0xc1830, 0x0, 0x1000, 0x0, 0x0), at 0xff1babe4 
  [6] __nis_lock_hash_table(0xc1830, 0x1, 0xff0eae58, 0x3, 0xe, 0x200), at 0xff0a78ac 
  [7] __nis_scan_table_mt(0xc1830, 0x25540, 0xffbfcef8, 0x3a800, 0x37b8c, 0x40008), at 0xff0a858c 
  [8] 0x25708(0x1df5550, 0x144a390, 0xaf2dc8, 0xffbfcf60, 0x2, 0x40c00), at 0x25707 
  [9] __directory_object_msg(0x2, 0xef1e0, 0x0, 0xffbfcfcc, 0xef1e0, 0x1df5550), at 0x25d98 
  [10] nis_finddirectory_svc(0xffbfd040, 0x0, 0x45400, 0x4, 0x0, 0x0), at 0x1c5d4 
  [11] __nis_finddirectory(0xffbfd0b4, 0x1df5550, 0xf53ba0, 0x0, 0x40c00, 0x2), at 0x294f8 
  [12] NisCache::loadDirectory(0x489a0, 0xb40650, 0x0, 0x2, 0x1c84918, 0x4), at 0xff278da0 
  [13] NisCache::bindReplica(0xff2a7dc4, 0x5e1d60, 0xffbfd17c, 0x0, 0xff2a7dc4, 0x0), at 0xff2787c8 
  [14] NisCache::refreshBinding(0x489a0, 0x834258, 0xff2a7dc4, 0xff2792e4, 0x0, 0x0), at 0xff2793d8 
  [15] NisCache::bindReplica(0xff2a7dc4, 0xffbfd406, 0xffbfd30c, 0xff2a7b28, 0xff278708, 0x0), at 0xff278778 
  [16] __nis_CacheBindDir(0xffbfd406, 0xffbfd30c, 0x0, 0xff1edc44, 0xffbfd244, 0xffbfd406), at 0xff27bf90 
  [17] nis_bind_dir(0xffbfd400, 0x20001, 0xffbfd30c, 0x20001, 0x0, 0x489a0), at 0xff26392c 
  [18] __nis_core_lookup(0xffbfd370, 0x20001, 0x0, 0x0, 0x0, 0x40c00), at 0x29218 
  [19] nis_lookup(0xffbfd400, 0x20001, 0xffbfd400, 0x400, 0xff26edb4, 0xff2a2d90), at 0xff264404 
  [20] get_group(0xe8f78, 0x0, 0x405e0, 0xffbfd934, 0x113, 0x33f8c), at 0xff26ee28 
  [21] cached_group_entry(0xe8f78, 0x0, 0x405e0, 0xffbfd934, 0x0, 0xea090), at 0xff26e614 
  [22] do_ismember_2(0xffbfda08, 0xe8f78, 0x0, 0x405e0, 0xffbfd934, 0xff1e8298), at 0xff26e8f4 
  [23] __do_ismember(0xffbfda08, 0x4bfce0, 0x405e0, 0x80808080, 0x342b4, 0x2e), at 0xff26eb08 
  [24] __can_do(0x100, 0x40400, 0x4bfce0, 0xffbfda08, 0x2, 0x1), at 0x26918 
  [25] nis_iblist_svc(0xffbfde74, 0x5e1a70, 0x2352788, 0x0, 0x0, 0x40c00), at 0x1ad44 
  [26] nis_prog_svc(0x5e1a70, 0x5e1a08, 0x1e0, 0xffbfde74, 0xf, 0x1ac84), at 0x33068 
  [27] _svc_prog_dispatch(0x5e1a08, 0x5b8390, 0x5e1a70, 0x3, 0x32b64, 0x0), at 0xff24f7ec 
  [28] _svc_run_mt(0x5e1a08, 0xff2ace4c, 0x0, 0xff2a2d90, 0xffffff39, 0xff2a706c), at 0xff254ab8 
  [29] svc_run(0x0, 0xff2a7030, 0x40008, 0xff2a2d90, 0x4ecc0, 0x0), at 0xff254158 
  [30] main(0x42000, 0x0, 0xffbffd2c, 0x0, 0x18400, 0x0), at 0x183d8 

If we now look at the 1st aguement pointer in the cond_sleep_wait() function, we again see a mutex pointing to pointing to 0x00004d58.

(dbx) examine 0xc1848/32x
dbx: warning: unknown language, 'c' assumed
0x000c1848:      0x0003 0x0100 0x0000 0x4356 0x0000 0x0000 0x0000 0x0000
0x000c1858:      0x0004 0x0000 0x0000 0x4d58 0x0000 0x0000 0x0000 0x0000
0x000c1868:      0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000
0x000c1878:      0x0000 0x0001 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000
Work Around
N/A
Comments
N/A