OpenSolaris

Printable Version Enter a New Search
Bug ID 6789732
Synopsis libdlpi may get stuck in i_dlpi_strgetmsg()
State 10-Fix Delivered (Fix available in build)
Category:Subcategory library:libdlpi
Keywords
Responsible Engineer Peter Memishian
Reported Against
Duplicate Of
Introduced In solaris_nevada
Commit to Fix snv_107
Fixed In snv_107
Release Fixed solaris_nevada(snv_107) , solaris_10u8(s10u8_03) (Bug ID:2178576)
Related Bugs 6512059
Submit Date 1-January-2009
Last Update Date 28-January-2009
Description
During IPMP stress testing, I occasionally found in.mpathd would go 
completely unresponsive; pstack(1M) would show: 
 
  libc_hwcap2.so.1`__getmsg+7(b, 8042500, 8044510, 80424f4) 
  libdlpi.so.1`i_dlpi_strgetmsg+0x1df(80c37d8, 1388, 8046608, 100, ... 
  libdlpi.so.1`i_dlpi_msg_common+0x68(80c37d8, 8046600, 8046608, 8, 0, ... 
  libdlpi.so.1`dlpi_enabnotify+0x10d(80c37d8, 1, 805487c, 80c5840, ... 
  phyint_link_init+0x129(80c5840, 8046b70, 8046778, 8054dd2) 
  phyint_create+0xbb(8046b70, 80b6060, 6, 9044843, 2, 80467d8) 
  phyint_inst_init_from_k+0x2c1(2, 8046b70, 8046bb8, 0) 
  pii_process+0xf0(2, 8046b70, 80469cc, 8059ac7) 
  initifs+0x2c7(0, 0, 0, 1, e030150, 10) 
  process_rtsock+0x189(4, 5, ffffffff, 805c3bf) 
  main+0x4d3(1, 8047450, 8047458, 804740c) 
  _start+0x7d(1, 8047580, 0, 8047590, 804759a, 80475d6) 
 
That is, in.mpathd is stuck in getmsg().  Looking at the libdlpi code, 
that shouldn't happen because poll() indicated that there was a message 
to read: 
                switch (poll(&pfd, 1, msec)) { 
                default: 
                        if (pfd.revents & POLLHUP) 
                                return (DL_SYSERR); 
                        break; 
                case 0: 
                        return (DLPI_ETIMEDOUT); 
                case -1: 
                        return (DL_SYSERR); 
                } 
 
                if ((retval = getmsg(fd, &ctl, &data, &flags)) < 0) 
                        return (DL_SYSERR); 
 
Indeed, looking at the kernel side of this process, there is indeed 
a message to read:

  > ::pgrep in.mpathd 
  S    PID   PPID   PGID    SID    UID      FLAGS             ADDR NAME 
  R 101120      1 101119 101119      0 0x42000000 ffffff0240cc83a0 in.mpathd 
  > ffffff0240cc83a0::walk thread | ::findstack -v 
  stack pointer for thread ffffff01d3da8200: ffffff0007e7ba90 
  [ ffffff0007e7ba90 _resume_from_idle+0xf1() ] 
    ffffff0007e7bac0 swtch+0x200() 
    ffffff0007e7bb20 cv_wait_sig+0x162(ffffff020ba49dd2, ffffff0248fee550) 
    ffffff0007e7bb80 str_cv_wait+0xbc(ffffff020ba49dd2, ffffff0248fee550, ... 
    ffffff0007e7bc30 strwaitq+0x234(ffffff0248fee4d0, 8, 0, 3, ... 
    ffffff0007e7bd50 strgetmsg+0x378(ffffff020c551300, ffffff0007e7bdb0,  
    ffffff0007e7bdd0, ffffff0007e7be67, ffffff0007e7be60, 3, ... 
    ffffff0007e7be40 msgio32+0x2b8(b, 8042500, 8044510, ffffff0007e7be68, 1, 
    ffffff0007e7be67, ffffff0007e7be60) 
    ffffff0007e7beb0 getmsg32+0x96(b, 8042500, 8044510, 80424f4) 
    ffffff0007e7bf00 sys_syscall32+0x1fc() 
  > ffffff020c551300::print vnode_t v_stream->sd_wrq | ::q2otherq | ::queue 
              ADDR MODULE         FLAGS NBLK 
  ffffff020ba49d50 strrhead      044030   11 ffffff01daf1eb40 
 
However, the message is low-priority: 
 
  > ffffff01daf1eb40::print mblk_t b_datap->db_type 
  b_datap->db_type = 0x1 
 
... and back in userland, we can see that libdlpi has erroneously 
requested only high priority messages, as per the the final argument to 
getmsg() above: 
 
  > 80424f4/X 
  0x80424f4:      1                
 
Looking at the code again, the problem is clear: `flags' is initialized at 
the very top of i_dlpi_strgetmsg(), but never reinitialized.  So if it 
reads a high-priority message, `flags' will get set to 1 by that getmsg() 
invocation.  When it loops to read the next message, it'll then 
erroneously request only high-priority messages, and get stuck.  The fix 
is also clear: always initialize `flags' to 0 prior to calling getmsg().
Work Around
N/A
Comments
N/A