OpenSolaris

Printable Version Enter a New Search
Bug ID 6666472
Synopsis Network performance regression under heavy load (introduced by 4868863)
State 10-Fix Delivered (Fix available in build)
Category:Subcategory kernel:streams
Keywords Barcelona | rtiq_regression | spbc_s10X
Responsible Engineer Jonathan Anderson
Reported Against 5.10 , fw_30 , s9u7_fcs , s10u4_fcs , solaris_9u8 , solaris_10u3 , solaris_10u4
Duplicate Of
Introduced In solaris_nevada
Commit to Fix snv_92
Fixed In snv_92
Release Fixed solaris_nevada(snv_92) , solaris_10u6(s10u6_06) (Bug ID:2159395)
Related Bugs 4868863 , 6471647 , 6668325 , 6707755 , 6715383 , 6718169 , 6721470 , 6791148
Submit Date 22-February-2008
Last Update Date 17-July-2008
Description
During testing of KU 127112-05 (and 127112-10), we see severe network performance degradation (characterized by a 2-orders-of-magnitude decrease in the number of packets per second as observed by "netstat -ni 1").  The two tests run are (1) a disk stresser test called "FilePat" (written By Greg Opp -- sources available by request) which consists of a workload of sequential reads, seeks, and sequential writes, and (2) corrupt_tcp (by Gerald Gibson, sources at /net/npweb/dsg-gate/src/suites/TCP_corrupt/src), a network test that transfers large amounts of data over TCP sockets (looking for any corruption in the transferred pattern).

To reproduce the problem, I've been running the following on an 8-core opteron system with 4 nge NICs, 8G of RAM, and 8 SAS disks.  7 of the SAS disks are configured as 7 single-disk zpools (named pool1, pool2, ..., and pool7; mounted in the default place (/pool1, /pool2, ..., /pool7).  I created directories /pool<N>/test (required for the FilePat test), then I fired up 6 FilePat  processes per "disk", with the following script:


---------------------8<-----------------------
#!/bin/tcsh

set c = 1;
set LOGDIR=/export/home/tests/FilePat/V005/Solaris/X86
set EXECDIR=/export/home/tests/FilePat/V005/Solaris/X86

foreach x ( /pool? )

	# 6 procs per pool:
        set numprocs = 6

        while ( $numprocs != 0 )
                ${EXECDIR}/FilePatternTest.bin -b 102928 -r 3 -f 3083 \
                -hours 234 -minutes 0 -dir ${x}/test -s pingpong \
                >& "${LOGDIR}/FilePatternTest-`printf %03d ${c}`-`date +%Y-%m-%d`.log" &

                @ c++;
                @ numprocs--;
        end

end
---------------------8<-----------------------

 After the FilePat tests were running, I kicked off 3 corrupt_tcp server processes on the same system:

./corrupt_tcp_svr -b 262144 -p 5650 -s 64095
./corrupt_tcp_svr -b 262144 -p 5651 -s 64095
./corrupt_tcp_svr -b 262144 -p 5648 -s 64000

(and on the crossover-connected client system, I fired up 3 clients:
./corrupt_tcp_cli -b 262144 -p 5651 -s 64095 -t 432000 192.168.4.1
./corrupt_tcp_cli -b 262144 -p 5650 -s 64095 -t 432000 192.168.3.1
./corrupt_tcp_cli -b 262144 -p 5648 -s 64000 -t 432000 192.168.1.1
)

 At the start of the test, the netstat output is consistent with performance we get from s10u4 FCS:
    input   e1000g    output       input  (Total)    output
packets errs  packets errs  colls  packets errs  packets errs  colls 
8075    0     2511    0     0      1240289189 0     621405498 0     0     
1       0     0       0     0      241535  0     121680  0     0     
3       0     0       0     0      248534  0     124522  0     0     
0       0     0       0     0      249430  0     125125  0     0     
0       0     0       0     0      249272  0     124638  0     0     
1       0     0       0     0      249185  0     125340  0     0     
0       0     0       0     0      248530  0     124514  0     0     
...


 But, within 15 minutes, the packet rate degrades.
 (I've noticed that the packet count varies wildly between ~250,000 and ~125,000 (in the first few minutes) before ultimately settling on the super-low throughput shown in the following output):

    input   e1000g    output       input  (Total)    output
packets errs  packets errs  colls  packets errs  packets errs  colls 
363     0     457     0     0      5077286910 0     2545551874 0     0     
0       0     0       0     0      32      0     30      0     0     
0       0     0       0     0      82      0     38      0     0     
0       0     0       0     0      61      0     34      0     0     
0       0     0       0     0      50      0     25      0     0     
0       0     0       0     0      47      0     25      0     0     
0       0     0       0     0      25      0     27      0     0     
0       0     0       0     0      97      0     46      0     0     
0       0     0       0     0      58      0     28      0     0     
0       0     0       0     0      70      0     49      0     0     
0       0     0       0     0      53      0     25      0     0     
0       0     0       0     0      54      0     36      0     0     
0       0     0       0     0      51      0     52      0     0     
0       0     0       0     0      6       0     17      0     0     
0       0     0       0     0      6       0     12      0     0     
0       0     0       0     0      9       0     31      0     0     
0       0     0       0     0      113     0     74      0     0     
0       0     0       0     0      105     0     57      0     0     
0       0     0       0     0      58      0     32      0     0     
0       0     0       0     0      6       0     21      0     0     
0       0     0       0     0      110     0     44      0     0     
0       0     0       0     0      59      0     33      0     0     
0       0     0       0     0      28      0     34      0     0     
0       0     0       0     0      111957  0     55986   0     0     
0       0     0       0     0      56378   0     28036   0     0     
0       0     0       0     0      115     0     31      0     0     
0       0     0       0     0      477     0     186     0     0     
0       0     0       0     0      179     0     280     0     0     
0       0     0       0     0      249     0     231     0     0     
0       0     0       0     0      137     0     185     0     0     
0       0     0       0     0      113     0     60      0     0     
0       0     0       0     0      186     0     96      0     0     
0       0     0       0     0      146     0     70      0     0     
0       0     0       0     0      149     0     61      0     0     
0       0     0       0     0      193     0     170     0     0     
0       0     0       0     0      70      0     96      0     0     
0       0     0       0     0      100     0     52      0     0     
0       0     0       0     0      55      0     30      0     0     
0       0     0       0     0      84      0     45      0     0     


 (Note that there are bursts of packets, but overall, the performance is poor)
(I have also reproduced this on an 8-core Intel system with 4 e1000g NICs under Nevada build 82).


 Dtracing the system (just with the profile provider shows the following most-encountered stack traces):

              genunix`putbq+0xee
              genunix`putback+0x5b
              genunix`kstrgetmsg+0x5e3
              sockfs`sotpi_recvmsg+0x260
              sockfs`socktpi_read+0x84
              genunix`fop_read+0x31
              genunix`read+0x188
              unix`sys_syscall+0x17b
             1463

              zfs`fletcher_2_native+0x16
              zfs`zio_checksum_verify+0x27
              zfs`zio_next_stage+0x65
              zfs`zio_wait_for_children+0x49
              zfs`zio_wait_children_done+0x15
              zfs`zio_next_stage+0x65
              zfs`zio_vdev_io_assess+0x84
              zfs`zio_next_stage+0x65
              zfs`vdev_disk_io_done+0x36
              zfs`vdev_io_done+0x12
              zfs`zio_vdev_io_done+0x1b
              genunix`taskq_thread+0xbc
              unix`thread_start+0x8
             2704

              unix`kcopy+0x2c
              genunix`uiomove+0x93
              zfs`dmu_read_uio+0x97
              zfs`zfs_read+0x13c
              genunix`fop_read+0x31
              genunix`read+0x188
              genunix`read32+0xe
              unix`sys_syscall32+0x101
             6799

              unix`cpu_halt+0x122
              unix`idle+0x89
              unix`thread_start+0x8
            26509


 (The server does show idle time, which explains the cpu_halt trace;  the system is also under the filesystem test, which would explain the kcopy and ZFS fletcher2 checksum calculation).  

I've also looked at read-side streams queues for the server process and they're almost always full:

> ::ps ! grep corr
R   1073      1   1070    870      0 0x4a004200 ffffffffbc8856c8 corrupt_tcp_svr
R   1074      1   1070    870      0 0x4a004200 ffffffffbc884a68 corrupt_tcp_svr
R   1085   1074   1070    870      0 0x42000200 ffffffffbc880028 corrupt_tcp_svr
R   1075      1   1070    870      0 0x4a004200 ffffffffbc883e08 corrupt_tcp_svr
R   1084   1075   1070    870      0 0x42000200 ffffffffbc880c88 corrupt_tcp_svr
R   1076      1   1070    870      0 0x4a004200 ffffffffbc8831a8 corrupt_tcp_svr
R   1086   1076   1070    870      0 0x42000200 ffffffffbc8818e8 corrupt_tcp_svr
> ffffffffbc880028::pfiles
FD   TYPE            VNODE INFO
   0  CHR ffffffff847c7480 /devices/pseudo/mm@0:null 
   1  REG ffffffffb7f0bac0 /export/home/tests/network/TCP/tcp_svr_p5648_2008-02-20-1539.log 
   2  REG ffffffffb7f0bac0 /export/home/tests/network/TCP/tcp_svr_p5648_2008-02-20-1539.log 
   4 SOCK ffffffffb7f0b8c0 socket: AF_INET 10.48.49.141 4118 remote: AF_INET 10.48.49.115 43662 
> ffffffffb7f0b8c0::print vnode_t v_stream | ::stream

+-----------------------+-----------------------+
| 0xffffffffa1a2c650    | 0xffffffffa1a2c558    | 
| strwhead              | strrhead              | 
|                       |                       |
| cnt = 0t0             | cnt = 0t538635794     | 
| flg = 0x00004022      | flg = 0x0004403c      | 
+-----------------------+-----------------------+
            |                       ^
            v                       |
+-----------------------+-----------------------+
| 0xffffffffbafe7e40    | 0xffffffffbafe7d48    | 
| tcp                   | tcp                   | 
|                       |                       |
| cnt = 0t0             | cnt = 0t0             | 
| flg = 0x20244022      | flg = 0x20204032      | 
+-----------------------+-----------------------+
> 0xffffffffa1a2c558::queue -v
            ADDR MODULE         FLAGS NBLK
ffffffffa1a2c558 strrhead      04403c    0 0000000000000000
                               |
                               +-->  QWANTW       Someone wants to write Q
                                     QFULL        Q is considered full
                                     QREADR       This is the reader (first) Q
                                     QUSE         This queue in use (allocation)
                                     QMTSAFE      stream module is MT-safe
                                     QEND         last queue in stream

 So somehow we're not processing the messages fast enough to clear the queues fast enough to allow the NICs to continue receiving at the previous high rates.

I've also seen using dtrace that putback() is taking a LONG time (relative to systems without the problem, with the same packet rate):

The script:
-------------------------------8<------------------------------
#!/usr/sbin/dtrace -Fs
tcp_rcv_drain:entry
{
  self->in = 1;
}

tcp_rcv_drain:return /self->in /
{
  self->in = 0;
}

putnext: entry, canputnext: entry, tcp_timeout_cancel:entry / self->in /
{
  self->times[probefunc] = timestamp;
}

putnext: return, canputnext: return, tcp_timeout_cancel:return
  /self->in
  && (self->times[probefunc]) /
{
  @[probefunc] = quantize (timestamp - self->times[probefunc]);
  self->times[probefunc] = 0;
}

-------------------------------8<------------------------------

On systems with the fix for 4868863:

tcp_timeout_cancel                                
           value  ------------- Distribution ------------- count    
            1024 |                                         0        
            2048 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@             442      
            4096 |@@@@@                                    77       
            8192 |@@@@@@@                                  109      
           16384 |                                         0        

  canputnext                                        
           value  ------------- Distribution ------------- count    
            1024 |                                         0        
            2048 |@@@@@@@@@@@@@@@@@@@@@@@@@@               547      
            4096 |@@                                       38       
            8192 |@                                        19       
           16384 |                                         1        
           32768 |                                         0        
           65536 |                                         0        
          131072 |                                         0        
          262144 |                                         1        
          524288 |                                         0        
         1048576 |                                         0        
         2097152 |                                         0        
         4194304 |                                         0        
         8388608 |@@@@@@@@@@@                              232      
        16777216 |                                         4        
        33554432 |                                         0        

  putnext                                           
           value  ------------- Distribution ------------- count    
            1024 |                                         0        
            2048 |@@@@@@@@                                 169      
            4096 |@@@@@@@@@                                196      
            8192 |@@@@@@@@                                 169      
           16384 |@@@                                      57       
           32768 |                                         10       
           65536 |                                         1        
          131072 |                                         0        
          262144 |                                         0        
          524288 |                                         0        
         1048576 |                                         0        
         2097152 |                                         0        
         4194304 |                                         0        
         8388608 |                                         3        
        16777216 |@                                        19       
        33554432 |@@@@@@@@@@                               219      
        67108864 |                      

(CONTINUED IN NEXT COMMENT)
(continued from previous comment)

 (Notice the huge amount of time taken by putnext in the previous comment's dtrace output).


(The following is output from a system WITHOUT the fix for 4868863:
  canputnext                                        
           value  ------------- Distribution ------------- count    
            1024 |                                         0        
            2048 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 44418    
            4096 |                                         285      
            8192 |                                         18       
           16384 |                                         2        
           32768 |                                         7        
           65536 |                                         0        

  tcp_timeout_cancel                                
           value  ------------- Distribution ------------- count    
            1024 |                                         0        
            2048 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  43971    
            4096 |@                                        747      
            8192 |                                         9        
           16384 |                                         0        
           32768 |                                         2        
           65536 |                                         0        

  putnext                                           
           value  ------------- Distribution ------------- count    
            1024 |                                         0        
            2048 |@@@@@@@                                  7835     
            4096 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@            32715    
            8192 |@@@                                      3772     
           16384 |                                         320      
           32768 |                                         74       
           65536 |                                         11       
          131072 |                                         1        
          262144 |                                         2        
          524288 |                                         0        

  (Here is what I would expect putnext's time distribution to look like on a system that is not cpu-saturated (as is the case in these tests) -- small spikes of 131k-262k nsecs, but mostly 4k-8k nsecs).

  To zero-in on 4868863, I brought over the s10u5 build 3 snapshot, did a full build and BFU'ed the test system with the build 3 non-debug archives, verified the problem was present, then backed out the fix for 4868863, and rebuilt genunix ONLY, copied over genunix, rebooted, and reran the tests.  The packet throughput was consistent with s10u4's throughput, even after 2 hours of testing (average time to poor performance with the fix for 4868863 is 15 minutes).
Looking at one of the corrupt_tcp server process's stream receive-side, a very large number of mblks are linked together (via b_cont):

> 0xffffffffa1a2c558::print queue_t
{
    q_qinfo = strdata
    q_first = 0xfffffe890a17c080
    q_last = 0xfffffe890a17c080
    q_next = 0
...
> 0xfffffe890a17c080::list mblk_t b_cont ! wc -l
 150670

 Analyzing the sizes of the individual mblks, there are around 144,200 of size 1460, but there are also sizes that range CONTINUOUSLY across the interval [1..1460) (multiples of each size).  Given that linkb() traverses the linked lists of mblks linearly, O(n^2) for a list of size 150,000 could easily lead to long putback() times.
This appears to reproduce a lot easier on Solaris 10 (and the worst-case appears worse also -- the packet rate devolves to around 100-200/second on a 4-cpu system with 3 e1000g's under test), e.g.: ("netstat -ni 1" output:

3       0     0       0     0      108     0     53      0     0     
2       0     0       0     0      109     0     56      0     0     
1       0     0       0     0      159     0     81      0     0     
2       0     0       0     0      116     0     60      0     0     
2       0     0       0     0      158     0     90      0     0     
2       0     0       0     0      161     0     72      0     0     
1       0     0       0     0      112     0     50      0     0     
1       0     0       0     0      165     0     83      0     0     
1       0     0       0     0      134     0     69      0     0     
3       0     2       0     0      158     0     79      0     0     
1       0     0       0     0      133     0     68      0     0     
1       0     0       0     0      110     0     68      0     0     
2       0     0       0     0      186     0     79      0     0     
2       0     0       0     0      162     0     84      0     0     
3       0     0       0     0      164     0     74      0     0     
1       0     0       0     0      149     0     99      0     0     
3       0     0       0     0      172     0     74      0     0     
1       0     0       0     0      169     0     92      0     0     
2       0     0       0     0      138     0     66      0     0     
1       0     0       0     0      130     0     68      0     0     
2       0     0       0     0      188     0     88      0     0     
1       0     0       0     0      148     0     82      0     0     
1       0     0       0     0      150     0     69      0     0     
1       0     0       0     0      150     0     72      0     0     

 ). (This is with 127112-10 installed on top of s10u4 on an X4500)  I've saved a live dump of the system while in this state.  It's available at /net/bergsoft.west/opt/dump/x4500.  From that dump, looking at the receive queue of one of the corrupt_tcp processes:

> 0xffffffffb305f590::print queue_t
{   
    q_qinfo = strdata
    q_first = 0
    q_last = 0
    q_next = 0
    q_link = 0
    q_ptr = 0xffffffffb3067c98
    q_count = 0x1d8347b3
    q_flag = 0x4403c
    q_minpsz = 0
    q_maxpsz = 0xffffffffffffffff
    q_hiwat = 0x40290
    q_lowat = 0x400
    q_bandp = 0
    q_lock = {
        _opaque = [ 0xffffffffb30f77e0 ]
    }
    q_stream = 0xffffffffb3067c98
    q_syncq = 0xffffffffb305f780
    q_nband = 0
    q_wait = {
        _opaque = 0
    }
    q_sync = {
        _opaque = 0
    }
    q_nfsrv = 0xffffffffb305f590
    q_nbsrv = 0xffffffffb305f300
    q_draining = 0
    q_struiot = 0xffff
    q_syncqmsgs = 0
    q_mblkcnt = 0x54432
    q_sqhead = 0
    q_sqtail = 0
 ...

  q_count is 495,142,835 (!!!!) and q_mblkcnt is 345,138.  Something is seriously interfering with proper flow control :).
Work Around
It helps if the receive buffer size is close to the read/recv size used in IO system
calls. i.e. if you are reading 16k at a time, don't use a receive window greater than
64k. This ameliorates the problem significantly.
Comments
N/A