|
Description
|
During testing of KU 127112-05 (and 127112-10), we see severe network performance degradation (characterized by a 2-orders-of-magnitude decrease in the number of packets per second as observed by "netstat -ni 1"). The two tests run are (1) a disk stresser test called "FilePat" (written By Greg Opp -- sources available by request) which consists of a workload of sequential reads, seeks, and sequential writes, and (2) corrupt_tcp (by Gerald Gibson, sources at /net/npweb/dsg-gate/src/suites/TCP_corrupt/src), a network test that transfers large amounts of data over TCP sockets (looking for any corruption in the transferred pattern).
To reproduce the problem, I've been running the following on an 8-core opteron system with 4 nge NICs, 8G of RAM, and 8 SAS disks. 7 of the SAS disks are configured as 7 single-disk zpools (named pool1, pool2, ..., and pool7; mounted in the default place (/pool1, /pool2, ..., /pool7). I created directories /pool<N>/test (required for the FilePat test), then I fired up 6 FilePat processes per "disk", with the following script:
---------------------8<-----------------------
#!/bin/tcsh
set c = 1;
set LOGDIR=/export/home/tests/FilePat/V005/Solaris/X86
set EXECDIR=/export/home/tests/FilePat/V005/Solaris/X86
foreach x ( /pool? )
# 6 procs per pool:
set numprocs = 6
while ( $numprocs != 0 )
${EXECDIR}/FilePatternTest.bin -b 102928 -r 3 -f 3083 \
-hours 234 -minutes 0 -dir ${x}/test -s pingpong \
>& "${LOGDIR}/FilePatternTest-`printf %03d ${c}`-`date +%Y-%m-%d`.log" &
@ c++;
@ numprocs--;
end
end
---------------------8<-----------------------
After the FilePat tests were running, I kicked off 3 corrupt_tcp server processes on the same system:
./corrupt_tcp_svr -b 262144 -p 5650 -s 64095
./corrupt_tcp_svr -b 262144 -p 5651 -s 64095
./corrupt_tcp_svr -b 262144 -p 5648 -s 64000
(and on the crossover-connected client system, I fired up 3 clients:
./corrupt_tcp_cli -b 262144 -p 5651 -s 64095 -t 432000 192.168.4.1
./corrupt_tcp_cli -b 262144 -p 5650 -s 64095 -t 432000 192.168.3.1
./corrupt_tcp_cli -b 262144 -p 5648 -s 64000 -t 432000 192.168.1.1
)
At the start of the test, the netstat output is consistent with performance we get from s10u4 FCS:
input e1000g output input (Total) output
packets errs packets errs colls packets errs packets errs colls
8075 0 2511 0 0 1240289189 0 621405498 0 0
1 0 0 0 0 241535 0 121680 0 0
3 0 0 0 0 248534 0 124522 0 0
0 0 0 0 0 249430 0 125125 0 0
0 0 0 0 0 249272 0 124638 0 0
1 0 0 0 0 249185 0 125340 0 0
0 0 0 0 0 248530 0 124514 0 0
...
But, within 15 minutes, the packet rate degrades.
(I've noticed that the packet count varies wildly between ~250,000 and ~125,000 (in the first few minutes) before ultimately settling on the super-low throughput shown in the following output):
input e1000g output input (Total) output
packets errs packets errs colls packets errs packets errs colls
363 0 457 0 0 5077286910 0 2545551874 0 0
0 0 0 0 0 32 0 30 0 0
0 0 0 0 0 82 0 38 0 0
0 0 0 0 0 61 0 34 0 0
0 0 0 0 0 50 0 25 0 0
0 0 0 0 0 47 0 25 0 0
0 0 0 0 0 25 0 27 0 0
0 0 0 0 0 97 0 46 0 0
0 0 0 0 0 58 0 28 0 0
0 0 0 0 0 70 0 49 0 0
0 0 0 0 0 53 0 25 0 0
0 0 0 0 0 54 0 36 0 0
0 0 0 0 0 51 0 52 0 0
0 0 0 0 0 6 0 17 0 0
0 0 0 0 0 6 0 12 0 0
0 0 0 0 0 9 0 31 0 0
0 0 0 0 0 113 0 74 0 0
0 0 0 0 0 105 0 57 0 0
0 0 0 0 0 58 0 32 0 0
0 0 0 0 0 6 0 21 0 0
0 0 0 0 0 110 0 44 0 0
0 0 0 0 0 59 0 33 0 0
0 0 0 0 0 28 0 34 0 0
0 0 0 0 0 111957 0 55986 0 0
0 0 0 0 0 56378 0 28036 0 0
0 0 0 0 0 115 0 31 0 0
0 0 0 0 0 477 0 186 0 0
0 0 0 0 0 179 0 280 0 0
0 0 0 0 0 249 0 231 0 0
0 0 0 0 0 137 0 185 0 0
0 0 0 0 0 113 0 60 0 0
0 0 0 0 0 186 0 96 0 0
0 0 0 0 0 146 0 70 0 0
0 0 0 0 0 149 0 61 0 0
0 0 0 0 0 193 0 170 0 0
0 0 0 0 0 70 0 96 0 0
0 0 0 0 0 100 0 52 0 0
0 0 0 0 0 55 0 30 0 0
0 0 0 0 0 84 0 45 0 0
(Note that there are bursts of packets, but overall, the performance is poor)
(I have also reproduced this on an 8-core Intel system with 4 e1000g NICs under Nevada build 82).
Dtracing the system (just with the profile provider shows the following most-encountered stack traces):
genunix`putbq+0xee
genunix`putback+0x5b
genunix`kstrgetmsg+0x5e3
sockfs`sotpi_recvmsg+0x260
sockfs`socktpi_read+0x84
genunix`fop_read+0x31
genunix`read+0x188
unix`sys_syscall+0x17b
1463
zfs`fletcher_2_native+0x16
zfs`zio_checksum_verify+0x27
zfs`zio_next_stage+0x65
zfs`zio_wait_for_children+0x49
zfs`zio_wait_children_done+0x15
zfs`zio_next_stage+0x65
zfs`zio_vdev_io_assess+0x84
zfs`zio_next_stage+0x65
zfs`vdev_disk_io_done+0x36
zfs`vdev_io_done+0x12
zfs`zio_vdev_io_done+0x1b
genunix`taskq_thread+0xbc
unix`thread_start+0x8
2704
unix`kcopy+0x2c
genunix`uiomove+0x93
zfs`dmu_read_uio+0x97
zfs`zfs_read+0x13c
genunix`fop_read+0x31
genunix`read+0x188
genunix`read32+0xe
unix`sys_syscall32+0x101
6799
unix`cpu_halt+0x122
unix`idle+0x89
unix`thread_start+0x8
26509
(The server does show idle time, which explains the cpu_halt trace; the system is also under the filesystem test, which would explain the kcopy and ZFS fletcher2 checksum calculation).
I've also looked at read-side streams queues for the server process and they're almost always full:
> ::ps ! grep corr
R 1073 1 1070 870 0 0x4a004200 ffffffffbc8856c8 corrupt_tcp_svr
R 1074 1 1070 870 0 0x4a004200 ffffffffbc884a68 corrupt_tcp_svr
R 1085 1074 1070 870 0 0x42000200 ffffffffbc880028 corrupt_tcp_svr
R 1075 1 1070 870 0 0x4a004200 ffffffffbc883e08 corrupt_tcp_svr
R 1084 1075 1070 870 0 0x42000200 ffffffffbc880c88 corrupt_tcp_svr
R 1076 1 1070 870 0 0x4a004200 ffffffffbc8831a8 corrupt_tcp_svr
R 1086 1076 1070 870 0 0x42000200 ffffffffbc8818e8 corrupt_tcp_svr
> ffffffffbc880028::pfiles
FD TYPE VNODE INFO
0 CHR ffffffff847c7480 /devices/pseudo/mm@0:null
1 REG ffffffffb7f0bac0 /export/home/tests/network/TCP/tcp_svr_p5648_2008-02-20-1539.log
2 REG ffffffffb7f0bac0 /export/home/tests/network/TCP/tcp_svr_p5648_2008-02-20-1539.log
4 SOCK ffffffffb7f0b8c0 socket: AF_INET 10.48.49.141 4118 remote: AF_INET 10.48.49.115 43662
> ffffffffb7f0b8c0::print vnode_t v_stream | ::stream
+-----------------------+-----------------------+
| 0xffffffffa1a2c650 | 0xffffffffa1a2c558 |
| strwhead | strrhead |
| | |
| cnt = 0t0 | cnt = 0t538635794 |
| flg = 0x00004022 | flg = 0x0004403c |
+-----------------------+-----------------------+
| ^
v |
+-----------------------+-----------------------+
| 0xffffffffbafe7e40 | 0xffffffffbafe7d48 |
| tcp | tcp |
| | |
| cnt = 0t0 | cnt = 0t0 |
| flg = 0x20244022 | flg = 0x20204032 |
+-----------------------+-----------------------+
> 0xffffffffa1a2c558::queue -v
ADDR MODULE FLAGS NBLK
ffffffffa1a2c558 strrhead 04403c 0 0000000000000000
|
+--> QWANTW Someone wants to write Q
QFULL Q is considered full
QREADR This is the reader (first) Q
QUSE This queue in use (allocation)
QMTSAFE stream module is MT-safe
QEND last queue in stream
So somehow we're not processing the messages fast enough to clear the queues fast enough to allow the NICs to continue receiving at the previous high rates.
I've also seen using dtrace that putback() is taking a LONG time (relative to systems without the problem, with the same packet rate):
The script:
-------------------------------8<------------------------------
#!/usr/sbin/dtrace -Fs
tcp_rcv_drain:entry
{
self->in = 1;
}
tcp_rcv_drain:return /self->in /
{
self->in = 0;
}
putnext: entry, canputnext: entry, tcp_timeout_cancel:entry / self->in /
{
self->times[probefunc] = timestamp;
}
putnext: return, canputnext: return, tcp_timeout_cancel:return
/self->in
&& (self->times[probefunc]) /
{
@[probefunc] = quantize (timestamp - self->times[probefunc]);
self->times[probefunc] = 0;
}
-------------------------------8<------------------------------
On systems with the fix for 4868863:
tcp_timeout_cancel
value ------------- Distribution ------------- count
1024 | 0
2048 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 442
4096 |@@@@@ 77
8192 |@@@@@@@ 109
16384 | 0
canputnext
value ------------- Distribution ------------- count
1024 | 0
2048 |@@@@@@@@@@@@@@@@@@@@@@@@@@ 547
4096 |@@ 38
8192 |@ 19
16384 | 1
32768 | 0
65536 | 0
131072 | 0
262144 | 1
524288 | 0
1048576 | 0
2097152 | 0
4194304 | 0
8388608 |@@@@@@@@@@@ 232
16777216 | 4
33554432 | 0
putnext
value ------------- Distribution ------------- count
1024 | 0
2048 |@@@@@@@@ 169
4096 |@@@@@@@@@ 196
8192 |@@@@@@@@ 169
16384 |@@@ 57
32768 | 10
65536 | 1
131072 | 0
262144 | 0
524288 | 0
1048576 | 0
2097152 | 0
4194304 | 0
8388608 | 3
16777216 |@ 19
33554432 |@@@@@@@@@@ 219
67108864 |
(CONTINUED IN NEXT COMMENT)
(continued from previous comment)
(Notice the huge amount of time taken by putnext in the previous comment's dtrace output).
(The following is output from a system WITHOUT the fix for 4868863:
canputnext
value ------------- Distribution ------------- count
1024 | 0
2048 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 44418
4096 | 285
8192 | 18
16384 | 2
32768 | 7
65536 | 0
tcp_timeout_cancel
value ------------- Distribution ------------- count
1024 | 0
2048 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 43971
4096 |@ 747
8192 | 9
16384 | 0
32768 | 2
65536 | 0
putnext
value ------------- Distribution ------------- count
1024 | 0
2048 |@@@@@@@ 7835
4096 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 32715
8192 |@@@ 3772
16384 | 320
32768 | 74
65536 | 11
131072 | 1
262144 | 2
524288 | 0
(Here is what I would expect putnext's time distribution to look like on a system that is not cpu-saturated (as is the case in these tests) -- small spikes of 131k-262k nsecs, but mostly 4k-8k nsecs).
To zero-in on 4868863, I brought over the s10u5 build 3 snapshot, did a full build and BFU'ed the test system with the build 3 non-debug archives, verified the problem was present, then backed out the fix for 4868863, and rebuilt genunix ONLY, copied over genunix, rebooted, and reran the tests. The packet throughput was consistent with s10u4's throughput, even after 2 hours of testing (average time to poor performance with the fix for 4868863 is 15 minutes).
Looking at one of the corrupt_tcp server process's stream receive-side, a very large number of mblks are linked together (via b_cont):
> 0xffffffffa1a2c558::print queue_t
{
q_qinfo = strdata
q_first = 0xfffffe890a17c080
q_last = 0xfffffe890a17c080
q_next = 0
...
> 0xfffffe890a17c080::list mblk_t b_cont ! wc -l
150670
Analyzing the sizes of the individual mblks, there are around 144,200 of size 1460, but there are also sizes that range CONTINUOUSLY across the interval [1..1460) (multiples of each size). Given that linkb() traverses the linked lists of mblks linearly, O(n^2) for a list of size 150,000 could easily lead to long putback() times.
This appears to reproduce a lot easier on Solaris 10 (and the worst-case appears worse also -- the packet rate devolves to around 100-200/second on a 4-cpu system with 3 e1000g's under test), e.g.: ("netstat -ni 1" output:
3 0 0 0 0 108 0 53 0 0
2 0 0 0 0 109 0 56 0 0
1 0 0 0 0 159 0 81 0 0
2 0 0 0 0 116 0 60 0 0
2 0 0 0 0 158 0 90 0 0
2 0 0 0 0 161 0 72 0 0
1 0 0 0 0 112 0 50 0 0
1 0 0 0 0 165 0 83 0 0
1 0 0 0 0 134 0 69 0 0
3 0 2 0 0 158 0 79 0 0
1 0 0 0 0 133 0 68 0 0
1 0 0 0 0 110 0 68 0 0
2 0 0 0 0 186 0 79 0 0
2 0 0 0 0 162 0 84 0 0
3 0 0 0 0 164 0 74 0 0
1 0 0 0 0 149 0 99 0 0
3 0 0 0 0 172 0 74 0 0
1 0 0 0 0 169 0 92 0 0
2 0 0 0 0 138 0 66 0 0
1 0 0 0 0 130 0 68 0 0
2 0 0 0 0 188 0 88 0 0
1 0 0 0 0 148 0 82 0 0
1 0 0 0 0 150 0 69 0 0
1 0 0 0 0 150 0 72 0 0
). (This is with 127112-10 installed on top of s10u4 on an X4500) I've saved a live dump of the system while in this state. It's available at /net/bergsoft.west/opt/dump/x4500. From that dump, looking at the receive queue of one of the corrupt_tcp processes:
> 0xffffffffb305f590::print queue_t
{
q_qinfo = strdata
q_first = 0
q_last = 0
q_next = 0
q_link = 0
q_ptr = 0xffffffffb3067c98
q_count = 0x1d8347b3
q_flag = 0x4403c
q_minpsz = 0
q_maxpsz = 0xffffffffffffffff
q_hiwat = 0x40290
q_lowat = 0x400
q_bandp = 0
q_lock = {
_opaque = [ 0xffffffffb30f77e0 ]
}
q_stream = 0xffffffffb3067c98
q_syncq = 0xffffffffb305f780
q_nband = 0
q_wait = {
_opaque = 0
}
q_sync = {
_opaque = 0
}
q_nfsrv = 0xffffffffb305f590
q_nbsrv = 0xffffffffb305f300
q_draining = 0
q_struiot = 0xffff
q_syncqmsgs = 0
q_mblkcnt = 0x54432
q_sqhead = 0
q_sqtail = 0
...
q_count is 495,142,835 (!!!!) and q_mblkcnt is 345,138. Something is seriously interfering with proper flow control :).
|