|
Description
|
bcopy and kcopy should'nt use rep, smov
xxxxx@xxxxx.com 2005-06-29 12:27:27 GMT
Adding comments to descrition to allow opensolaris.org contributors view
this information.
Opensolaris viewers can see the source x86 64-bit and 32-bit versions of
bcopy, kcopy, kcopy_nta, xcopyout_nta, xcopyin_nta in copy.s here:
http://src.opensolaris.org/source/xref/onnv/aside/usr/src/uts/intel/ia32/ml/copy.s
As of 2007-05-10 kcopy_nta, xcopyout_nta, and xcopyin_nta have a 16-byte
"NTA_ALIGN_SIZE" alignment restrictions. That needs to be fixed.
Is the requirement simply that these _nta versions work with 8-byte aligned
data, or do they need to be able to work with any alignment data?
There is no bcopy_nta routine in copy.s. There needs to be an nta version of bcopy. This is covered by CR 5070897.
xxxxx@xxxxx.com 2007-05-10 15:51:52 GMT
-------------------------------------------------------------------------
kcopy and bcopy are very popular routines in the kernel, yet they are implemented using
rep
smovq
movq %rdx, %rcx
andq $7, %rcx /* bytes left over */
rep
smovb
xorl %eax, %eax /* return 0 (success) */
wisdom shows this is suboptimal, see also bug: 6281624
I've implemented the libc/amd64 version and this gives ~25% reduction in cycles
to perform the same operation. This will reduce cache pollution as it does nta
for larger copies
instrumented netperf, comparing onnv_17 and onnv_17 with better kcopy
Mean 757049760.1 568378224
% Change 0.00% -24.92%
Number of Runs 10 10
Std.Dev. 71576824.01 34645049.23
Std.Dev. % 9.45% 6.10%
False Alarm Level 10.00%
95% Confidence (+/-) 6.86%
T test 0.00%
gate new_kcopy
840840739 627030562
646927121 562845459
649583946 545172919
720428863 547099024
835514750 633144991
759572881 545403742
831075883 539929209
739690876 546986514
745233450 554840450
801629092 581329370
this was the amount of time spent in kcopy while running a network benchmark
I also used lockstat kernel profile to measure the reduction in number of times
the pc was caught in kcopy...
Mean 515.3 396.3
% Change 0.00% -23.09%
Variance 8777.04 1872.56
Number of Runs 30 30
Std.Dev. 93.69 43.27
Std.Dev. % 18.18% 10.92%
False Alarm Level 10.00%
95% Confidence (+/-) 7.29%
T test 0.00%
gate new kcopy
518 398
644 411
395 337
411 491
470 355
557 354
402 353
369 422
432 369
451 366
381 432
extra data removed for brevity, which lines up rather nicely
xxxxx@xxxxx.com 2005-06-29 12:27:27 GMT
Would you please be more specific on how you instrumented netperf?
netperf should already be able to take advantage of the new kcopy_nta.
The thresholds could probably be fine tuned some more to improve
netperf further.
The tricky part of modifying bcopy is that it is a very popular
routine. I examined many of the calls to bcopy, and the sizes vary
dramatically, many of them very small. Simply changing bcopy in
similar ways as I had done with kcopy_nta would certainly penalize more
than half of the callers. I was thinking about implementing bcopy_nta,
so callers can conciously take advantage of the nta effect. Obviously
I haven't gotten around to it.
xxxxx@xxxxx.com 2005-06-29 16:34:29 GMT
It was iperf, sorry
I instrumented it thus:
#!/usr/sbin/dtrace -Cs
kcopy:entry,kcopy_nta:entry,xcopyout_nta:entry, xcopyin_nta:entry
/ execname == "iperf" /
{
self->size = arg2;
self->kcopy = 1;
self->in= vtimestamp;
}
kcopy:return,kcopy_nta:return,xcopyout_nta:return, xcopyin_nta:return
/self->kcopy == 1/
{
self->in2 = vtimestamp;
self->kcopy = 0;
@d["kcopy"] = sum(self->in2 - self-> in);
@a["total"] = count();
}
which gave me my ~22+ %
the profile of the copies are:
nta copies: 1296 bytes
normal copies: 1460 bytes ( lots off !) it looks like 1 for each packet hence the 1460 number
xxxxx@xxxxx.com 2005-06-30 15:51:52 GMT
That's it. I have a 16-byte size alignment requirement to apply nta,
which 1460 doesn't fit. At one point I had a more sophisticated
implementation that took care of such cases, but the additional comparison
penalized other benchmarks. I guess it is time to figure out a good
solution.
xxxxx@xxxxx.com 2005-06-30 17:49:00 GMT
There are 3 bugs being discussed here. This CR is being split into 3 CRs.
1) bcopy can use NTA instructions for larger sizes.
This issue is covered by CR 5070897.
2) bcopy should not use rep, smov.
This CR 6292199 will fix this issue.
3) kcopy_nta does not handle sizes that are less than multiples of 16 bytes.
New CR 6774407 has been filed to track this issue.
Extensive macrobenchmark and microbenchmark work has shown that
using a different algorithm that rep, smov works better on modern 64-bit
x86 systems. The following algorithm was settled on for 64-bit kcopy and
bcopy after extensive testing on both AMD and Intel 64-bit system from the
earliest 64-bit processors through the latest engineering samples.
If size >= BCOPY_DFLT_REP (768 on Nehalem and 128 on other processors)
Move data with rep smovq loop
Else If size >= 80 bytes
Align destination on 8-bytes and
Do loops or 64-byte copys using 8-byte registers.
Else // size < 80 bytes
Jump to optimal unrolled move code.
There are different alignment versions of the unrolled move code.
The unrolled code for small sizes was a win in benchmarks and concurrent
testing dispite that slightly increased ICache footprint.
|