OpenSolaris

Printable Version Enter a New Search
Bug ID 6292199
Synopsis bcopy and kcopy shouldn't use rep, smov
State 10-Fix Delivered (Fix available in build)
Category:Subcategory kernel:amd64
Keywords SFO | onnv_triage
Responsible Engineer Bill Holler
Reported Against s10u8_01
Duplicate Of
Introduced In solaris_nevada
Commit to Fix snv_106
Fixed In snv_106
Release Fixed solaris_nevada(snv_106) , solaris_10u8(s10u8_03) (Bug ID:2177807)
Related Bugs 6281624 , 6774407 , 6793191 , 2177961 , 6226737
Submit Date 29-June-2005
Last Update Date 16-June-2009
Description
bcopy and kcopy should'nt use rep, smov
 xxxxx@xxxxx.com 2005-06-29 12:27:27 GMT
Adding comments to descrition to allow opensolaris.org contributors view
this information.

Opensolaris viewers can see the source x86 64-bit and 32-bit versions of
bcopy, kcopy, kcopy_nta, xcopyout_nta, xcopyin_nta in copy.s here:
http://src.opensolaris.org/source/xref/onnv/aside/usr/src/uts/intel/ia32/ml/copy.s


As of 2007-05-10 kcopy_nta, xcopyout_nta, and xcopyin_nta have a 16-byte
"NTA_ALIGN_SIZE" alignment restrictions.  That needs to be fixed.
Is the requirement simply that these _nta versions work with 8-byte aligned
data, or do they need to be able to work with any alignment data?

There is no bcopy_nta routine in copy.s.  There needs to be an nta version of bcopy. This is covered by CR 5070897.

 xxxxx@xxxxx.com 2007-05-10 15:51:52 GMT
-------------------------------------------------------------------------

kcopy and bcopy are very popular routines in the kernel, yet they are implemented using 

        rep
          smovq

        movq    %rdx, %rcx
        andq    $7, %rcx                /* bytes left over */
        rep
          smovb
        xorl    %eax, %eax              /* return 0 (success) */

wisdom shows this is suboptimal, see also bug: 6281624

I've implemented the libc/amd64 version and this gives ~25% reduction in cycles
to perform the same operation. This will reduce cache pollution as it does nta
for larger copies

instrumented netperf, comparing onnv_17 and onnv_17 with better kcopy

Mean		757049760.1	568378224
% Change	0.00%		-24.92%
Number of Runs	10		10
Std.Dev.	71576824.01	34645049.23
Std.Dev. %	9.45%		6.10%
False Alarm Level 	10.00%	
95% Confidence (+/-)		6.86%
T test		0.00%
		
	gate		new_kcopy
	840840739	627030562
	646927121	562845459
	649583946	545172919
	720428863	547099024
	835514750	633144991
	759572881	545403742
	831075883	539929209
	739690876	546986514
	745233450	554840450
	801629092	581329370
	
this was the amount of time spent in kcopy while running a network benchmark

I also used lockstat kernel profile to measure the reduction in number of times
the pc was caught in kcopy...

Mean		515.3	396.3
% Change	0.00%	-23.09%
Variance	8777.04	1872.56
Number of Runs	30	30
Std.Dev.	93.69	43.27
Std.Dev. %	18.18%	10.92%
False Alarm Level	10.00%	
95% Confidence (+/-)		7.29%
T test		0.00%
		
	gate	new kcopy
	518	398
	644	411
	395	337
	411	491
	470	355
	557	354
	402	353
	369	422
	432	369
	451	366
	381	432

extra data removed for brevity, which lines up rather nicely

 xxxxx@xxxxx.com 2005-06-29 12:27:27 GMT				
	
Would you please be more specific on how you instrumented netperf?
netperf should already be able to take advantage of the new kcopy_nta.
The thresholds could probably be fine tuned some more to improve
netperf further.

The tricky part of modifying bcopy is that it is a very popular
routine.  I examined many of the calls to bcopy, and the sizes vary
dramatically, many of them very small.  Simply changing bcopy in
similar ways as I had done with kcopy_nta would certainly penalize more
than half of the callers.  I was thinking about implementing bcopy_nta,
so callers can conciously take advantage of the nta effect.  Obviously
I haven't gotten around to it.

 xxxxx@xxxxx.com 2005-06-29 16:34:29 GMT

It was iperf, sorry

I instrumented it thus:

#!/usr/sbin/dtrace -Cs


kcopy:entry,kcopy_nta:entry,xcopyout_nta:entry, xcopyin_nta:entry
/ execname == "iperf" /
{
        self->size = arg2;
        self->kcopy = 1;
        self->in= vtimestamp;
}

kcopy:return,kcopy_nta:return,xcopyout_nta:return, xcopyin_nta:return
/self->kcopy == 1/
{
        self->in2 = vtimestamp;
        self->kcopy = 0;
        @d["kcopy"] = sum(self->in2 - self-> in);
        @a["total"] = count();
}

which gave me my ~22+ %

the profile of the copies are:

nta copies:	1296 bytes
normal copies:  1460 bytes ( lots off !) it looks like 1 for each packet hence the 1460 number

 xxxxx@xxxxx.com 2005-06-30 15:51:52 GMT

That's it.  I have a 16-byte size alignment requirement to apply nta,
which 1460 doesn't fit.  At one point I had a more sophisticated
implementation that took care of such cases, but the additional comparison
penalized other benchmarks.  I guess it is time to figure out a good
solution.

 xxxxx@xxxxx.com 2005-06-30 17:49:00 GMT
There are 3 bugs being discussed here.  This CR is being split into 3 CRs.

1) bcopy can use NTA instructions for larger sizes. 
   This issue is covered by CR 5070897.

2) bcopy should not use rep, smov.
   This CR 6292199 will fix this issue.

3) kcopy_nta does not handle sizes that are less than multiples of 16 bytes.
   New CR 6774407 has been filed to track this issue.
Extensive macrobenchmark and microbenchmark work has shown that
using a different algorithm that rep, smov works better on modern 64-bit
x86 systems.  The following algorithm was settled on for 64-bit kcopy and
bcopy after extensive testing on both AMD and Intel 64-bit system from the
earliest 64-bit processors through the latest engineering samples.

If size >= BCOPY_DFLT_REP (768 on Nehalem and 128 on other processors)
	Move data with rep smovq loop
Else If size >= 80 bytes
	Align destination on 8-bytes and
	Do loops or 64-byte copys using 8-byte registers.
Else // size < 80 bytes
	Jump to optimal unrolled move code.
	There are different alignment versions of the unrolled move code.

The unrolled code for small sizes was a win in benchmarks and concurrent 
testing dispite that slightly increased ICache footprint.
Work Around
N/A
Comments
N/A