From owner-freebsd-hackers@FreeBSD.ORG Fri Apr 1 18:06:07 2005 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id CE6F416A4CE; Fri, 1 Apr 2005 18:06:07 +0000 (GMT) Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by mx1.FreeBSD.org (Postfix) with ESMTP id 532E143D39; Fri, 1 Apr 2005 18:06:07 +0000 (GMT) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (localhost [127.0.0.1]) j31I4F0e059406; Fri, 1 Apr 2005 10:04:15 -0800 (PST) (envelope-from dillon@apollo.backplane.com) Received: (from dillon@localhost) by apollo.backplane.com (8.12.9p2/8.12.9/Submit) id j31I4Ens059405; Fri, 1 Apr 2005 10:04:14 -0800 (PST) (envelope-from dillon) Date: Fri, 1 Apr 2005 10:04:14 -0800 (PST) From: Matthew Dillon Message-Id: <200504011804.j31I4Ens059405@apollo.backplane.com> To: Bruce Evans References: <423C15C5.6040902@fsn.hu> <20050327133059.3d68a78c@Magellan.Leidinger.net> <5bbfe7d405032823232103d537@mail.gmail.com> <424A23A8.5040109@ec.rr.com><20050330130051.GA4416@VARK.MIT.EDU> <200504010315.j313FGLn056122@apollo.backplane.com> <20050401215011.R24396@delplex.bde.org> cc: Peter Jeremy cc: David Schultz cc: hackers@freebsd.org cc: jason henson cc: bde@freebsd.org Subject: Re: Fwd: 5-STABLE kernel build with icc broken X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 01 Apr 2005 18:06:08 -0000 :> The use of the XMM registers is a cpu optimization. Modern CPUs, :> especially AMD Athlon and Opterons, are more efficient with 128 bit :> moves then with 64 bit moves. I experimented with all sorts of :> configurations, including the use of special data caching instructions, :> but they had so many special cases and degenerate conditions that :> I found that simply using straight XMM instructions, reading as big :> a glob as possible, then writing the glob, was by far the best solution. : :Are you sure about that? The amd64 optimization manual says (essentially) :that big globs are bad, and my benchmarks confirm this. The best glob size :is 128 bits according to my benchmarks. This can be obtained using 2 :... : :Unfortunately (since I want to avoid using both MMX and XMM), I haven't :managed to make copying through 64-integer registers work as well. :Copying 128 bits at a time using 2 pairs of movq's through integer :registers gives only 7.9GB/sec. movq through MMX is never that slow. :However, movdqu through xmm is even slower (7.4GB/sec). : :The fully cached case is too unrepresentative of normal use, and normal :(partially cached) use is hard to bencmark, so I normally benchmark :the fully uncached case. For that, movnt* is best for benchmarks but :not for general use, and it hardly matters which registers are used. Yah, I'm pretty sure. I tested the fully cached (L1), partially cached (L2), and the fully uncached cases. I don't have a logic analyzer but what I think is happening is that the cpu's write buffer is messing around with the reads and causing extra RAS cycles to occur. I also tested using various combinations of movdqa, movntdq, and prefetcha. carefully arranged non-temporal and/or prefetch instructions were much faster for the uncached case, but much, MUCH slower for the partially cached (L2) or fully (L1) cached case, making them unsuitable for a generic copy. I am rather miffed that AMD screwed up the non-temporal instructions so badly. I also think there might be some odd instruction pipeline effects that skew the results when only one or two instructions are between the load into an %xmm register and the store from the same register. I tried using 2, 4, and 8 XMM registers. 8 XMM registers seemed to work the best. Of course, I primarily tested on an Athlon 64 3200+, so YMMV. (One of the first Athlon 64's, so it has a 1MB L2 cache). :> The key for fast block copying is to not issue any memory writes other :> then those related directly to the data being copied. This avoids :> unnecessary RAS cycles which would otherwise kill copying performance. :> In tests I found that copying multi-page blocks in a single loop was :> far more efficient then copying data page-by-page precisely because :> page-by-page copying was too complex to be able to avoid extranious :> writes to memory unrelated to the target buffer inbetween each page copy. : :By page-by-page, do you mean prefetch a page at a time into the L1 :cache? No, I meant that copying taking, e.g. a vm_page_t array and doing page-by-page mappings copying in 4K chunks seems to be a lot slower then doing a linear mapping of the entire vm_page_t array and doing one big copy. Literally the same code, just rearranged a bit. Just writing to the stack in between each page was enough to throw it off. :I've noticed strange loss (apparently) from extraneous reads or writes :more for benchmarks that do just (very large) writes. An at least old :Celerons and AthlonXPs, the writes go straight to the L1/L2 caches :(unless you use movntq on AthlonXP's). The caches are flushed to main :memory some time later, apparently not very well since some pages take :more than twice as long to write as others (as seen by the writer :filling the caches), and the slow case happens enough to affect the :average write speed by up to 50%. This problem can be reduced by :putting memory bank bits in the page colors. This is hard to get right :even for the simple unrepresentative case of large writes. : :Bruce I've seen the same effects and come to the same conclusion. The copy code I eventually settled on was this (taken from my i386/bcopy.s). It isn't as fast as using movntdq for the fully uncached case, but it seems to perform in the system the best because the data tends to be accessed and in the cache by someone in real life (e.g. source data tends to be in the cache even if the device driver doesn't touch the target data). I wish AMD had made movntdq work the same as movdqa for the case where the data was already in the cache, then movntdq would have been the clear winner. The prefetchnta I have commented out seemed to improve performance, but it requires 3dNOW and I didn't want to NOT have an MMX copy mode for cpu's with MMX but without 3dNOW. Prefetching less then 128 bytes did not help, and prefetching greater then 128 bytes (e.g. 256(%esi)) seemed to cause extra RAS cycles. It was unbelievably finicky, not at all what I expected. [ mmx_save_block does a 2048 check on the length and the FPU setup and kernel fpu lock bit ] ENTRY(asm_xmm_bcopy) MMX_SAVE_BLOCK(asm_generic_bcopy) cmpl %esi,%edi /* if (edi < esi) fwd copy ok */ jb 1f addl %ecx,%esi cmpl %esi,%edi /* if (edi < esi + count) do bkwrds copy */ jb 10f subl %ecx,%esi 1: movl %esi,%eax /* skip xmm if the data is not aligned */ andl $15,%eax jnz 5f movl %edi,%eax andl $15,%eax jz 3f jmp 5f SUPERALIGN_TEXT 2: movdqa (%esi),%xmm0 movdqa 16(%esi),%xmm1 movdqa 32(%esi),%xmm2 movdqa 48(%esi),%xmm3 movdqa 64(%esi),%xmm4 movdqa 80(%esi),%xmm5 movdqa 96(%esi),%xmm6 movdqa 112(%esi),%xmm7 /*prefetchnta 128(%esi) 3dNOW */ addl $128,%esi /* * movdqa or movntdq can be used. */ movdqa %xmm0,(%edi) movdqa %xmm1,16(%edi) movdqa %xmm2,32(%edi) movdqa %xmm3,48(%edi) movdqa %xmm4,64(%edi) movdqa %xmm5,80(%edi) movdqa %xmm6,96(%edi) movdqa %xmm7,112(%edi) addl $128,%edi 3: subl $128,%ecx jae 2b addl $128,%ecx jz 6f jmp 5f [ fall through to loop to handle blocks less then 128 bytes ] SUPERALIGN_TEXT 4: movq (%esi),%mm0 movq 8(%esi),%mm1 movq 16(%esi),%mm2 movq 24(%esi),%mm3 ... 10: [ backwards copy code ... ] -Matt Matthew Dillon