From owner-freebsd-hackers@FreeBSD.ORG Fri Apr 1 13:20:34 2005 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 0DC0C16A4CE; Fri, 1 Apr 2005 13:20:34 +0000 (GMT) Received: from mailout2.pacific.net.au (mailout2.pacific.net.au [61.8.0.85]) by mx1.FreeBSD.org (Postfix) with ESMTP id 519FC43D39; Fri, 1 Apr 2005 13:20:33 +0000 (GMT) (envelope-from bde@zeta.org.au) Received: from mailproxy2.pacific.net.au (mailproxy2.pacific.net.au [61.8.0.87])j31DKGHn002523; Fri, 1 Apr 2005 23:20:16 +1000 Received: from katana.zip.com.au (katana.zip.com.au [61.8.7.246]) j31DKCMq000829; Fri, 1 Apr 2005 23:20:13 +1000 Date: Fri, 1 Apr 2005 23:20:12 +1000 (EST) From: Bruce Evans X-X-Sender: bde@delplex.bde.org To: Matthew Dillon In-Reply-To: <200504010315.j313FGLn056122@apollo.backplane.com> Message-ID: <20050401215011.R24396@delplex.bde.org> References: <423C15C5.6040902@fsn.hu> <20050327133059.3d68a78c@Magellan.Leidinger.net> <5bbfe7d405032823232103d537@mail.gmail.com> <424A23A8.5040109@ec.rr.com><20050330130051.GA4416@VARK.MIT.EDU> <200504010315.j313FGLn056122@apollo.backplane.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Mailman-Approved-At: Fri, 01 Apr 2005 13:29:14 +0000 cc: Peter Jeremy cc: David Schultz cc: hackers@freebsd.org cc: jason henson cc: bde@freebsd.org Subject: Re: Fwd: 5-STABLE kernel build with icc broken X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 01 Apr 2005 13:20:34 -0000 On Thu, 31 Mar 2005, Matthew Dillon wrote: I didn't mean to get into the kernel's use of the FPU, but... > All I really did was implement a comment that DG had made many years > ago in the PCB structure about making the FPU save area a pointer rather > then hardwiring it into the PCB. ISTR writing something like that. dg committed most of my early work since I didn't have commit access at the time. >... > The use of the XMM registers is a cpu optimization. Modern CPUs, > especially AMD Athlon and Opterons, are more efficient with 128 bit > moves then with 64 bit moves. I experimented with all sorts of > configurations, including the use of special data caching instructions, > but they had so many special cases and degenerate conditions that > I found that simply using straight XMM instructions, reading as big > a glob as possible, then writing the glob, was by far the best solution. Are you sure about that? The amd64 optimization manual says (essentially) that big globs are bad, and my benchmarks confirm this. The best glob size is 128 bits according to my benchmarks. This can be obtained using 2 64-bit reads of 64-bit registers followed by 2 64-bit writes of these registers, or by read-write of a single 128-bit register. The 64-bit registers can be either MMX or integer registers on 64-bit systems, but the 128-registers must be XMM on all systems. I get identical speeds of 12.9GB/sec (+-0.1GB/sec) on a fairly old and slow Athlon64 system for copying 16K (fully cached) through MMX and XMM 128 bits at a time using the following instructions: # MMX: # XMM movq (%0),%mm0 movdqa (%0),%xmm0 movq 8(%0),%mm1 movdqa %xmm0,(%1) movq %mm0,(%1) ... # unroll same amount movq %mm1,8(%1) ... # unroll to copy 64 bytes per iteration Unfortunately (since I want to avoid using both MMX and XMM), I haven't managed to make copying through 64-integer registers work as well. Copying 128 bits at a time using 2 pairs of movq's through integer registers gives only 7.9GB/sec. movq through MMX is never that slow. However, movdqu through xmm is even slower (7.4GB/sec). The fully cached case is too unrepresentative of normal use, and normal (partially cached) use is hard to bencmark, so I normally benchmark the fully uncached case. For that, movnt* is best for benchmarks but not for general use, and it hardly matters which registers are used. > The key for fast block copying is to not issue any memory writes other > then those related directly to the data being copied. This avoids > unnecessary RAS cycles which would otherwise kill copying performance. > In tests I found that copying multi-page blocks in a single loop was > far more efficient then copying data page-by-page precisely because > page-by-page copying was too complex to be able to avoid extranious > writes to memory unrelated to the target buffer inbetween each page copy. By page-by-page, do you mean prefetch a page at a time into the L1 cache? I've noticed strange loss (apparently) from extraneous reads or writes more for benchmarks that do just (very large) writes. An at least old Celerons and AthlonXPs, the writes go straight to the L1/L2 caches (unless you use movntq on AthlonXP's). The caches are flushed to main memory some time later, apparently not very well since some pages take more than twice as long to write as others (as seen by the writer filling the caches), and the slow case happens enough to affect the average write speed by up to 50%. This problem can be reduced by putting memory bank bits in the page colors. This is hard to get right even for the simple unrepresentative case of large writes. Bruce