From owner-freebsd-arch@FreeBSD.ORG Fri Jan 19 07:14:26 2007 Return-Path: X-Original-To: freebsd-arch@freebsd.org Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 9D1A116A60D; Fri, 19 Jan 2007 07:14:26 +0000 (UTC) (envelope-from bde@zeta.org.au) Received: from mailout1.pacific.net.au (mailout1-3.pacific.net.au [61.8.2.210]) by mx1.freebsd.org (Postfix) with ESMTP id 36A8713C45E; Fri, 19 Jan 2007 07:14:26 +0000 (UTC) (envelope-from bde@zeta.org.au) Received: from mailproxy2.pacific.net.au (mailproxy2.pacific.net.au [61.8.2.163]) by mailout1.pacific.net.au (Postfix) with ESMTP id 1709F5A7CAA; Fri, 19 Jan 2007 18:14:24 +1100 (EST) Received: from besplex.bde.org (katana.zip.com.au [61.8.7.246]) by mailproxy2.pacific.net.au (Postfix) with ESMTP id 3DF8227423; Fri, 19 Jan 2007 18:14:22 +1100 (EST) Date: Fri, 19 Jan 2007 18:14:21 +1100 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Peter Jeremy In-Reply-To: <20070118184650.GB845@turion.vk2pj.dyndns.org> Message-ID: <20070119172335.O2216@besplex.bde.org> References: <20070116154258.568e1aaf@pleiades.nextvenue.com> <3bbf2fe10701161525j6ad9292y93502b8df0f67aa9@mail.gmail.com> <45AD6DFA.6030808@FreeBSD.org> <3bbf2fe10701161655p5e686b52n7340b3100ecfab93@mail.gmail.com> <200701172022.l0HKMYV8053837@apollo.backplane.com> <20070118113831.A11834@delplex.bde.org> <20070118184650.GB845@turion.vk2pj.dyndns.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: Maxim Sobolev , Ivan Voras , Attilio Rao , freebsd-current@freebsd.org, freebsd-arch@freebsd.org Subject: Re: [PATCH] Mantaining turnstile aligned to 128 bytes in i386 CPUs X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 19 Jan 2007 07:14:26 -0000 On Fri, 19 Jan 2007, Peter Jeremy wrote: > On Thu, 2007-Jan-18 18:03:20 +1100, Bruce Evans wrote: >> On Wed, 17 Jan 2007, Matthew Dillon wrote: >>> Alignment is critical. If the data is not aligned, don't bother. 128 >>> bits means 16 byte alignment. >> >> The above benchmark output is for aligned data :-). I don't try hard to >> optimize or benchmark misaligned cases. > > How realistic is this? Has anyone collected statistics on the size and > alignment of bzero/bcopy calls? How much of the time is the size known > at compile time? I think perfect alignment is very realistic. If not, it is an application bug :), just like for misaligned integer accesses on arches that allow this. In the kernel, other parts of the kernel are the application and it is reasonable to require perfect alignment. I recently did a dynamic search for misaligned (but only 32-bit non-aligned) bxx's (maybe only bzeros) in low-level network code and found only a couple. For the original i586 FPU optimizations, I gatherer statistics for bcopy/bzero. IIRC, alignment (64-bit?) was normal, at least for the large copies of interest, and large bcopys were so uncommon that it was a complete waste of time to optimize them (at least for my applications). Large bzeros/copyins/copyouts are more common. FreeBSD has some optimizations in low-level networking code for bcopys with a small size that is known at compile time (just use gcc's builtin_memcpy). These were lost to -ffreestanding and/or gcc's aggressive optimization of things like printf using the builtin printf. (-ffreestanding implies -fno-builtin, and no one cared enough about the loss to turn builtins back on. If you turn them back on, then they should be turned on individually as recommended in gcc.info to avoid conflicts. This is easy enough for the memcpy builtin but messy if you want all the old builtins starting with strlen.) I looked at these lost optimizations again while trying to optimize the low- level networking code for packets-per-second. The difficulty of implementing memcpy/bcopy perfectly is shown by gcc's builtin not being very close to getting it right for small fixed sizes even with -march=... I lost interest in this for now when I found that optimizations were impossible to measure because the packet rate depends mysteriously on the layout of the text section. My changes may have given +10%, but unrelated changes gave +-30%. The most mysterious one was -20% when cvs updated added ~500 bytes of object code that is never executed. Using builtin memcpy didn't have a noticeable effect here. Bruce