From owner-freebsd-current@FreeBSD.ORG Wed Jan 17 20:47:42 2007 Return-Path: X-Original-To: freebsd-current@freebsd.org Delivered-To: freebsd-current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 1B34A16A412; Wed, 17 Jan 2007 20:47:42 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by mx1.freebsd.org (Postfix) with ESMTP id 7926613C457; Wed, 17 Jan 2007 20:47:41 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (localhost [127.0.0.1]) by apollo.backplane.com (8.13.7/8.13.7) with ESMTP id l0HKMYFV053838; Wed, 17 Jan 2007 12:25:34 -0800 (PST) Received: (from dillon@localhost) by apollo.backplane.com (8.13.7/8.13.4/Submit) id l0HKMYV8053837; Wed, 17 Jan 2007 12:22:34 -0800 (PST) Date: Wed, 17 Jan 2007 12:22:34 -0800 (PST) From: Matthew Dillon Message-Id: <200701172022.l0HKMYV8053837@apollo.backplane.com> To: "Attilio Rao" References: <3bbf2fe10607250813w8ff9e34pc505bf290e71758@mail.gmail.com> <3bbf2fe10607281004o6727e976h19ee7e054876f914@mail.gmail.com> <3bbf2fe10701160851r79b04464m2cbdbb7f644b22b6@mail.gmail.com> <20070116154258.568e1aaf@pleiades.nextvenue.com> <3bbf2fe10701161525j6ad9292y93502b8df0f67aa9@mail.gmail.com> <45AD6DFA.6030808@FreeBSD.org> <3bbf2fe10701161655p5e686b52n7340b3100ecfab93@mail.gmail.com> Cc: freebsd-current@freebsd.org, Ivan Voras , freebsd-arch@freebsd.org Subject: Re: [PATCH] Mantaining turnstile aligned to 128 bytes in i386 CPUs X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 17 Jan 2007 20:47:42 -0000 The cost of using the FPU can simply be thought of in terms of how many bytes you have to have to copy for it to become worth using the FPU over a far less complex integer copy loop. This is really easy to find out, and it is also fairly easy to instrument a sysctl to set the value used in the comparison and run benchmarks to determine at what point using the FP unit becomes the better choice. * Saving the FP state. The kernel doesn't have to save or restore anything if userland was not using the floating point unit. In fact, the kernel doesn't even need to FNINIT! All the kernel needs to do is CLTS and FNCLEX to make the FP unit usable for media copy instructions, then set CR0_TS when it is finished. Gee, that's nice! But if on the otherhand userland is using the floating point unit inbetween every system call then having the kernel try to use it does require calling fxsave and clearing npxthread == serious inefficiencies if userland is using the FP unit heavily. Or, alternatively, it can fxsave AND restore the state when it is done at a total cost of around 70ns plus write bandwidth cruft. In fact, I would say that if userland is not using the FP unit, that is npxthread == NULL or npxthread != curthread, you should *DEFINITELY* use the FP unit. Hands down, no question about it. * First, raw memory bandwidth is governed by RAS cycles. The fewer RAS cycles you have, the higher the bandwidth. This means that the more data you can load into the cpu on the 'read' side of the copy before transitioning to the 'write' side, the better. With XMM you can load 128 *BYTES* a shot (8 128 bit registers). For large copies, nothing beats it. * Modern cpu hardware uses a 128 bit data path for 128 bit media instructions and can optimize the 128 bit operation all the way through to a cache line or to main memory. It can't be beat. Alignment is critical. If the data is not aligned, don't bother. 128 bits means 16 byte alignment. * No extranious memory writes, no uncached extranious memory reads. If you do any writes to memory other then to the copy destination in your copy loop you screw up the cpu's write fifo and destroy performance. Systems are so sensitive to this that it is even better to spend the time linearly mapping large copy spaces into KVM and do a single block copy then to have an inner per-PAGE loop. * Use of prefetch or use of movntdq instead of movdqa is highly problematic. It is possible to use these to optimize very particular cases but the problem is they tend to nerf all OTHER cases. I've given up trying to use either mechanism. Instead, I prefer copying as large a block as possible to remove these variables from the cpu pipeline entirely. The cpu has a write fifo anyway, you don't need prefetch instructions if you can use instructions to write to memory faster then available L2 cache bandwidth. On some cpus this mandates the use of 64 or 128 bit media instructions or the cpu can't keep the write FIFO full and starts interleaving reads and writes on the wrong boundaries (creating more RAS cycles, which is what kills copy bandwidth). * RAS transitions also have to be aligned or you get boundary cases when the memory address transitions a RAS line. This again mandates maximal alignment (even more then 16 bytes, frankly, which is why being able to do 128 byte blocks with XMM registers is so nice). Even though reads and writes are reblocked to the cache line size by the cpu, your inner loop can still transition a RAS boundary in the middle of a large block read if it isn't aligned. But at this point the alignment requirements start to get kinda silly. 128 byte alignment requirement? I don't think so. I do a 16-byte alignment check in DragonFly as a pre-req for using XMM and that's it. But, as I said in the beginning... all you need is just one variable. Copying data below that threshold is faster without the FP unit, copying data above that threshold is faster with the FP unit. Implement it, test it, and see how you fare. If you are paranoid about having to save the FP state, then only use the FP unit when npxthread == NULL (no save required) or npxthread != curthread (save on behalf of a different thread required, which is ok)... It's that simple. Pinning is an issue with FreeBSD, one whos effect I cannot comment on. I don't know about AMD64. You only have 64 bit general registers in 64 bit mode so you may not be able to keep the write pipeline full. But you do have 8 of them so you are roughly equivalent to MMX (but not XMM's 8 128 bit registers). -Matt