From owner-freebsd-arch@FreeBSD.ORG Fri Oct 5 22:44:42 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id A5B84106567B; Fri, 5 Oct 2012 22:44:42 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail02.syd.optusnet.com.au (mail02.syd.optusnet.com.au [211.29.132.183]) by mx1.freebsd.org (Postfix) with ESMTP id 26B3E8FC0C; Fri, 5 Oct 2012 22:44:41 +0000 (UTC) Received: from c122-106-157-84.carlnfd1.nsw.optusnet.com.au (c122-106-157-84.carlnfd1.nsw.optusnet.com.au [122.106.157.84]) by mail02.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id q95MiHTn032665 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Sat, 6 Oct 2012 08:44:18 +1000 Date: Sat, 6 Oct 2012 08:44:17 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: John Baldwin In-Reply-To: <201210051141.16147.jhb@freebsd.org> Message-ID: <20121006072636.V978@besplex.bde.org> References: <506C385C.3020400@FreeBSD.org> <86a9w1kq94.fsf@ds4.des.no> <20121005133616.GP35915@deviant.kiev.zoral.com.ua> <201210051141.16147.jhb@freebsd.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: Garrett Cooper , Andriy Gapon , freebsd-arch@freebsd.org, Konstantin Belousov , Dag-Erling Sm??rgrav , Dimitry Andric Subject: Re: x86 boot code build X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 05 Oct 2012 22:44:42 -0000 On Fri, 5 Oct 2012, John Baldwin wrote: > On Friday, October 05, 2012 9:36:16 am Konstantin Belousov wrote: >> On Fri, Oct 05, 2012 at 03:22:31PM +0200, Dag-Erling Sm??rgrav wrote: >>> Konstantin Belousov writes: >>>> So what ISA additions do you expect to get advantage of by switching >>>> to pentium-mmx from 486 ? As I already said, I am not aware of any. >>> >>> The TSC, for one. MMX, and the ability to use MMX registers to copy >>> data. >> >> TSC is used regardless of the compiler flags, we use it if CPU claims >> that TSC is supported, even in usermode. >> >> Compiler never generates MMX copies. More, in kernel, the manual >> FPU context save/restore is needed around the FPU/MMX register file access. 1. The TSC provides no significant performance advantage for boot code. In the kernel it is very difficult to use, and provides few advantages for pentium-mmx. It takes about core2 or later for it to be P-state invariant. Then it is not quite so difficult to use, and provides some advantages. 2. MMX for copying data provides no signifant performance advantage for boot code. In the kernel, it is difficult to use, and provides few advantages for pentium-mmx. MMX registers are only 64 bits wide, and the copying speed tends to be limited more by (lack of) caches and write buffers than by the registers used. SSE registers provide larger advantages by being 128 bits wide. It takes about an AthlonXP or later for SSE plus enough extensions (at least one movnt* instruction is needed and I think basic SSE doesn't have any). The best method is very machine- and context-dependent. Someone named des removed my hooks for plugging in the best known copying routines at runtime. I was happy to see them gone, since they are too compicated to used. There would have to be about 100 different versions for each of bcopy, bzero, copyin and copyout (memcpy and friends are intentionally not optimized, since use use of them for large data asks for slowness). I only tested about 40 different versions of bcopy and 20 of bzero. > I agree with kib. I don't think building i386 releases with > i486 buys > you much of anything. Using MMX in the kernel is of dubious value (have to > be very careful to use it, and when tested in the past by bde@ for things like > bcopy() and bzero() it wasn't a clear win IIRC). Here are results of a current run of old test code: on core2 (ref10-i386): results only for a data size of 4K (for much smaller sizes, simple methods are best, and for much larger sizes, all reasonable methods are limited by the speed of main memory and cache overheads, and all reasonable methods have the same speed, except ones using movnt* are faster since they bypass the caches): % copy0: 12146747898 B/s ( 263445 us) (511794241 tsc) (movsl) movsl is a good general method, and on this CPU it is almost twice as fast as all other methods that don't use SSE. (On Athlon64, some of the other non-SSE methods are competitive). % copy1: 7120415120 B/s ( 449412 us) (838775735 tsc) (unroll *4) % copy2: 5773557468 B/s ( 554251 us) (1032266095 tsc) (unroll *4 prefetch) % copy3: 4452898768 B/s ( 718633 us) (1338746402 tsc) (unroll *16 i586-opt) % copy4: 6465613041 B/s ( 494926 us) (921710503 tsc) (unroll *16 i586-opt prefetch) % copy5: 6328337902 B/s ( 505662 us) (942113053 tsc) (unroll *16 i586-opx prefetch) % copy6: 4838090285 B/s ( 661418 us) (1231845839 tsc) (unroll *8 prefetch 4) % copy7: 7290755322 B/s ( 438912 us) (817908588 tsc) (unroll 64 fp++) % copy8: 6463210196 B/s ( 495110 us) (922004965 tsc) (unroll 128 fp i-prefetch) % copy9: 7264439208 B/s ( 440502 us) (820443267 tsc) (unroll 64 fp reordered) % copyA: 7298770613 B/s ( 438430 us) (816486286 tsc) (unroll 256 fp reordered++) % copyB: 7296257704 B/s ( 438581 us) (816792606 tsc) (unroll 512 fp reordered++) % copyC: 700413769 B/s (4568728 us) (8509304678 tsc) (Terje cksum) % copyD: 6266866684 B/s ( 510622 us) (951099730 tsc) (kernel bcopy (unroll 64 fp i-prefetch)) % copyE: 6479962740 B/s ( 493830 us) (919570911 tsc) (unroll 64 fp i-prefetch++) Raw (i586-optimized) kernel bcopy (copy9) is 12.5% faster than the non-raw version (copyD) mainly because it is sloppy and doesn't do FPU state switching. % copyF: 6463432123 B/s ( 495093 us) (922068252 tsc) (new kernel bcopy (unroll 64 fp i-prefetch++)) "new" kernel bcopy has some improvements for Pentium1 related to fxch. These make little difference on core2, since the overhead of fxch is pipelined almost out of existence on core2. % copyG: 11128460690 B/s ( 287551 us) (535363248 tsc) (memcpy (movsl)) % copyH: 2494210703 B/s (1282971 us) (2389591890 tsc) (movntps) % copyI: 2283259781 B/s (1401505 us) (2611152194 tsc) (movntps with prefetchnta) % copyJ: 2246123156 B/s (1424677 us) (2662460521 tsc) (movntps with block prefetch) % copyK: 13432566418 B/s ( 238227 us) (443974286 tsc) (movq) % copyL: 11812171705 B/s ( 270907 us) (504438067 tsc) (movq with prefetchnta) % copyM: 12430515361 B/s ( 257431 us) (479327961 tsc) (movq with block prefetch) movq (64 bits through MMX registers) gives the same speed as movsl. But state switching for MMX would probably cost 12.5% like it does for i586- optimized kernel bcopy. % copyQ: 26618974338 B/s ( 120215 us) (223928117 tsc) (movdqa) % copyR: 21855833459 B/s ( 146414 us) (272801830 tsc) (movdqa with prefetchnta) % copyS: 22343716179 B/s ( 143217 us) (266771960 tsc) (movdqa with block prefetch) movdqa (128 bits through SSE registers using an SSE2 instruction) is the only method tested that is significantly faster than movsl (about twice as fast). Here all data is in the L1 cache except possibly for the first iteration (there are several hundred thousand iterations). % copyT: 6627728760 B/s ( 482820 us) (899276378 tsc) (unroll *8 a64-opt) % copyU: 6441859201 B/s ( 496751 us) (925378496 tsc) (unroll *8 a64-opt with prefetchnta) % copyV: 6514737558 B/s ( 491194 us) (914475275 tsc) (unroll *8 a64-opt with block prefetch) % copyW: 2769764215 B/s (1155333 us) (2151805649 tsc) (movnti) % copyX: 2519306152 B/s (1270191 us) (2365494292 tsc) (movnti with prefetchnta) % copyY: 2494284581 B/s (1282933 us) (2389247728 tsc) (movnti with block prefetch) movnti gives the speed of main memory, which is very slow for ref10-i386 (2.7 GB/S). The source is cached, so the only limit should be for writing to the target; movnti prevents this being cached. If the data size were larger than all caches, then movnti would be best and we would hope for a speed of 2.7/2 GB/S; without movnti, we would only hope for 2.7/3 GB/S. It is very difficult for copy routines or callers to know whether movnti should be used to possibly get this speedup by a factor of 1.5 for large data at the possible cost of a speed-down by a factor of 10 for small data. Some systems have relatively faster main memory and it is clear that movnti is less good for them. % copyZ: 18939281846 B/s ( 168961 us) (314562465 tsc) (i686_memcpy( movdqa)) % copya: 19157661568 B/s ( 167035 us) (311110842 tsc) (~i686_memcpy (movaps)) > Also, for the boot code, the most important thing is size. The text + data + > stack for /boot/loader has to all fit below 640k (and the first 40k is > reserved by BTX, so you really only have 600k for that, minus any "low" memory > consumed by things like PXE ROMs). That is true even on amd64, and won't be > any better on x86 until we fully support EFI for booting. Compiling boot code for newer processors would mainly break it for emergency use on older processors. Bruce