Date: Sun, 21 Jan 2007 15:03:48 +1100 (EST) From: Bruce Evans <bde@zeta.org.au> To: David Malone <dwmalone@maths.tcd.ie> Cc: Attilio Rao <attilio@FreeBSD.org>, freebsd-current@FreeBSD.org, Ivan Voras <ivoras@fer.hr>, freebsd-arch@FreeBSD.org Subject: Re: Optimized copy&move (was: Re: [PATCH] Mantaining turnstile aligned to 128 bytes in i386 CPUs) Message-ID: <20070121140716.W4007@besplex.bde.org> In-Reply-To: <20070120215103.GA93101@walton.maths.tcd.ie> References: <eoji7s$cit$2@sea.gmane.org> <b1fa29170701161425n7bcfe1e5m1b8c671caf3758db@mail.gmail.com> <eojlnb$qje$1@sea.gmane.org> <b1fa29170701161534n1f6c3803tbb8ca60996d200d9@mail.gmail.com> <eojok9$449$1@sea.gmane.org> <20070117134022.V18339@besplex.bde.org> <20070117224812.Q23194@besplex.bde.org> <45AE7BF8.10703@fer.hr> <3bbf2fe10701171315g696bca4fi3bf676b62c06f4d@mail.gmail.com> <20070118094808.F11834@delplex.bde.org> <20070120215103.GA93101@walton.maths.tcd.ie>
next in thread | previous in thread | raw e-mail | index | archive | help
On Sat, 20 Jan 2007, David Malone wrote: > On Thu, Jan 18, 2007 at 11:16:19AM +1100, Bruce Evans wrote: >> - the FPU routines are faster on Athlons (XP and 64 at least), but these >> didn't exist until 2001. The introduction of these CPUs may have >> been the trigger for turning off the FPU routines in -current in 2001. >> Until then problems were limited to Pentium-1's since the dynamic >> configuration prevented the routines being used on all other machines. > > I think a very quirky K6-2 machine that I had let us reproduce the > problem fairly dependably and may have been part of the reason it > was finally turned off. I just looked again at your old (2001) mail about this. The userland benchmark was flawed. It tried 3 methods sequentially without warming up caches, so all methods did unintended testing of I-cache misses (including branch target cache cache) and the first method (userland bzero) warmed up the D-cache for the other 2. The kernel runtime configuration also fails to either warm or cool the caches initially. It assumes P1 cache sizes and depends on a 1MB buffer being much larger than caches. Maybe this was not enough for K6-2. It is certainly not enough for Athlon64, but I think it would mostly cause false negatives so I don't understand why it gave a false positive for the K6-2. After fixing the userland benchmark, userland bzero did much better and your benchmark agreed with mine that FPU methods for bzero are just pessimizations on A64-AXP. However, the behaviour for bcopy is quite different on A64-AXP -- even the old FPU methods are small optimizations in some cases (on A64, about 25% in the fully-L2 cached case; little difference for other large copies). Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20070121140716.W4007>