Date: Thu, 18 Jan 2007 11:16:19 +1100 (EST) From: Bruce Evans <bde@zeta.org.au> To: Attilio Rao <attilio@freebsd.org> Cc: freebsd-current@freebsd.org, Ivan Voras <ivoras@fer.hr>, freebsd-arch@freebsd.org Subject: Re: Optimized copy&move (was: Re: [PATCH] Mantaining turnstile aligned to 128 bytes in i386 CPUs) Message-ID: <20070118094808.F11834@delplex.bde.org> In-Reply-To: <3bbf2fe10701171315g696bca4fi3bf676b62c06f4d@mail.gmail.com> References: <3bbf2fe10607250813w8ff9e34pc505bf290e71758@mail.gmail.com> <b1fa29170701161355lc021b90o35fa5f9acb5749d@mail.gmail.com> <eoji7s$cit$2@sea.gmane.org> <b1fa29170701161425n7bcfe1e5m1b8c671caf3758db@mail.gmail.com> <eojlnb$qje$1@sea.gmane.org> <b1fa29170701161534n1f6c3803tbb8ca60996d200d9@mail.gmail.com> <eojok9$449$1@sea.gmane.org> <20070117134022.V18339@besplex.bde.org> <20070117224812.Q23194@besplex.bde.org> <45AE7BF8.10703@fer.hr> <3bbf2fe10701171315g696bca4fi3bf676b62c06f4d@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, 17 Jan 2007, Attilio Rao wrote: > 2007/1/17, Ivan Voras <ivoras@fer.hr>: >> Bruce Evans wrote: >> >> > And MMX/XMM registers ar not needed to get movnt on machines with SSE2, >> > since movnti is part of SSE2. This reduces the advantages of using >> MMX/XMM >> > registers on P4's and A64's in 32-bit mode to the non-nt parts of the >> > above (fully cached case), which I think are less important than the nt >> > parts. One more point on movnt*: - The i386 (32-bit) kernel already uses movnti in the one place where it is certain to be an optimization (for zeroing pages but not for copying anything) if and only if movnti is known for sure to be available (esentially, if the CPU supports SSE2). See i386/pmap.c and i386/support.s. >> Hmm, I'm looking at i386/i386/support.s and there are several versions >> of bcopy and bmove functions, including some that optimize by using FPU >> registers (large_i586_bcopy_loop), and a version that uses movnti >> (sse2_pagezero), but I can't find the bit of magic which glues them to >> bzero() call. sse2_pagezero() is the SSE2 optimization mentioned above. It is glued in pmap.c. The i586 bcopy functions are my old mistakes which I'm trying to bury :-). They are glued in isa/npx.c. There is a runtime test for their efficiency there, and config-time configuration by npx flags (see NOTES). All use of the FPU routines is disabled in all released versions of FreeBSD later than FreeBSD-4 (the config-time configuration is ignored and the dynamic test and the glueing are not done; only the actual routines are compiled, so in theory you could enable them by glueing them in a kmod). >> Also, as as I can tell by the comments, the FPU version works by >> manually saving context... why is this possible (i.e. won't something >> preempt it?) In RELENG_4, the kernel is not preemptible so preemption isn't a problem. In later versions, preemption is a problem so the FPU routines are disabled. RELENG_4 has a limited amount of preemption for interrupt handlers, and the FPU routines have a limited amount of recursive saving of contexts to support this. No bugs are known in this under RELENG_4. Under later versions, the recursive saving doesn't quite work. RELENG_4 has a limited amount of support for SMP, and the FPU routines have a limited amount of locking to support this. No bugs are known in this under RELENG_4, but I wouldn't trust it without testing. It has probably been tested enough under RELENG_4 by now, but it might never have been tested before 2001 when the FPU routines were turned off in -current, because there were no machines with the critical combination of features until relatively recently: - SMP machines with Pentium-1's were rare and are now rarer - the FPU routines are slower on P2-P4, K5 and K6 so the dynamic configuration should prevent them being used on machines with P2-P4, K5 and K6. - the FPU routines are faster on Athlons (XP and 64 at least), but these didn't exist until 2001. The introduction of these CPUs may have been the trigger for turning off the FPU routines in -current in 2001. Until then problems were limited to Pentium-1's since the dynamic configuration prevented the routines being used on all other machines. > They are just broken. > My implementation, which follows DragonFlyBSD patterns, just use a bts > (which is atomic) in order to set a "lock" and avoid thread migration > with scheduler pinning. This is enough to solve concurrency problems. There is a bit more to it than that :-). The old implementation uses a sar <mem> instruction for the same purpose. Neither bts nor sar <mem> is atomic, but both can be made atomic using a lock prefix. The old implementation neglects to do this, so the instruction is only atomic with respect to interrupts. If it works at all for the SMP case under RELENG_4, then it is because Giant locking prevents all types of preemption. Giant locking certainly prevents process preemption, but it is less clear that it prevents interrupt handlers running on other CPUs from getting far enough to clobber the lock. I think it does. The unlocked sar just doesn't work under -current, especially starting much later than 2001 when the kernel became fully preemptible. (I like to use sar instead of bts/cmpxchg/ whatever since it is more portable.) Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20070118094808.F11834>