From owner-freebsd-current Tue Jul 13 14:48:18 1999 Delivered-To: freebsd-current@freebsd.org Received: from alcanet.com.au (border.alcanet.com.au [203.62.196.10]) by hub.freebsd.org (Postfix) with ESMTP id D0EBA14DD2; Tue, 13 Jul 1999 14:48:10 -0700 (PDT) (envelope-from jeremyp@gsmx07.alcatel.com.au) Received: by border.alcanet.com.au id <40325>; Wed, 14 Jul 1999 07:28:23 +1000 Date: Wed, 14 Jul 1999 07:46:13 +1000 From: Peter Jeremy Subject: Re: LOCK overheads (was Re: "objtrm" problem probably found) In-reply-to: <199907131555.IAA78738@apollo.backplane.com> To: dillon@apollo.backplane.com Cc: freebsd-current@FreeBSD.ORG, green@FreeBSD.ORG Message-Id: <99Jul14.072823est.40325@border.alcanet.com.au> Sender: owner-freebsd-current@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG Matthew Dillon wrote: >:mode 1 17.99 ns/loop nproc=1 lcks=no >:mode 3 166.33 ns/loop nproc=1 lcks=yes ... >:This is a K6-2 350. Locks are pretty expensive on them. > Wow, now that *is* expensive! The K6 must be implementing it in > microcode for it to be that bad. I wouldn't be surprised if lock prefixes did result in microcode execution. As I stated yesterday, I don't believe locked instructions are implemented frequently enough to warrant special handling, and are therefore likely to be implemented in whichever way need the least chip area. Since you need to be able to track and mark the memory references associated with the instruction, the cheapest implementation (in terms of dedicated chip area) is likely to be something like: wait until all currently executing instructions are complete, wait until all pending memory writes are complete (at least to L1 cache), assert the lock pin and execute RMW instuction without allowing any other instructions to commence, deassert lock pin. This is (of course) probably the worst case as far as execution time as seen by that CPU - though it's not far off optimal as seen by other CPUs. (`Assert lock pin' should also be mapped into a `begin locked memory reference' using whatever cache coherency protocol is being used). I'm not saying that you can't implement a locked RMW sequence a lot better, but until the chip architects decide that the performance is an issue, they aren't likely to spend any silicon on it. The big IA-32 market is UP systems running games - and locked RMW instructions don't affect this market. Intel see the high-end of the market (where SMP and hence locked RMW is more of an issue) moving to Merced. This suggests that it's unlikely that the IA-32 will ever acquire a decent lock capability (though at least the PIII is no worse than the PII). That said, the above timings make a lock prefix cost over 50 core clocks (or 15 bus clocks) - even microcode couldn't be that bad. My other timings (core/bus cycles) were: 486DX2: 20/10, Pentium: 28/7, P-II: 34/8.5, P-III 34/7.5. I suspect that these timings are a combination of inefficient on-chip implementation of the lock prefix (see above for my reasoning behind this), together with poor off-chip handling of locked cycles. Peter To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-current" in the body of the message