From owner-freebsd-current  Tue Jul 13 14:48:18 1999
Delivered-To: freebsd-current@freebsd.org
Received: from alcanet.com.au (border.alcanet.com.au [203.62.196.10])
	by hub.freebsd.org (Postfix) with ESMTP
	id D0EBA14DD2; Tue, 13 Jul 1999 14:48:10 -0700 (PDT)
	(envelope-from jeremyp@gsmx07.alcatel.com.au)
Received: by border.alcanet.com.au id <40325>; Wed, 14 Jul 1999 07:28:23 +1000
Date: Wed, 14 Jul 1999 07:46:13 +1000
From: Peter Jeremy <jeremyp@gsmx07.alcatel.com.au>
Subject: Re: LOCK overheads (was Re: "objtrm" problem probably found)
In-reply-to: <199907131555.IAA78738@apollo.backplane.com>
To: dillon@apollo.backplane.com
Cc: freebsd-current@FreeBSD.ORG, green@FreeBSD.ORG
Message-Id: <99Jul14.072823est.40325@border.alcanet.com.au>
Sender: owner-freebsd-current@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

Matthew Dillon <dillon@apollo.backplane.com> wrote:
>:mode 1   17.99 ns/loop nproc=1 lcks=no
>:mode 3  166.33 ns/loop nproc=1 lcks=yes
...
>:This is a K6-2 350. Locks are pretty expensive on them.
>    Wow, now that *is* expensive!  The K6 must be implementing it in
>    microcode for it to be that bad.

I wouldn't be surprised if lock prefixes did result in microcode
execution.  As I stated yesterday, I don't believe locked instructions
are implemented frequently enough to warrant special handling, and are
therefore likely to be implemented in whichever way need the least
chip area.

Since you need to be able to track and mark the memory references
associated with the instruction, the cheapest implementation (in terms
of dedicated chip area) is likely to be something like: wait until all
currently executing instructions are complete, wait until all pending
memory writes are complete (at least to L1 cache), assert the lock pin
and execute RMW instuction without allowing any other instructions to
commence, deassert lock pin.  This is (of course) probably the worst
case as far as execution time as seen by that CPU - though it's not
far off optimal as seen by other CPUs.

(`Assert lock pin' should also be mapped into a `begin locked memory
reference' using whatever cache coherency protocol is being used).

I'm not saying that you can't implement a locked RMW sequence a lot
better, but until the chip architects decide that the performance is
an issue, they aren't likely to spend any silicon on it.  The big
IA-32 market is UP systems running games - and locked RMW instructions
don't affect this market.  Intel see the high-end of the market (where
SMP and hence locked RMW is more of an issue) moving to Merced.  This
suggests that it's unlikely that the IA-32 will ever acquire a decent
lock capability (though at least the PIII is no worse than the PII).

That said, the above timings make a lock prefix cost over 50 core
clocks (or 15 bus clocks) - even microcode couldn't be that bad.  My
other timings (core/bus cycles) were: 486DX2: 20/10, Pentium: 28/7,
P-II: 34/8.5, P-III 34/7.5.

I suspect that these timings are a combination of inefficient on-chip
implementation of the lock prefix (see above for my reasoning behind
this), together with poor off-chip handling of locked cycles.

Peter


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-current" in the body of the message