From owner-cvs-sys  Fri Dec  6 23:58:43 1996
Return-Path: <owner-cvs-sys>
Received: (from root@localhost)
          by freefall.freebsd.org (8.8.4/8.8.4) id XAA16011
          for cvs-sys-outgoing; Fri, 6 Dec 1996 23:58:43 -0800 (PST)
Received: from godzilla.zeta.org.au (godzilla.zeta.org.au [203.2.228.19])
          by freefall.freebsd.org (8.8.4/8.8.4) with ESMTP id XAA16006;
          Fri, 6 Dec 1996 23:58:20 -0800 (PST)
Received: (from bde@localhost) by godzilla.zeta.org.au (8.8.3/8.6.9) id SAA19808; Sat, 7 Dec 1996 18:51:47 +1100
Date: Sat, 7 Dec 1996 18:51:47 +1100
From: Bruce Evans <bde@zeta.org.au>
Message-Id: <199612070751.SAA19808@godzilla.zeta.org.au>
To: bde@zeta.org.au, peter@spinner.DIALix.COM
Subject: Re: cvs commit: src/sys/i386/include endian.h
Cc: cvs-all@freefall.freebsd.org, CVS-committers@freefall.freebsd.org,
        cvs-sys@freefall.freebsd.org, dyson@freefall.freebsd.org,
        toor@dyson.iquest.net
Sender: owner-cvs-sys@FreeBSD.ORG
X-Loop: FreeBSD.org
Precedence: bulk

>Just as a thought, is it possible to handle the illegal instruction
>trap on the i386 and emulate the bswap instruction?  Then we could just
>use bswap everywhere and be done with it.  Obviously this would be a

We would have to replace it by code that doesn't trap.  Always trapping
would be too slow.  Unfortunately, bswap is a short instruction (3 bytes
IIRC) so there is no room for replacing it unless it is padded to begin
with.

>penalty on the i386 (I wonder how much?), but it'd simplify the

I guess about 50 us.

>environment on the "current" mainstream cpu's.  Perhaps this would
>also be worth doing for invlpg() and other instructions?  It would
>eliminate a runtime overhead for testing cpu_class on >= i486 cpu's

Using function calls would provide maximum flexibilty at a small cost.
A function call+ret takes only 1+2 cycles (+more for cache misses and
BTB misses).  That's not much more than the 2 cycles (+more ...) for
testing cpu_class.

BTW, I have found a case where non-inline spls cause a reproducible
slowdown - `ping localhost' on an idle P5/133 takes about 3 us longer.
Each ping takes about 16 splnnn()s and 16 splx()s, so the call+ret
overhead doesn't completely account for the slowdown.  I guess this
is caused by more BTB and cache misses caused by the extra function
calls and lack of localitly.  16 splnnn()'s per second is probably
too few to allow the function to stay in the L1 cache any better than
inline code.

Bruce