From owner-freebsd-hackers  Mon Jan 22 10:49:44 1996
Return-Path: owner-hackers
Received: (from root@localhost)
          by freefall.freebsd.org (8.7.3/8.7.3) id KAA05239
          for hackers-outgoing; Mon, 22 Jan 1996 10:49:44 -0800 (PST)
Received: from phaeton.artisoft.com (phaeton.Artisoft.COM [198.17.250.211])
          by freefall.freebsd.org (8.7.3/8.7.3) with SMTP id KAA05232
          for <freebsd-hackers@freefall.freebsd.org>; Mon, 22 Jan 1996 10:49:39 -0800 (PST)
Received: (from terry@localhost) by phaeton.artisoft.com (8.6.11/8.6.9) id LAA15576; Mon, 22 Jan 1996 11:39:47 -0700
From: Terry Lambert <terry@lambert.org>
Message-Id: <199601221839.LAA15576@phaeton.artisoft.com>
Subject: Re: stanford benchmark/usenix
To: davidg@root.com
Date: Mon, 22 Jan 1996 11:39:47 -0700 (MST)
Cc: hasty@rah.star-gate.com, rmallory@wiley.csusb.edu,
        freebsd-hackers@freefall.freebsd.org
In-Reply-To: <199601221021.CAA14236@Root.COM> from "David Greenman" at Jan 22, 96 02:21:28 am
X-Mailer: ELM [version 2.4 PL24]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-hackers@FreeBSD.ORG
Precedence: bulk

[ ... CPU specific bzero/bcopy/other ... ]

> The function vector can then be changed to an optimized function for specific
> CPU types. This would happen at some convenient place before program startup,
> or perhaps in the generic function (which could, perhaps, be a stub whose sole
> purpose is to select the appropriate routine, or fall back to a generic one).
> I really don't want to get into this in more detail right now - I don't have
> the time and in the end it would be easier to just sit down and code it. If
> you think you know how to implement this correctly, then by all means, go for
> it!

We did this as well with the BSD kernel environment emulation under
Windows95 for the file system framework (we have UFS running as a
native FS under Win95 after making some changes of the changes I've
been suggesting after isolating the BSD'isms and optimizing performance
from the non-statistical profiling data).

Do you remember Bruce's message regarding reordering the cache line
loads in the P5 optimized bcopy?  He said:

| On my 486DX2/66 with an unknown writing strategy, copy() is about 20%
| faster than memcpy() (*) but can be improved another 20% by changing the
| cache line allocation strategy slightly: replace the load of 28(%edi) by
| a load of 12(%edi) and add a load of 28(%edi) in the middle of the loop.
| The pairing stuff and the nops make little difference.  cache-line
| alignment of the source and target made little difference.
| 
| (*) When memcpy() is run a second time, it is as fast as the fastest
| version as copy()!

I didn't quite follow the reasoning, since it would write the contents
of 12(%edi) into 28(%edi)?!?

I mailed Bruce about this directly, but haven't seen a response yet...


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.