From owner-freebsd-hackers  Mon Apr 15 14:06:02 1996
Return-Path: owner-hackers
Received: (from root@localhost)
          by freefall.freebsd.org (8.7.3/8.7.3) id OAA06648
          for hackers-outgoing; Mon, 15 Apr 1996 14:06:02 -0700 (PDT)
Received: from phaeton.artisoft.com (phaeton.Artisoft.COM [198.17.250.211])
          by freefall.freebsd.org (8.7.3/8.7.3) with SMTP id OAA06633
          for <hackers@FreeBSD.ORG>; Mon, 15 Apr 1996 14:05:57 -0700 (PDT)
Received: (from terry@localhost) by phaeton.artisoft.com (8.6.11/8.6.9) id OAA09539; Mon, 15 Apr 1996 14:01:51 -0700
From: Terry Lambert <terry@lambert.org>
Message-Id: <199604152101.OAA09539@phaeton.artisoft.com>
Subject: Re: Pentium fast copy?
To: msmith@atrad.adelaide.edu.au (Michael Smith)
Date: Mon, 15 Apr 1996 14:01:51 -0700 (MST)
Cc: peter@nmti.com, hackers@FreeBSD.ORG
In-Reply-To: <199604140416.NAA11976@genesis.atrad.adelaide.edu.au> from "Michael Smith" at Apr 14, 96 01:46:21 pm
X-Mailer: ELM [version 2.4 PL24]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-hackers@FreeBSD.ORG
X-Loop: FreeBSD.org
Precedence: bulk

[ ... Lai/Baker Pentium bcopy notes ... ]

> Most of the implementations that have been thrown around here tend to 
> average at about 40M/sec.  There have been some significantly faster under
> certain specialised circumstances, but they tend to perform poorly on older
> processors, or they fall foul of some of the caching policies imposed 
> by some motherboards, or they impose extra overhead elsewhere in the
> system (the most common complaint is that context-switches have to become
> more complex to handle the technique).
> 
> I'd be really interested to see what sort of hardware they're using that
> has 160M/sec of memory bandwidth.  Unless they're running 100% static
> RAM, I suspect they've never actually implemented their code on a practical 
> scale 8(

I would suspect that a large part of the performance is in ensuring source
and target addresses of the same quad alignment (using the 8 byte floating
point copy) or the same dword alignment (using the integer register cache
line prefetch method).

The 160M/sec number is exceedingly optimististic.  I would expect that
we would be unable to see that performance until we could enable the
alignment trap on a P5/P6 and fail unaligned memory access like a
decent RISC chip, without causing the kernel to panic.


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.