From owner-freebsd-hackers  Thu Oct  3 06:57:38 1996
Return-Path: owner-hackers
Received: (from root@localhost)
          by freefall.freebsd.org (8.7.5/8.7.3) id GAA28117
          for hackers-outgoing; Thu, 3 Oct 1996 06:57:38 -0700 (PDT)
Received: from minnow.render.com (render.demon.co.uk [158.152.30.118])
          by freefall.freebsd.org (8.7.5/8.7.3) with SMTP id GAA28092;
          Thu, 3 Oct 1996 06:57:22 -0700 (PDT)
Received: from minnow.render.com (minnow.render.com [193.195.178.1]) by minnow.render.com (8.6.12/8.6.9) with SMTP id OAA26130; Thu, 3 Oct 1996 14:54:58 +0100
Date: Thu, 3 Oct 1996 14:54:56 +0100 (BST)
From: Doug Rabson <dfr@render.com>
To: dyson@freebsd.org
cc: phk@critter.tfs.com, heo@cslsun10.sogang.ac.kr,
        freebsd-hackers@freebsd.org, freebsd-fs@freebsd.org
Subject: Re: vnode and cluster read-ahead
In-Reply-To: <199610031312.IAA00602@dyson.iquest.net>
Message-ID: <Pine.BSF.3.95.961003144401.10204Q-100000@minnow.render.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-hackers@freebsd.org
X-Loop: FreeBSD.org
Precedence: bulk

On Thu, 3 Oct 1996, John S. Dyson wrote:

> > 
> > On the subject of saving memory, I firmly believe that signficant
> > performance improvements can be made just by reducing the memory footprint
> > of algorithms.  In our 3D graphics work, we have found that making
> > important datastructures fit into cache lines (and using an aligning
> > allocator to make sure that they start on cache line boundaries) can
> > improve performance by as much as 20%.
> > 
> The pmap code is a perfect example of that.  There are times that I have
> "improved" the code, and noted a net slowdown, because it has grown.
> Soon, I intend to chop out another 1-2k out of pmap.o.  Smaller is
> definitely better sometimes.

You may find that increasing the size of struct pv_entry to 32 bytes and
arranging get_pv_entry to return new pv_entries on 32 byte boundaries will
improve performance for operations that traverse pmaps which contain a
large number of entries.  Making structures like this fit cleanly into
cache lines reduces the average number of cache misses needed to access a
large quantity of data.

If in addition, you arrange those functions to access the struct pv_entry
sequentially from start to end, you will benefit from the fact that the 8
words of a cache line are read sequentially after a cache miss by the
pentium and are available for use by instructions as soon as they are
read, i.e. you can use the first couple of words in the cache line while
the processor reads the rest.   Looking at pmap_remove_entry() it seems to
do this already but you can only benefit from it if the structure starts
on a cache line boundary.

--
Doug Rabson, Microsoft RenderMorphics Ltd.	Mail:  dfr@render.com
						Phone: +44 171 734 3761
						FAX:   +44 171 734 6426