From owner-freebsd-hackers  Fri Sep  6 11:10:11 1996
Return-Path: owner-hackers
Received: (from root@localhost)
          by freefall.freebsd.org (8.7.5/8.7.3) id LAA17279
          for hackers-outgoing; Fri, 6 Sep 1996 11:10:11 -0700 (PDT)
Received: from phaeton.artisoft.com (phaeton.Artisoft.COM [198.17.250.211])
          by freefall.freebsd.org (8.7.5/8.7.3) with SMTP id LAA17226
          for <hackers@FreeBSD.org>; Fri, 6 Sep 1996 11:09:47 -0700 (PDT)
Received: (from terry@localhost) by phaeton.artisoft.com (8.6.11/8.6.9) id LAA11517; Fri, 6 Sep 1996 11:02:06 -0700
From: Terry Lambert <terry@lambert.org>
Message-Id: <199609061802.LAA11517@phaeton.artisoft.com>
Subject: Re: FreeBSD vs. Linux 96 (my impressions)
To: koshy@india.hp.com (A JOSEPH KOSHY)
Date: Fri, 6 Sep 1996 11:02:05 -0700 (MST)
Cc: terry@lambert.org, jkh@time.cdrom.com, jehamby@lightside.com,
        imp@village.org, lada@ws2301.gud.siemens.co.at, dennis@etinc.com,
        hackers@FreeBSD.org
In-Reply-To: <199609060508.AA079396512@fakir.india.hp.com> from "A JOSEPH KOSHY" at Sep 6, 96 10:08:31 am
X-Mailer: ELM [version 2.4 PL24]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-hackers@FreeBSD.org
X-Loop: FreeBSD.org
Precedence: bulk

> tl> This is mostly because the BSD namei() interface is a piece of shit no
> tl> one seems prepared to allow a change to because there are one or two
> tl> CSRG hackers locked in a closet somewhere, and every once in a while
> tl> they shove something out under the door, and God Forbid we lose out
> tl> on the ability to integrate those occasional changes.
> 
> On another point, I did some basic kernel profiling while doing some
> assorted operations (make kernel, find | cpio -O /dev/null) etc.
> 
> Surprisingly `namei' turned out to be the single biggest contributor to
> time spent in the kernel.

I can understand the find -- a more balanced benchmark would be to
run a DOS/Windows client running the Ziff/Davis suite from Netbench
(DiskMix) against a BSD server.  Alternately, there are several
commercial suites which would cost you several hundred dollars to
acquire (or more than that to rewrite).  The LM/Bench stuff is not
much better than the find, since it biases FS operations toward
directory ops -- the single biggest FS usage is for read calls,
then writes, then directory ops, then all other ops.

On my own system, profiling shows that the single biggest time
is data copies, by about a factor of 5:1 over all other sources
of delay.  You can fix this somewhat by picking an optimum bcopy()
implementation per processor in the uiomove() code.  The uiomove()
code is also needlessly complex to support the "struct fileops"
abstraction (deadfs -- unnecessary, specfs -- replaced by devfs to
not use fileops, and pipes -- should be implemented in an unexported
FS name space).

You can get about 2% by cleaning up the relative root code, at the
cost of having to specify a relative root vnode in all cases by
inheriting the root at process creation from the fork()ing process.
This only means that you have to set the root for the init process,
something you do for it's current directory anyway.

The namei() call tends to copy the path string around, and so is a
big offender; this is correctable with a couple of interface changes;
the nameifree() change drops it about 10%, for instance, by moving
the alloc/free operation to the same API level, and reducing the
extra testing that has to go on everywhere in the error cases.

Changing the path string into a pre-parsed list of path components is
about another 6% win, and you can get another 8% by putting in the
change to not go through with the allocation on a preexisting element.
This complicated parsing of symbolic links, since it means you have
to loop-unroll the mutual recusrsion (which is how symbolic links
are currently implemented).  To avoid using too much kernel stack,
you have to reduce the stack usage to get there -- another reason
for a pre-parsed list not allocated on the stack.

Moving a lot of the flag based complexity out of the VOP_LOOKUP
will flatten the function call graph, and save another 8% in the
non-failure case, as well as making the code less subject to
misimplementation by moving it out of the per-FS VOP_LOOKUP code.
For instance, the directory name cache code wants to be in the
common lookup code instead of the per FS lookup code.  You would
use a per FS instance (vfsstruct) flag to enable/disable the
six or so cache conditions (create/delete/negative cache, etc.).
The union FS would have to be expanded to include cache information
for its inferior FS -- basically an issue for the FS layers which
fan-out 1:N mappings.

Finally, presorting the function vector list at the time you register
the FS allows you to change the indirect function reference for the
VOP_* vnode_if.c calls into macro references, which throws out the
additional stack call and functioncall overhead of simply using the
VOP interface at all (push-call-push-call-ret-pop-ret-pop simply
decomposes to push-call-ret-pop).  This is only about 1% for the
VOP_LOOKUP, but ends up being about 7% overall in the Ziff/Davis
benchmarks.


					Regards,
					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.