From owner-freebsd-hackers  Sun Sep  8 15:07:36 1996
Return-Path: owner-hackers
Received: (from root@localhost)
          by freefall.freebsd.org (8.7.5/8.7.3) id PAA28220
          for hackers-outgoing; Sun, 8 Sep 1996 15:07:36 -0700 (PDT)
Received: from phaeton.artisoft.com (phaeton.Artisoft.COM [198.17.250.211])
          by freefall.freebsd.org (8.7.5/8.7.3) with SMTP id PAA28212
          for <hackers@FreeBSD.ORG>; Sun, 8 Sep 1996 15:07:27 -0700 (PDT)
Received: (from terry@localhost) by phaeton.artisoft.com (8.6.11/8.6.9) id PAA03751; Sun, 8 Sep 1996 15:05:30 -0700
From: Terry Lambert <terry@lambert.org>
Message-Id: <199609082205.PAA03751@phaeton.artisoft.com>
Subject: Re: namei performance (was Re: FreeBSD vs. Linux 96 (my impressions))
To: michaelh@cet.co.jp (Michael Hancock)
Date: Sun, 8 Sep 1996 15:05:30 -0700 (MST)
Cc: terry@lambert.org, hackers@FreeBSD.ORG
In-Reply-To: <Pine.SV4.3.93.960908181852.8929B-100000@parkplace.cet.co.jp> from "Michael Hancock" at Sep 8, 96 06:47:58 pm
X-Mailer: ELM [version 2.4 PL24]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-hackers@FreeBSD.ORG
X-Loop: FreeBSD.org
Precedence: bulk

> > How would it compare to a SYSV lookuppn() which works component by
> > component?  Namei can handle components up to mount boundaries requiring
> > less calls.
> 
> I should clarify this.  Namei allows the file system dependent layer make
> the decision to go component by component or by components up to a mount
> point.  nfs_lookup requires component by component, but presumably
> ufs_lookup can operate on chunks of the pathname up to a mount point.

I understood the reference; I've been into UNIX FS code for about 5 years
now.  8-).

Like I said in my previous posting, I think that cache locality in the
component name cache would more than make up for this.

There is also the issue of multibyte support at the FS level, if you
want to get technical about it.  If the path seperator is literal, then
each FS which wants to eat components will have to have an understanding
of seperator processing, etc.  This has the unfortunate side effect of
locking you into a single path representation geometry.

This fails for VFAT/VFAT32/HPFS/NTFS, all of which have "long names"
and 8.3 names for each file.  For the Mac FS, it fails because of the
32 character "short" Mac file name and the ProDOS name on each file,
and for the fact that the seperator character in the file name and out
is a null, not a '/' (you can hide this difference by translating
between '/' and ':' in the UNIX stored name for Mac clients of the FS).

> Isn't the biggest problem, with the namei way of doing things, the lock
> put on the directory?  A directory will be locked between operations, even
> if the caller blocks.  The directory is locked even for reads which
> serializes access to the directory.

Actually, it's possible to only write-lock the directory block for the
next traversal while you block the process.  This requires a sliding
block window of two blocks.

As long as you guarantee that the block itself will be internally
consistent when you block, this will not prevent reader traversal.

The only real reason for imposing locks at this point is to enable
kernel reentrancy, probably by another processor in the SMP case, but
maybe also for kernel preemptability for kernel multithreading and/or
RT scheduling support.


> It would be interesting to see if using multiple-reader/single writer
> locks would improve performance here.

Actually, it probably would not.  Multiple-reader/single-writer locks
would block reading while a write was in progress.  Since a read of a
directory block is a snapshot of a directory block in a consistent
state copied to user space, then there is no reason to block the reader,
since the readers will always get a copy of a consistent state for the
block.

This type of locking is generally more useful where the operations on
the state do not idempotently move from one consistent state to another;
for instance, in an update of two or more files in a source tree.


The  actual locking methodology that probaly wants to be used is seven
state: no lock, R (read), W (write), X (exclusive), IR (intention read),
IW, and IX.

Use of intention modes means that it is posible to avoid collisions --
it is much better to use collision avoidance rather than collision
detection, since it means you don't need to use exception code to
support the unwinding of complex state in case of a collision.  The
state unwinding issues are also the rationale behind moving to single
entry/single exit, and dividing the kernel functions into the categories:

o	Expects no lock to be held
o	Expects one or more locks to be held
o	Asserts lock (expects lock is not held)
o	Deasserts lock (expects lock is held)


The issues of kernel preemption for RT scheduling, kernel preemption for
kernel multithreading, and kernel reentrancy on a state context for SMP
support, are all topologically identical problems.  The only difference
is that the SMP case must be able to synchronize state between processors.
This means the SMP case is the limiting case, and pursuit of it will buy
you all your other cases (they are subsets of the SMP problem).  So the
current SMP approach is probably the best approach for getting maximum
functionality without needing a redesign for each integration.

In any case, seperating stack state and global state in the lookup and
similar places in the FS is a necessary step in increasing concurrency.
The stack state must be seperated because it includes processor locality,
and prevents use of superscalar systems (there is at least one German
company talking about 1024 processor PPC systems).  It even has a
noticible effect at the Intel processor level; the current practical
limits of 4 Intel processors are because of most operating systems
synchronizing access to the system bus, and in particular, memory.  The
sole exception is Sequent's OS, and they haven't dealt with the FS issues
(this is easy to prove using two shells and two find commands).  Even
then, there is an inherent APIC address limit of 5 bits -- 32 processors --
for Intel systems.

Hopefully we can learn from other peoples successes as well as their
failures.  A good idea is good or bad independent of its originator.
8-).


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.