From owner-freebsd-hackers Sun Sep 8 15:07:36 1996 Return-Path: owner-hackers Received: (from root@localhost) by freefall.freebsd.org (8.7.5/8.7.3) id PAA28220 for hackers-outgoing; Sun, 8 Sep 1996 15:07:36 -0700 (PDT) Received: from phaeton.artisoft.com (phaeton.Artisoft.COM [198.17.250.211]) by freefall.freebsd.org (8.7.5/8.7.3) with SMTP id PAA28212 for ; Sun, 8 Sep 1996 15:07:27 -0700 (PDT) Received: (from terry@localhost) by phaeton.artisoft.com (8.6.11/8.6.9) id PAA03751; Sun, 8 Sep 1996 15:05:30 -0700 From: Terry Lambert Message-Id: <199609082205.PAA03751@phaeton.artisoft.com> Subject: Re: namei performance (was Re: FreeBSD vs. Linux 96 (my impressions)) To: michaelh@cet.co.jp (Michael Hancock) Date: Sun, 8 Sep 1996 15:05:30 -0700 (MST) Cc: terry@lambert.org, hackers@FreeBSD.ORG In-Reply-To: from "Michael Hancock" at Sep 8, 96 06:47:58 pm X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-hackers@FreeBSD.ORG X-Loop: FreeBSD.org Precedence: bulk > > How would it compare to a SYSV lookuppn() which works component by > > component? Namei can handle components up to mount boundaries requiring > > less calls. > > I should clarify this. Namei allows the file system dependent layer make > the decision to go component by component or by components up to a mount > point. nfs_lookup requires component by component, but presumably > ufs_lookup can operate on chunks of the pathname up to a mount point. I understood the reference; I've been into UNIX FS code for about 5 years now. 8-). Like I said in my previous posting, I think that cache locality in the component name cache would more than make up for this. There is also the issue of multibyte support at the FS level, if you want to get technical about it. If the path seperator is literal, then each FS which wants to eat components will have to have an understanding of seperator processing, etc. This has the unfortunate side effect of locking you into a single path representation geometry. This fails for VFAT/VFAT32/HPFS/NTFS, all of which have "long names" and 8.3 names for each file. For the Mac FS, it fails because of the 32 character "short" Mac file name and the ProDOS name on each file, and for the fact that the seperator character in the file name and out is a null, not a '/' (you can hide this difference by translating between '/' and ':' in the UNIX stored name for Mac clients of the FS). > Isn't the biggest problem, with the namei way of doing things, the lock > put on the directory? A directory will be locked between operations, even > if the caller blocks. The directory is locked even for reads which > serializes access to the directory. Actually, it's possible to only write-lock the directory block for the next traversal while you block the process. This requires a sliding block window of two blocks. As long as you guarantee that the block itself will be internally consistent when you block, this will not prevent reader traversal. The only real reason for imposing locks at this point is to enable kernel reentrancy, probably by another processor in the SMP case, but maybe also for kernel preemptability for kernel multithreading and/or RT scheduling support. > It would be interesting to see if using multiple-reader/single writer > locks would improve performance here. Actually, it probably would not. Multiple-reader/single-writer locks would block reading while a write was in progress. Since a read of a directory block is a snapshot of a directory block in a consistent state copied to user space, then there is no reason to block the reader, since the readers will always get a copy of a consistent state for the block. This type of locking is generally more useful where the operations on the state do not idempotently move from one consistent state to another; for instance, in an update of two or more files in a source tree. The actual locking methodology that probaly wants to be used is seven state: no lock, R (read), W (write), X (exclusive), IR (intention read), IW, and IX. Use of intention modes means that it is posible to avoid collisions -- it is much better to use collision avoidance rather than collision detection, since it means you don't need to use exception code to support the unwinding of complex state in case of a collision. The state unwinding issues are also the rationale behind moving to single entry/single exit, and dividing the kernel functions into the categories: o Expects no lock to be held o Expects one or more locks to be held o Asserts lock (expects lock is not held) o Deasserts lock (expects lock is held) The issues of kernel preemption for RT scheduling, kernel preemption for kernel multithreading, and kernel reentrancy on a state context for SMP support, are all topologically identical problems. The only difference is that the SMP case must be able to synchronize state between processors. This means the SMP case is the limiting case, and pursuit of it will buy you all your other cases (they are subsets of the SMP problem). So the current SMP approach is probably the best approach for getting maximum functionality without needing a redesign for each integration. In any case, seperating stack state and global state in the lookup and similar places in the FS is a necessary step in increasing concurrency. The stack state must be seperated because it includes processor locality, and prevents use of superscalar systems (there is at least one German company talking about 1024 processor PPC systems). It even has a noticible effect at the Intel processor level; the current practical limits of 4 Intel processors are because of most operating systems synchronizing access to the system bus, and in particular, memory. The sole exception is Sequent's OS, and they haven't dealt with the FS issues (this is easy to prove using two shells and two find commands). Even then, there is an inherent APIC address limit of 5 bits -- 32 processors -- for Intel systems. Hopefully we can learn from other peoples successes as well as their failures. A good idea is good or bad independent of its originator. 8-). Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers.