From owner-freebsd-fs Mon Dec 6 11:26:52 1999 Delivered-To: freebsd-fs@freebsd.org Received: from bingnet2.cc.binghamton.edu (bingnet2.cc.binghamton.edu [128.226.1.18]) by hub.freebsd.org (Postfix) with ESMTP id 7C49614C93; Mon, 6 Dec 1999 11:26:29 -0800 (PST) (envelope-from zzhang@cs.binghamton.edu) Received: from sol.cs.binghamton.edu (cs1-gw.cs.binghamton.edu [128.226.171.72]) by bingnet2.cc.binghamton.edu (8.9.3/8.9.3) with SMTP id OAA10187; Mon, 6 Dec 1999 14:26:20 -0500 (EST) Date: Mon, 6 Dec 1999 13:13:18 -0500 (EST) From: Zhihui Zhang To: freebsd-hackers@freebsd.org, freebsd-fs@freebsd.org Subject: ELF & putting inode at the front of a file Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org I have modified FFS filesystem code to put the disk inode at the beginning of a file, i.e, the logical block #0 of each file begins with 128 bytes of its disk inode and the rest of it are file data. Everything seems to be working. But I am stuck with an ELF executable file stored in this layout - I can not run it. The kernel uses memory map to read the ELF file and assumes that the file data begins at offset 0. I have looked at the files kern_exec.c and imgact_elf.c trying to adjust the header pointers by an offset of 128 bytes to at least let the kernel recognize that it is an ELF file. But still I got messages like "too few PT_LOAD segments". Obviously, I need to modify the kernel files elsewhere, perhaps those under directory contrib/rtld-elf/*, which I have never read before. My questions are: (1) What consequences will my file layout affect the load and execution of an ELF file? Do I have to adjust the virtual addresses in the ELF object file as well? (2) If I modify any files under contrib/rtld-elf, how to make the modifications take effect. Is that as simple as "make" and followed by "make install". I am new to these kernel stuff. Any help or hints are very appreciated. -Zhihui To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Mon Dec 6 12:11:20 1999 Delivered-To: freebsd-fs@freebsd.org Received: from ns1.yes.no (ns1.yes.no [195.204.136.10]) by hub.freebsd.org (Postfix) with ESMTP id C80C614D0F for ; Mon, 6 Dec 1999 12:11:15 -0800 (PST) (envelope-from eivind@bitbox.follo.net) Received: from bitbox.follo.net (bitbox.follo.net [195.204.143.218]) by ns1.yes.no (8.9.3/8.9.3) with ESMTP id VAA09840 for ; Mon, 6 Dec 1999 21:11:14 +0100 (CET) Received: (from eivind@localhost) by bitbox.follo.net (8.8.8/8.8.6) id VAA09903 for fs@freebsd.org; Mon, 6 Dec 1999 21:11:13 +0100 (MET) Date: Mon, 6 Dec 1999 21:11:12 +0100 From: Eivind Eklund To: fs@freebsd.org Subject: NDFREE patches / architecture change Message-ID: <19991206211112.I8056@bitbox.follo.net> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 1.0i Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org A patchset is available at http://www.freebsd.org/~eivind/namei-boots-with-NDFREE.patch In fact, it doesn't only boot - it also gets through 'make world' to the point where my testbox runs out of diskspace. NFS is not tested at all yet; I don't have any NFS setup at all. I'd appreciate assistance in that area (which, as I write this, seems to be coming up from Peter Wemm - thanks Peter!) This patch give the following changes to the calling conventions of the VFS: (1) Freeing of the pathname buffer (nd.ni_cnd.cn_pnbuf) is no longer done by the individual filesystems, but rather by the caller. (2) NDFREE() becomes necessary for each NDINIT()/namei() pair, even those without SAVESTART or SAVENAME in the flags, as filesystems that set the SAVENAME flag no longer frees their own pathname buffers. (3) HASBUF is expected to be cleared when you free the path name buffer (NFS already does this, the rest of the code is sloppy about it.) (4) VOP_ABORTOP() dies. I am killing this because there is now *no* code in it, and an unused code formality is not going to be adhered to correctly. Thus, I see it as pointless to keep it around, and suggest we instead re-introduce it when/if a use for it shows up. At that point, I believe the old diffs will be as useful as calls left around in the code would have been. I elected to implement a full NDFREE instead of a more limited free for just the path name buffer. This is implemented as follows (I'm extracting plain code, as I don't expect everybody to read the patches, even though I wish they would ;-) #define NDF_NO_DVP_RELE 0x00000001 #define NDF_NO_DVP_UNLOCK 0x00000002 #define NDF_NO_DVP_PUT 0x00000003 #define NDF_NO_VP_RELE 0x00000004 #define NDF_NO_VP_UNLOCK 0x00000008 #define NDF_NO_VP_PUT 0x0000000c #define NDF_NO_STARTDIR_RELE 0x00000010 #define NDF_NO_FREE_PNBUF 0x00000020 #define NDF_ONLY_PNBUF (~NDF_NO_FREE_PNBUF) #define NDFREE(ndp, flags) do { \ struct nameidata *_ndp = (ndp); \ unsigned int _flags = (flags); \ \ if (!(_flags & NDF_NO_FREE_PNBUF) && \ (_ndp->ni_cnd.cn_flags & HASBUF)) { \ zfree(namei_zone, _ndp->ni_cnd.cn_pnbuf); \ _ndp->ni_cnd.cn_flags &= ~HASBUF; \ } \ if (!(_flags & NDF_NO_DVP_UNLOCK) && \ (_ndp->ni_cnd.cn_flags & LOCKPARENT)) \ VOP_UNLOCK(_ndp->ni_dvp, 0, _ndp->ni_cnd.cn_proc); \ if (!(_flags & NDF_NO_DVP_RELE) && \ (_ndp->ni_cnd.cn_flags & (LOCKPARENT|WANTPARENT))) { \ vrele(_ndp->ni_dvp); \ _ndp->ni_dvp = NULL; \ } \ if (!(_flags & NDF_NO_VP_RELE) && \ _ndp->ni_vp) { \ vrele(_ndp->ni_vp); \ _ndp->ni_vp = NULL; \ } \ if (!(_flags & NDF_NO_STARTDIR_RELE) && \ (_ndp->ni_cnd.cn_flags & SAVESTART)) { \ vrele(_ndp->ni_startdir); \ _ndp->ni_startdir = NULL; \ } \ } while (0) As you can see, this takes a series of flags to supress various parts of the free - allowing any of the return fields to be turned into an output for the function using namei(), and hopefully letting us collapse large amounts of boilerplate code that is presently around (or at least not write more of it.) The NDF_ONLY_PNBUF is intended as a hack to make it easy to convert legacy code. New code should not use it. My next step will be to hit the locking code with improved assertions, as discussed previously. Eivind. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Mon Dec 6 12:25:50 1999 Delivered-To: freebsd-fs@freebsd.org Received: from acl.lanl.gov (acl.lanl.gov [128.165.147.1]) by hub.freebsd.org (Postfix) with ESMTP id 0CF8A14F99; Mon, 6 Dec 1999 12:25:46 -0800 (PST) (envelope-from rminnich@lanl.gov) Received: from mini.acl.lanl.gov (root@mini.acl.lanl.gov [128.165.147.34]) by acl.lanl.gov (8.8.8/8.8.5) with ESMTP id NAA480510; Mon, 6 Dec 1999 13:25:46 -0700 (MST) Received: from localhost (rminnich@localhost) by mini.acl.lanl.gov (8.9.3/8.8.8) with ESMTP id NAA20227; Mon, 6 Dec 1999 13:25:46 -0700 X-Authentication-Warning: mini.acl.lanl.gov: rminnich owned process doing -bs Date: Mon, 6 Dec 1999 13:25:46 -0700 (MST) From: "Ronald G. Minnich" X-Sender: rminnich@mini.acl.lanl.gov To: freebsd-fs@FreeBSD.ORG Cc: freebsd-hackers@FreeBSD.ORG Subject: Re: ELF & putting inode at the front of a file In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org On Mon, 6 Dec 1999, Zhihui Zhang wrote: > I have modified FFS filesystem code to put the disk inode at the beginning > of a file, i.e, the logical block #0 of each file begins with 128 bytes of > its disk inode and the rest of it are file data. first question I have is, why? ron To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Mon Dec 6 12:47:44 1999 Delivered-To: freebsd-fs@freebsd.org Received: from bingnet2.cc.binghamton.edu (bingnet2.cc.binghamton.edu [128.226.1.18]) by hub.freebsd.org (Postfix) with ESMTP id 36EC215484; Mon, 6 Dec 1999 12:45:28 -0800 (PST) (envelope-from zzhang@cs.binghamton.edu) Received: from sol.cs.binghamton.edu (cs1-gw.cs.binghamton.edu [128.226.171.72]) by bingnet2.cc.binghamton.edu (8.9.3/8.9.3) with SMTP id PAA04476; Mon, 6 Dec 1999 15:44:24 -0500 (EST) Date: Mon, 6 Dec 1999 14:31:22 -0500 (EST) From: Zhihui Zhang To: "Ronald G. Minnich" Cc: freebsd-fs@FreeBSD.ORG, freebsd-hackers@FreeBSD.ORG Subject: Re: ELF & putting inode at the front of a file In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org On Mon, 6 Dec 1999, Ronald G. Minnich wrote: > On Mon, 6 Dec 1999, Zhihui Zhang wrote: > > I have modified FFS filesystem code to put the disk inode at the beginning > > of a file, i.e, the logical block #0 of each file begins with 128 bytes of > > its disk inode and the rest of it are file data. > > first question I have is, why? I am doing some research on filesystem. I guess it may be faster to put the disk inode with its file data together so that both can be read into memory in one I/O. -Zhihui To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Mon Dec 6 12:58:28 1999 Delivered-To: freebsd-fs@freebsd.org Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by hub.freebsd.org (Postfix) with ESMTP id 4523C15904; Mon, 6 Dec 1999 12:56:20 -0800 (PST) (envelope-from dillon@apollo.backplane.com) Received: (from dillon@localhost) by apollo.backplane.com (8.9.3/8.9.1) id MAA72561; Mon, 6 Dec 1999 12:56:14 -0800 (PST) (envelope-from dillon) Date: Mon, 6 Dec 1999 12:56:14 -0800 (PST) From: Matthew Dillon Message-Id: <199912062056.MAA72561@apollo.backplane.com> To: "Ronald G. Minnich" Cc: freebsd-fs@FreeBSD.ORG, freebsd-hackers@FreeBSD.ORG Subject: Re: ELF & putting inode at the front of a file References: Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org :On Mon, 6 Dec 1999, Zhihui Zhang wrote: :> I have modified FFS filesystem code to put the disk inode at the beginning :> of a file, i.e, the logical block #0 of each file begins with 128 bytes of :> its disk inode and the rest of it are file data. : :first question I have is, why? : :ron Good god, is he joking? Offsetting the entire file by 128 bytes will break mmap() and make I/O extremely inefficient. Many filesystems over the years have mixed meta-data in the file data blocks on disk only to remove it later on when it was found to destroy performance. A good example of this is the Amiga's filesystem. The Amiga's old filesystem was emminently recoverable because each data block had a backpointer, but it was so inefficient due to all the copying required that the updated filesystem removed the metadata so disk blocks could be DMA'd directory into the buffer. -Matt Matthew Dillon To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Mon Dec 6 13:33:40 1999 Delivered-To: freebsd-fs@freebsd.org Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by hub.freebsd.org (Postfix) with ESMTP id 93A641532A; Mon, 6 Dec 1999 13:33:35 -0800 (PST) (envelope-from dillon@apollo.backplane.com) Received: (from dillon@localhost) by apollo.backplane.com (8.9.3/8.9.1) id NAA72851; Mon, 6 Dec 1999 13:33:33 -0800 (PST) (envelope-from dillon) Date: Mon, 6 Dec 1999 13:33:33 -0800 (PST) From: Matthew Dillon Message-Id: <199912062133.NAA72851@apollo.backplane.com> To: Zhihui Zhang Cc: "Ronald G. Minnich" , freebsd-fs@FreeBSD.ORG, freebsd-hackers@FreeBSD.ORG Subject: Re: ELF & putting inode at the front of a file References: Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org :On Mon, 6 Dec 1999, Ronald G. Minnich wrote: : :> On Mon, 6 Dec 1999, Zhihui Zhang wrote: :> > I have modified FFS filesystem code to put the disk inode at the beginning :> > of a file, i.e, the logical block #0 of each file begins with 128 bytes of :> > its disk inode and the rest of it are file data. :> :> first question I have is, why? : :I am doing some research on filesystem. I guess it may be faster to put :the disk inode with its file data together so that both can be read into :memory in one I/O. : :-Zhihui Not really. The inode tends to wind up being cached by the system longer then file data, so placing it with the file data will not help -- since it is already probably cached, the system generally doesn't have to read it off the disk more then once anyway, and in a heavily loaded system the system caching is sufficiently detached from the file data processing that it is actually more beneficial to group inodes together (one disk read is able to cache many inodes all in one go). Another problem is that things like 'ls -la' or 'find' have to stat files and if you put the inode at the beginning of the file you essentially distribute the inodes all over the cylinder group rather then concentrate all the inodes in one place. p.s. I was wrong about it breaking mmap() - in fact offseting the file data on-disk will not break mmap(). But it will produce unaligned disk transfers and potentially extra seeking. -Matt To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Mon Dec 6 14: 1:19 1999 Delivered-To: freebsd-fs@freebsd.org Received: from alpo.whistle.com (alpo.whistle.com [207.76.204.38]) by hub.freebsd.org (Postfix) with ESMTP id C6CDF14A05; Mon, 6 Dec 1999 14:01:16 -0800 (PST) (envelope-from julian@whistle.com) Received: from current1.whiste.com (current1.whistle.com [207.76.205.22]) by alpo.whistle.com (8.9.1a/8.9.1) with ESMTP id OAA73739; Mon, 6 Dec 1999 14:01:12 -0800 (PST) Date: Mon, 6 Dec 1999 14:01:11 -0800 (PST) From: Julian Elischer To: Zhihui Zhang Cc: "Ronald G. Minnich" , freebsd-fs@FreeBSD.ORG, freebsd-hackers@FreeBSD.ORG Subject: Re: ELF & putting inode at the front of a file In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org how do you find the inode? On Mon, 6 Dec 1999, Zhihui Zhang wrote: > > On Mon, 6 Dec 1999, Ronald G. Minnich wrote: > > > On Mon, 6 Dec 1999, Zhihui Zhang wrote: > > > I have modified FFS filesystem code to put the disk inode at the beginning > > > of a file, i.e, the logical block #0 of each file begins with 128 bytes of > > > its disk inode and the rest of it are file data. > > > > first question I have is, why? > > I am doing some research on filesystem. I guess it may be faster to put > the disk inode with its file data together so that both can be read into > memory in one I/O. > > -Zhihui > > > > To Unsubscribe: send mail to majordomo@FreeBSD.org > with "unsubscribe freebsd-hackers" in the body of the message > To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Mon Dec 6 14: 6:42 1999 Delivered-To: freebsd-fs@freebsd.org Received: from bingnet2.cc.binghamton.edu (bingnet2.cc.binghamton.edu [128.226.1.18]) by hub.freebsd.org (Postfix) with ESMTP id 5EB9D14A05; Mon, 6 Dec 1999 14:06:24 -0800 (PST) (envelope-from zzhang@cs.binghamton.edu) Received: from sol.cs.binghamton.edu (cs1-gw.cs.binghamton.edu [128.226.171.72]) by bingnet2.cc.binghamton.edu (8.9.3/8.9.3) with SMTP id RAA24839; Mon, 6 Dec 1999 17:06:16 -0500 (EST) Date: Mon, 6 Dec 1999 15:53:14 -0500 (EST) From: Zhihui Zhang Reply-To: Zhihui Zhang To: Matthew Dillon Cc: "Ronald G. Minnich" , freebsd-fs@FreeBSD.ORG, freebsd-hackers@FreeBSD.ORG Subject: Re: ELF & putting inode at the front of a file In-Reply-To: <199912062133.NAA72851@apollo.backplane.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org On Mon, 6 Dec 1999, Matthew Dillon wrote: > :On Mon, 6 Dec 1999, Ronald G. Minnich wrote: > : > :> On Mon, 6 Dec 1999, Zhihui Zhang wrote: > :> > I have modified FFS filesystem code to put the disk inode at the beginning > :> > of a file, i.e, the logical block #0 of each file begins with 128 bytes of > :> > its disk inode and the rest of it are file data. > :> > :> first question I have is, why? > : > :I am doing some research on filesystem. I guess it may be faster to put > :the disk inode with its file data together so that both can be read into > :memory in one I/O. > : > :-Zhihui > > Not really. The inode tends to wind up being cached by the system > longer then file data, so placing it with the file data will not > help -- since it is already probably cached, the system generally doesn't > have to read it off the disk more then once anyway, and in a heavily > loaded system the system caching is sufficiently detached from the file > data processing that it is actually more beneficial to group inodes > together (one disk read is able to cache many inodes all in one go). I have read some papers. People have put disk inode with its file data. For small files, they can be read into memory with one I/O. In a distributed filesystem, if a inode block (contains 8192/128 inodes) is shared by multiple clients, it will hurt performance. One such paper may be "A 64-bit, shared disk file system for Linux" available at http://www.globalfilesystem.org/Pages/gfspapers.html. They call it "stuffed dinode". My understanding could be wrong though. > Another problem is that things like 'ls -la' or 'find' have to stat files > and if you put the inode at the beginning of the file you essentially > distribute the inodes all over the cylinder group rather then concentrate > all the inodes in one place. Yes. I have implemented most of the code. I noticed the "ls -al" is slow but "ls" is OK. > p.s. I was wrong about it breaking mmap() - in fact offseting the file > data on-disk will not break mmap(). But it will produce unaligned disk > transfers and potentially extra seeking. Yes. The cp command may use mmap(). I modify the mmap() so that this works. But this mmap() is done by the user, I can intercept it at the mmap() system call. As I said in my original email, the kernel uses memory map internally to load an ELF object file. I have to let the kernel know that there is a disk inode at the beginning of the ELF object file. It is hard for me to identify what part of the code is affected and to what extent. I think there should be a way. -Zhihui To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Mon Dec 6 14: 8:14 1999 Delivered-To: freebsd-fs@freebsd.org Received: from bingnet2.cc.binghamton.edu (bingnet2.cc.binghamton.edu [128.226.1.18]) by hub.freebsd.org (Postfix) with ESMTP id 4258514A05; Mon, 6 Dec 1999 14:07:55 -0800 (PST) (envelope-from zzhang@cs.binghamton.edu) Received: from sol.cs.binghamton.edu (cs1-gw.cs.binghamton.edu [128.226.171.72]) by bingnet2.cc.binghamton.edu (8.9.3/8.9.3) with SMTP id RAA25757; Mon, 6 Dec 1999 17:07:47 -0500 (EST) Date: Mon, 6 Dec 1999 15:54:44 -0500 (EST) From: Zhihui Zhang To: Julian Elischer Cc: "Ronald G. Minnich" , freebsd-fs@FreeBSD.ORG, freebsd-hackers@FreeBSD.ORG Subject: Re: ELF & putting inode at the front of a file In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org On Mon, 6 Dec 1999, Julian Elischer wrote: > how do you find the inode? There is an inode address map to look up. Each entry is four bytes. -Zhihui To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Mon Dec 6 14:13:50 1999 Delivered-To: freebsd-fs@freebsd.org Received: from alpo.whistle.com (alpo.whistle.com [207.76.204.38]) by hub.freebsd.org (Postfix) with ESMTP id 805BC14A05; Mon, 6 Dec 1999 14:13:42 -0800 (PST) (envelope-from julian@whistle.com) Received: from current1.whiste.com (current1.whistle.com [207.76.205.22]) by alpo.whistle.com (8.9.1a/8.9.1) with ESMTP id OAA74101; Mon, 6 Dec 1999 14:13:39 -0800 (PST) Date: Mon, 6 Dec 1999 14:13:38 -0800 (PST) From: Julian Elischer To: Matthew Dillon Cc: Zhihui Zhang , "Ronald G. Minnich" , freebsd-fs@FreeBSD.ORG, freebsd-hackers@FreeBSD.ORG Subject: Re: ELF & putting inode at the front of a file In-Reply-To: <199912062133.NAA72851@apollo.backplane.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org On Mon, 6 Dec 1999, Matthew Dillon wrote: > :On Mon, 6 Dec 1999, Ronald G. Minnich wrote: > : > :> On Mon, 6 Dec 1999, Zhihui Zhang wrote: > :> > I have modified FFS filesystem code to put the disk inode at the beginning > :> > of a file, i.e, the logical block #0 of each file begins with 128 bytes of > :> > its disk inode and the rest of it are file data. > :> > :> first question I have is, why? > : > :I am doing some research on filesystem. I guess it may be faster to put > :the disk inode with its file data together so that both can be read into > :memory in one I/O. > : > :-Zhihui > > Not really. The inode tends to wind up being cached by the system > longer then file data, so placing it with the file data will not > help -- since it is already probably cached, the system generally doesn't > have to read it off the disk more then once anyway, and in a heavily > loaded system the system caching is sufficiently detached from the file > data processing that it is actually more beneficial to group inodes > together (one disk read is able to cache many inodes all in one go). > > Another problem is that things like 'ls -la' or 'find' have to stat files > and if you put the inode at the beginning of the file you essentially > distribute the inodes all over the cylinder group rather then concentrate > all the inodes in one place. > > p.s. I was wrong about it breaking mmap() - in fact offseting the file > data on-disk will not break mmap(). But it will produce unaligned disk > transfers and potentially extra seeking. At Usenix 98 there was a paper on puting the inode in ht edirectory entry for files with only one link. that DID speed a lot of things up.. Puting the inode in "frag -1" is interesting, but the question remains of how do you find the inode? I presume the directory entry needs to have the actual disk block in it.. > > -Matt > > > > To Unsubscribe: send mail to majordomo@FreeBSD.org > with "unsubscribe freebsd-fs" in the body of the message > To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Mon Dec 6 14:20:17 1999 Delivered-To: freebsd-fs@freebsd.org Received: from acl.lanl.gov (acl.lanl.gov [128.165.147.1]) by hub.freebsd.org (Postfix) with ESMTP id 7BD0A14C3F; Mon, 6 Dec 1999 14:20:14 -0800 (PST) (envelope-from rminnich@lanl.gov) Received: from mini.acl.lanl.gov (root@mini.acl.lanl.gov [128.165.147.34]) by acl.lanl.gov (8.8.8/8.8.5) with ESMTP id PAA500918; Mon, 6 Dec 1999 15:20:13 -0700 (MST) Received: from localhost (rminnich@localhost) by mini.acl.lanl.gov (8.9.3/8.8.8) with ESMTP id PAA20531; Mon, 6 Dec 1999 15:20:13 -0700 X-Authentication-Warning: mini.acl.lanl.gov: rminnich owned process doing -bs Date: Mon, 6 Dec 1999 15:20:13 -0700 (MST) From: "Ronald G. Minnich" X-Sender: rminnich@mini.acl.lanl.gov To: Zhihui Zhang Cc: freebsd-fs@FreeBSD.ORG, freebsd-hackers@FreeBSD.ORG Subject: Re: ELF & putting inode at the front of a file In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org On Mon, 6 Dec 1999, Zhihui Zhang wrote: > I am doing some research on filesystem. I guess it may be faster to put > the disk inode with its file data together so that both can be read into > memory in one I/O. I still don't get it. To get the file, you do a lookup. So the inode is in memory. The you call the handler for the executable. But the inode is in memory at this point .... what am I missing? ron To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Mon Dec 6 14:23:49 1999 Delivered-To: freebsd-fs@freebsd.org Received: from critter.freebsd.dk (critter.freebsd.dk [212.242.40.131]) by hub.freebsd.org (Postfix) with ESMTP id 9E28D14F88; Mon, 6 Dec 1999 14:23:34 -0800 (PST) (envelope-from phk@critter.freebsd.dk) Received: from critter.freebsd.dk (localhost [127.0.0.1]) by critter.freebsd.dk (8.9.3/8.9.2) with ESMTP id XAA24765; Mon, 6 Dec 1999 23:23:15 +0100 (CET) (envelope-from phk@critter.freebsd.dk) To: "Ronald G. Minnich" Cc: Zhihui Zhang , freebsd-fs@FreeBSD.ORG, freebsd-hackers@FreeBSD.ORG Subject: Re: ELF & putting inode at the front of a file In-reply-to: Your message of "Mon, 06 Dec 1999 15:20:13 MST." Date: Mon, 06 Dec 1999 23:23:15 +0100 Message-ID: <24763.944518995@critter.freebsd.dk> From: Poul-Henning Kamp Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org In message , "Ronal d G. Minnich" writes: >On Mon, 6 Dec 1999, Zhihui Zhang wrote: >> I am doing some research on filesystem. I guess it may be faster to put >> the disk inode with its file data together so that both can be read into >> memory in one I/O. > >I still don't get it. To get the file, you do a lookup. So the inode is in >memory. The you call the handler for the executable. But the inode is in >memory at this point .... what am I missing? The inode is not likely to be in memory for a news spool or similar. Only very recently used inodes are in memory actually. They die with the vnode which maybe still die to fast. Putting the inode with the data saves a little less than one diskaccess on average per file, which for truly random access filesystems is a good thing. -- Poul-Henning Kamp FreeBSD coreteam member phk@FreeBSD.ORG "Real hackers run -current on their laptop." FreeBSD -- It will take a long time before progress goes too far! To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Mon Dec 6 14:30: 7 1999 Delivered-To: freebsd-fs@freebsd.org Received: from bingnet2.cc.binghamton.edu (bingnet2.cc.binghamton.edu [128.226.1.18]) by hub.freebsd.org (Postfix) with ESMTP id 4B8B314FEB; Mon, 6 Dec 1999 14:29:54 -0800 (PST) (envelope-from zzhang@cs.binghamton.edu) Received: from sol.cs.binghamton.edu (cs1-gw.cs.binghamton.edu [128.226.171.72]) by bingnet2.cc.binghamton.edu (8.9.3/8.9.3) with SMTP id RAA06876; Mon, 6 Dec 1999 17:29:32 -0500 (EST) Date: Mon, 6 Dec 1999 16:16:32 -0500 (EST) From: Zhihui Zhang To: "Ronald G. Minnich" Cc: freebsd-fs@FreeBSD.ORG, freebsd-hackers@FreeBSD.ORG Subject: Re: ELF & putting inode at the front of a file In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org On Mon, 6 Dec 1999, Ronald G. Minnich wrote: > On Mon, 6 Dec 1999, Zhihui Zhang wrote: > > I am doing some research on filesystem. I guess it may be faster to put > > the disk inode with its file data together so that both can be read into > > memory in one I/O. > > I still don't get it. To get the file, you do a lookup. So the inode is in > memory. The you call the handler for the executable. But the inode is in > memory at this point .... what am I missing? > When you read the disk inode, the first part of the data of its corresponding file is brought into the memory at the same time. -Zhihui To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Mon Dec 6 16:46:41 1999 Delivered-To: freebsd-fs@freebsd.org Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by hub.freebsd.org (Postfix) with ESMTP id 32C461504B; Mon, 6 Dec 1999 16:46:38 -0800 (PST) (envelope-from dillon@apollo.backplane.com) Received: (from dillon@localhost) by apollo.backplane.com (8.9.3/8.9.1) id QAA73815; Mon, 6 Dec 1999 16:46:36 -0800 (PST) (envelope-from dillon) Date: Mon, 6 Dec 1999 16:46:36 -0800 (PST) From: Matthew Dillon Message-Id: <199912070046.QAA73815@apollo.backplane.com> To: Zhihui Zhang Cc: "Ronald G. Minnich" , freebsd-fs@FreeBSD.ORG, freebsd-hackers@FreeBSD.ORG Subject: Re: ELF & putting inode at the front of a file References: Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org :> distribute the inodes all over the cylinder group rather then concentrate :> all the inodes in one place. : :Yes. I have implemented most of the code. I noticed the "ls -al" is slow :but "ls" is OK. Yes, ls (without any options) is ok because the file type is now being stuffed in the directory entry, allowing ls (without any options) to avoid stat()ing the file. :> p.s. I was wrong about it breaking mmap() - in fact offseting the file :> data on-disk will not break mmap(). But it will produce unaligned disk :> transfers and potentially extra seeking. : :Yes. The cp command may use mmap(). I modify the mmap() so that this :works. But this mmap() is done by the user, I can intercept it at the :mmap() system call. As I said in my original email, the kernel uses :memory map internally to load an ELF object file. I have to let the kernel :know that there is a disk inode at the beginning of the ELF object file. :It is hard for me to identify what part of the code is affected and to :what extent. I think there should be a way. : :-Zhihui There's another issue that you should look at - generally functionally different caches work better as separate entities then as a single entity. In this case it is far easier for the system to cache an inode (or a set of inodes) then it is for the system to cache a data block. If you force a system to cache both at the same time when it only needs one type or the other (because one might already be cached), the result is that neither cache is able to run optimally. It might be interesting, as an exercise, to attempt to pre-cache the inode space in the traditional unmodified system when a directory is read and leave them as separate entities and see whether that gives you the same performance boost. -Matt Matthew Dillon To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Tue Dec 7 16:41:24 1999 Delivered-To: freebsd-fs@freebsd.org Received: from ns1.yes.no (ns1.yes.no [195.204.136.10]) by hub.freebsd.org (Postfix) with ESMTP id 8C53814BE6 for ; Tue, 7 Dec 1999 16:41:20 -0800 (PST) (envelope-from eivind@bitbox.follo.net) Received: from bitbox.follo.net (bitbox.follo.net [195.204.143.218]) by ns1.yes.no (8.9.3/8.9.3) with ESMTP id BAA00875 for ; Wed, 8 Dec 1999 01:41:17 +0100 (CET) Received: (from eivind@localhost) by bitbox.follo.net (8.8.8/8.8.6) id BAA18610 for fs@FreeBSD.org; Wed, 8 Dec 1999 01:41:15 +0100 (MET) Date: Wed, 8 Dec 1999 01:41:15 +0100 From: Eivind Eklund To: fs@FreeBSD.org Subject: Final call for VOP_ISLOCKED objections Message-ID: <19991208014115.L14851@bitbox.follo.net> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 1.0i Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org I'd like to commit the changes to VOP_ISLOCKED() I've mentioned here before. The patches are at http://www.freebsd/org/~eivind/new-types-for-lock.patch and * Make VOP_ISLOCKED() and lockstatus() take an extra parameter (process), and return a new constant (LK_EXCLOTHER) if the process parameter is supplied and there is an exclusive lock held by somebody else. * Extend the ASSERT_VOP_LOCKED/UNLOCKED family with a series of calls to do better checking. * Changes the ASSERT_VOP_UNLOCKED semantics to unlocked-by-this-process, which is more in line with how the code use it. * Introduce new (presently unused) lock descriptions in vnode_if.src/vnode_if.sh, allowing precise descriptions WRT shared/exclusive locks. They do *not* change any of behaviour unless the undocumented option DEBUG_VFS_LOCKS is enabled. I would like to commit these changes tomorrow, for the convenience reason of getting an environment I can debug the locking system under, rushing for getting as much 'stuff' as possible tested and into the system for the 15th. Eivind. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Tue Dec 7 21:53:53 1999 Delivered-To: freebsd-fs@freebsd.org Received: from sv01.cet.co.jp (sv01.cet.co.jp [210.171.56.2]) by hub.freebsd.org (Postfix) with ESMTP id 7D67614BD4; Tue, 7 Dec 1999 21:53:49 -0800 (PST) (envelope-from michaelh@cet.co.jp) Received: from localhost (michaelh@localhost) by sv01.cet.co.jp (8.9.3/8.9.3) with SMTP id FAA20090; Wed, 8 Dec 1999 05:53:48 GMT Date: Wed, 8 Dec 1999 14:53:48 +0900 (JST) From: Michael Hancock To: Eivind Eklund Cc: fs@FreeBSD.ORG Subject: Re: Final call for VOP_ISLOCKED objections In-Reply-To: <19991208014115.L14851@bitbox.follo.net> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org Eivind, I think the DEBUG_VFS_LOCKS stuff was temporary debugging infrastructure, put into debug NFS. When SMP gets more fine grained I'm not sure how useful they will be. i.e. race conditions between checking the assertion and the protected code. If the changes to vnode_if.src/vnode_if.sh are just comments then it probably isn't a problem. Regards, Mike On Wed, 8 Dec 1999, Eivind Eklund wrote: > I'd like to commit the changes to VOP_ISLOCKED() I've mentioned here > before. The patches are at > http://www.freebsd/org/~eivind/new-types-for-lock.patch > and > * Make VOP_ISLOCKED() and lockstatus() take an extra parameter > (process), and return a new constant (LK_EXCLOTHER) if the process > parameter is supplied and there is an exclusive lock held by > somebody else. > * Extend the ASSERT_VOP_LOCKED/UNLOCKED family with a series of calls > to do better checking. > * Changes the ASSERT_VOP_UNLOCKED semantics to > unlocked-by-this-process, which is more in line with how the code > use it. > * Introduce new (presently unused) lock descriptions in > vnode_if.src/vnode_if.sh, allowing precise descriptions WRT > shared/exclusive locks. > > They do *not* change any of behaviour unless the undocumented option > DEBUG_VFS_LOCKS is enabled. > > I would like to commit these changes tomorrow, for the convenience > reason of getting an environment I can debug the locking system under, > rushing for getting as much 'stuff' as possible tested and into the > system for the 15th. > > Eivind. > > > To Unsubscribe: send mail to majordomo@FreeBSD.org > with "unsubscribe freebsd-fs" in the body of the message > To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Wed Dec 8 3: 3:37 1999 Delivered-To: freebsd-fs@freebsd.org Received: from ns1.yes.no (ns1.yes.no [195.204.136.10]) by hub.freebsd.org (Postfix) with ESMTP id 5A3BB14C40 for ; Wed, 8 Dec 1999 03:03:17 -0800 (PST) (envelope-from eivind@bitbox.follo.net) Received: from bitbox.follo.net (bitbox.follo.net [195.204.143.218]) by ns1.yes.no (8.9.3/8.9.3) with ESMTP id MAA08227; Wed, 8 Dec 1999 12:03:16 +0100 (CET) Received: (from eivind@localhost) by bitbox.follo.net (8.8.8/8.8.6) id MAA20742; Wed, 8 Dec 1999 12:03:16 +0100 (MET) Date: Wed, 8 Dec 1999 12:03:16 +0100 From: Eivind Eklund To: Michael Hancock Cc: fs@FreeBSD.ORG Subject: Re: Final call for VOP_ISLOCKED objections Message-ID: <19991208120316.Q14851@bitbox.follo.net> References: <19991208014115.L14851@bitbox.follo.net> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 1.0i In-Reply-To: ; from michaelh@cet.co.jp on Wed, Dec 08, 1999 at 02:53:48PM +0900 Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org On Wed, Dec 08, 1999 at 02:53:48PM +0900, Michael Hancock wrote: > I think the DEBUG_VFS_LOCKS stuff was temporary debugging infrastructure, > put into debug NFS. The comments seems indicate they were intended to debug all locks; and this is what I'm using it for, anyway. > When SMP gets more fine grained I'm not sure how useful they will > be. i.e. race conditions between checking the assertion and the > protected code. This may become a problem for some of the assertions at some point, yes, but to get to that point I think we will need quite a few other code sweeps to fix assumptions that the kernel is single-threaded. Before then, I hope to have cleaned up the VFS locking protocols (both use and specification) well enough that the assertions won't be crucial any more. > If the changes to vnode_if.src/vnode_if.sh are just comments then it > probably isn't a problem. The changes to vnode_if.src are just comments (about new available lockspecs). The changes to vnode_if.sh are to take new lockspecs (of which none are yet available) and turn them into assertions in the generated VOP code. Eivind. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Thu Dec 9 11:49:12 1999 Delivered-To: freebsd-fs@freebsd.org Received: from pmr.com (pmr.pmr.com [216.140.144.138]) by hub.freebsd.org (Postfix) with ESMTP id 4CDA115675 for ; Thu, 9 Dec 1999 11:49:07 -0800 (PST) (envelope-from rbg@pmr.com) Received: from jeeves (jeeves.pmr.com [207.170.114.16]) by pmr.com (8.9.3/8.9.3) with SMTP id NAA92897 for ; Thu, 9 Dec 1999 13:38:56 -0600 (CST) (envelope-from rbg@pmr.com) Message-ID: <008401bf427f$81ccb6a0$1072aacf@pmr.com> Reply-To: "Robert Gordon" From: "Robert Gordon" To: Subject: Clustered Read/writes and NFS.. Date: Thu, 9 Dec 1999 13:56:39 -0600 Organization: PMR MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 5.00.2314.1300 X-MimeOLE: Produced By Microsoft MimeOLE V5.00.2314.1300 Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org Hello, I'm attempting to understand if the NFS implementation takes advantage of clustered read/writes. So far I see that if NFS needs a buf that (via getnewbuf()) a call could be made to vfs_bio_awrite() which could cause a clustered write to free up some buffers... but I don't see that NFS takes advantage of a clustered read/write.... Thanks, Robert........................ rbg@pmr.com To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Fri Dec 10 7:10:53 1999 Delivered-To: freebsd-fs@freebsd.org Received: from bingnet2.cc.binghamton.edu (bingnet2.cc.binghamton.edu [128.226.1.18]) by hub.freebsd.org (Postfix) with ESMTP id 3557015356; Fri, 10 Dec 1999 07:10:39 -0800 (PST) (envelope-from zzhang@cs.binghamton.edu) Received: from sol.cs.binghamton.edu (cs1-gw.cs.binghamton.edu [128.226.171.72]) by bingnet2.cc.binghamton.edu (8.9.3/8.9.3) with SMTP id KAA02221; Fri, 10 Dec 1999 10:09:55 -0500 (EST) Date: Fri, 10 Dec 1999 08:56:47 -0500 (EST) From: Zhihui Zhang To: freebsd-hackers@freebsd.org, freebsd-fs@freebsd.org Subject: Why VMIO directory is a bad idea? Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org I read some postings on the Linux-Archive, complaining the slowness in looking up some big directory. Some claims that since the directory file is typically small, advanced techniques such as B+ tree (and I add hash method) are not necessary. We can simply pre-allocate the directory file contiguously and achieve good performance. This makes me wondering if we can read directory file into memory and keep it there as long as possible to get a good performance. I remember there is a discussion of VMIO directory early this year and only until now I begin to understand that idea. (1) If the directory file is less than one page, there will be a waste of memory due to internal fragmentation. Why do not we set a limit, say one page, on when we start VMIO a directory? (2) If VMIO directory is not desirable for some reasons, how about bump up the usecount of the buffer used by a directory file to let it stay in the queue longer? (3) Or maybe we can add a parameter to the filesytem, telling it to try to preallocate some contiguous disk space for all directory files. I guess that the cost per bit on disk is less than the cost per bit in memory. Can anyone give me an idea on how big a directory could be in some environment? Any comments or ideas are appreciated. -Zhihui To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Fri Dec 10 7:45: 8 1999 Delivered-To: freebsd-fs@freebsd.org Received: from salmon.maths.tcd.ie (salmon.maths.tcd.ie [134.226.81.11]) by hub.freebsd.org (Postfix) with SMTP id C76DA14E18; Fri, 10 Dec 1999 07:45:01 -0800 (PST) (envelope-from dwmalone@maths.tcd.ie) Received: from hamilton.maths.tcd.ie by salmon.maths.tcd.ie with SMTP id ; 10 Dec 1999 15:45:00 +0000 (GMT) Date: Fri, 10 Dec 1999 15:44:59 +0000 From: David Malone To: Zhihui Zhang Cc: freebsd-hackers@freebsd.org, freebsd-fs@freebsd.org Subject: Re: Why VMIO directory is a bad idea? Message-ID: <19991210154459.A1034@hamilton.maths.tcd.ie> References: Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 1.0pre3i In-Reply-To: Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org On Fri, Dec 10, 1999 at 08:56:47AM -0500, Zhihui Zhang wrote: > Can anyone give me an idea on how big a directory could be in some > environment? Our inn's /news/spool/control/cancel directory is almost 300k. If we were a significantly larger news site we probably wouldn't be running inn though. David. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Fri Dec 10 8:19:37 1999 Delivered-To: freebsd-fs@freebsd.org Received: from fledge.watson.org (fledge.watson.org [204.156.12.50]) by hub.freebsd.org (Postfix) with ESMTP id C6CD4152E9; Fri, 10 Dec 1999 08:19:32 -0800 (PST) (envelope-from robert@cyrus.watson.org) Received: from fledge.watson.org (robert@fledge.pr.watson.org [192.0.2.3]) by fledge.watson.org (8.9.3/8.9.3) with SMTP id LAA34614; Fri, 10 Dec 1999 11:19:15 -0500 (EST) (envelope-from robert@cyrus.watson.org) Date: Fri, 10 Dec 1999 11:19:15 -0500 (EST) From: Robert Watson X-Sender: robert@fledge.watson.org Reply-To: Robert Watson To: David Malone Cc: Zhihui Zhang , freebsd-hackers@freebsd.org, freebsd-fs@freebsd.org Subject: Re: Why VMIO directory is a bad idea? In-Reply-To: <19991210154459.A1034@hamilton.maths.tcd.ie> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org On Fri, 10 Dec 1999, David Malone wrote: > On Fri, Dec 10, 1999 at 08:56:47AM -0500, Zhihui Zhang wrote: > > > Can anyone give me an idea on how big a directory could be in some > > environment? > > Our inn's /news/spool/control/cancel directory is almost 300k. If > we were a significantly larger news site we probably wouldn't be > running inn though. > I use the CMU cyrus server to manage mail for my system, and also to handle mailing lists archives. I have an archive of the freebsd-current mailing list that contains around 33,000 messages, and with one file per message the directory size is about 530k. The big hit I see is Cyrus's use of dbm to manage index information, not the cost of directory operations. Typically, due to the way IMAP is usually used, no one reads in all the files, etc. However, performing an ls is non-trivially expensive: # time ls -af > /dev/null 1.018u 0.241s 0:01.25 100.0% 225+4365k 0+0io 0pf+0w 1.012u 0.288s 0:01.29 100.0% 225+4352k 0+0io 0pf+0w 1.053u 0.280s 0:01.33 100.0% 223+4273k 0+0io 0pf+0w 1.076u 0.253s 0:01.32 100.0% 219+4238k 0+0io 0pf+0w 1.091u 0.235s 0:01.33 99.2% 224+4273k 0+0io 0pf+0w # ls -af | wc -l 33323 This is under 2.2-STABLE, although I hope to push it to a 3.3-STABLE machine in the near future. This machine is currently a 486 dx2 66 w/24 mb of ram. It will become a Pentium sometime soon, with more memory. It should be observed, for the benefit of critics, that storing a mailbox in this format is far better from a performance perspective than storing all the messages in a single file :-). But it goes through a bunch of inodes (it almost justifies the default inode allocate on large disks :-), and does have drawbacks in terms of directory size. Because messages are hardly ever removed from this directory, my guess is that it's use of the directory space is fairly compact and unfragmented. Robert N M Watson robert@fledge.watson.org http://www.watson.org/~robert/ PGP key fingerprint: AF B5 5F FF A6 4A 79 37 ED 5F 55 E9 58 04 6A B1 TIS Labs at Network Associates, Safeport Network Services To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Fri Dec 10 11:11:21 1999 Delivered-To: freebsd-fs@freebsd.org Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by hub.freebsd.org (Postfix) with ESMTP id BAC2D15A5C; Fri, 10 Dec 1999 11:08:03 -0800 (PST) (envelope-from dillon@apollo.backplane.com) Received: (from dillon@localhost) by apollo.backplane.com (8.9.3/8.9.1) id LAA59861; Fri, 10 Dec 1999 11:07:46 -0800 (PST) (envelope-from dillon) Date: Fri, 10 Dec 1999 11:07:46 -0800 (PST) From: Matthew Dillon Message-Id: <199912101907.LAA59861@apollo.backplane.com> To: Zhihui Zhang Cc: freebsd-hackers@FreeBSD.ORG, freebsd-fs@FreeBSD.ORG Subject: Re: Why VMIO directory is a bad idea? References: Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org :(1) If the directory file is less than one page, there will be a waste of :memory due to internal fragmentation. Why do not we set a limit, say one :page, on when we start VMIO a directory? It is true that when we use VMIO to back directories that a minimum of one page of memory is used even for a small directory. However, unlike B_MALLOC space in the buffer cache the VM page cache is much better suited towards figuring out when cached VM pages can be reused. So even though there is waste the page may still be reused by the system fairly quickly if need be. When we use B_MALLOC space in the buffer to store a small directory 'efficiently', it tends to get reused too quickly due to the small size of the buffer cache which results in another physical I/O the next time the directory needs to be accessed. Given the choice being some wasteage (which is less then you think) and having to do another physical I/O, it is clear that the advantage is to keep the waste and avoid the physical I/O. :(2) If VMIO directory is not desirable for some reasons, how about bump up :the usecount of the buffer used by a directory file to let it stay in the :queue longer? This is how the old algorithm worked. It failed utterly to address the problem and in fact led to a considerable amount of complexity and wasted cpu cycles when the buffer cache became unbalanced (due to excessive write loading or directory scanning loading). :(3) Or maybe we can add a parameter to the filesytem, telling it to try to :preallocate some contiguous disk space for all directory files. I guess :that the cost per bit on disk is less than the cost per bit in memory. I believe the filesystem already does this. -Matt Matthew Dillon :Can anyone give me an idea on how big a directory could be in some :environment? : :Any comments or ideas are appreciated. : :-Zhihui To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Fri Dec 10 12:17: 4 1999 Delivered-To: freebsd-fs@freebsd.org Received: from bingnet2.cc.binghamton.edu (bingnet2.cc.binghamton.edu [128.226.1.18]) by hub.freebsd.org (Postfix) with ESMTP id CCCB215AC0; Fri, 10 Dec 1999 12:06:03 -0800 (PST) (envelope-from zzhang@cs.binghamton.edu) Received: from sol.cs.binghamton.edu (cs1-gw.cs.binghamton.edu [128.226.171.72]) by bingnet2.cc.binghamton.edu (8.9.3/8.9.3) with SMTP id PAA20176; Fri, 10 Dec 1999 15:05:53 -0500 (EST) Date: Fri, 10 Dec 1999 13:52:46 -0500 (EST) From: Zhihui Zhang To: Matthew Dillon Cc: freebsd-hackers@FreeBSD.ORG, freebsd-fs@FreeBSD.ORG Subject: Re: Why VMIO directory is a bad idea? In-Reply-To: <199912101907.LAA59861@apollo.backplane.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org > :(3) Or maybe we can add a parameter to the filesytem, telling it to try to > :preallocate some contiguous disk space for all directory files. I guess > :that the cost per bit on disk is less than the cost per bit in memory. > > I believe the filesystem already does this. > The FFS tries to allocate space contiguously for any type of file. It does not PRE-allocate disk space, which will result wasteage of disk space if that space is not used later. -Zhihui To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Fri Dec 10 12:17:31 1999 Delivered-To: freebsd-fs@freebsd.org Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by hub.freebsd.org (Postfix) with ESMTP id 0EEEE15919; Fri, 10 Dec 1999 12:13:18 -0800 (PST) (envelope-from dillon@apollo.backplane.com) Received: (from dillon@localhost) by apollo.backplane.com (8.9.3/8.9.1) id MAA60500; Fri, 10 Dec 1999 12:13:13 -0800 (PST) (envelope-from dillon) Date: Fri, 10 Dec 1999 12:13:13 -0800 (PST) From: Matthew Dillon Message-Id: <199912102013.MAA60500@apollo.backplane.com> To: Zhihui Zhang Cc: freebsd-hackers@FreeBSD.ORG, freebsd-fs@FreeBSD.ORG Subject: Re: Why VMIO directory is a bad idea? References: Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org : :> :(3) Or maybe we can add a parameter to the filesytem, telling it to try to :> :preallocate some contiguous disk space for all directory files. I guess :> :that the cost per bit on disk is less than the cost per bit in memory. :> :> I believe the filesystem already does this. :> : :The FFS tries to allocate space contiguously for any type of file. It :does not PRE-allocate disk space, which will result wasteage of disk space :if that space is not used later. : :-Zhihui I'm sorry, I misread that ... I thought he had said 'allocate'. It definitely does not preallocate disk space. FFS is designed to to avoid fragmentation so the blocks that it allocates when appending to a file (or directory) tend to be contiguous. -Matt Matthew Dillon To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Sat Dec 11 2:43:12 1999 Delivered-To: freebsd-fs@freebsd.org Received: from netcore.fi (netcore.fi [193.94.160.1]) by hub.freebsd.org (Postfix) with ESMTP id 8169514FCA for ; Sat, 11 Dec 1999 02:43:08 -0800 (PST) (envelope-from pekkas@netcore.fi) Received: from localhost (pekkas@localhost) by netcore.fi (8.9.3/8.9.3) with ESMTP id MAA14166 for ; Sat, 11 Dec 1999 12:43:06 +0200 Date: Sat, 11 Dec 1999 12:43:06 +0200 (EET) From: Pekka Savola To: freebsd-fs@freebsd.org Subject: Deleting a directory on ext2fs crashed the system Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org Hello all, Deleting one directory on an ext2fs crashed my FreeBSD 3.4-RC system up pretty badly: panic: bmemfree: removing a buffer when not in queue Syncing disks ... [crash] After that, I couldn't log in in with SSH, or log in from console (keyboard didn't seem to function apart from ALT-Fx). NAT'ed connections stayed alive, though, and the system was pingable. Anyone else seen anything like this? Btw, are there any good ext2 fsck tools? I'm using the ones from Linux with emulation, but there are some unimplemented system calls or such. HTH, Pekka Savola Btw, I'm not subscribing to the list, so if anything comes up, please CC it to me. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Sat Dec 11 14:53:58 1999 Delivered-To: freebsd-fs@freebsd.org Received: from thelab.hub.org (nat196.191.mpoweredpc.net [142.177.196.191]) by hub.freebsd.org (Postfix) with ESMTP id 3CA2014FAD; Sat, 11 Dec 1999 14:53:52 -0800 (PST) (envelope-from scrappy@hub.org) Received: from localhost (scrappy@localhost) by thelab.hub.org (8.9.3/8.9.1) with ESMTP id SAA41702; Sat, 11 Dec 1999 18:53:56 -0400 (AST) (envelope-from scrappy@hub.org) X-Authentication-Warning: thelab.hub.org: scrappy owned process doing -bs Date: Sat, 11 Dec 1999 18:53:55 -0400 (AST) From: The Hermit Hacker To: David Malone Cc: Zhihui Zhang , freebsd-hackers@FreeBSD.ORG, freebsd-fs@FreeBSD.ORG Subject: Re: Why VMIO directory is a bad idea? In-Reply-To: <19991210154459.A1034@hamilton.maths.tcd.ie> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org On Fri, 10 Dec 1999, David Malone wrote: > On Fri, Dec 10, 1999 at 08:56:47AM -0500, Zhihui Zhang wrote: > > > Can anyone give me an idea on how big a directory could be in some > > environment? > > Our inn's /news/spool/control/cancel directory is almost 300k. If > we were a significantly larger news site we probably wouldn't be > running inn though. If you run newer INN's, you could use a CNFS buffer for control, which would make your directory much smaller... My directory's for ~52gig of news: drwxr-xr-x 2 news news 512 Dec 10 12:50 buffer5 drwxr-xr-x 2 news news 512 Dec 10 10:50 buffer4 drwxr-xr-x 2 news news 512 Dec 10 10:25 buffer2 drwxr-xr-x 2 news news 512 Dec 8 07:13 buffer1 drwxr-xr-x 2 news news 512 Dec 8 07:13 buffer3 And they will never change from 512, as each directory contains but one file... Marc G. Fournier ICQ#7615664 IRC Nick: Scrappy Systems Administrator @ hub.org primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message