Date: Fri, 13 Aug 1999 20:50:39 +0000 (GMT) From: Terry Lambert <tlambert@primenet.com> To: zzhang@cs.binghamton.edu (Zhihui Zhang) Cc: tlambert@primenet.com, phk@critter.freebsd.dk, roberto@keltia.freenix.fr, freebsd-fs@FreeBSD.ORG Subject: Re: Help with understand file system performance Message-ID: <199908132050.NAA17293@usr01.primenet.com> In-Reply-To: <Pine.GSO.3.96.990812202049.1878A-100000@sol.cs.binghamton.edu> from "Zhihui Zhang" at Aug 12, 99 09:02:32 pm
next in thread | previous in thread | raw e-mail | index | archive | help
> On Thu, 12 Aug 1999, Terry Lambert wrote: > > > The filesystem block allocation table in directories is unique, in > > that it is generally used as a convenience for locating physical > > blocks, rather than using the standard filesystem block access > > mechanisms, when reading or writing directories. > > Directory files have the same on-disk structure as regular files. Yes. But they are not accessed internally as if they were regular files. The only operation which is trated as "regular" is extending (and as of 4.4BSD, truncating back) the block allocations in directories. The directory manipulation code treats it as a series of blocks, and translates from the "regular file" aspect into BLKATOFF(). > However, they can never have holes and they can only be incremented at the > end of the file in device block chunks. No directory entry can cross the > device block boundary to guarantee the atomic update. Right. There is no such thing as a "sparse block allocation" in a directory, since BLKATOFF() assumes the existance of a block. Directory entries are physically prevented from crossing block boundaries in order to ensure atomic update. But this is an implementation detail, and it is not the only way one could ensure atomicity, so long as one were willing to reallocate (filesystem, not physical) blocks or frags in order to do the updates (i.e. you could arrange for a two stage commit; I did this in my Unicode FFS prototype, since even though a 256 character name would fit in 512b, there was no room left over for the metadata). > However, I do not know why you say the block map (direct and indirect > blocks) of a directory is only used as a convenience. I mean there is a > need to call VOP_BMAP() on a directory file. The routine ffs_blkatoff() > calls bread(), which in turn calls VOP_BMAP(). The in-core inode does have > several fields to facilitate the insertion of new directory entries. But > we still need the block map (block allocation table). Directory manipulations access blocks directly. You've no doubt noticed that the vast majority of system calls do _not_ require VOP_BMAP() calls for copyin/out operations on VM objects backed by the filesystem. The need to call VOP_BMAP() is an artifact of treating the directories as a list of blocks, rather than treating them as files. The "convenience" aspect is that they are files, but they are not used as such, and it's just because it is convenient that files are used as the underlying abstraction: directories are not naturally represented as files, and in fact, trying to make them conform to the normal file behaviour would result in breakage of the atomicity guarantee. > Directory files are also specical in that we can not write into them > with the write() system call as normal files. They use a special > routine to grow, i.e., ufs_direnter(). By the way, we can use read() > system call to read directory files as we do with normal files. The lack of the ability to write was mirrored by a lack of ability to read, as well, until this was changed, intentionally. Likewise, there was no ability to mmap directories (read only, of course), until that, too, was changed. These are both optimizations to speed certain programs, and are really antithetical to POSIX. In reality, if you have looked at the "cookie" code for VOP_READDIR() in NFS, FFS, and at least one other FS, you will see that the need for cookies is an artifact of the structure of the interface. An alternate interface would allow directory block abstraction seperate from the externalization of directory entries. The structure that is returned by getdents() is actually only coincidentally (albeit intentionally so) the same as the FFS on disk structure. See the 4.3/4.4 compatability translation code in the VOP_READDIR() in the FFS implementation. The upshot of this is that the ability to read or mmap for read directories is actually a very bad thing, from an interface perspective, since it promotes the writing of code that depends on data format interfaces. This is similar to the use of the KVM as a data interface. It is only coincidental, based on implementation (unintentionally so, this time) that the POSIX access time updates for files and the access time of directories (as POSIX mandates for getdents() operations) happen to coincide. If you look at the cookie mess, and the NFS server code wire format translation mess, I'm sure you will agree. You only need to ask yourself "how could NFS handle a VOP_READDIR() that came from an underlying FS that could pack more entries in a block than could be represented in a block in the external 'coincidental' format?" to prove to yourself that this is broken. > > There are a number of performance penalties for this, especially > > on large directories, where it is not possible to trigger sequential > > readahead through use of the getdents() system call sequentially > > accessing sequential 512b/physical_block_size extents. > > I do not understand this. The read-ahead mechanism should work on any > files. I thought the reorganization of diretory entries within a directory > block when you delete an entry is an inefficiency. > > Does this issue have anything to do with the VMIO directory issue > discussed earlier this year? No. It has to do with VOP_READDIR() not exhibiting behaviour which would trigger read-ahead, such as is triggered by READ, WRITE, GETPAGES, and PUTPAGES. > > The frag size can be tuned down below this (i.e. 1/4, 1/2, 1). > > > > The only case where 1024 bytes of physical disk would be used is at > > a filesystem block size of 8192 (or greater), which, divided by 8, > > gives 1024b (or greater). > > I did not realize this before. The maximum ratio is 8. So if the > filesystem block is 8192, the allocation unit (fragment size) can not be > 512 because 8192/512 > 8. Yes. There are only 8 bits available for representing frag allocations. > > This is called an encapsulated two stage commit, in database terms. > > > > For inodes, indirect blocks, and directory entry blocks, there is > > no two stage commit, because there is no indirection of their data > > contents. > > I guess you mean that their data are not managed by any higher level > metadata which must be updated together. Yes. Despite the fact that "higher level" metadata exists, since the implementation detail is that they are stored using "files", the actual implementation does not take advantage of this, either for triggering read-ahead, or for encapsulated commits of directory modifications, or for clustering (which could only occur on a restore from an archive, given the incremental nature of directory entries), or for any of a dozen other speed enhancements which are applied to normal files. This means that directories are, by their nature, rather slow. > Thanks for your help. > > -Zhihui Any time. 8-). It's an interesting discussion to engage in; there are (not implemented in FreeBSD) interesting soloutions to much of the performance issues that people raise against the FFS. The last time this issue came up that I remember had to do with depth-first creation and breadth-first traversal of the ports directory structure; I actually still maintain that this is a problem in the creation of the directory (i.e. the organization of the archive) more than it is a problem with the FS itself (a tool is only as good as the craftsman using it). If used properly, there really aren't a lot of performance problems that you can point to (sort of like cutting with vs. against the grain in a board). I am becoming convinced that an intermediate abstraction is really what is called for, to turn the bottom end into what is, in effect, nothing more than a flat, numeric namespace on top of a variable granularity block store. A nice topic for much research... 8-). Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199908132050.NAA17293>