Date: Thu, 12 Aug 1999 23:14:05 +0000 (GMT) From: Terry Lambert <tlambert@primenet.com> To: phk@critter.freebsd.dk (Poul-Henning Kamp) Cc: zzhang@cs.binghamton.edu, roberto@keltia.freenix.fr, freebsd-fs@FreeBSD.ORG Subject: Re: Help with understand file system performance Message-ID: <199908122314.QAA23506@usr04.primenet.com> In-Reply-To: <1404.934469109@critter.freebsd.dk> from "Poul-Henning Kamp" at Aug 12, 99 04:45:09 pm
next in thread | previous in thread | raw e-mail | index | archive | help
Poul-Henning Kamp writes: > Zhihui Zhang writes: > > > >> According to Poul-Henning Kamp: > >> > Yes. The minimum directory size is the fragsize of the filesystem, > >> > >> I'm afraid it is not the case... > >> > >> 216 [13:35] root@tara:/src# ll > >> total 5 > >> drwxr-xr-x 2 roberto staff 512 Sep 26 1998 CVS/ > >> ^^^ > >The fsize is the number of bytes in a fragment. Even if your file is 1 > >byte, that file needs 1024 bytes to store. However, the byte count is > >still one byte. In your example, the byte count is 512 bytes. > > Yeah, well, the real issue is if the UFS implementation works on > the 512 bytes size of the fragsize. Poul's right. More particularly, there are two concepts here: 1) File system block size 2) Directory entry block size The directory entry block size is a physical disk block. This is intentional for the purposes of atomicity of directory entry block updates. In point of fact, the code is incapable of dealing with anything other than BLKATOFF()-type semantics. Directories are files. This is an implementation detail, and the wording of POSIX specifically distances itself from the concept that directories and files are the same primitive object. This is probably in an attempt to allow VMS, NT, and NetWare filesystems claim POSIX compliance. The filesystem block allocation table in directories is unique, in that it is generally used as a convenience for locating physical blocks, rather than using the standard filesystem block access mechanisms, when reading or writing directories. There are a number of performance penalties for this, especially on large directories, where it is not possible to trigger sequential readahead through use of the getdents() system call sequentially accessing sequential 512b/physical_block_size extents. There also appears to be a misunderstanding about frags here: > >> drwxr-xr-x 2 roberto staff 512 Sep 26 1998 CVS/ > >> ^^^ > >The fsize is the number of bytes in a fragment. Even if your file is 1 > >byte, that file needs 1024 bytes to store. However, the byte count is > >still one byte. In your example, the byte count is 512 bytes. The frag size is, by default, 1/8 of the filesystem block size. For a filesystem block size of 4096, the frag size is 512b, which is the physical block size on most media (e.g. most everything that you might have an FFS on, except not Japanese magneto-optical and some Japanese winchester disk drives). The frag size can be tuned down below this (i.e. 1/4, 1/2, 1). The only case where 1024 bytes of physical disk would be used is at a filesystem block size of 8192 (or greater), which, divided by 8, gives 1024b (or greater). In this case, the directory entry structure size is... still the physical device block size, or 512b. As an exercise for the reader, try implementing a directory entry block size in excess of 512b (e.g. 1024b, in an attempt to support both 8.3 names and 256 character Unicode names for files). The problem you will encounter is that the physical disk only guarantees atomicity at the block I/O level. Soft Updates allow this to work for file contents, but inodes are still 128 bytes (sub 1 physical device block) and directory entry blocks are still 512b (equal to or sub the physical device block size. There aren't really structures to allow for an encapsulated update of these objects to occur, to allow them to exceed the physical device block size, yet remain atomic. What happens at the inode data contents level, is that new blocks are allocated, given the new content for the region, verified that they are written to disk, and then the direct block list in the inode, or the direct block list of an indirect block pointed to by the inode or by another indirect block, is updated. This means that if a crash occurs before the block list is modified, the old contents remain, in their entirety, and if a crash occurs after the block list is modified, the fact that the data is verified on disk before the update occurs, the new contents are there, in their entirety. This is called an encapsulated two stage commit, in database terms. For inodes, indirect blocks, and directory entry blocks, there is no two stage commit, because there is no indirection of their data contents. Hope this sets things straight in your mind (not you, Poul, I know you already understand it 8-)). Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199908122314.QAA23506>