Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 12 Aug 1999 23:14:05 +0000 (GMT)
From:      Terry Lambert <tlambert@primenet.com>
To:        phk@critter.freebsd.dk (Poul-Henning Kamp)
Cc:        zzhang@cs.binghamton.edu, roberto@keltia.freenix.fr, freebsd-fs@FreeBSD.ORG
Subject:   Re: Help with understand file system performance
Message-ID:  <199908122314.QAA23506@usr04.primenet.com>
In-Reply-To: <1404.934469109@critter.freebsd.dk> from "Poul-Henning Kamp" at Aug 12, 99 04:45:09 pm

next in thread | previous in thread | raw e-mail | index | archive | help
Poul-Henning Kamp writes:
> Zhihui Zhang writes:
> >
> >> According to Poul-Henning Kamp:
> >> > Yes.  The minimum directory size is the fragsize of the filesystem,
> >> 
> >> I'm afraid it is not the case...
> >> 
> >> 216 [13:35] root@tara:/src# ll
> >> total 5
> >> drwxr-xr-x   2 roberto  staff   512 Sep 26  1998 CVS/
> >>                                 ^^^
> >The fsize is the number of bytes in a fragment.  Even if your file is 1
> >byte, that file needs 1024 bytes to store.  However, the byte count is
> >still one byte.  In your example, the byte count is 512 bytes.
> 
> Yeah, well, the real issue is if the UFS implementation works on
> the 512 bytes size of the fragsize.

Poul's right.  More particularly, there are two concepts here:

1)	File system block size

2)	Directory entry block size


The directory entry block size is a physical disk block.  This is
intentional for the purposes of atomicity of directory entry block
updates.  In point of fact, the code is incapable of dealing with
anything other than BLKATOFF()-type semantics.


Directories are files.  This is an implementation detail, and the
wording of POSIX specifically distances itself from the concept
that directories and files are the same primitive object.  This is
probably in an attempt to allow VMS, NT, and NetWare filesystems
claim POSIX compliance.

The filesystem block allocation table in directories is unique, in
that it is generally used as a convenience for locating physical
blocks, rather than using the standard filesystem block access
mechanisms, when reading or writing directories.

There are a number of performance penalties for this, especially
on large directories, where it is not possible to trigger sequential
readahead through use of the getdents() system call sequentially
accessing sequential 512b/physical_block_size extents.


There also appears to be a misunderstanding about frags here:

> >> drwxr-xr-x   2 roberto  staff   512 Sep 26  1998 CVS/
> >>                                 ^^^
> >The fsize is the number of bytes in a fragment.  Even if your file is 1
> >byte, that file needs 1024 bytes to store.  However, the byte count is
> >still one byte.  In your example, the byte count is 512 bytes.

The frag size is, by default, 1/8 of the filesystem block size.

For a filesystem block size of 4096, the frag size is 512b, which
is the physical block size on most media (e.g. most everything that
you might have an FFS on, except not Japanese magneto-optical and
some Japanese winchester disk drives).

The frag size can be tuned down below this (i.e. 1/4, 1/2, 1).

The only case where 1024 bytes of physical disk would be used is at
a filesystem block size of 8192 (or greater), which, divided by 8,
gives 1024b (or greater).

In this case, the directory entry structure size is... still the
physical device block size, or 512b.


As an exercise for the reader, try implementing a directory entry
block size in excess of 512b (e.g. 1024b, in an attempt to support
both 8.3 names and 256 character Unicode names for files).

The problem you will encounter is that the physical disk only
guarantees atomicity at the block I/O level.


Soft Updates allow this to work for file contents, but inodes are
still 128 bytes (sub 1 physical device block) and directory entry
blocks are still 512b (equal to or sub the physical device block
size.  There aren't really structures to allow for an encapsulated
update of these objects to occur, to allow them to exceed the
physical device block size, yet remain atomic.

What happens at the inode data contents level, is that new blocks
are allocated, given the new content for the region, verified that
they are written to disk, and then the direct block list in the
inode, or the direct block list of an indirect block pointed to
by the inode or by another indirect block, is updated.

This means that if a crash occurs before the block list is modified,
the old contents remain, in their entirety, and if a crash occurs
after the block list is modified, the fact that the data is verified
on disk before the update occurs, the new contents are there, in their
entirety.

This is called an encapsulated two stage commit, in database terms.

For inodes, indirect blocks, and directory entry blocks, there is
no two stage commit, because there is no indirection of their data
contents.


Hope this sets things straight in your mind (not you, Poul, I know
you already understand it 8-)).


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199908122314.QAA23506>