Date: Wed, 25 Mar 2009 23:50:31 +1100 (EST) From: Bruce Evans <brde@optusnet.com.au> To: Jan Mikkelsen <janm@transactionware.com> Cc: freebsd-fs@freebsd.org, Julian Elischer <julian@elischer.org> Subject: Re: Trying to understand how aio(4), mount(8) async (and gjournal) relate Message-ID: <20090325223213.P35996@delplex.bde.org> In-Reply-To: <49C97A6F.70204@transactionware.com> References: <200903231733.51671.mel.flynn%2Bfbsd.fs@mailing.thruhere.net> <49C7C45B.7040708@elischer.org> <20090324224001.D1670@besplex.bde.org> <49C97A6F.70204@transactionware.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, 25 Mar 2009, Jan Mikkelsen wrote: > [Jumping into a conversation on aio, async mounts, etc.] > > I have had a few questions for a while that I haven't asked yet; these seems > like an appropriate time to ask them! > > Is it reasonable to open a file with O_FSYNC and then use aio_write() to > issue multiple writes, and then assume that the data is on disk when the aio > completes? I know very little about aio, but looking at the sources seems to show that O_FSYNC (or mounting with the sync option) just defeats the asyncness of aio. aio seems to use only fo_write() for writing, so at lower (file system) levels, O_FSYNC has the same behaviour as for write(2) -- it syncs the i/o at the end of the call in the usual case where fo_write = vn_write. > Can I get I/O parallelism using this approach? Apparently not. > I recall reading (some time > ago) that FreeBSD doesn't do I/O parallelism on a single file descriptor. Is > that true? Do I need to open the file multiple times in order to get I/O > parallelism? The fs part of vn_write() is serialized, now using the exclusive vnode lock. The code is essentially: vn_lock(vp, LK_EXCLUSIVE | LK_RETRY); error = VOP_WRITE(...); /* this soon reaches foofs_write() */ VOP_UNLOCK(vp); In the usual case without O_FSYNC, foofs should try to only schedule the i/o (by writing it to the buffer cache and not waiting), so that the actual i/o is done in parallel later. However, foofs might need to do some physical input in order to tell where to write (e.g., reading indirect block(s)) or some physical output of metadata needed for consistency (e.g., writing indirect blocks), and any such i/o is serialized. (I think most file systems avoid writing to the inode on every foofs_write(), though not doing requires tricks to maintain consistency. No tricks seem to be available for indirect blocks, so ffs without soft updates always writes them synchronously (except in my version where the async mount option actually works for indurect blocks).) O_FSYNC should cause almost all writes related to the file to be synced at the end of foofs_write(). Thus it forces all i/o to be serialized. Some excepions to "all": - at least in ffs, bitmap blocks are not synced. This is safe since fsck can always recover bitmap blocks. - at least in ffs, directories above the file are not synced by fsync() for the file. This is normally harmless because critical directory operations are normally synchronous (or ordered relative to everything including related file operations in the case of soft updates), and fsync() is not specified to do this (?), but perhaps careful applications should fsync() all the directories too, and with the async mount option, even the most critical directory operation (creation of the file's directory entry) is asynchronous (except bugs make it partly synchronus). - at least in ffs, with the async mount option, fsync() is more broken than it should be broken -- it syncs everything except for the most critical metadata (the inode) and directories above the file. > You can see where I'm going with this: What I'd really like to do is open a > file with O_FSYNC | O_DIRECT | O_EXCL, and then do lots of aio operations on > it using chunks that a multiple of the page size with buffers that are > aligned on page boundaries. I'd like to know when aio writes are "really" > complete to maintain various kinds of on-disk structures (eg. b-trees). I'd > also like to avoid call fsync(2). Calling fsync() or aio_waitcomplete() seems to be necessary. More global options like the sync mount flag and O_FSYNC don't provide enough control. I can't find any aio interfaces to select or poll for completion. It seems to have only aio_return() to test for completion, with the possibly unwanted side effect of doing the completion if possible. I don't trust aio_return() to test that _all_ the things that would be done by the file system for fsync(2) have been done. aio_waitcomplete ensures doing these things by calling the file system (VOP_FSYNC()), but aio_return() doesn't seem to go near the file system. BTW, I just remembered that there is no mount option or file flag to give fully sync metadata. At least in ffs, all inode-change operations (chmod(), chown(), fchmod(), fchown(), etc.) are async, irrespective of mount options and O_FSYNC. It takes a syscall calling VOP_FSYNC() or an unrelated inode update to sync the metadata for these operations. Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20090325223213.P35996>