Date: Wed, 25 Mar 2009 10:22:23 -0700 From: Julian Elischer <julian@elischer.org> To: Bruce Evans <brde@optusnet.com.au> Cc: freebsd-fs@freebsd.org Subject: Re: Trying to understand how aio(4), mount(8) async (and gjournal) relate Message-ID: <49CA684F.70604@elischer.org> In-Reply-To: <20090325223213.P35996@delplex.bde.org> References: <200903231733.51671.mel.flynn%2Bfbsd.fs@mailing.thruhere.net> <49C7C45B.7040708@elischer.org> <20090324224001.D1670@besplex.bde.org> <49C97A6F.70204@transactionware.com> <20090325223213.P35996@delplex.bde.org>
next in thread | previous in thread | raw e-mail | index | archive | help
Bruce Evans wrote: > On Wed, 25 Mar 2009, Jan Mikkelsen wrote: > >> [Jumping into a conversation on aio, async mounts, etc.] >> >> I have had a few questions for a while that I haven't asked yet; these >> seems like an appropriate time to ask them! >> >> Is it reasonable to open a file with O_FSYNC and then use aio_write() >> to issue multiple writes, and then assume that the data is on disk >> when the aio completes? > > I know very little about aio, but looking at the sources seems to show that > O_FSYNC (or mounting with the sync option) just defeats the asyncness of > aio. aio seems to use only fo_write() for writing, so at lower (file > system) levels, O_FSYNC has the same behaviour as for write(2) -- it syncs > the i/o at the end of the call in the usual case where fo_write = vn_write. > >> Can I get I/O parallelism using this approach? > > Apparently not. > >> I recall reading (some time ago) that FreeBSD doesn't do I/O >> parallelism on a single file descriptor. Is that true? Do I need to >> open the file multiple times in order to get I/O parallelism? > > The fs part of vn_write() is serialized, now using the exclusive vnode > lock. > The code is essentially: > > vn_lock(vp, LK_EXCLUSIVE | LK_RETRY); > error = VOP_WRITE(...); /* this soon reaches foofs_write() */ > VOP_UNLOCK(vp); > > In the usual case without O_FSYNC, foofs should try to only schedule > the i/o (by writing it to the buffer cache and not waiting), so that > the actual i/o is done in parallel later. However, foofs might need > to do some physical input in order to tell where to write (e.g., reading > indirect block(s)) or some physical output of metadata needed for > consistency (e.g., writing indirect blocks), and any such i/o is > serialized. (I think most file systems avoid writing to the inode on > every foofs_write(), though not doing requires tricks to maintain > consistency. No tricks seem to be available for indirect blocks, so > ffs without soft updates always writes them synchronously (except in > my version where the async mount option actually works for indurect > blocks).) > > O_FSYNC should cause almost all writes related to the file to be > synced at the end of foofs_write(). Thus it forces all i/o to be > serialized. > Some excepions to "all": > - at least in ffs, bitmap blocks are not synced. This is safe since > fsck can always recover bitmap blocks. > - at least in ffs, directories above the file are not synced by fsync() > for the file. This is normally harmless because critical directory > operations are normally synchronous (or ordered relative to everything > including related file operations in the case of soft updates), and > fsync() is not specified to do this (?), but perhaps careful > applications should fsync() all the directories too, and with the > async mount option, even the most critical directory operation > (creation of the file's directory entry) is asynchronous (except > bugs make it partly synchronus). > - at least in ffs, with the async mount option, fsync() is more broken than > it should be broken -- it syncs everything except for the most critical > metadata (the inode) and directories above the file. > >> You can see where I'm going with this: What I'd really like to do is >> open a file with O_FSYNC | O_DIRECT | O_EXCL, and then do lots of aio >> operations on it using chunks that a multiple of the page size with >> buffers that are aligned on page boundaries. I'd like to know when >> aio writes are "really" complete to maintain various kinds of on-disk >> structures (eg. b-trees). I'd also like to avoid call fsync(2). > > Calling fsync() or aio_waitcomplete() seems to be necessary. More > global options like the sync mount flag and O_FSYNC don't provide > enough control. I can't find any aio interfaces to select or poll for > completion. it does have a comprehensive interface with kqueue. > It seems to have only aio_return() to test for completion, > with the possibly unwanted side effect of doing the completion if > possible. I don't trust aio_return() to test that _all_ the things > that would be done by the file system for fsync(2) have been done. > aio_waitcomplete ensures doing these things by calling the file system > (VOP_FSYNC()), but aio_return() doesn't seem to go near the file system. > > BTW, I just remembered that there is no mount option or file flag to > give fully sync metadata. At least in ffs, all inode-change operations > (chmod(), chown(), fchmod(), fchown(), etc.) are async, irrespective of > mount options and O_FSYNC. It takes a syscall calling VOP_FSYNC() or > an unrelated inode update to sync the metadata for these operations. > > Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?49CA684F.70604>