From owner-freebsd-fs@FreeBSD.ORG Wed Mar 25 17:22:09 2009 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 82D9D10658D0 for ; Wed, 25 Mar 2009 17:22:09 +0000 (UTC) (envelope-from julian@elischer.org) Received: from outW.internet-mail-service.net (outw.internet-mail-service.net [216.240.47.246]) by mx1.freebsd.org (Postfix) with ESMTP id 60D898FC1A for ; Wed, 25 Mar 2009 17:22:09 +0000 (UTC) (envelope-from julian@elischer.org) Received: from idiom.com (mx0.idiom.com [216.240.32.160]) by out.internet-mail-service.net (Postfix) with ESMTP id B6E8783D41; Wed, 25 Mar 2009 10:22:28 -0700 (PDT) X-Client-Authorized: MaGic Cook1e X-Client-Authorized: MaGic Cook1e X-Client-Authorized: MaGic Cook1e Received: from julian-mac.elischer.org (home.elischer.org [216.240.48.38]) by idiom.com (Postfix) with ESMTP id 9D3F22D6017; Wed, 25 Mar 2009 10:22:06 -0700 (PDT) Message-ID: <49CA684F.70604@elischer.org> Date: Wed, 25 Mar 2009 10:22:23 -0700 From: Julian Elischer User-Agent: Thunderbird 2.0.0.21 (Macintosh/20090302) MIME-Version: 1.0 To: Bruce Evans References: <200903231733.51671.mel.flynn+fbsd.fs@mailing.thruhere.net> <49C7C45B.7040708@elischer.org> <20090324224001.D1670@besplex.bde.org> <49C97A6F.70204@transactionware.com> <20090325223213.P35996@delplex.bde.org> In-Reply-To: <20090325223213.P35996@delplex.bde.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: freebsd-fs@freebsd.org Subject: Re: Trying to understand how aio(4), mount(8) async (and gjournal) relate X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 25 Mar 2009 17:22:13 -0000 Bruce Evans wrote: > On Wed, 25 Mar 2009, Jan Mikkelsen wrote: > >> [Jumping into a conversation on aio, async mounts, etc.] >> >> I have had a few questions for a while that I haven't asked yet; these >> seems like an appropriate time to ask them! >> >> Is it reasonable to open a file with O_FSYNC and then use aio_write() >> to issue multiple writes, and then assume that the data is on disk >> when the aio completes? > > I know very little about aio, but looking at the sources seems to show that > O_FSYNC (or mounting with the sync option) just defeats the asyncness of > aio. aio seems to use only fo_write() for writing, so at lower (file > system) levels, O_FSYNC has the same behaviour as for write(2) -- it syncs > the i/o at the end of the call in the usual case where fo_write = vn_write. > >> Can I get I/O parallelism using this approach? > > Apparently not. > >> I recall reading (some time ago) that FreeBSD doesn't do I/O >> parallelism on a single file descriptor. Is that true? Do I need to >> open the file multiple times in order to get I/O parallelism? > > The fs part of vn_write() is serialized, now using the exclusive vnode > lock. > The code is essentially: > > vn_lock(vp, LK_EXCLUSIVE | LK_RETRY); > error = VOP_WRITE(...); /* this soon reaches foofs_write() */ > VOP_UNLOCK(vp); > > In the usual case without O_FSYNC, foofs should try to only schedule > the i/o (by writing it to the buffer cache and not waiting), so that > the actual i/o is done in parallel later. However, foofs might need > to do some physical input in order to tell where to write (e.g., reading > indirect block(s)) or some physical output of metadata needed for > consistency (e.g., writing indirect blocks), and any such i/o is > serialized. (I think most file systems avoid writing to the inode on > every foofs_write(), though not doing requires tricks to maintain > consistency. No tricks seem to be available for indirect blocks, so > ffs without soft updates always writes them synchronously (except in > my version where the async mount option actually works for indurect > blocks).) > > O_FSYNC should cause almost all writes related to the file to be > synced at the end of foofs_write(). Thus it forces all i/o to be > serialized. > Some excepions to "all": > - at least in ffs, bitmap blocks are not synced. This is safe since > fsck can always recover bitmap blocks. > - at least in ffs, directories above the file are not synced by fsync() > for the file. This is normally harmless because critical directory > operations are normally synchronous (or ordered relative to everything > including related file operations in the case of soft updates), and > fsync() is not specified to do this (?), but perhaps careful > applications should fsync() all the directories too, and with the > async mount option, even the most critical directory operation > (creation of the file's directory entry) is asynchronous (except > bugs make it partly synchronus). > - at least in ffs, with the async mount option, fsync() is more broken than > it should be broken -- it syncs everything except for the most critical > metadata (the inode) and directories above the file. > >> You can see where I'm going with this: What I'd really like to do is >> open a file with O_FSYNC | O_DIRECT | O_EXCL, and then do lots of aio >> operations on it using chunks that a multiple of the page size with >> buffers that are aligned on page boundaries. I'd like to know when >> aio writes are "really" complete to maintain various kinds of on-disk >> structures (eg. b-trees). I'd also like to avoid call fsync(2). > > Calling fsync() or aio_waitcomplete() seems to be necessary. More > global options like the sync mount flag and O_FSYNC don't provide > enough control. I can't find any aio interfaces to select or poll for > completion. it does have a comprehensive interface with kqueue. > It seems to have only aio_return() to test for completion, > with the possibly unwanted side effect of doing the completion if > possible. I don't trust aio_return() to test that _all_ the things > that would be done by the file system for fsync(2) have been done. > aio_waitcomplete ensures doing these things by calling the file system > (VOP_FSYNC()), but aio_return() doesn't seem to go near the file system. > > BTW, I just remembered that there is no mount option or file flag to > give fully sync metadata. At least in ffs, all inode-change operations > (chmod(), chown(), fchmod(), fchown(), etc.) are async, irrespective of > mount options and O_FSYNC. It takes a syscall calling VOP_FSYNC() or > an unrelated inode update to sync the metadata for these operations. > > Bruce