From owner-freebsd-fs@FreeBSD.ORG  Wed Mar 25 17:22:09 2009
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 82D9D10658D0
	for <freebsd-fs@freebsd.org>; Wed, 25 Mar 2009 17:22:09 +0000 (UTC)
	(envelope-from julian@elischer.org)
Received: from outW.internet-mail-service.net (outw.internet-mail-service.net
	[216.240.47.246])
	by mx1.freebsd.org (Postfix) with ESMTP id 60D898FC1A
	for <freebsd-fs@freebsd.org>; Wed, 25 Mar 2009 17:22:09 +0000 (UTC)
	(envelope-from julian@elischer.org)
Received: from idiom.com (mx0.idiom.com [216.240.32.160])
	by out.internet-mail-service.net (Postfix) with ESMTP id B6E8783D41;
	Wed, 25 Mar 2009 10:22:28 -0700 (PDT)
X-Client-Authorized: MaGic Cook1e
X-Client-Authorized: MaGic Cook1e
X-Client-Authorized: MaGic Cook1e
Received: from julian-mac.elischer.org (home.elischer.org [216.240.48.38])
	by idiom.com (Postfix) with ESMTP id 9D3F22D6017;
	Wed, 25 Mar 2009 10:22:06 -0700 (PDT)
Message-ID: <49CA684F.70604@elischer.org>
Date: Wed, 25 Mar 2009 10:22:23 -0700
From: Julian Elischer <julian@elischer.org>
User-Agent: Thunderbird 2.0.0.21 (Macintosh/20090302)
MIME-Version: 1.0
To: Bruce Evans <brde@optusnet.com.au>
References: <200903231733.51671.mel.flynn+fbsd.fs@mailing.thruhere.net>
	<49C7C45B.7040708@elischer.org>
	<20090324224001.D1670@besplex.bde.org>
	<49C97A6F.70204@transactionware.com>
	<20090325223213.P35996@delplex.bde.org>
In-Reply-To: <20090325223213.P35996@delplex.bde.org>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: freebsd-fs@freebsd.org
Subject: Re: Trying to understand how aio(4), mount(8) async (and gjournal)
 relate
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 25 Mar 2009 17:22:13 -0000

Bruce Evans wrote:
> On Wed, 25 Mar 2009, Jan Mikkelsen wrote:
> 
>> [Jumping into a conversation on aio, async mounts, etc.]
>>
>> I have had a few questions for a while that I haven't asked yet; these 
>> seems like an appropriate time to ask them!
>>
>> Is it reasonable to open a file with O_FSYNC and then use aio_write() 
>> to issue multiple writes, and then assume that the data is on disk 
>> when the aio completes?
> 
> I know very little about aio, but looking at the sources seems to show that
> O_FSYNC (or mounting with the sync option) just defeats the asyncness of
> aio.  aio seems to use only fo_write() for writing, so at lower (file
> system) levels, O_FSYNC has the same behaviour as for write(2) -- it syncs
> the i/o at the end of the call in the usual case where fo_write = vn_write.
> 
>> Can I get I/O parallelism using this approach?
> 
> Apparently not.
> 
>> I recall reading (some time ago) that FreeBSD doesn't do I/O 
>> parallelism on a single file descriptor.  Is that true?  Do I need to 
>> open the file multiple times in order to get I/O parallelism?
> 
> The fs part of vn_write() is serialized, now using the exclusive vnode 
> lock.
> The code is essentially:
> 
>     vn_lock(vp, LK_EXCLUSIVE | LK_RETRY);
>     error = VOP_WRITE(...);        /* this soon reaches foofs_write() */
>     VOP_UNLOCK(vp);
> 
> In the usual case without O_FSYNC, foofs should try to only schedule
> the i/o (by writing it to the buffer cache and not waiting), so that
> the actual i/o is done in parallel later.  However, foofs might need
> to do some physical input in order to tell where to write (e.g., reading
> indirect block(s)) or some physical output of metadata needed for
> consistency (e.g., writing indirect blocks), and any such i/o is
> serialized.  (I think most file systems avoid writing to the inode on
> every foofs_write(), though not doing requires tricks to maintain
> consistency.  No tricks seem to be available for indirect blocks, so
> ffs without soft updates always writes them synchronously (except in
> my version where the async mount option actually works for indurect
> blocks).)
> 
> O_FSYNC should cause almost all writes related to the file to be
> synced at the end of foofs_write().  Thus it forces all i/o to be
> serialized.
> Some excepions to "all":
> - at least in ffs, bitmap blocks are not synced.  This is safe since
>   fsck can always recover bitmap blocks.
> - at least in ffs, directories above the file are not synced by fsync()
>   for the file.  This is normally harmless because critical directory
>   operations are normally synchronous (or ordered relative to everything
>   including related file operations in the case of soft updates), and
>   fsync() is not specified to do this (?), but perhaps careful
>   applications should fsync() all the directories too, and with the
>   async mount option, even the most critical directory operation
>   (creation of the file's directory entry) is asynchronous (except
>   bugs make it partly synchronus).
> - at least in ffs, with the async mount option, fsync() is more broken than
>   it should be broken -- it syncs everything except for the most critical
>   metadata (the inode) and directories above the file.
> 
>> You can see where I'm going with this:  What I'd really like to do is 
>> open a file with O_FSYNC | O_DIRECT | O_EXCL, and then do lots of aio 
>> operations on it using chunks that a multiple of the page size with 
>> buffers that are aligned on page boundaries.  I'd like to know when 
>> aio writes are "really" complete to maintain various kinds of on-disk 
>> structures (eg. b-trees).  I'd also like to avoid call fsync(2).
> 
> Calling fsync() or aio_waitcomplete() seems to be necessary.  More
> global options like the sync mount flag and O_FSYNC don't provide
> enough control.  I can't find any aio interfaces to select or poll for
> completion.

it does have a comprehensive interface with kqueue.

>  It seems to have only aio_return() to test for completion,
> with the possibly unwanted side effect of doing the completion if
> possible.  I don't trust aio_return() to test that _all_ the things
> that would be done by the file system for fsync(2) have been done.
> aio_waitcomplete ensures doing these things by calling the file system
> (VOP_FSYNC()), but aio_return() doesn't seem to go near the file system.
> 
> BTW, I just remembered that there is no mount option or file flag to
> give fully sync metadata.  At least in ffs, all inode-change operations
> (chmod(), chown(), fchmod(), fchown(), etc.) are async, irrespective of
> mount options and O_FSYNC.  It takes a syscall calling VOP_FSYNC() or
> an unrelated inode update to sync the metadata for these operations.
> 
> Bruce