FreeBSD Mail Archives

Date:      Mon, 7 Feb 2000 12:56:36 -0800
From:      Alfred Perlstein <bright@wintelcom.net>
To:        Matthew Dillon <dillon@apollo.backplane.com>
Cc:        hackers@FreeBSD.ORG
Subject:   Re: Syncing a vector of fileoffsets and lengths?
Message-ID:  <20000207125636.G25520@fw.wintelcom.net>
In-Reply-To: <200002071938.LAA50114@apollo.backplane.com>; from dillon@apollo.backplane.com on Mon, Feb 07, 2000 at 11:38:43AM -0800
References:  <20000207114042.E25520@fw.wintelcom.net> <200002071938.LAA50114@apollo.backplane.com>

* Matthew Dillon <dillon@apollo.backplane.com> [000207 12:05] wrote:
> 
> :Is it possible to submit several offsets of a file to be synced
> :rather than calling fsync or mmap'ing over the file and calling
> :msync()?
> :
> :The only way I can think of doing this is queuing write requests
> :backed by a O_FSYNC fd to an aiod.
> :
> :Even then the desired result isn't really achived as instead of
> :all the buffers being simultaniously queued for immediate IO
> :the aiod will stall on each buffer.
> :
> :Is there a better way to submit multiple buffers for immediate
> :sync without syncing the entire file?
> 
>     There is no way to do this currently. 
> 
> :It seems that msync with MS_ASYNC would work (a bit kludgy), but
> :it's not implemented according to the manpage.
> :
> :thanks,
> :-- 
> :-Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
> 
>     The man page is wrong.  It is implemented, but it doesn't guarentee
>     that metadata will be written nor does it guarentee the timing of
>     the data writes.
> 
>     I have long considered this problem.  NFSv3 needs to be able to do
>     ranged fsyncs to handle the commit RPC.  It used to just fysnc the
>     whole file (bad).  Now it has a kludge to scan the buffers and 
>     write out the appropriate ones, which is better.   However, it has 
>     the same problem that msync() has when doing a ranged sync - the 
>     metadata is not guarenteed.  (When you do a normal fsync() the
>     meta-data is guarenteed, even in the softupdates case).
> 
>     Only minor adjustments are required to brute-force the metadata, and 
>     a couple more adjustments to make it work properly with softupdates.
>     Once I do this we can break the code out into its own system call.
> 
> 	fsync2(fd, options, offset, size) ??? 
> 
> 	degenerate case would sync to the EOF if size == 0.
> 
> 	synchronously fsyncs by default, FSYNC_ASYNC would run it
> 	asynchronously.  
> 
> 	You could async fsync it, then fsync it normally later on to
> 	make sure it has all gone out.
> 
>     We would also have to make a new VOP to do it, VOP_FSYNC2(), which
>     would default to calling VOP_FSYNC() with a 0 offset and 0 size.
> 
>     I've been wanting to do this for a while.  There are a huge number
>     of uses for this sort of system call, including database apps and
>     two of my own projects.  I'm waiting till after the release before
>     starting work on it.

I think this is still a step away from what is really needed, scheduling
multiple vectors in different files in order to be written.

The interface you are proposing is nice but not flexible to gain enough
performance.

I asked this question because of a problem that Postgresql has,
basically multiple processes will be updating a file, they may do
scattered IO to multiple offsets into the file, at the end of a
transaction they want to sync the data... fsync().  ow.  This causes
buffers dirtied from multiple processes to be pushed to disk where
they really only want thier own.  The order doesn't really matter,
just that all of the IO is on stable storage.

I think two kinds of behavior are needed, ordered range fsync and
unordered async fsync.

The ordered range could be taken care of easily by your implementation,
however for maximum effectiveness you'd want to allow for unordered
async fsync and notification.

The simplest way I can think of doing this keeping a per-process count
of how many buffers where scheduled for async IO and allowing as many
async ops to happen, incrementing the count, as each io completes it
decrements the count and calls wakeup_one once it reaches 0 again.

This would allow multiple sync IOs to be scheduled without stalling
the process and at the same time allowing for notification when the
IO actually completes.  It also allows async fsync to be done
across multiple files.  The waiting syscall would simply sleep on
the count variable in the process structure.

I think there's enough fields in the struct buf to support this unordered,
i'm not sure it will be possible to do this if the application wants
FIFO async fsync.

I can see this happenning pretty easily via the buffer interface but
doing it via async msync() through the vm system eludes me. :)
I assume you can get at the struct buf through the vm as that's how
IO is scheduled in general, but I'll need to research it more.

What do you think?

-Alfred

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20000207125636.G25520>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation