Date: Wed, 18 May 2016 08:39:08 +1000 (EST) From: Bruce Evans <brde@optusnet.com.au> To: Konstantin Belousov <kostikbel@gmail.com> Cc: fs@freebsd.org Subject: Re: quick fix for slow directory shrinking in ffs Message-ID: <20160518081302.X6396@besplex.bde.org> In-Reply-To: <20160517212227.GE89104@kib.kiev.ua> References: <20160517072705.F2157@besplex.bde.org> <20160517082050.GX89104@kib.kiev.ua> <20160517192933.U4573@besplex.bde.org> <20160517111715.GC89104@kib.kiev.ua> <20160518035413.L4357@besplex.bde.org> <20160518052656.R5764@besplex.bde.org> <20160517212227.GE89104@kib.kiev.ua>
next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, 18 May 2016, Konstantin Belousov wrote: > On Wed, May 18, 2016 at 05:40:07AM +1000, Bruce Evans wrote: >> Also, ftruncate() seems to be broken. POSIX doesn't seem to require it >> to honor O_SYNC, but POLA requires this. But there is no VOP_TRUNCATE(); >> truncation is done using VOP_SETATTR() and there is no way to pass down >> the O_SYNC flag to it; in practice, ffs just does UFS_TRUNCATE() without >> IO_SYNC. >> >> This makes a difference mainly for async mounts with my fixes to honor >> IO_SYNC in ffs_update(). With async mounts, consistency of the file >> system is not guaranteed but O_SYNC for a file should at least cause >> all of the file data and most of its metdata to be written. Not syncing >> for ftruncate() unnecessarily loses metadata writes. With !async mounts, >> consistency of the file system is partly guaranteed and lost metadata >> writes for ftruncate() shouldn't affect this -- they should just lose >> the ftruncate() atomically. >> >> vfs could do an fsync() after VOP_SETATTR() for the O_SYNC case. This >> reduces the race window. > > vattr already has the va_vaflags field. It is trivial to add flag there > requesting O_SYNC behaviour. Of course, other updates could also > honour VA_SYNC, but this is for later. Like this: > > diff --git a/sys/kern/vfs_vnops.c b/sys/kern/vfs_vnops.c > index 0a3a88a..1e42a3d 100644 > --- a/sys/kern/vfs_vnops.c > +++ b/sys/kern/vfs_vnops.c > @@ -1314,6 +1314,8 @@ vn_truncate(struct file *fp, off_t length, struct ucred *active_cred, > if (error == 0) { > VATTR_NULL(&vattr); > vattr.va_size = length; > + if ((fp->f_flag & O_FSYNC) != 0) > + vattr.va_vaflags |= VA_SYNC; > error = VOP_SETATTR(vp, &vattr, fp->f_cred); > } > out: > diff --git a/sys/sys/vnode.h b/sys/sys/vnode.h > index e82f6ee..41ec7f7 100644 > --- a/sys/sys/vnode.h > +++ b/sys/sys/vnode.h > @@ -286,6 +286,7 @@ struct vattr { > */ > #define VA_UTIMES_NULL 0x01 /* utimes argument was NULL */ > #define VA_EXCLUSIVE 0x02 /* exclusive create request */ > +#define VA_SYNC 0x04 /* O_SYNC truncation */ > > /* > * Flags for ioflag. (high 16 bits used to ask for read-ahead and > diff --git a/sys/ufs/ufs/ufs_vnops.c b/sys/ufs/ufs/ufs_vnops.c > index c0729f8..83df347 100644 > --- a/sys/ufs/ufs/ufs_vnops.c > +++ b/sys/ufs/ufs/ufs_vnops.c > @@ -625,7 +625,8 @@ ufs_setattr(ap) > */ > return (0); > } > - if ((error = UFS_TRUNCATE(vp, vap->va_size, IO_NORMAL, > + if ((error = UFS_TRUNCATE(vp, vap->va_size, IO_NORMAL | > + ((vap->va_vaflags & VA_SYNC) != 0 ? IO_SYNC : 0), > cred)) != 0) > return (error); > } Looks good. O_SYNC is actually spelled O_FSYNC in FreeBSD. You spelled it correctly in the above, but about 2 places in the kernel use the POSIX spelling. It is confusing enough to also have the spellings FFSYNC and IO_SYNC for this flag in different layers. FFSYNC is for fcntl and must equal O_FSYNC since the layers are not clearly separated. IO_SYNC is for a clearly separated layer and is supposed to be translated to, but it has the same value as O_FSYNC so O_FSYNC might work accidentally when not translated. This should probably also be done for truncations with O_TRUNC at open time. There are a couple of these in vfs_syscalls.c. O_TRUNC is used much more than ftruncate() so the extra overhead from this would be more noticeable. I think the implementation is not very good. If open() with O_TRUNC or truncate with O_FSYNC or fsync() fails, then the file contents might be garbage. So it would be better to do large truncations mostly async and only sync at the end. *fs_truncate() could operate like that, but I think it takes the IO_SYNC flag as a directive to do the whole operation synchronously. A non-sync truncation followed by fsync() is likely to work better for ffs and just work for all fs's. Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20160518081302.X6396>