Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 24 Sep 2010 10:42:54 +1000 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        Michael Naef <cal@linu.gs>
Cc:        freebsd-fs <freebsd-fs@freebsd.org>
Subject:   Re: Strange behaviour with sappend flag set on ZFS
Message-ID:  <20100924083610.B714@delplex.bde.org>
In-Reply-To: <201009231938.09548.cal@linu.gs>
References:  <201009231938.09548.cal@linu.gs>

next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, 23 Sep 2010, Michael Naef wrote:

> If I open (2) a file with O_APPEND the fd position pointer is set
> to the start of the file. When I then write to an fd on a ufs FS in
> FB6.4, it is automatically moved to the end and the bytes are
> appended. No problems appear with write to the sappend-flages file.
>
> However if I do the same on FB8.1/ZFS, the write fails as "not
> permitted". When I manually move the position pointer to the end of
> the file, it works as axpected:

> Am I missing something or is this a bug (or feature or watsoever
> ;-) in ZFS?

I know little about zfs, but the bug is obvious: from zfs_write():

% 	/*
% 	 * If immutable or not appending then return EPERM
% 	 */
% 	pflags = zp->z_phys->zp_flags;
% 	if ((pflags & (ZFS_IMMUTABLE | ZFS_READONLY)) ||
% 	    ((pflags & ZFS_APPENDONLY) && !(ioflag & FAPPEND) &&
% 	    (uio->uio_loffset < zp->z_phys->zp_size))) {
% 		ZFS_EXIT(zfsvfs);
% 		return (EPERM);
% 	}

This returns EPERM before advancing the pointer.

The comment does less than echo the code.  The code also returns EPERM
if read-only.  The code only returns EPERM for not appending if the file
is append-only.

Using FAPPEND is a layering violation.  That is file descriptor flag
(or is it an open-file flag?  Anyway, it is not the attribute-file
flag) , but here we have an i/o flag.  vfs has carefully translated
from FAPPEND to IO_APPEND.  This is just a spelling error here since
the flags have the same value.

% ...
% 	/*
% 	 * If in append mode, set the io offset pointer to eof.
% 	 */
% 	if (ioflag & IO_APPEND) {

This is almost dead code, since we have already failed if the pointer is
not already at the end.

% 		/*
% 		 * Range lock for a file append:
% 		 * The value for the start of range will be determined by
% 		 * zfs_range_lock() (to guarantee append semantics).

It is unfortunate that vfs doesn't do this, so 10-20 individual file systems
have to do it.  However, if the vfs lock isn't sufficient, then the
advancement must be delayed like it is until the foofs lock is held.

% 		 * If this write will cause the block size to increase,
% 		 * zfs_range_lock() will lock the entire file, so we must
% 		 * later reduce the range after we grow the block size.
% 		 */
% 		rl = zfs_range_lock(zp, 0, n, RL_APPEND);
% 		if (rl->r_len == UINT64_MAX) {
% 			/* overlocked, zp_size can't change */
% 			woff = uio->uio_loffset = zp->z_phys->zp_size;

This case doesn't advance the pointer, so would be broken if we hadn't
already checked ZFS_APPEND.

% 		} else {
% 			woff = uio->uio_loffset = rl->r_off;

This case does advance the pointer, so is broken since we already checked
ZFS_APPEND.

% 		}
% 	} else {

The canonical code for advancing the pointer and checking for append-only
files is much simpler and less broken: from ffs_write():

% 	switch (vp->v_type) {
% 	case VREG:
% 		if (ioflag & IO_APPEND)
% 			uio->uio_offset = ip->i_size;
% 		if ((ip->i_flags & APPEND) && uio->uio_offset != ip->i_size)
% 			return (EPERM);

ffs_write() is also correctly missing the check for immutable files.  The
immutable flag should limit open() (and some other operations not including
write()), just like the file permissions flags.

ffs_write() is also missing the check for read-only mounts.  This is subtle.
I think the fs cannot be mounted read-only if ffs_write() is reached --
even if the file was opened for writing at a time when the mount was
read-write, the file must have been dowgraded to unwriteable of the mount
was successfully downgraded to read-only (takes a forced unmount which
tends to break applications but shouldn't break the file system).

ffs^Wufs_open() also checks the append-only flag (the attribute one)
at open() time, and fails then if O_APPEND is not set.  This is half
good:  The behaviour should only depend on the attributes at the time
of the open, just like it does for the file permissions and the immutable
attributes.  But ffs_write() is broken here -- it uses the append-only
attribute at the time of the write.  This is half bad: unlike the
attributes, the O_APPEND flag is intended to be changed for open files
(despite the O_ in its name, it can be changed by fcntl()), so checking
it at open time is just inconsistent (unless you deny unsetting O_APPEND
if the append-only attribute is set, foofs_write() still has to be
prepared for the combination of append flags that is disallowed by
ffs_open()).

Thus it is a bug to check any append flags at open time.  The append-only
flag at the time of the open should be copied for use at the time of
later writes to the open file descriptor (or is it the open file?), and
the O_APPEND flag should only be checked at the time of later writes.

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20100924083610.B714>