Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 24 Sep 2010 17:46:04 +0200
From:      Markus Gebert <markus.gebert@hostpoint.ch>
To:        Bruce Evans <brde@optusnet.com.au>
Cc:        freebsd-fs <freebsd-fs@freebsd.org>
Subject:   Re: Strange behaviour with sappend flag set on ZFS
Message-ID:  <0C64D46D-1B34-46E7-A736-52199050B5F5@hostpoint.ch>
In-Reply-To: <20100924122839.P842@delplex.bde.org>
References:  <201009231938.09548.cal@linu.gs> <66757A1E-E445-4AAD-8F57-382D85BFD579@hostpoint.ch> <20100924122839.P842@delplex.bde.org>

next in thread | previous in thread | raw e-mail | index | archive | help

On 24.09.2010, at 05:16, Bruce Evans wrote:

> On Fri, 24 Sep 2010, Markus Gebert wrote:
>=20
>> On 23.09.2010, at 19:38, Michael Naef wrote:
>>=20
>>> If I open (2) a file with O_APPEND the fd position pointer is set
>>> to the start of the file. When I then write to an fd on a ufs FS in
>>> FB6.4, it is automatically moved to the end and the bytes are
>>> appended. No problems appear with write to the sappend-flages file.
>>>=20
>>> However if I do the same on FB8.1/ZFS, the write fails as "not
>>> permitted".
>>=20
>> To me, it seems that the zfs_write() VOP incorrectly uses FAPPEND =
instead of IO_APPEND when checking the ioflag argument to see if the =
current write is an appending one, here:
>>=20
>> ----
>> 	/*
>> 	 * If immutable or not appending then return EPERM
>> 	 */
>> 	pflags =3D zp->z_phys->zp_flags;
>> 	if ((pflags & (ZFS_IMMUTABLE | ZFS_READONLY)) ||
>> 	    ((pflags & ZFS_APPENDONLY) && !(ioflag & FAPPEND) &&
>> 	    (uio->uio_loffset < zp->z_phys->zp_size))) {
>> 		ZFS_EXIT(zfsvfs);
>> 		return (EPERM);
>> 	}
>> ----
>>=20
>> The function comment only mentions IO_APPEND and even later on in =
zfs_write() IO_APPEND is used to check wether the offset should be =
changed to EOF, so FAPPEND really seems to wrong in the above permission =
check.
>=20
> Oops, the file[descriptor] append flags aren't all the same like I =
said in
> a previous reply.  FAPPEND is identical with O_APPEND and is 8, but =
IO_APPEND
> is 2.  Here zfs is testing IO_NODELOCKED =3D 8.

Yes, I checked that before posting and forgot to mention it, sorry for =
that, could have saved you some time :-). Then again if it had been a =
"spelling-only bug", my patch wouldn't have changed any behaviour, but =
instead it did make Michi's test case work properly.


> The use of IO_NODELOCKED is confusing and slightly broken.  It is only
> a flag for vn_rdwr() and a couple of other in-kernel i/o functions.
> These start with the vnode locked and set IO_NODELOCKED to tell =
vn_rdwr()
> not to lock again.  Userland writes call vn_rdwr() without the vnode
> locked and keep IO_NODELOCKED clear to tell vn_rdwr() to lock the
> vnode.  vn_rdwr() doesn't set it so it is garbage after that -- it =
remains
> consistent with the vnode lock state for in-kernel writes, but for =
userland
> writes it is inconsistent with the vnode lock state below the level of
> vn_rdwr().  VOP_WRITE() is called from places other than vn_rdwr(), =
and
> most or all of these places also fail to set IO_NODELOCKED.  They also
> mostly or always know that they are for in-kernel writes, so they also
> don't check IO_NODELOCKED.
>=20
> Thus in the above, the FAPPEND bit is usually clear so the test =
doesn't
> work (the cases where FAPPEND is set where it should be clear, and
> where ZFS_APPENDONLY is also set, are rare since they only occur for
> in-kernel writes of things like core dumps, and you have other =
problems
> if you have many append-only files being written to for things like
> core dumps).  My previous mail was incorrect about this point, and
> understated the ways in which the comment doesn't match the code --
> apart from FAPPEND not being the append-flag, the comment combined
> with the reported behaviour and mu confusion about the equality of the
> flags helped me misread the IO_APPEND test backwards.

Makes sense. This explains why the test case failed so realiably =
although zfs_write() checked the wrong bit.


> There is similar confusion and possibly bugs for IO_UNIT.  Now all
> callers seem to set it, and most VOP_WRITE() functions can't work =
right
> without it.  E.g., they all need to back out of short writes to =
satisfy
> the implementation of sys_generic.c:write(); ffs_write() only backs
> out ionly if IO_UNIT is set; it is always set for userland writes so
> there is no problem with write(), but why not remove it as a flag and
> always do atomic writes to simplify things?  Kernel writers are even
> less prepared than userland writers to recover from short writes.
>=20
>> CURRENT and STABLE-8 seem to be affected to. The following patch =
seems to fix it (at least Michi's test case works fine with it):
>>=20
>> ----
>> diff -ru =
../src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c =
./sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c
>> --- ../src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c =
  2010-05-19 08:49:52.000000000 +0200
>> +++ ./sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c      =
  2010-09-23 23:24:43.549846948 +0200
>> @@ -709,7 +709,7 @@
>>        */
>>       pflags =3D zp->z_phys->zp_flags;
>>       if ((pflags & (ZFS_IMMUTABLE | ZFS_READONLY)) ||
>> -           ((pflags & ZFS_APPENDONLY) && !(ioflag & FAPPEND) &&
>> +           ((pflags & ZFS_APPENDONLY) && !(ioflag & IO_APPEND) &&
>>           (uio->uio_loffset < zp->z_phys->zp_size))) {
>>               ZFS_EXIT(zfsvfs);
>>               return (EPERM);
>=20
> Now I think the not-just-a-spelling error was the only major bug, and =
this
> fixes it.

Ok. Anyone going to commit the patch?


> There is another minor bug/inconsistency with ffs: in the =
non-IO_APPEND
> case, ffs requires the offset to =3D=3D the file size, while the above
> allows it to be >=3D the file size.


I assume you're talking about the "file has appending-only flag set but =
IO_APPEND is not set" case (I didn't look at the ufs code).

The lseek(2) man page says:

----
     The lseek() system call allows the file offset to be set beyond the =
end
     of the existing end-of-file of the file.  If data is later written =
at
     this point, subsequent reads of the data in the gap return bytes of =
zeros
     (until data is actually written into the gap).
----

Don't know VFS well enough to tell, but why would appending something =
with a gap/hole (of zeroes) before it not be considered appending (in =
the append-only file flag sense)? Unless I missed something (like the =
offset being modified somewhere else in the kernel before the VOP call), =
depending on how one answers this question, either UFS or ZFS is right, =
it seems.


Markus





Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?0C64D46D-1B34-46E7-A736-52199050B5F5>