Date: Fri, 24 Sep 2010 17:46:04 +0200 From: Markus Gebert <markus.gebert@hostpoint.ch> To: Bruce Evans <brde@optusnet.com.au> Cc: freebsd-fs <freebsd-fs@freebsd.org> Subject: Re: Strange behaviour with sappend flag set on ZFS Message-ID: <0C64D46D-1B34-46E7-A736-52199050B5F5@hostpoint.ch> In-Reply-To: <20100924122839.P842@delplex.bde.org> References: <201009231938.09548.cal@linu.gs> <66757A1E-E445-4AAD-8F57-382D85BFD579@hostpoint.ch> <20100924122839.P842@delplex.bde.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On 24.09.2010, at 05:16, Bruce Evans wrote: > On Fri, 24 Sep 2010, Markus Gebert wrote: >=20 >> On 23.09.2010, at 19:38, Michael Naef wrote: >>=20 >>> If I open (2) a file with O_APPEND the fd position pointer is set >>> to the start of the file. When I then write to an fd on a ufs FS in >>> FB6.4, it is automatically moved to the end and the bytes are >>> appended. No problems appear with write to the sappend-flages file. >>>=20 >>> However if I do the same on FB8.1/ZFS, the write fails as "not >>> permitted". >>=20 >> To me, it seems that the zfs_write() VOP incorrectly uses FAPPEND = instead of IO_APPEND when checking the ioflag argument to see if the = current write is an appending one, here: >>=20 >> ---- >> /* >> * If immutable or not appending then return EPERM >> */ >> pflags =3D zp->z_phys->zp_flags; >> if ((pflags & (ZFS_IMMUTABLE | ZFS_READONLY)) || >> ((pflags & ZFS_APPENDONLY) && !(ioflag & FAPPEND) && >> (uio->uio_loffset < zp->z_phys->zp_size))) { >> ZFS_EXIT(zfsvfs); >> return (EPERM); >> } >> ---- >>=20 >> The function comment only mentions IO_APPEND and even later on in = zfs_write() IO_APPEND is used to check wether the offset should be = changed to EOF, so FAPPEND really seems to wrong in the above permission = check. >=20 > Oops, the file[descriptor] append flags aren't all the same like I = said in > a previous reply. FAPPEND is identical with O_APPEND and is 8, but = IO_APPEND > is 2. Here zfs is testing IO_NODELOCKED =3D 8. Yes, I checked that before posting and forgot to mention it, sorry for = that, could have saved you some time :-). Then again if it had been a = "spelling-only bug", my patch wouldn't have changed any behaviour, but = instead it did make Michi's test case work properly. > The use of IO_NODELOCKED is confusing and slightly broken. It is only > a flag for vn_rdwr() and a couple of other in-kernel i/o functions. > These start with the vnode locked and set IO_NODELOCKED to tell = vn_rdwr() > not to lock again. Userland writes call vn_rdwr() without the vnode > locked and keep IO_NODELOCKED clear to tell vn_rdwr() to lock the > vnode. vn_rdwr() doesn't set it so it is garbage after that -- it = remains > consistent with the vnode lock state for in-kernel writes, but for = userland > writes it is inconsistent with the vnode lock state below the level of > vn_rdwr(). VOP_WRITE() is called from places other than vn_rdwr(), = and > most or all of these places also fail to set IO_NODELOCKED. They also > mostly or always know that they are for in-kernel writes, so they also > don't check IO_NODELOCKED. >=20 > Thus in the above, the FAPPEND bit is usually clear so the test = doesn't > work (the cases where FAPPEND is set where it should be clear, and > where ZFS_APPENDONLY is also set, are rare since they only occur for > in-kernel writes of things like core dumps, and you have other = problems > if you have many append-only files being written to for things like > core dumps). My previous mail was incorrect about this point, and > understated the ways in which the comment doesn't match the code -- > apart from FAPPEND not being the append-flag, the comment combined > with the reported behaviour and mu confusion about the equality of the > flags helped me misread the IO_APPEND test backwards. Makes sense. This explains why the test case failed so realiably = although zfs_write() checked the wrong bit. > There is similar confusion and possibly bugs for IO_UNIT. Now all > callers seem to set it, and most VOP_WRITE() functions can't work = right > without it. E.g., they all need to back out of short writes to = satisfy > the implementation of sys_generic.c:write(); ffs_write() only backs > out ionly if IO_UNIT is set; it is always set for userland writes so > there is no problem with write(), but why not remove it as a flag and > always do atomic writes to simplify things? Kernel writers are even > less prepared than userland writers to recover from short writes. >=20 >> CURRENT and STABLE-8 seem to be affected to. The following patch = seems to fix it (at least Michi's test case works fine with it): >>=20 >> ---- >> diff -ru = ../src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c = ./sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c >> --- ../src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c = 2010-05-19 08:49:52.000000000 +0200 >> +++ ./sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c = 2010-09-23 23:24:43.549846948 +0200 >> @@ -709,7 +709,7 @@ >> */ >> pflags =3D zp->z_phys->zp_flags; >> if ((pflags & (ZFS_IMMUTABLE | ZFS_READONLY)) || >> - ((pflags & ZFS_APPENDONLY) && !(ioflag & FAPPEND) && >> + ((pflags & ZFS_APPENDONLY) && !(ioflag & IO_APPEND) && >> (uio->uio_loffset < zp->z_phys->zp_size))) { >> ZFS_EXIT(zfsvfs); >> return (EPERM); >=20 > Now I think the not-just-a-spelling error was the only major bug, and = this > fixes it. Ok. Anyone going to commit the patch? > There is another minor bug/inconsistency with ffs: in the = non-IO_APPEND > case, ffs requires the offset to =3D=3D the file size, while the above > allows it to be >=3D the file size. I assume you're talking about the "file has appending-only flag set but = IO_APPEND is not set" case (I didn't look at the ufs code). The lseek(2) man page says: ---- The lseek() system call allows the file offset to be set beyond the = end of the existing end-of-file of the file. If data is later written = at this point, subsequent reads of the data in the gap return bytes of = zeros (until data is actually written into the gap). ---- Don't know VFS well enough to tell, but why would appending something = with a gap/hole (of zeroes) before it not be considered appending (in = the append-only file flag sense)? Unless I missed something (like the = offset being modified somewhere else in the kernel before the VOP call), = depending on how one answers this question, either UFS or ZFS is right, = it seems. Markus
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?0C64D46D-1B34-46E7-A736-52199050B5F5>