From owner-freebsd-fs@FreeBSD.ORG Fri Sep 24 02:47:41 2010 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 846A6106566B for ; Fri, 24 Sep 2010 02:47:41 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from fallbackmx10.syd.optusnet.com.au (fallbackmx10.syd.optusnet.com.au [211.29.132.251]) by mx1.freebsd.org (Postfix) with ESMTP id B548D8FC08 for ; Fri, 24 Sep 2010 02:47:39 +0000 (UTC) Received: from mail05.syd.optusnet.com.au (mail05.syd.optusnet.com.au [211.29.132.186]) by fallbackmx10.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id o8O0h0Ka025709 for ; Fri, 24 Sep 2010 10:43:00 +1000 Received: from c122-107-116-249.carlnfd1.nsw.optusnet.com.au (c122-107-116-249.carlnfd1.nsw.optusnet.com.au [122.107.116.249]) by mail05.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id o8O0gs6v025530 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Fri, 24 Sep 2010 10:42:57 +1000 Date: Fri, 24 Sep 2010 10:42:54 +1000 (EST) From: Bruce Evans X-X-Sender: bde@delplex.bde.org To: Michael Naef In-Reply-To: <201009231938.09548.cal@linu.gs> Message-ID: <20100924083610.B714@delplex.bde.org> References: <201009231938.09548.cal@linu.gs> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-fs Subject: Re: Strange behaviour with sappend flag set on ZFS X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 24 Sep 2010 02:47:41 -0000 On Thu, 23 Sep 2010, Michael Naef wrote: > If I open (2) a file with O_APPEND the fd position pointer is set > to the start of the file. When I then write to an fd on a ufs FS in > FB6.4, it is automatically moved to the end and the bytes are > appended. No problems appear with write to the sappend-flages file. > > However if I do the same on FB8.1/ZFS, the write fails as "not > permitted". When I manually move the position pointer to the end of > the file, it works as axpected: > Am I missing something or is this a bug (or feature or watsoever > ;-) in ZFS? I know little about zfs, but the bug is obvious: from zfs_write(): % /* % * If immutable or not appending then return EPERM % */ % pflags = zp->z_phys->zp_flags; % if ((pflags & (ZFS_IMMUTABLE | ZFS_READONLY)) || % ((pflags & ZFS_APPENDONLY) && !(ioflag & FAPPEND) && % (uio->uio_loffset < zp->z_phys->zp_size))) { % ZFS_EXIT(zfsvfs); % return (EPERM); % } This returns EPERM before advancing the pointer. The comment does less than echo the code. The code also returns EPERM if read-only. The code only returns EPERM for not appending if the file is append-only. Using FAPPEND is a layering violation. That is file descriptor flag (or is it an open-file flag? Anyway, it is not the attribute-file flag) , but here we have an i/o flag. vfs has carefully translated from FAPPEND to IO_APPEND. This is just a spelling error here since the flags have the same value. % ... % /* % * If in append mode, set the io offset pointer to eof. % */ % if (ioflag & IO_APPEND) { This is almost dead code, since we have already failed if the pointer is not already at the end. % /* % * Range lock for a file append: % * The value for the start of range will be determined by % * zfs_range_lock() (to guarantee append semantics). It is unfortunate that vfs doesn't do this, so 10-20 individual file systems have to do it. However, if the vfs lock isn't sufficient, then the advancement must be delayed like it is until the foofs lock is held. % * If this write will cause the block size to increase, % * zfs_range_lock() will lock the entire file, so we must % * later reduce the range after we grow the block size. % */ % rl = zfs_range_lock(zp, 0, n, RL_APPEND); % if (rl->r_len == UINT64_MAX) { % /* overlocked, zp_size can't change */ % woff = uio->uio_loffset = zp->z_phys->zp_size; This case doesn't advance the pointer, so would be broken if we hadn't already checked ZFS_APPEND. % } else { % woff = uio->uio_loffset = rl->r_off; This case does advance the pointer, so is broken since we already checked ZFS_APPEND. % } % } else { The canonical code for advancing the pointer and checking for append-only files is much simpler and less broken: from ffs_write(): % switch (vp->v_type) { % case VREG: % if (ioflag & IO_APPEND) % uio->uio_offset = ip->i_size; % if ((ip->i_flags & APPEND) && uio->uio_offset != ip->i_size) % return (EPERM); ffs_write() is also correctly missing the check for immutable files. The immutable flag should limit open() (and some other operations not including write()), just like the file permissions flags. ffs_write() is also missing the check for read-only mounts. This is subtle. I think the fs cannot be mounted read-only if ffs_write() is reached -- even if the file was opened for writing at a time when the mount was read-write, the file must have been dowgraded to unwriteable of the mount was successfully downgraded to read-only (takes a forced unmount which tends to break applications but shouldn't break the file system). ffs^Wufs_open() also checks the append-only flag (the attribute one) at open() time, and fails then if O_APPEND is not set. This is half good: The behaviour should only depend on the attributes at the time of the open, just like it does for the file permissions and the immutable attributes. But ffs_write() is broken here -- it uses the append-only attribute at the time of the write. This is half bad: unlike the attributes, the O_APPEND flag is intended to be changed for open files (despite the O_ in its name, it can be changed by fcntl()), so checking it at open time is just inconsistent (unless you deny unsetting O_APPEND if the append-only attribute is set, foofs_write() still has to be prepared for the combination of append flags that is disallowed by ffs_open()). Thus it is a bug to check any append flags at open time. The append-only flag at the time of the open should be copied for use at the time of later writes to the open file descriptor (or is it the open file?), and the O_APPEND flag should only be checked at the time of later writes. Bruce