From owner-svn-src-all@FreeBSD.ORG Fri Feb 10 06:33:03 2012 Return-Path: Delivered-To: svn-src-all@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id CE4CA106566B; Fri, 10 Feb 2012 06:33:03 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail07.syd.optusnet.com.au (mail07.syd.optusnet.com.au [211.29.132.188]) by mx1.freebsd.org (Postfix) with ESMTP id 4EE4B8FC12; Fri, 10 Feb 2012 06:33:02 +0000 (UTC) Received: from c211-30-171-136.carlnfd1.nsw.optusnet.com.au (c211-30-171-136.carlnfd1.nsw.optusnet.com.au [211.30.171.136]) by mail07.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id q1A6X0YG015616 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Fri, 10 Feb 2012 17:33:01 +1100 Date: Fri, 10 Feb 2012 17:32:59 +1100 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Kirk McKusick In-Reply-To: <201202092234.q19MYGEJ040871@svn.freebsd.org> Message-ID: <20120210154420.F882@besplex.bde.org> References: <201202092234.q19MYGEJ040871@svn.freebsd.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: svn-src-head@FreeBSD.org, svn-src-all@FreeBSD.org, src-committers@FreeBSD.org Subject: Re: svn commit: r231313 - head/sys/ufs/ffs X-BeenThere: svn-src-all@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "SVN commit messages for the entire src tree \(except for " user" and " projects" \)" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 10 Feb 2012 06:33:03 -0000 On Thu, 9 Feb 2012, Kirk McKusick wrote: > Log: > Historically when an application wrote an entire block of a file, > the kernel allocated a buffer but did not zero it as it was about > to be completely filled by a uiomove() from the user's buffer. > However, if the uiomove() failed, the old contents of the buffer > could be exposed especially if the file was being mmap'ed. The > fix was to always zero the buffer when it was allocated. > > This change first attempts the uiomove() to the newly allocated > (and dirty) buffer and only zeros it if the uiomove() fails. The > effect is to eliminate the gratuitous zeroing of the buffer in > the usual case where the uiomove() successfully fills it. Lately I have been thinking again of the error handling near here: % if (error) { % if (ioflag & IO_UNIT) { % (void)ffs_truncate(vp, osize, % IO_NORMAL | (ioflag & IO_SYNC), % ap->a_cred, uio->uio_td); % uio->uio_offset -= resid - uio->uio_resid; % uio->uio_resid = resid; % } % } else if (resid > uio->uio_resid && (ioflag & IO_SYNC)) % error = ffs_update(vp, 1); This attempts to back out of the whole write after an error. But it is impossible to back out of a write in the middle of the file (unless you save all the blocks written to, and don't get more i/o errors trying to restore them). The above makes various messes trying, and even breaks some cases where the write is at the end of the file: suppose error != 0 and that some i/o has been done successfully (otherwise there is no problem): - suppose IO_UNIT is not set. Then the above fails to do anything. Even if not backing out is correct in this case, it is necessary to call ffs_update(vp, 1) under the condition where it is called in the non-error case, irrespective of whether the write is at the end of the file. But I think the existence of IO_UNIT is a bug. It is only used here and in VOP_WRITE() for some other file systems not including zfs. The others that use it are ext2fs, msdosfs and xfs (don't forget to change their VOPs when changing ffs). The 2 nfsclients have a comment saying that IO_UNIT is not used because all writes are atomic. IO_UNIT is also passed to VOP_READ(), but it is not used by any read VOPs. IO_UNIT is always set by vn_write(). Thus it is set in almost all cases. And I can't find any cases where it is not set, using grep in kern. Thus its existence does nothing except to make it hard to verify that it works correctly because it always has no effect. - suppose that IO_UNIT is set. - we don't check that ffs_truncate() succeeded, and we even explicitly (void) it, though we should check. We blindly proceed to resetting the i/o count, so that upper layers see the error and don't see the amount written, although the file may have been extended, depending on the details of how the truncation failed. - when the write is not at EOF and doesn't extend beyond the original end, ffs_truncate() usually succeeds and has no visible effect. We blindy proceed to resetting the i/o count, so that upper layers see the error and don't see the amount successfully written, although this amount is nonzero (since I specified that this case writes something before hitting the error). Upper layers also get no indication of the amount of clobbered bytes beyond the end of the successively written bytes. POSIX write semantics work very badly here. If we never return a short write, but always convert it to an error (this needs just a bit more force than the above, and might be the natural result of fixing these bugs), then writers should assume that the file is damaged for the whole region of the file that the write was attempted on, but for short writes, writers should be able to assume no damage. - when the write is not at EOF but extends beyond the original end, there is a combination of the above bugs. Just before here, we turn off the set*id bits if some i/o was done. This is unnecessary if there was an error and the i/o is successfully backed out of in a non-buggy way. But it is safer to do it anyway. It would be even safer to do it after any error. My version has the following (mostly wrong) comments and (mostly right) fixes in this area: % Index: ffs_vnops.c % =================================================================== % RCS file: /home/ncvs/src/sys/ufs/ffs/ffs_vnops.c,v % retrieving revision 1.130 % diff -u -2 -r1.130 ffs_vnops.c % --- ffs_vnops.c 21 May 2004 12:05:48 -0000 1.130 % +++ ffs_vnops.c 28 Sep 2010 10:45:17 -0000 % @@ -719,4 +722,8 @@ % * we clear the setuid and setgid bits as a precaution against % * tampering. % + * XXX too late, the tamperer may have opened the file while we % + * were writing the data (or before). % + * XXX too early, if (error && ioflag & IO_UNIT) then we will % + * unwrite the data. % */ % if (resid > uio->uio_resid && ap->a_cred && Probably wrong. Exlusive access should prevent problems. % @@ -725,7 +732,12 @@ % DIP(ip, i_mode) = ip->i_mode; % } % + /* XXX this seems to be misplaced. Which resid? */ % if (resid > uio->uio_resid) % VN_KNOTE(vp, NOTE_WRITE | (extended ? NOTE_EXTEND : 0)); This seems to have been fixed in -current, by moving the note taking to vfs, so it is now done after the back-out. % if (error) { % + /* % + * XXX should truncate to the last successfully written % + * data if the uiomove() failed. % + */ % if (ioflag & IO_UNIT) { % (void)UFS_TRUNCATE(vp, osize, Not a good idea -- see above. % @@ -735,6 +747,11 @@ % uio->uio_resid = resid; % } % - } else if (resid > uio->uio_resid && (ioflag & IO_SYNC)) % - error = UFS_UPDATE(vp, 1); % + } % + % + if (uio->uio_resid != resid) { % + ip->i_flag |= IN_CHANGE | IN_UPDATE; % + if (ioflag & IO_SYNC) % + error = UFS_UPDATE(vp, 1); % + } % return (error); % } This fixes 2 bugs: - spurious setting of IN_CHANGE | IN_UPDATE after backing out of the i/o. However, this enlarges the bugs when the backout just broke the i/o count by resetting it -- now the flags are not set despite the i/o having done some combination of successfully changing the file in the middle and clobbering the file in the middle. - 1 of the bugs pointed out above: missing update in the IO_SYNC subcase of the error case. Again this enlarges the bug if the backout just broke the i/o count by resetting it -- now the update is not done despite the i/o being non-null. In FreeBSD-1, I changed the corresponding code to do a bit more: % if (error == EFAULT || error && (ioflag & IO_UNIT)) { % (void) itrunc(ip, osize, ioflag & IO_SYNC); % uio->uio_offset -= resid - uio->uio_resid; % uio->uio_resid = resid; % } % if (!error && (ioflag & IO_SYNC)) % error = iupdat(ip, &time, &time, 1); This was a primitive incomplete fix for failing uio's. Most uio errors are when write() passes an invalid buffer to probe for security holes. The error for this case is EFAULT, and the backout is done unconditionally in this case. Any user could do this probe because, strangely, IO_UNIT was far from having no effect in FreeBSD-1 -- it was used for most but not all kernel writes (mainly for acct, ktrace and core dumps) but not for user write()s! I didn't know about the problem with mmap() at the time, but planned to fix this better someday. Bruce