From owner-svn-src-all@FreeBSD.ORG  Fri Feb 10 06:33:03 2012
Return-Path: <owner-svn-src-all@FreeBSD.ORG>
Delivered-To: svn-src-all@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id CE4CA106566B;
	Fri, 10 Feb 2012 06:33:03 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from mail07.syd.optusnet.com.au (mail07.syd.optusnet.com.au
	[211.29.132.188])
	by mx1.freebsd.org (Postfix) with ESMTP id 4EE4B8FC12;
	Fri, 10 Feb 2012 06:33:02 +0000 (UTC)
Received: from c211-30-171-136.carlnfd1.nsw.optusnet.com.au
	(c211-30-171-136.carlnfd1.nsw.optusnet.com.au [211.30.171.136])
	by mail07.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id
	q1A6X0YG015616
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
	Fri, 10 Feb 2012 17:33:01 +1100
Date: Fri, 10 Feb 2012 17:32:59 +1100 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: Kirk McKusick <mckusick@FreeBSD.org>
In-Reply-To: <201202092234.q19MYGEJ040871@svn.freebsd.org>
Message-ID: <20120210154420.F882@besplex.bde.org>
References: <201202092234.q19MYGEJ040871@svn.freebsd.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: svn-src-head@FreeBSD.org, svn-src-all@FreeBSD.org,
	src-committers@FreeBSD.org
Subject: Re: svn commit: r231313 - head/sys/ufs/ffs
X-BeenThere: svn-src-all@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: "SVN commit messages for the entire src tree \(except for &quot;
	user&quot; and &quot; projects&quot; \)" <svn-src-all.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/svn-src-all>,
	<mailto:svn-src-all-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/svn-src-all>
List-Post: <mailto:svn-src-all@freebsd.org>
List-Help: <mailto:svn-src-all-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/svn-src-all>,
	<mailto:svn-src-all-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 10 Feb 2012 06:33:03 -0000

On Thu, 9 Feb 2012, Kirk McKusick wrote:

> Log:
>  Historically when an application wrote an entire block of a file,
>  the kernel allocated a buffer but did not zero it as it was about
>  to be completely filled by a uiomove() from the user's buffer.
>  However, if the uiomove() failed, the old contents of the buffer
>  could be exposed especially if the file was being mmap'ed. The
>  fix was to always zero the buffer when it was allocated.
>
>  This change first attempts the uiomove() to the newly allocated
>  (and dirty) buffer and only zeros it if the uiomove() fails. The
>  effect is to eliminate the gratuitous zeroing of the buffer in
>  the usual case where the uiomove() successfully fills it.

Lately I have been thinking again of the error handling near here:

% 	if (error) {
% 		if (ioflag & IO_UNIT) {
% 			(void)ffs_truncate(vp, osize,
% 			    IO_NORMAL | (ioflag & IO_SYNC),
% 			    ap->a_cred, uio->uio_td);
% 			uio->uio_offset -= resid - uio->uio_resid;
% 			uio->uio_resid = resid;
% 		}
% 	} else if (resid > uio->uio_resid && (ioflag & IO_SYNC))
% 		error = ffs_update(vp, 1);

This attempts to back out of the whole write after an error.  But it is
impossible to back out of a write in the middle of the file (unless you
save all the blocks written to, and don't get more i/o errors trying to
restore them).  The above makes various messes trying, and even breaks
some cases where the write is at the end of the file: suppose error != 0
and that some i/o has been done successfully (otherwise there is no
problem):

- suppose IO_UNIT is not set.  Then the above fails to do anything.
   Even if not backing out is correct in this case, it is necessary
   to call ffs_update(vp, 1) under the condition where it is called
   in the non-error case, irrespective of whether the write is at
   the end of the file.

   But I think the existence of IO_UNIT is a bug.  It is only used
   here and in VOP_WRITE() for some other file systems not including
   zfs.  The others that use it are ext2fs, msdosfs and xfs (don't
   forget to change their VOPs when changing ffs).  The 2 nfsclients
   have a comment saying that IO_UNIT is not used because all writes
   are atomic.  IO_UNIT is also passed to VOP_READ(), but it is not
   used by any read VOPs.  IO_UNIT is always set by vn_write().
   Thus it is set in almost all cases.  And I can't find any cases
   where it is not set, using grep in kern.  Thus its existence
   does nothing except to make it hard to verify that it works
   correctly because it always has no effect.

- suppose that IO_UNIT is set.
   - we don't check that ffs_truncate() succeeded, and we even explicitly
     (void) it, though we should check.  We blindly proceed to resetting
     the i/o count, so that upper layers see the error and don't see the
     amount written, although the file may have been extended, depending
     on the details of how the truncation failed.
   - when the write is not at EOF and doesn't extend beyond the original
     end, ffs_truncate() usually succeeds and has no visible effect.  We
     blindy proceed to resetting the i/o count, so that upper layers see
     the error and don't see the amount successfully written, although
     this amount is nonzero (since I specified that this case writes
     something before hitting the error).  Upper layers also get no
     indication of the amount of clobbered bytes beyond the end of the
     successively written bytes.  POSIX write semantics work very badly
     here.  If we never return a short write, but always convert it to
     an error (this needs just a bit more force than the above, and might
     be the natural result of fixing these bugs), then writers should
     assume that the file is damaged for the whole region of the file
     that the write was attempted on, but for short writes, writers
     should be able to assume no damage.
   - when the write is not at EOF but extends beyond the original end,
     there is a combination of the above bugs.

Just before here, we turn off the set*id bits if some i/o was done.
This is unnecessary if there was an error and the i/o is successfully
backed out of in a non-buggy way.  But it is safer to do it anyway.
It would be even safer to do it after any error.

My version has the following (mostly wrong) comments and (mostly right)
fixes in this area:

% Index: ffs_vnops.c
% ===================================================================
% RCS file: /home/ncvs/src/sys/ufs/ffs/ffs_vnops.c,v
% retrieving revision 1.130
% diff -u -2 -r1.130 ffs_vnops.c
% --- ffs_vnops.c	21 May 2004 12:05:48 -0000	1.130
% +++ ffs_vnops.c	28 Sep 2010 10:45:17 -0000
% @@ -719,4 +722,8 @@
%  	 * we clear the setuid and setgid bits as a precaution against
%  	 * tampering.
% +	 * XXX too late, the tamperer may have opened the file while we
% +	 * were writing the data (or before).
% +	 * XXX too early, if (error && ioflag & IO_UNIT) then we will
% +	 * unwrite the data.
%  	 */
%  	if (resid > uio->uio_resid && ap->a_cred &&

Probably wrong.  Exlusive access should prevent problems.

% @@ -725,7 +732,12 @@
%  		DIP(ip, i_mode) = ip->i_mode;
%  	}
% +	/* XXX this seems to be misplaced.  Which resid? */
%  	if (resid > uio->uio_resid)
%  		VN_KNOTE(vp, NOTE_WRITE | (extended ? NOTE_EXTEND : 0));

This seems to have been fixed in -current, by moving the note taking to
vfs, so it is now done after the back-out.

%  	if (error) {
% +		/*
% +		 * XXX should truncate to the last successfully written
% +		 * data if the uiomove() failed.
% +		 */
%  		if (ioflag & IO_UNIT) {
%  			(void)UFS_TRUNCATE(vp, osize,

Not a good idea -- see above.

% @@ -735,6 +747,11 @@
%  			uio->uio_resid = resid;
%  		}
% -	} else if (resid > uio->uio_resid && (ioflag & IO_SYNC))
% -		error = UFS_UPDATE(vp, 1);
% +	}
% +
% +	if (uio->uio_resid != resid) {
% +		ip->i_flag |= IN_CHANGE | IN_UPDATE;
% +		if (ioflag & IO_SYNC)
% +			error = UFS_UPDATE(vp, 1);
% +	}
%  	return (error);
%  }

This fixes 2 bugs:
- spurious setting of IN_CHANGE | IN_UPDATE after backing out of the i/o.
   However, this enlarges the bugs when the backout just broke the i/o
   count by resetting it -- now the flags are not set despite the i/o
   having done some combination of successfully changing the file in
   the middle and clobbering the file in the middle.
- 1 of the bugs pointed out above: missing update in the IO_SYNC subcase
   of the error case.  Again this enlarges the bug if the backout just
   broke the i/o count by resetting it -- now the update is not done
   despite the i/o being non-null.

In FreeBSD-1, I changed the corresponding code to do a bit more:

% 	if (error == EFAULT || error && (ioflag & IO_UNIT)) {
% 		(void) itrunc(ip, osize, ioflag & IO_SYNC);
% 		uio->uio_offset -= resid - uio->uio_resid;
% 		uio->uio_resid = resid;
% 	}
% 	if (!error && (ioflag & IO_SYNC))
% 		error = iupdat(ip, &time, &time, 1);

This was a primitive incomplete fix for failing uio's.  Most uio errors
are when write() passes an invalid buffer to probe for security holes.
The error for this case is EFAULT, and the backout is done unconditionally
in this case.  Any user could do this probe because, strangely, IO_UNIT
was far from having no effect in FreeBSD-1 -- it was used for most but
not all kernel writes (mainly for acct, ktrace and core dumps) but not
for user write()s!  I didn't know about the problem with mmap() at the
time, but planned to fix this better someday.

Bruce