Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 7 Jul 2004 18:23:57 +1000 (EST)
From:      Bruce Evans <bde@zeta.org.au>
To:        KOIE Hidetaka <koie@suri.co.jp>
Cc:        freebsd-gnats-submit@FreeBSD.org
Subject:   Re: kern/68690: write(2) returns wrong vlalue when EFAULT 
Message-ID:  <20040707172014.C3185@gamplex.bde.org>
In-Reply-To: <200407051221.i65CLC3Q057710@www.freebsd.org>
References:  <200407051221.i65CLC3Q057710@www.freebsd.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, 5 Jul 2004, KOIE Hidetaka wrote:

> >Description:
> Invoking write(fd, buf, size), if buf has both validand invalid segment,
> write return -1, but it's file pointer has been advanced.
> The caller mistakes the pointer stays.

This is a longstanding bug.  It is more of a problem for non-seekable
devices since physically written can't be undone, so the best possible
error handling is to return a short write as specified in all versions
of POSIX.

Do you actually see the file pointer advanced?  This may be file system
dependent.  ffs is supposed to back out of the write, and it does so
for me.  Output:

%%%
pos=0
write(20480)->-1 (should be 12288)
write: Bad address
pos=0 (should be 12288)
%%%

This is correct.  We tried to write 20K but the buffer size is only 12K
and the was an i/o error after 12K.  ffs_write() handled the error by
truncating the file to its original size and rerstoring the original
file system, and since the write was at the end of the file this has
the same result as if the write had failed with EFAULT after 0K.  I
think there is also no way for other processes to see the write while
it is in progress, since even if they have mmapped the file they can't
see beyond its original length.

To show an actual bug for ffs, modifiy the example slightly so that
it overwrites existing data.  Then truncating the file has no effect,
and it is just wrong for write() to return -1 after clobbering some
data.

> >Fix:

(1) Here is ffs_write()'s buggy error handling:

% 	for (error = 0; uio->uio_resid > 0;) {
% 		...
% 		if (uio->uio_offset + xfersize > ip->i_size)
% 			vnode_pager_setsize(vp, uio->uio_offset + xfersize);

Not error handling, but general setup.  Hopefully extending the size at the
vm level doesn't allow other other processes to see up to the new EOF yet.

%
%                 /*
% 		 * We must perform a read-before-write if the transfer size
% 		 * does not cover the entire buffer.
%                  */
% 		if (fs->fs_bsize > xfersize)
% 			flags |= BA_CLRBUF;
% 		else
% 			flags &= ~BA_CLRBUF;

Security-related code.  Other processes can see buffer contents via
mmap(), at least after completion of extension of the file at the vfs
level, so we must clear the buffer contents now.  This also avoids
potential leakage of unwritten buffer contents to the same process
after buggy EFAULT handling.

% 		...
% 		/*
% 		 * If the buffer is not valid we have to clear out any
% 		 * garbage data from the pages instantiated for the buffer.
% 		 * If we do not, a failed uiomove() during a write can leave
% 		 * the prior contents of the pages exposed to a userland
% 		 * mmap().  XXX deal with uiomove() errors a better way.
% 		 */
% 		if ((bp->b_flags & B_CACHE) == 0 && fs->fs_bsize <= xfersize)
% 			vfs_bio_clrbuf(bp);

More buffer clearing, for another case.  It's not so clear that this is
done before other processes can see the buffer.

% 	...
% 	/*
% 	 * If we successfully wrote any data, and we are not the superuser
% 	 * we clear the setuid and setgid bits as a precaution against
% 	 * tampering.
% 	 */
% 	if (resid > uio->uio_resid && ap->a_cred &&
% 	    suser_cred(ap->a_cred, PRISON_ROOT)) {
% 		ip->i_mode &= ~(ISUID | ISGID);
% 		DIP(ip, i_mode) = ip->i_mode;
% 	}

Misplaced code.  We are going to back out of the write in some cases, so
we don't know if we successefully wrote any data here.

% 	...
% 	if (error) {
% 		if (ioflag & IO_UNIT) {
% 			(void)UFS_TRUNCATE(vp, osize,
% 			    IO_NORMAL | (ioflag & IO_SYNC),
% 			    ap->a_cred, uio->uio_td);
% 			uio->uio_offset -= resid - uio->uio_resid;
% 			uio->uio_resid = resid;
% 		}
% 	} ...

Here is where we back out of the write in some cases.
(a) We only back out in the IO_UNIT case, but that is the usual case and
    is always the case for write(2).  (There is another bug suite
    involving IO_UNIT.  Most references to IO_UNIT in the kernel except
    those related to userland i/o don't actually give atomic writes
    since writes are made non-atomic using vfs_rdwr_inchunks() anyway.
    OTOH, there seems to be no reason for ffs to skip the backout in
    the !IO_UNIT case, so there is no need for the IO_UNIT flag to exist.)
(b) We cannot back out if anything was written before the original end of
    the file.  Truncation just increases the mess in that case, since it
    only reduces the file to its original size, but the file position
    (in uio->uio_offset) is reduced by the full amount.

Fixes for ffs_write(): ignore IO_UNIT, and don't rewind the file pointer
below osize.  This gives short for writes that wrote something before
osize and then failed.

Similarly for other file systems.

Here is write(2)'s buggy error handling (in dofilewrite()):

% 	if ((error = fo_write(fp, &auio, td->td_ucred, flags, td))) {
% 		if (auio.uio_resid != cnt && (error == ERESTART ||
% 		    error == EINTR || error == EWOULDBLOCK))
% 			error = 0;

(auio.uio_resid != cnt) means that the write was short.  It should be
returned to userland as a short write by setting error to 0 unconditionally.
LOwer layers mostly get this write and return a short write if and only
if they couldn't back out of the write, but they still return a nonzero
error code to indicate that there was a problem.  The userland interface
(write(2) in this case) just doesn't support returning both an error code
and a byte count, so one of them must be discarded, and it must be the
error code since it is useful and POSIX standard to return the byte count.

Since ffs attempts to backs out of failing writes, and the backup is
successful in most cases, this bug mainly affects devices.

Similarly for read(2) and writev(2), etc.

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20040707172014.C3185>