From owner-freebsd-bugs@FreeBSD.ORG  Wed Feb 15 14:55:58 2012
Return-Path: <owner-freebsd-bugs@FreeBSD.ORG>
Delivered-To: freebsd-bugs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id DF6CE1065670;
	Wed, 15 Feb 2012 14:55:57 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from mail01.syd.optusnet.com.au (mail01.syd.optusnet.com.au
	[211.29.132.182])
	by mx1.freebsd.org (Postfix) with ESMTP id 77F4F8FC1E;
	Wed, 15 Feb 2012 14:55:57 +0000 (UTC)
Received: from c211-30-171-136.carlnfd1.nsw.optusnet.com.au
	(c211-30-171-136.carlnfd1.nsw.optusnet.com.au [211.30.171.136])
	by mail01.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id
	q1FEtst6006547
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
	Thu, 16 Feb 2012 01:55:55 +1100
Date: Thu, 16 Feb 2012 01:55:54 +1100 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: Nicolas Bourdaud <nicolas.bourdaud@gmail.com>
In-Reply-To: <4F3BAF7B.2010305@gmail.com>
Message-ID: <20120216005011.E2689@besplex.bde.org>
References: <201202051142.q15Bgrh6041302@red.freebsd.org>
	<20120206050042.E2728@besplex.bde.org> <4F3BAF7B.2010305@gmail.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: freebsd-bugs@freebsd.org, freebsd-gnats-submit@freebsd.org,
	Bruce Evans <brde@optusnet.com.au>
Subject: Re: kern/164793: 'write' system call violates POSIX standard
X-BeenThere: freebsd-bugs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Bug reports <freebsd-bugs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-bugs>,
	<mailto:freebsd-bugs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-bugs>
List-Post: <mailto:freebsd-bugs@freebsd.org>
List-Help: <mailto:freebsd-bugs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-bugs>,
	<mailto:freebsd-bugs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 15 Feb 2012 14:55:58 -0000

On Wed, 15 Feb 2012, Nicolas Bourdaud wrote:

> On 05/02/2012 19:54, Bruce Evans wrote:
>> I think this is actually a bug in POSIX (XSI).  Most programs aren't
>> prepared to deal with short writes, and returning an error like
>> truncate() is specified to is adequate.
>
> I disagree, I think that most programs that check that the write
> succeeded also check that the write was complete. Actually it was

Well, in BSD, programs that don't understand short writes start with
the cp utility in 4.4BSD (it checks for short writes, but then mishandles
them by treating them as errors).  This wasn't fixed in FreeBSD until
1998.

> because my programs were assuming the POSIX behavior that I notice the
> bug. In addition, I think (this must be confirmed) that the bug don't
> affect the version 8.2... So the programs are already facing the POSIX

No, it was in 4.4BSD, and hasn't been changed in FreeBSD since 1994.
8.2 only differs in having the check in all file systems instead of
in vfs.  Perhaps some file systems got it right, but ffs didn't.

> behavior. Moreover the programs that are cross platform (in particular
> ported to Linux) are already facing this behavior.
>
> Whatever is decided, either freebsd should conform to the POSIX
> standard, either the standard should be changed.

It must conform, since it is too late to fix standards.

I forgot about this when I looked at ffs's handling of i/o errors recently.
There are many more bugs.  ffs normally tries to back out of writes
completely after an i/o error, by using ftruncate() to return to the
original file size.  Garbage written to the disk or memory is too hard
to back out of, but ffs avoids security holes by zeroing it memory (in
case it is memmap()ed) and by making it inaccessible by normal means on
the disk (ftruncate() does this.  When the error is ENOSPC due to a
full disk, this gives the same behaviour as ffs has now for EFBIG for
the file size being too big (due to the maximum size for the file
system, or the rlimit).  POSIX has looser wording for the ENOSPC error.
It says that ENOSPC shall be returned if there "was" no space...  This
can be interpreted as requiring the same things as EFBIG -- that if there
was any space to begin with, ENOSPC is not required to be returned;
presumably the write() should succeed in writing as much as possible since
there is no other reasonable error.

But ffs's behaviour is "correct" here.  The most broken case here is for
an i/o error for a write in the middle of a file.  Then it is not reasonable
to try to back out.  ffs doesn't do the ftruncate() in this case.  But it
still tries to back out.  This results in write() returning -1/EIO.  This
is wrong if something has been successfully written.  On second thoughts
is it is the best possible behaviour.  Everything in the region of the
file covered by the write() may have been clobbered, either by writing
the requested bytes, or by a hardware or software error writing garbage,
or by the intentional zeroing for security.  The only way to tell the
application about this is to say that the whole write failed.  The
application should assume that the entire region has been clobbered,
and take steps to check and limit the extent of the damage, perhaps
by trying to rewrite it all in smaller pieces.

There seem to be more bugs in [f]truncate():
- POSIX requires SIGXFSZ for attempts to exceed the file size rlimit
   in truncate() too, but FreeBSD doesn't even check the rlimit for
   truncate().

Checking the rlimit in vfs makes all this easier to fix.  I think
write() can be fixed in a couple of lines in vfs.  All file systems
call back to vfs to check, though I don't know of any requirement for
other errors to have precedence, so vfs could check up front.  zfs's
write vnop actually calls back to vfs before doing anything else, so
this error already has precedence over all fs-specific errors for zfs.
All other file systems' write vnop do the check a fair way into the
vnop in much the same place as ffs.  No file systems check the limit
for truncate().  The limit checking is commented out in xfs's write
vnop.

Bruce