From owner-freebsd-arch@FreeBSD.ORG  Sat Aug  4 06:32:35 2012
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id C774A106564A;
	Sat,  4 Aug 2012 06:32:35 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from mail09.syd.optusnet.com.au (mail09.syd.optusnet.com.au
	[211.29.132.190])
	by mx1.freebsd.org (Postfix) with ESMTP id 5C84D8FC0A;
	Sat,  4 Aug 2012 06:32:34 +0000 (UTC)
Received: from c122-106-171-246.carlnfd1.nsw.optusnet.com.au
	(c122-106-171-246.carlnfd1.nsw.optusnet.com.au [122.106.171.246])
	by mail09.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id
	q746WPXr005666
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
	Sat, 4 Aug 2012 16:32:26 +1000
Date: Sat, 4 Aug 2012 16:32:25 +1000 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: Konstantin Belousov <kostikbel@gmail.com>
In-Reply-To: <20120802170526.GC2676@deviant.kiev.zoral.com.ua>
Message-ID: <20120804154317.C791@besplex.bde.org>
References: <5018992C.8000207@freebsd.org>
	<20120801071934.GJ2676@deviant.kiev.zoral.com.ua>
	<20120801183240.K1291@besplex.bde.org>
	<20120801162836.GO2676@deviant.kiev.zoral.com.ua>
	<20120802040542.G2978@besplex.bde.org>
	<20120802100240.GV2676@deviant.kiev.zoral.com.ua>
	<20120802222245.D2585@besplex.bde.org>
	<20120802170526.GC2676@deviant.kiev.zoral.com.ua>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: arch@FreeBSD.org, David Xu <davidxu@FreeBSD.org>
Subject: Re: short read/write and error code
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 04 Aug 2012 06:32:36 -0000

On Thu, 2 Aug 2012, Konstantin Belousov wrote:

I plan to reply to this in more detail, but don't have time and want
to slow down this thread.

> ...
> So the bugs with losing data shall be fixed in the filesystems.
> Otherwise well-behaving filesystems which do return errors only when
> it is proper to return error are punished.
> ...

My main point is that this has nothing to do with file systems.  It
is special files, fifos and many other non-regular files that are
the problem.  File systems like ffs already reset uio when the return
an error.  Thus there is (should be) no problem for file systems with
clearing `error' at the dofile*() level if uio shows i/o.  For special
files, it more usually correct to not reset uio, and then it is
convenient to not clear `error' until the dofile*() level.

I see a problem with this mainly for (seekable r/w) direct access
devices (DASDs).  These are similar to regular files in a critical
respect: suppose you write in the middle of a disk or file and get an
i/o error somewhere.  Then sometimes, even after writing parts
successfully, you (you == the system) have no idea where the error
was.  The best handling is as in ffs: fail the whole i/o, and reset
uio completely.  If you know where the error was, and that it didn't
corrupt the file (say for EFAULT or ENOSPC), then you can do better
and return a short count with no error.  ffs doesn't try to do better.
When you don't do better, the application has the burden of figuring
out where the error was and how much of the file was corrupted.  Even
when the write was at EOF and ffs has prevented corruption of the file
by truncating it to the original EOF, the application still has a
difficult task to determine that the file wasn't corrupted, because
another application may have moved EOF.  Handling EOF perfectly is
simpler for DASDs, for both the system and applications (except now
st_size doesn't tell applications where it is or was).

Reads don't corrupt files, so returning what was read is always good.
ffs expects dofileread() to prefer a short count to an error, and
breaks the atime when dofileread() is broken:

% 	if ((error == 0 || uio->uio_resid != orig_resid) &&

We have done i/o even when error != 0, so we test both...

% 	    (vp->v_mount->mnt_flag & MNT_NOATIME) == 0 &&
% 	    (ip->i_flag & IN_ACCESS) == 0) {
% 		VI_LOCK(vp);
% 		ip->i_flag |= IN_ACCESS;

... and mark the atime for update when we have done i/o (also when we
done null i/o successfully).

% 		VI_UNLOCK(vp);
% 	}
% 	return (error);

Then we return both, expecting dofileread() to prefer the i/o.  But
dofileread() is broken and prefers the error.  Thus the atime is
marked for update even when the i/o is backed out of.

The above is mostly my code.  I fixed it in ~1992 and knew about the
bugs in sys_generic.c and hoped to fix them someday.  (Some of my
tests check that the atime is not updated on errors.)  In Net/2 and
4.4BSD-Lite*, ffs_write() marks the atime for update unconditionally
at the end.  It only has one other return statement in ffs_write()
(for EFBIG, for the up-front check of fs_maxfilesize, which is still
broken and requires a related fix (POSIX requires short writes up to
the max)).  Thus in 4.4BSD-Lite2, ffs_write() marks the atime for
update on all errors except EFBIG.

Oops, that was too many details.

Bruce