From owner-freebsd-arch@FreeBSD.ORG Sat Aug 4 06:32:35 2012 Return-Path: Delivered-To: arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id C774A106564A; Sat, 4 Aug 2012 06:32:35 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail09.syd.optusnet.com.au (mail09.syd.optusnet.com.au [211.29.132.190]) by mx1.freebsd.org (Postfix) with ESMTP id 5C84D8FC0A; Sat, 4 Aug 2012 06:32:34 +0000 (UTC) Received: from c122-106-171-246.carlnfd1.nsw.optusnet.com.au (c122-106-171-246.carlnfd1.nsw.optusnet.com.au [122.106.171.246]) by mail09.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id q746WPXr005666 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Sat, 4 Aug 2012 16:32:26 +1000 Date: Sat, 4 Aug 2012 16:32:25 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Konstantin Belousov In-Reply-To: <20120802170526.GC2676@deviant.kiev.zoral.com.ua> Message-ID: <20120804154317.C791@besplex.bde.org> References: <5018992C.8000207@freebsd.org> <20120801071934.GJ2676@deviant.kiev.zoral.com.ua> <20120801183240.K1291@besplex.bde.org> <20120801162836.GO2676@deviant.kiev.zoral.com.ua> <20120802040542.G2978@besplex.bde.org> <20120802100240.GV2676@deviant.kiev.zoral.com.ua> <20120802222245.D2585@besplex.bde.org> <20120802170526.GC2676@deviant.kiev.zoral.com.ua> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: arch@FreeBSD.org, David Xu Subject: Re: short read/write and error code X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 04 Aug 2012 06:32:36 -0000 On Thu, 2 Aug 2012, Konstantin Belousov wrote: I plan to reply to this in more detail, but don't have time and want to slow down this thread. > ... > So the bugs with losing data shall be fixed in the filesystems. > Otherwise well-behaving filesystems which do return errors only when > it is proper to return error are punished. > ... My main point is that this has nothing to do with file systems. It is special files, fifos and many other non-regular files that are the problem. File systems like ffs already reset uio when the return an error. Thus there is (should be) no problem for file systems with clearing `error' at the dofile*() level if uio shows i/o. For special files, it more usually correct to not reset uio, and then it is convenient to not clear `error' until the dofile*() level. I see a problem with this mainly for (seekable r/w) direct access devices (DASDs). These are similar to regular files in a critical respect: suppose you write in the middle of a disk or file and get an i/o error somewhere. Then sometimes, even after writing parts successfully, you (you == the system) have no idea where the error was. The best handling is as in ffs: fail the whole i/o, and reset uio completely. If you know where the error was, and that it didn't corrupt the file (say for EFAULT or ENOSPC), then you can do better and return a short count with no error. ffs doesn't try to do better. When you don't do better, the application has the burden of figuring out where the error was and how much of the file was corrupted. Even when the write was at EOF and ffs has prevented corruption of the file by truncating it to the original EOF, the application still has a difficult task to determine that the file wasn't corrupted, because another application may have moved EOF. Handling EOF perfectly is simpler for DASDs, for both the system and applications (except now st_size doesn't tell applications where it is or was). Reads don't corrupt files, so returning what was read is always good. ffs expects dofileread() to prefer a short count to an error, and breaks the atime when dofileread() is broken: % if ((error == 0 || uio->uio_resid != orig_resid) && We have done i/o even when error != 0, so we test both... % (vp->v_mount->mnt_flag & MNT_NOATIME) == 0 && % (ip->i_flag & IN_ACCESS) == 0) { % VI_LOCK(vp); % ip->i_flag |= IN_ACCESS; ... and mark the atime for update when we have done i/o (also when we done null i/o successfully). % VI_UNLOCK(vp); % } % return (error); Then we return both, expecting dofileread() to prefer the i/o. But dofileread() is broken and prefers the error. Thus the atime is marked for update even when the i/o is backed out of. The above is mostly my code. I fixed it in ~1992 and knew about the bugs in sys_generic.c and hoped to fix them someday. (Some of my tests check that the atime is not updated on errors.) In Net/2 and 4.4BSD-Lite*, ffs_write() marks the atime for update unconditionally at the end. It only has one other return statement in ffs_write() (for EFBIG, for the up-front check of fs_maxfilesize, which is still broken and requires a related fix (POSIX requires short writes up to the max)). Thus in 4.4BSD-Lite2, ffs_write() marks the atime for update on all errors except EFBIG. Oops, that was too many details. Bruce