From owner-freebsd-standards@FreeBSD.ORG Thu Apr 21 04:56:16 2005 Return-Path: Delivered-To: freebsd-standards@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id C426B16A4CE for ; Thu, 21 Apr 2005 04:56:16 +0000 (GMT) Received: from mailout2.pacific.net.au (mailout2.pacific.net.au [61.8.0.85]) by mx1.FreeBSD.org (Postfix) with ESMTP id F29EA43D45 for ; Thu, 21 Apr 2005 04:56:15 +0000 (GMT) (envelope-from bde@zeta.org.au) Received: from mailproxy2.pacific.net.au (mailproxy2.pacific.net.au [61.8.0.87])j3L4uAml002046; Thu, 21 Apr 2005 14:56:10 +1000 Received: from katana.zip.com.au (katana.zip.com.au [61.8.7.246]) j3L4u50I025325; Thu, 21 Apr 2005 14:56:07 +1000 Date: Thu, 21 Apr 2005 14:56:05 +1000 (EST) From: Bruce Evans X-X-Sender: bde@delplex.bde.org To: Marc Olzheim In-Reply-To: <20050420173859.GA99695@stack.nl> Message-ID: <20050421135507.D88964@delplex.bde.org> References: <20050419133227.GA11612@stack.nl> <20050419151800.GE1157@green.homeunix.org> <20050419160258.GA12287@stack.nl> <20050419160900.GB12287@stack.nl> <20050419204723.GG1157@green.homeunix.org> <20050420142448.GH1157@green.homeunix.org> <16998.36437.809896.936800@khavrinen.csail.mit.edu> <20050420173859.GA99695@stack.nl> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed cc: Garrett Wollman cc: freebsd-standards@FreeBSD.org Subject: Re: NFS client/buffer cache deadlock X-BeenThere: freebsd-standards@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Standards compliance List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 21 Apr 2005 04:56:16 -0000 On Wed, 20 Apr 2005, Marc Olzheim wrote: > On Wed, Apr 20, 2005 at 01:16:05PM -0400, Garrett Wollman wrote: >> < said: >> >>> Btw.: I'm not sure write(),writev() and pwrite() are allowed to do short >>> writes on regular files... ? >> >> I believe it is the intent of the Standard to prohibit this (a >> paragraph in the rationale says that short writes can only happen if >> O_NONBLOCK is set, but this is clearly wrong because the normative >> text says end-of-medium also results in a short write) but there does >> not appear to be any language which requires atomic behavior for >> descriptors other than pipes and FIFOs. >> >> As a quality-of-implementation matter, for writes to regular files not >> to be atomic would be considered surprising. >> >> -GAWollman > > Could someone from standards comment here ? I believe Garrett is > right... > (thread is on -hackers and -current) I can't see anywhere that POSIX actually requires this. ffs only backs out short writes in some (most) cases. I have the following notes about this: % Index: ffs_vnops.c % =================================================================== % RCS file: /home/ncvs/src/sys/ufs/ffs/ffs_vnops.c,v % retrieving revision 1.130 % diff -u -2 -r1.130 ffs_vnops.c % --- ffs_vnops.c 21 May 2004 12:05:48 -0000 1.130 % +++ ffs_vnops.c 29 May 2004 13:16:07 -0000 % @@ -719,4 +730,8 @@ % * we clear the setuid and setgid bits as a precaution against % * tampering. % + * XXX too late, the tamperer may have opened the file while we % + * were writing the data (or before). This comment is probably wrong, since locking should prevent concurrent access. % + * XXX too early, if (error && ioflag & IO_UNIT) then we will % + * unwrite the data. (1) % */ % if (resid > uio->uio_resid && ap->a_cred && % @@ -725,7 +740,12 @@ % DIP(ip, i_mode) = ip->i_mode; % } % + /* XXX this seems to be misplaced. Which resid? */ % if (resid > uio->uio_resid) % VN_KNOTE(vp, NOTE_WRITE | (extended ? NOTE_EXTEND : 0)); % if (error) { % + /* % + * XXX should truncate to the last successfully written % + * data if the uiomove() failed. % + */ (2) % if (ioflag & IO_UNIT) { % (void)UFS_TRUNCATE(vp, osize, In FreeBSD-1, I fixed the bug mentioned in comments (1) and (2) by backing out short writes in the error == EFAULT case as well as in the IO_UNIT case (most writes are with IO_UNIT so short ones get backed out without this). Now I think that IO_UNIT shouldn't exist and all writes to regular files should be as atomic as possible, and in particular, short writes should always be backed out in the above. IO_UNIT is set in almost all cases. It is set in vn_write(), so it is normally set for regular files. It is set for all (?) writes generated by the kernel itself. E.g., it is set for core dumps, although that use is bogus since core dumps aren't atomic and even the atomicness that they once had has rotted (*). I thought that IO_UNIT was used in all important cases, but I recently noticed that nfs doesn't seem to use it. Perhaps that is the bug that initiated this thread. Individual file systems barely use IO_UNIT. ffs's only use of it is to give low quality behaviour in the above. Handling of short writes (and reads) is broken for non-regular files too. One of my notes about this is: % Index: sys_generic.c % =================================================================== % RCS file: /home/ncvs/src/sys/kern/sys_generic.c,v % retrieving revision 1.131 % diff -u -2 -r1.131 sys_generic.c % --- sys_generic.c 5 Apr 2004 21:03:35 -0000 1.131 % +++ sys_generic.c 6 Apr 2004 10:17:12 -0000 % @@ -420,4 +415,5 @@ % bwillwrite(); % if ((error = fo_write(fp, &auio, td->td_ucred, flags, td))) { % + /* XXX short write botch - should && be ||? */ % if (auio.uio_resid != cnt && (error == ERESTART || % error == EINTR || error == EWOULDBLOCK)) The condition for a short write is (auio.uio_resid != cnt). When there is a short write, the amount that was written must be returned. This requires ignoring `error', but `error' is only ignored if it is ERESTART, EINTR or EWOULDBLOCK. One case that is broken is writing to a tty that gets hung up after writing something. Then a short (non-null) write with error = EIO is returned. This case is unimportant, since writing the data at the level of tty.c is no guarantee that it has gone further than the driver's buffers (possibly several layers of these), let alone that it has been seen by its intended recipient. Short writes for tape drives are probably the most important. This bug would be more obvious if it happened for regular files. Then callers can easily see that the file was extended despite write() bogusly returning -1. In fact, my simple test for this failure (written in 1996) still acts strangely for nfs under my 1 year old version of -current: %%% #include #include #include #define SIZE 0x10000 main() { void *buf; int n; buf = malloc(0x10000); n = write(1, buf, 0x20000); perror(""); fprintf(stderr, "write %d\n", n); fprintf(stderr, "offset %qd\n", lseek(1, 0ll, SEEK_CUR)); } %%% This doesn't show any bugs directly. write() returns -1 and the offset is correct (65536). However, ls shows a file size of 73728 (8192 too long), and trying to read the extra bytes returns 0 and magically fixes up the file size to 65536. (*) On non-atomicity of core dumps: In FreeBSD-1, the vnode was locked throughout a core dump, so writing cores was atomic even if there are several sections, but if one of the writes failed then the previous successful ones were not backed out so the core is garbage after an error, and there is no indication of the error to userland. Now the vnode is intentionally left unlocked and even natural large blocks (sections for ELF) are intentionally written non-atomically using vn_rdwr_inchunks(), so that that the whole file system doesn't tend to hang while writing cores. Error handling after a write failure is no better, and applications can see cores that are garbage because thet are incomplete. Bruce