From owner-freebsd-fs@freebsd.org Sun Jan 17 07:35:56 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 39ECAA85747 for ; Sun, 17 Jan 2016 07:35:56 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail105.syd.optusnet.com.au (mail105.syd.optusnet.com.au [211.29.132.249]) by mx1.freebsd.org (Postfix) with ESMTP id DC14B1E59 for ; Sun, 17 Jan 2016 07:35:55 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from c211-30-166-197.carlnfd1.nsw.optusnet.com.au (c211-30-166-197.carlnfd1.nsw.optusnet.com.au [211.30.166.197]) by mail105.syd.optusnet.com.au (Postfix) with ESMTPS id 95D3F1043BBB; Sun, 17 Jan 2016 18:35:44 +1100 (AEDT) Date: Sun, 17 Jan 2016 18:35:43 +1100 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Konstantin Belousov cc: Rick Macklem , FreeBSD Filesystems , Kirk McKusick Subject: Re: panic ffs_truncate3 (maybe fuse being evil) In-Reply-To: <20160117035858.GO3942@kib.kiev.ua> Message-ID: <20160117172549.C11309@besplex.bde.org> References: <1696608910.154845456.1452438117036.JavaMail.zimbra@uoguelph.ca> <1773157955.158922767.1452698181137.JavaMail.zimbra@uoguelph.ca> <1351730674.159022044.1452699617235.JavaMail.zimbra@uoguelph.ca> <20160114092934.GL72455@kib.kiev.ua> <964333498.161527381.1452827658163.JavaMail.zimbra@uoguelph.ca> <20160115095749.GC3942@kib.kiev.ua> <1817287612.162823118.1452909605928.JavaMail.zimbra@uoguelph.ca> <20160116191116.GI3942@kib.kiev.ua> <853868666.163292727.1452986431921.JavaMail.zimbra@uoguelph.ca> <20160117035858.GO3942@kib.kiev.ua> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.1 cv=R4L+YolX c=1 sm=1 tr=0 a=KA6XNC2GZCFrdESI5ZmdjQ==:117 a=PO7r1zJSAAAA:8 a=L9H7d07YOLsA:10 a=9cW_t1CCXrUA:10 a=s5jvgZ67dGcA:10 a=JzwRw_2MAAAA:8 a=kj9zAlcOel0A:10 a=RQzWn0Zg5Lz1cNM0FkkA:9 a=CjuIK1q_8ugA:10 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 17 Jan 2016 07:35:56 -0000 On Sun, 17 Jan 2016, Konstantin Belousov wrote: > On Sat, Jan 16, 2016 at 06:20:31PM -0500, Rick Macklem wrote: >> Kostik wrote: >>> Was IO_EXT flag passed to the ffs_truncate() invocation where the >>> assert ffs_truncate3 fired ? >>> >> Yes. The only call to UFS_TRUNCATE() in ufs_inactive() specified both >> IO_EXT | IO_NORMAL. > > Please try this. > > diff --git a/sys/ufs/ffs/ffs_vnops.c b/sys/ufs/ffs/ffs_vnops.c > index 381f6f8..ecc3f9b 100644 > --- a/sys/ufs/ffs/ffs_vnops.c > +++ b/sys/ufs/ffs/ffs_vnops.c > @@ -1313,7 +1313,8 @@ ffs_close_ea(struct vnode *vp, int commit, struct ucred *cred, struct thread *td > /* XXX: I'm not happy about truncating to zero size */ > if (ip->i_ea_len < dp->di_extsize) > error = ffs_truncate(vp, 0, IO_EXT, cred); > - error = ffs_extwrite(vp, &luio, IO_EXT | IO_SYNC, cred); > + error = ffs_extwrite(vp, &luio, IO_EXT | IO_SYNC | IO_UNIT, > + cred); > } IO_UNIT shouldn't exist. Its only use is to create bugs by omitting it in callers or not supporting it in callees. These bugs are rare because since vn_write() always sets it. Hopefully callees that ignore it are rare. fuse ignores it but says that it always does it since it is not worth optimizing rare cases. fuse probably doesn't understand that rare means never for normal writes. zfs also checks it for reads, but it is never passed for reads AFAIK (vn_read() never sets it). ffs does some backing out even if IO_UNIT is not set. IO_UNIT seems to have last been much more than a no-op in Feb 1997 in FreeBSD and in 1993 in 4.4BSD. Then it was used to create bugs in most callers by not setting it in the main vn_write() caller. This bug was in 4.4BSD-Lite1 in 1993 but was fixed in 4.4BSD-Lite2 in 1995. FreeBSD imported the 1-line fix in 1997. This fix had some style bugs (initialization in declaration) which were fixed in 1999. The larger style bug of not removing the flag is still there. Grepping for VOP_WRITE in kern shows just 1 caller that is missing the flag, and that caller is broken: vop_stdallocate() passes ioflag 0, and then has the usual buggy error handling for non-atomic and/or short writes. IO_UNIT is needed more for avoiding short writes than for giving atomicity, since most callers have the usual sloppy error handling and sys_generic.c still has its old bugs for short i/o's. Non-buggy error handling might be as simple as backing out of the whole i/o, just like IO_UNIT would do for disk files. The actual handling in vop_stdallocate() is to abort. When combined with uio, this should result in a short write or a successful short write. But upper layers can't be trusted to handle short writes correctly. VOP_ALLOCATE()'s caller kern_posix_fallocate() seems to understand this stuff even less than old write calls in sys_generic.c. The old have the necessary handling for ERESTART, EINTR and EWOULDBLOCK although not for other errors. kern_posix_fallocate() just aborts. This results in short writes being returned as errors with no sign that the write is partially complete. The easiest to exercise error is EFAULT. IO_UNIT avoids the bugs in sys_generic.c by backing out of the whole write, so that EFAULT becomes the correct error. posix_fallocate() needs IO_UNIT even more than write() since it doesn't support partial successes. Setting IO_UNIT for all file types in vn_write() is not so good. Too many file types that can't support it go through vn_write(). It should be an error for them, but is just ignored. Some file types like pipes could reasonably try to support it if it is set. E.g., {PIPE_BUF} gives the normal limit of for atomic writes on pipes, bug it would be reasonable to use IO_UNIT to specify trying to do larger atomic writes and failing if this is not possible. Layering makes it a bit hard to see if IO_UNIT is set. E.g., it must be set in callers of vn_rdwr_inchunks(). core_write() does this for imgact_elf.c. core_write() also passes IO_DIRECT, which I think is a pessimization. I think IO_UNIT is not very important for security here. It is just that the error handling in the caller is so sloppy that the callee should do it, and backing out of all of even a huge write is the best error handling for core dumps. ENOSPACE is quite a likely error for a core dump and discarding a few TB to recover from it is no worse than wasting time writing that much. vn_rdwr_inchunks() is easy to analyze since it has no other callers. It used to be used for aout core dumps but those are broken (unsupported) now. Bruce