From owner-freebsd-fs@freebsd.org  Sun Jan 17 07:35:56 2016
Return-Path: <owner-freebsd-fs@freebsd.org>
Delivered-To: freebsd-fs@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 39ECAA85747
 for <freebsd-fs@mailman.ysv.freebsd.org>; Sun, 17 Jan 2016 07:35:56 +0000 (UTC)
 (envelope-from brde@optusnet.com.au)
Received: from mail105.syd.optusnet.com.au (mail105.syd.optusnet.com.au
 [211.29.132.249])
 by mx1.freebsd.org (Postfix) with ESMTP id DC14B1E59
 for <freebsd-fs@freebsd.org>; Sun, 17 Jan 2016 07:35:55 +0000 (UTC)
 (envelope-from brde@optusnet.com.au)
Received: from c211-30-166-197.carlnfd1.nsw.optusnet.com.au
 (c211-30-166-197.carlnfd1.nsw.optusnet.com.au [211.30.166.197])
 by mail105.syd.optusnet.com.au (Postfix) with ESMTPS id 95D3F1043BBB;
 Sun, 17 Jan 2016 18:35:44 +1100 (AEDT)
Date: Sun, 17 Jan 2016 18:35:43 +1100 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: Konstantin Belousov <kostikbel@gmail.com>
cc: Rick Macklem <rmacklem@uoguelph.ca>, 
 FreeBSD Filesystems <freebsd-fs@freebsd.org>, 
 Kirk McKusick <mckusick@mckusick.com>
Subject: Re: panic ffs_truncate3 (maybe fuse being evil)
In-Reply-To: <20160117035858.GO3942@kib.kiev.ua>
Message-ID: <20160117172549.C11309@besplex.bde.org>
References: <1696608910.154845456.1452438117036.JavaMail.zimbra@uoguelph.ca>
 <1773157955.158922767.1452698181137.JavaMail.zimbra@uoguelph.ca>
 <1351730674.159022044.1452699617235.JavaMail.zimbra@uoguelph.ca>
 <20160114092934.GL72455@kib.kiev.ua>
 <964333498.161527381.1452827658163.JavaMail.zimbra@uoguelph.ca>
 <20160115095749.GC3942@kib.kiev.ua>
 <1817287612.162823118.1452909605928.JavaMail.zimbra@uoguelph.ca>
 <20160116191116.GI3942@kib.kiev.ua>
 <853868666.163292727.1452986431921.JavaMail.zimbra@uoguelph.ca>
 <20160117035858.GO3942@kib.kiev.ua>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
X-Optus-CM-Score: 0
X-Optus-CM-Analysis: v=2.1 cv=R4L+YolX c=1 sm=1 tr=0
 a=KA6XNC2GZCFrdESI5ZmdjQ==:117 a=PO7r1zJSAAAA:8 a=L9H7d07YOLsA:10
 a=9cW_t1CCXrUA:10 a=s5jvgZ67dGcA:10 a=JzwRw_2MAAAA:8 a=kj9zAlcOel0A:10
 a=RQzWn0Zg5Lz1cNM0FkkA:9 a=CjuIK1q_8ugA:10
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs/>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 17 Jan 2016 07:35:56 -0000

On Sun, 17 Jan 2016, Konstantin Belousov wrote:

> On Sat, Jan 16, 2016 at 06:20:31PM -0500, Rick Macklem wrote:
>> Kostik wrote:
>>> Was IO_EXT flag passed to the ffs_truncate() invocation where the
>>> assert ffs_truncate3 fired ?
>>>
>> Yes. The only call to UFS_TRUNCATE() in ufs_inactive() specified both
>> IO_EXT | IO_NORMAL.
>
> Please try this.
>
> diff --git a/sys/ufs/ffs/ffs_vnops.c b/sys/ufs/ffs/ffs_vnops.c
> index 381f6f8..ecc3f9b 100644
> --- a/sys/ufs/ffs/ffs_vnops.c
> +++ b/sys/ufs/ffs/ffs_vnops.c
> @@ -1313,7 +1313,8 @@ ffs_close_ea(struct vnode *vp, int commit, struct ucred *cred, struct thread *td
> 		/* XXX: I'm not happy about truncating to zero size */
> 		if (ip->i_ea_len < dp->di_extsize)
> 			error = ffs_truncate(vp, 0, IO_EXT, cred);
> -		error = ffs_extwrite(vp, &luio, IO_EXT | IO_SYNC, cred);
> +		error = ffs_extwrite(vp, &luio, IO_EXT | IO_SYNC | IO_UNIT,
> +		    cred);
> 	}

IO_UNIT shouldn't exist.  Its only use is to create bugs by omitting it
in callers or not supporting it in callees.  These bugs are rare because 
since vn_write() always sets it.  Hopefully callees that ignore it are
rare.  fuse ignores it but says that it always does it since it is not
worth optimizing rare cases.  fuse probably doesn't understand that rare
means never for normal writes.  zfs also checks it for reads, but it is
never passed for reads AFAIK (vn_read() never sets it).  ffs does some
backing out even if IO_UNIT is not set.

IO_UNIT seems to have last been much more than a no-op in Feb 1997 in
FreeBSD and in 1993 in 4.4BSD.  Then it was used to create bugs in most
callers by not setting it in the main vn_write() caller.  This bug was
in 4.4BSD-Lite1 in 1993 but was fixed in 4.4BSD-Lite2 in 1995.  FreeBSD
imported the 1-line fix in 1997.  This fix had some style bugs
(initialization in declaration) which were fixed in 1999.  The larger
style bug of not removing the flag is still there.

Grepping for VOP_WRITE in kern shows just 1 caller that is missing the
flag, and that caller is broken: vop_stdallocate() passes ioflag 0,
and then has the usual buggy error handling for non-atomic and/or
short writes.  IO_UNIT is needed more for avoiding short writes than
for giving atomicity, since most callers have the usual sloppy error
handling and sys_generic.c still has its old bugs for short i/o's.
Non-buggy error handling might be as simple as backing out of the
whole i/o, just like IO_UNIT would do for disk files.  The actual
handling in vop_stdallocate() is to abort.  When combined with uio,
this should result in a short write or a successful short write.  But
upper layers can't be trusted to handle short writes correctly.
VOP_ALLOCATE()'s caller kern_posix_fallocate() seems to understand
this stuff even less than old write calls in sys_generic.c.  The
old have the necessary handling for ERESTART, EINTR and EWOULDBLOCK
although not for other errors.  kern_posix_fallocate() just aborts.
This results in short writes being returned as errors with no sign
that the write is partially complete.  The easiest to exercise error
is EFAULT.  IO_UNIT avoids the bugs in sys_generic.c by backing out
of the whole write, so that EFAULT becomes the correct error.
posix_fallocate() needs IO_UNIT even more than write() since it
doesn't support partial successes.

Setting IO_UNIT for all file types in vn_write() is not so good.  Too
many file types that can't support it go through vn_write().  It should
be an error for them, but is just ignored.  Some file types like pipes
could reasonably try to support it if it is set.  E.g., {PIPE_BUF}
gives the normal limit of for atomic writes on pipes, bug it would
be reasonable to use IO_UNIT to specify trying to do larger atomic
writes and failing if this is not possible.

Layering makes it a bit hard to see if IO_UNIT is set.  E.g., it must
be set in callers of vn_rdwr_inchunks().  core_write() does this
for imgact_elf.c.  core_write() also passes IO_DIRECT, which I think
is a pessimization.  I think IO_UNIT is not very important for security
here.  It is just that the error handling in the caller is so sloppy
that the callee should do it, and backing out of all of even a huge
write is the best error handling for core dumps.  ENOSPACE is quite
a likely error for a core dump and discarding a few TB to recover
from it is no worse than wasting time writing that much.

vn_rdwr_inchunks() is easy to analyze since it has no other callers.
It used to be used for aout core dumps but those are broken
(unsupported) now.

Bruce