From owner-freebsd-fs@freebsd.org  Mon Jul 17 08:15:04 2017
Return-Path: <owner-freebsd-fs@freebsd.org>
Delivered-To: freebsd-fs@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 5FC2BCFF645
 for <freebsd-fs@mailman.ysv.freebsd.org>; Mon, 17 Jul 2017 08:15:04 +0000 (UTC)
 (envelope-from longwitz@incore.de)
Received: from dss.incore.de (dss.incore.de [195.145.1.138])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id B1A1D77AD2
 for <freebsd-fs@freebsd.org>; Mon, 17 Jul 2017 08:15:02 +0000 (UTC)
 (envelope-from longwitz@incore.de)
Received: from inetmail.dmz (inetmail.dmz [10.3.0.3])
 by dss.incore.de (Postfix) with ESMTP id 19A3B68583;
 Mon, 17 Jul 2017 10:15:00 +0200 (CEST)
X-Virus-Scanned: amavisd-new at incore.de
Received: from dss.incore.de ([10.3.0.3])
 by inetmail.dmz (inetmail.dmz [10.3.0.3]) (amavisd-new, port 10024)
 with LMTP id ukLUVGdFnew1; Mon, 17 Jul 2017 10:14:58 +0200 (CEST)
Received: from mail.local.incore (fwintern.dmz [10.0.0.253])
 by dss.incore.de (Postfix) with ESMTP id DF18D68586;
 Mon, 17 Jul 2017 10:14:57 +0200 (CEST)
Received: from bsdlo.incore (bsdlo.incore [192.168.0.84])
 by mail.local.incore (Postfix) with ESMTP id B544C508AA;
 Mon, 17 Jul 2017 10:14:57 +0200 (CEST)
Message-ID: <596C7201.8090700@incore.de>
Date: Mon, 17 Jul 2017 10:14:57 +0200
From: Andreas Longwitz <longwitz@incore.de>
User-Agent: Thunderbird 2.0.0.19 (X11/20090113)
MIME-Version: 1.0
To: Konstantin Belousov <kostikbel@gmail.com>
CC: Kirk McKusick <mckusick@mckusick.com>, freebsd-fs@freebsd.org
Subject: Re: ufs snapshot is sometimes corrupt on gjourneled partition
References: <596B8BC7.3030505@incore.de>
 <201707161633.v6GGXtjd035000@chez.mckusick.com>
 <20170717060720.GH1935@kib.kiev.ua>
In-Reply-To: <20170717060720.GH1935@kib.kiev.ua>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs/>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 17 Jul 2017 08:15:04 -0000

> On Sun, Jul 16, 2017 at 09:33:55AM -0700, Kirk McKusick wrote:
>> I am copying Kostik on this email even though he is on freebsd-fs
>> as I would like him to review my answer below.
>>
>>> Date: Sun, 16 Jul 2017 17:52:39 +0200
>>> From: Andreas Longwitz <longwitz@incore.de>
>>> To: Kirk McKusick <mckusick@mckusick.com>
>>> CC: freebsd-fs@freebsd.org
>>> Subject: Re: ufs snapshot is sometimes corrupt on gjourneled partition
>>>
>>> Hello,
>>>
>>> I like to discuss the following code snippet from function
>>> indiracct_ufs2() in source file ffs_snapshot.c:
>>>
>>>     /*
>>>      * We have to expand bread here since it will deadlock looking
>>>      * up the block number for any blocks that are not in the cache.
>>>      */
>>>     bp = getblk(cancelvp, lbn, fs->fs_bsize, 0, 0, 0);
>>>     bp->b_blkno = fsbtodb(fs, blkno);
>>>     if ((bp->b_flags & (B_DONE | B_DELWRI)) == 0 &&
>>>         (error = readblock(cancelvp, bp, fragstoblks(fs, blkno)))) {
>>>             brelse(bp);
>>>             return (error);
>>>     }
>>>
>>> In all my tests (all mount types, good or bad result) the flags B_DONE
>>> and B_DELWRI in bp->b_flags were always zero, so the data buffer given by
>>> getblk() at bp->b_data is always overwritten with the data from the
>>> explicit I/O performed by readblock(). By the way the flag B_CACHE was
>>> always set.
>> Key observation here that B_CACHE is set! That means that the data
>> in the block is valid and does not need to be read again. So the fix
>> is this:
>>
>> Index: sys/ufs/ffs/ffs_snapshot.c
>> ===================================================================
>> --- sys/ufs/ffs/ffs_snapshot.c	(revision 321045)
>> +++ sys/ufs/ffs/ffs_snapshot.c	(working copy)
>> @@ -1394,7 +1394,7 @@
>>  	 */
>>  	bp = getblk(cancelvp, lbn, fs->fs_bsize, 0, 0, 0);
>>  	bp->b_blkno = fsbtodb(fs, blkno);
>> -	if ((bp->b_flags & (B_DONE | B_DELWRI)) == 0 &&
>> +	if ((bp->b_flags & (B_DONE | B_DELWRI | B_CACHE)) == 0 &&
>>  	    (error = readblock(cancelvp, bp, fragstoblks(fs, blkno)))) {
>>  		brelse(bp);
>>  		return (error);
>>

This patch was my first try, because the comment in the code snippet
points to B_CACHE. Unfortunaly it did not work, because the getblk()
buffer does not always have the correct data, the following "rm
snapfile" crashed always immediately with

  panic: ffs_blkfree_cg: freeing free block.

Therfore I came up with the CLUSTEROK hack.

>> Absent gjournal the extra read was unnecessary but harmless. The
>> buffer exists with B_DELWRI so we will not read it, or it is locked
>> while being written to disk, so we will wait for the write to complete,
>> after which it is safe to read it. With gjournal there is a window of
>> vulnerability between when the buffer is marked B_DELWRI and when it
>> is actually written to disk. This is because gjournal will say the
>> write has completed when it has not yet actually done so. Thus when
>> we read it, we get incorrect results. But until gjournal actually
>> completes the write, it will keep the buffer around (marked B_CACHE),
>> so as long as we check for B_CACHE to avoid the read (as we should
>> anyway to avoid an unnecessary read) we should always get consistent
>> results.
> I tend to disagree with the fix. What if the cached block being
> discarded before we enter this code fragment ?
> 
> I believe the bug is indeed clear after your analysis, but it is in
> gjournal.  Problem is that the disk (gjournal substrate layer, but for UFS
> it is all disk) returns data different from what was written to the
> volume.  After the write bio signalled completion, unless other write
> was done after it, reads _must_ return the content of the write.
> 
> The right fix is for gjournal is to perform 'store forwarding', i.e.
> to return data from the journal until journal is flushd to disk.

That makes sense to me and explains why the problem never occurs after a
clean mount, it needs the foregoing "rm snapfile" to trigger the
gjournal bug.

To be complete:
1. I run gjournal above gmirror.
2. I use the patch given in kern/198500.
3. In my loader.conf I have
   # use max. 8GB (1/4 of vm.kmem_size) for gjournal cache
   kern.geom.journal.cache.divisor="4"
It is a little bit annoying, but setting kern.geom.journal.cache.limit
in loader.conf does not work because its overwritten in gjournal_init().

-- 
Dr. Andreas Longwitz