From owner-freebsd-fs@freebsd.org Mon Jul 17 08:15:04 2017 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 5FC2BCFF645 for ; Mon, 17 Jul 2017 08:15:04 +0000 (UTC) (envelope-from longwitz@incore.de) Received: from dss.incore.de (dss.incore.de [195.145.1.138]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id B1A1D77AD2 for ; Mon, 17 Jul 2017 08:15:02 +0000 (UTC) (envelope-from longwitz@incore.de) Received: from inetmail.dmz (inetmail.dmz [10.3.0.3]) by dss.incore.de (Postfix) with ESMTP id 19A3B68583; Mon, 17 Jul 2017 10:15:00 +0200 (CEST) X-Virus-Scanned: amavisd-new at incore.de Received: from dss.incore.de ([10.3.0.3]) by inetmail.dmz (inetmail.dmz [10.3.0.3]) (amavisd-new, port 10024) with LMTP id ukLUVGdFnew1; Mon, 17 Jul 2017 10:14:58 +0200 (CEST) Received: from mail.local.incore (fwintern.dmz [10.0.0.253]) by dss.incore.de (Postfix) with ESMTP id DF18D68586; Mon, 17 Jul 2017 10:14:57 +0200 (CEST) Received: from bsdlo.incore (bsdlo.incore [192.168.0.84]) by mail.local.incore (Postfix) with ESMTP id B544C508AA; Mon, 17 Jul 2017 10:14:57 +0200 (CEST) Message-ID: <596C7201.8090700@incore.de> Date: Mon, 17 Jul 2017 10:14:57 +0200 From: Andreas Longwitz User-Agent: Thunderbird 2.0.0.19 (X11/20090113) MIME-Version: 1.0 To: Konstantin Belousov CC: Kirk McKusick , freebsd-fs@freebsd.org Subject: Re: ufs snapshot is sometimes corrupt on gjourneled partition References: <596B8BC7.3030505@incore.de> <201707161633.v6GGXtjd035000@chez.mckusick.com> <20170717060720.GH1935@kib.kiev.ua> In-Reply-To: <20170717060720.GH1935@kib.kiev.ua> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 17 Jul 2017 08:15:04 -0000 > On Sun, Jul 16, 2017 at 09:33:55AM -0700, Kirk McKusick wrote: >> I am copying Kostik on this email even though he is on freebsd-fs >> as I would like him to review my answer below. >> >>> Date: Sun, 16 Jul 2017 17:52:39 +0200 >>> From: Andreas Longwitz >>> To: Kirk McKusick >>> CC: freebsd-fs@freebsd.org >>> Subject: Re: ufs snapshot is sometimes corrupt on gjourneled partition >>> >>> Hello, >>> >>> I like to discuss the following code snippet from function >>> indiracct_ufs2() in source file ffs_snapshot.c: >>> >>> /* >>> * We have to expand bread here since it will deadlock looking >>> * up the block number for any blocks that are not in the cache. >>> */ >>> bp = getblk(cancelvp, lbn, fs->fs_bsize, 0, 0, 0); >>> bp->b_blkno = fsbtodb(fs, blkno); >>> if ((bp->b_flags & (B_DONE | B_DELWRI)) == 0 && >>> (error = readblock(cancelvp, bp, fragstoblks(fs, blkno)))) { >>> brelse(bp); >>> return (error); >>> } >>> >>> In all my tests (all mount types, good or bad result) the flags B_DONE >>> and B_DELWRI in bp->b_flags were always zero, so the data buffer given by >>> getblk() at bp->b_data is always overwritten with the data from the >>> explicit I/O performed by readblock(). By the way the flag B_CACHE was >>> always set. >> Key observation here that B_CACHE is set! That means that the data >> in the block is valid and does not need to be read again. So the fix >> is this: >> >> Index: sys/ufs/ffs/ffs_snapshot.c >> =================================================================== >> --- sys/ufs/ffs/ffs_snapshot.c (revision 321045) >> +++ sys/ufs/ffs/ffs_snapshot.c (working copy) >> @@ -1394,7 +1394,7 @@ >> */ >> bp = getblk(cancelvp, lbn, fs->fs_bsize, 0, 0, 0); >> bp->b_blkno = fsbtodb(fs, blkno); >> - if ((bp->b_flags & (B_DONE | B_DELWRI)) == 0 && >> + if ((bp->b_flags & (B_DONE | B_DELWRI | B_CACHE)) == 0 && >> (error = readblock(cancelvp, bp, fragstoblks(fs, blkno)))) { >> brelse(bp); >> return (error); >> This patch was my first try, because the comment in the code snippet points to B_CACHE. Unfortunaly it did not work, because the getblk() buffer does not always have the correct data, the following "rm snapfile" crashed always immediately with panic: ffs_blkfree_cg: freeing free block. Therfore I came up with the CLUSTEROK hack. >> Absent gjournal the extra read was unnecessary but harmless. The >> buffer exists with B_DELWRI so we will not read it, or it is locked >> while being written to disk, so we will wait for the write to complete, >> after which it is safe to read it. With gjournal there is a window of >> vulnerability between when the buffer is marked B_DELWRI and when it >> is actually written to disk. This is because gjournal will say the >> write has completed when it has not yet actually done so. Thus when >> we read it, we get incorrect results. But until gjournal actually >> completes the write, it will keep the buffer around (marked B_CACHE), >> so as long as we check for B_CACHE to avoid the read (as we should >> anyway to avoid an unnecessary read) we should always get consistent >> results. > I tend to disagree with the fix. What if the cached block being > discarded before we enter this code fragment ? > > I believe the bug is indeed clear after your analysis, but it is in > gjournal. Problem is that the disk (gjournal substrate layer, but for UFS > it is all disk) returns data different from what was written to the > volume. After the write bio signalled completion, unless other write > was done after it, reads _must_ return the content of the write. > > The right fix is for gjournal is to perform 'store forwarding', i.e. > to return data from the journal until journal is flushd to disk. That makes sense to me and explains why the problem never occurs after a clean mount, it needs the foregoing "rm snapfile" to trigger the gjournal bug. To be complete: 1. I run gjournal above gmirror. 2. I use the patch given in kern/198500. 3. In my loader.conf I have # use max. 8GB (1/4 of vm.kmem_size) for gjournal cache kern.geom.journal.cache.divisor="4" It is a little bit annoying, but setting kern.geom.journal.cache.limit in loader.conf does not work because its overwritten in gjournal_init(). -- Dr. Andreas Longwitz