Date: Mon, 17 Jul 2017 10:14:57 +0200 From: Andreas Longwitz <longwitz@incore.de> To: Konstantin Belousov <kostikbel@gmail.com> Cc: Kirk McKusick <mckusick@mckusick.com>, freebsd-fs@freebsd.org Subject: Re: ufs snapshot is sometimes corrupt on gjourneled partition Message-ID: <596C7201.8090700@incore.de> In-Reply-To: <20170717060720.GH1935@kib.kiev.ua> References: <596B8BC7.3030505@incore.de> <201707161633.v6GGXtjd035000@chez.mckusick.com> <20170717060720.GH1935@kib.kiev.ua>
next in thread | previous in thread | raw e-mail | index | archive | help
> On Sun, Jul 16, 2017 at 09:33:55AM -0700, Kirk McKusick wrote: >> I am copying Kostik on this email even though he is on freebsd-fs >> as I would like him to review my answer below. >> >>> Date: Sun, 16 Jul 2017 17:52:39 +0200 >>> From: Andreas Longwitz <longwitz@incore.de> >>> To: Kirk McKusick <mckusick@mckusick.com> >>> CC: freebsd-fs@freebsd.org >>> Subject: Re: ufs snapshot is sometimes corrupt on gjourneled partition >>> >>> Hello, >>> >>> I like to discuss the following code snippet from function >>> indiracct_ufs2() in source file ffs_snapshot.c: >>> >>> /* >>> * We have to expand bread here since it will deadlock looking >>> * up the block number for any blocks that are not in the cache. >>> */ >>> bp = getblk(cancelvp, lbn, fs->fs_bsize, 0, 0, 0); >>> bp->b_blkno = fsbtodb(fs, blkno); >>> if ((bp->b_flags & (B_DONE | B_DELWRI)) == 0 && >>> (error = readblock(cancelvp, bp, fragstoblks(fs, blkno)))) { >>> brelse(bp); >>> return (error); >>> } >>> >>> In all my tests (all mount types, good or bad result) the flags B_DONE >>> and B_DELWRI in bp->b_flags were always zero, so the data buffer given by >>> getblk() at bp->b_data is always overwritten with the data from the >>> explicit I/O performed by readblock(). By the way the flag B_CACHE was >>> always set. >> Key observation here that B_CACHE is set! That means that the data >> in the block is valid and does not need to be read again. So the fix >> is this: >> >> Index: sys/ufs/ffs/ffs_snapshot.c >> =================================================================== >> --- sys/ufs/ffs/ffs_snapshot.c (revision 321045) >> +++ sys/ufs/ffs/ffs_snapshot.c (working copy) >> @@ -1394,7 +1394,7 @@ >> */ >> bp = getblk(cancelvp, lbn, fs->fs_bsize, 0, 0, 0); >> bp->b_blkno = fsbtodb(fs, blkno); >> - if ((bp->b_flags & (B_DONE | B_DELWRI)) == 0 && >> + if ((bp->b_flags & (B_DONE | B_DELWRI | B_CACHE)) == 0 && >> (error = readblock(cancelvp, bp, fragstoblks(fs, blkno)))) { >> brelse(bp); >> return (error); >> This patch was my first try, because the comment in the code snippet points to B_CACHE. Unfortunaly it did not work, because the getblk() buffer does not always have the correct data, the following "rm snapfile" crashed always immediately with panic: ffs_blkfree_cg: freeing free block. Therfore I came up with the CLUSTEROK hack. >> Absent gjournal the extra read was unnecessary but harmless. The >> buffer exists with B_DELWRI so we will not read it, or it is locked >> while being written to disk, so we will wait for the write to complete, >> after which it is safe to read it. With gjournal there is a window of >> vulnerability between when the buffer is marked B_DELWRI and when it >> is actually written to disk. This is because gjournal will say the >> write has completed when it has not yet actually done so. Thus when >> we read it, we get incorrect results. But until gjournal actually >> completes the write, it will keep the buffer around (marked B_CACHE), >> so as long as we check for B_CACHE to avoid the read (as we should >> anyway to avoid an unnecessary read) we should always get consistent >> results. > I tend to disagree with the fix. What if the cached block being > discarded before we enter this code fragment ? > > I believe the bug is indeed clear after your analysis, but it is in > gjournal. Problem is that the disk (gjournal substrate layer, but for UFS > it is all disk) returns data different from what was written to the > volume. After the write bio signalled completion, unless other write > was done after it, reads _must_ return the content of the write. > > The right fix is for gjournal is to perform 'store forwarding', i.e. > to return data from the journal until journal is flushd to disk. That makes sense to me and explains why the problem never occurs after a clean mount, it needs the foregoing "rm snapfile" to trigger the gjournal bug. To be complete: 1. I run gjournal above gmirror. 2. I use the patch given in kern/198500. 3. In my loader.conf I have # use max. 8GB (1/4 of vm.kmem_size) for gjournal cache kern.geom.journal.cache.divisor="4" It is a little bit annoying, but setting kern.geom.journal.cache.limit in loader.conf does not work because its overwritten in gjournal_init(). -- Dr. Andreas Longwitz
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?596C7201.8090700>