From owner-freebsd-fs@freebsd.org Tue Jul 18 00:40:44 2017 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 7914DC08199 for ; Tue, 18 Jul 2017 00:40:44 +0000 (UTC) (envelope-from mckusick@mckusick.com) Received: from chez.mckusick.com (chez.mckusick.com [70.36.157.235]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 557077C91F for ; Tue, 18 Jul 2017 00:40:43 +0000 (UTC) (envelope-from mckusick@mckusick.com) Received: from chez.mckusick.com (localhost [IPv6:::1]) by chez.mckusick.com (8.15.2/8.15.2) with ESMTP id v6I0iKvg040471; Mon, 17 Jul 2017 17:44:20 -0700 (PDT) (envelope-from mckusick@chez.mckusick.com) Message-Id: <201707180044.v6I0iKvg040471@chez.mckusick.com> From: Kirk McKusick To: Andreas Longwitz Subject: Re: ufs snapshot is sometimes corrupt on gjourneled partition cc: Konstantin Belousov , freebsd-fs@freebsd.org In-reply-to: <596C7201.8090700@incore.de> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-ID: <40469.1500338660.1@chez.mckusick.com> Date: Mon, 17 Jul 2017 17:44:20 -0700 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 18 Jul 2017 00:40:44 -0000 > Date: Mon, 17 Jul 2017 10:14:57 +0200 > From: Andreas Longwitz > To: Konstantin Belousov > CC: Kirk McKusick , freebsd-fs@freebsd.org > Subject: Re: ufs snapshot is sometimes corrupt on gjourneled partition > >> On Sun, Jul 16, 2017 at 09:33:55AM -0700, Kirk McKusick wrote: >>> I am copying Kostik on this email even though he is on freebsd-fs >>> as I would like him to review my answer below. >>> >>>> Date: Sun, 16 Jul 2017 17:52:39 +0200 >>>> From: Andreas Longwitz >>>> To: Kirk McKusick >>>> CC: freebsd-fs@freebsd.org >>>> Subject: Re: ufs snapshot is sometimes corrupt on gjourneled partition >>>> >>>> Hello, >>>> >>>> I like to discuss the following code snippet from function >>>> indiracct_ufs2() in source file ffs_snapshot.c: >>>> >>>> /* >>>> * We have to expand bread here since it will deadlock looking >>>> * up the block number for any blocks that are not in the cache. >>>> */ >>>> bp = getblk(cancelvp, lbn, fs->fs_bsize, 0, 0, 0); >>>> bp->b_blkno = fsbtodb(fs, blkno); >>>> if ((bp->b_flags & (B_DONE | B_DELWRI)) == 0 && >>>> (error = readblock(cancelvp, bp, fragstoblks(fs, blkno)))) { >>>> brelse(bp); >>>> return (error); >>>> } >>>> >>>> In all my tests (all mount types, good or bad result) the flags B_DONE >>>> and B_DELWRI in bp->b_flags were always zero, so the data buffer given by >>>> getblk() at bp->b_data is always overwritten with the data from the >>>> explicit I/O performed by readblock(). By the way the flag B_CACHE was >>>> always set. >>> >>> Key observation here that B_CACHE is set! That means that the data >>> in the block is valid and does not need to be read again. So the fix >>> is this: >>> >>> Index: sys/ufs/ffs/ffs_snapshot.c >>> =================================================================== >>> --- sys/ufs/ffs/ffs_snapshot.c (revision 321045) >>> +++ sys/ufs/ffs/ffs_snapshot.c (working copy) >>> @@ -1394,7 +1394,7 @@ >>> */ >>> bp = getblk(cancelvp, lbn, fs->fs_bsize, 0, 0, 0); >>> bp->b_blkno = fsbtodb(fs, blkno); >>> - if ((bp->b_flags & (B_DONE | B_DELWRI)) == 0 && >>> + if ((bp->b_flags & (B_DONE | B_DELWRI | B_CACHE)) == 0 && >>> (error = readblock(cancelvp, bp, fragstoblks(fs, blkno)))) { >>> brelse(bp); >>> return (error); >>> > > This patch was my first try, because the comment in the code snippet > points to B_CACHE. Unfortunaly it did not work, because the getblk() > buffer does not always have the correct data, the following "rm > snapfile" crashed always immediately with > > panic: ffs_blkfree_cg: freeing free block. > > Therfore I came up with the CLUSTEROK hack. Kostik's comment below that this is not a complete solution is correct. Once gjournal has put the block in its log it releases the buffer and it may fall out of the cache even though the log has not been written to backing store. So, it may still be necessary to get the contents filled. But, if B_CACHE is set, then the contents should be valid as that is exactly what bread uses to decide if the buffer needs to be read (see breadn_flags() in /sys/kern/vfs_bio.c). So, I really do not understand how you can end up with a buffer with invalid contents if B_CACHE is set because we are open coding bread() here and that is the criterion it uses to read. If however, you do the read when B_CACHE is not set, then as noted you will get the wrong results and crash. But see also my comments below when B_CACHE is not set. >>> Absent gjournal the extra read was unnecessary but harmless. The >>> buffer exists with B_DELWRI so we will not read it, or it is locked >>> while being written to disk, so we will wait for the write to complete, >>> after which it is safe to read it. With gjournal there is a window of >>> vulnerability between when the buffer is marked B_DELWRI and when it >>> is actually written to disk. This is because gjournal will say the >>> write has completed when it has not yet actually done so. Thus when >>> we read it, we get incorrect results. But until gjournal actually >>> completes the write, it will keep the buffer around (marked B_CACHE), >>> so as long as we check for B_CACHE to avoid the read (as we should >>> anyway to avoid an unnecessary read) we should always get consistent >>> results. >> I tend to disagree with the fix. What if the cached block being >> discarded before we enter this code fragment ? >> >> I believe the bug is indeed clear after your analysis, but it is in >> gjournal. Problem is that the disk (gjournal substrate layer, but for UFS >> it is all disk) returns data different from what was written to the >> volume. After the write bio signalled completion, unless other write >> was done after it, reads _must_ return the content of the write. >> >> The right fix is for gjournal is to perform 'store forwarding', i.e. >> to return data from the journal until journal is flushd to disk. > > That makes sense to me and explains why the problem never occurs after a > clean mount, it needs the foregoing "rm snapfile" to trigger the > gjournal bug. > > To be complete: > 1. I run gjournal above gmirror. > 2. I use the patch given in kern/198500. > 3. In my loader.conf I have > # use max. 8GB (1/4 of vm.kmem_size) for gjournal cache > kern.geom.journal.cache.divisor="4" > It is a little bit annoying, but setting kern.geom.journal.cache.limit > in loader.conf does not work because its overwritten in gjournal_init(). > > -- > Dr. Andreas Longwitz The sequence of calls when using bread is: Function Line File -------- ---- ---- bread 491 sys/buf.h breadn_flags 1814 kern/vfs_bio.c bstrategy 397 sys/buf.h BO_STRATEGY 86 sys/bufobj.h bufstrategy 4535 kern/vfs_bio.c ufs_strategy 2290 ufs/ufs/ufs_vnops.c BO_STRATEGY on filesystem device -> ffs_geom_strategy ffs_geom_strategy 2141 ufs/ffs/ffs_vfsops.c g_vfs_strategy 161 geom/geom_vfs.c g_io_request 470 geom/geom_io.c Whereas readblock skips all these steps and calls g_io_request directly. In my looking at the gjournal code, I believe that we will still enter it with the g_io_request() call as I believe that it does not hook itself into any of the VOP_ call structure. but I have not done a backtrace to confirm this fact. Assuming that we are still getting into g_journal_start(), then it should be possible to catch reads that are only in the log and pull out the data as needed. Another alternative is to send gjournal a request to flush its log before starting the removal of a snapshot. Kirk McKusick