From owner-freebsd-fs@freebsd.org Sat Jul 29 21:09:48 2017 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 679A0DCB262 for ; Sat, 29 Jul 2017 21:09:48 +0000 (UTC) (envelope-from mckusick@mckusick.com) Received: from chez.mckusick.com (chez.mckusick.com [70.36.157.235]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 4D8267C58A for ; Sat, 29 Jul 2017 21:09:47 +0000 (UTC) (envelope-from mckusick@mckusick.com) Received: from chez.mckusick.com (localhost [IPv6:::1]) by chez.mckusick.com (8.15.2/8.15.2) with ESMTP id v6TLDUaw093863; Sat, 29 Jul 2017 14:13:30 -0700 (PDT) (envelope-from mckusick@chez.mckusick.com) Message-Id: <201707292113.v6TLDUaw093863@chez.mckusick.com> From: Kirk McKusick To: Konstantin Belousov Subject: Re: ufs snapshot is sometimes corrupt on gjourneled partition cc: Andreas Longwitz , freebsd-fs@freebsd.org In-reply-to: <20170729152444.GB1700@kib.kiev.ua> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-ID: <93861.1501362810.1@chez.mckusick.com> Date: Sat, 29 Jul 2017 14:13:30 -0700 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 29 Jul 2017 21:09:48 -0000 > Date: Sat, 29 Jul 2017 18:24:44 +0300 > From: Konstantin Belousov > To: Andreas Longwitz > Cc: Kirk McKusick , freebsd-fs@freebsd.org > Subject: Re: ufs snapshot is sometimes corrupt on gjourneled partition > > On Sat, Jul 29, 2017 at 12:38:13PM +0200, Andreas Longwitz wrote: > >> hello Kirk and Kostik, >> >> you are both right. >> >>> So, I really do not understand how you can end up with a buffer with >>> invalid contents if B_CACHE is set because we are open coding bread() >>> here and that is the criterion it uses to read. If however, you do >>> the read when B_CACHE is not set, then as noted you will get the >>> wrong results and crash. But see also my comments below when B_CACHE >>> is not set. >> >> The patch >> >> -- ffs_snapshot.c.1st 2016-06-08 17:25:21.000000000 +0200 >> +++ ffs_snapshot.c 2017-07-29 10:51:34.682067000 +0200 >> @@ -1403,7 +1403,7 @@ >> */ >> bp = getblk(cancelvp, lbn, fs->fs_bsize, 0, 0, 0); >> bp->b_blkno = fsbtodb(fs, blkno); >> - if ((bp->b_flags & (B_DONE | B_DELWRI)) == 0 && >> + if ((bp->b_flags & (B_DONE | B_DELWRI | B_CACHE)) == 0 && >> (error = readblock(cancelvp, bp, fragstoblks(fs, blkno)))) { >> brelse(bp); >> return (error); >> >> is correct, the buffer has always the correct data, when B_CACHE is set >> after calling getblk(). So the calls to readblock() are unnecessary in >> this case. In my first test the crash of following rm probably was >> induced by other readblock() calls, but I did not know at that time. > > Ok. In fact it is not clear to me why B_DONE or B_DELWRI is tested at all > there. We only need to know that the buffer content is valid. Historically bdwrite() did not set B_CACHE because it was not safe/true for NFS. Under the senario where B_CACHE is not set and a new disk block is allocated by the filesystem, it calls getblk() to get a buffer in which to place it. The usual case is that the block is not already be in the cache so the B_CACHE flag is not set. The filesystem fills in the new contents then bdwrite()s the buffer which results in B_DELWRI being set. So now, if you find that buffer when doing a getblk(), its only valid contents are what are in the buffer so reading the block at which it will be written would destroy the only valid copy. Hence the check for B_DELWRI as well as B_CACHE. The check for B_DONE is just to save an unneeded read as it can be set after a B_DELWRI has been written (so B_DELWRI is replaced with B_DONE). Since it is still a newly created block, it has not yet been read so B_CACHE is not set, but the contents are valid and can be used. All that said, NFS got overhauled by alc@ in -r46349 in May 1999 so that it became safe to set B_CACHE in bdwrite(). Since that change the above checks for B_DONE | B_DELWRI are no longer needed. Since this is a bug / performance enhancement in snapshot code, and has now been reviewed and its correctness verified, I propose to check in this change: Index: sys/ufs/ffs/ffs_snapshot.c =================================================================== --- sys/ufs/ffs/ffs_snapshot.c (revision 321679) +++ sys/ufs/ffs/ffs_snapshot.c (working copy) @@ -1394,7 +1394,7 @@ */ bp = getblk(cancelvp, lbn, fs->fs_bsize, 0, 0, 0); bp->b_blkno = fsbtodb(fs, blkno); - if ((bp->b_flags & (B_DONE | B_DELWRI)) == 0 && + if ((bp->b_flags & B_CACHE) == 0 && (error = readblock(cancelvp, bp, fragstoblks(fs, blkno)))) { brelse(bp); return (error); Let me know if you have any concerns about it. >>> So the way to fix the bug is to read gjournal code and understand why >>> does it sometime returns wrong data. >> >> Yes, gjournal is broken in handling his flush_queue. If we have 10 bio's >> in the flush_queue: >> 1 2 3 4 5 6 7 8 9 10 >> and another 10 bio's goes to the flush queue before all the first 10 >> bio's was removed from the flush queue we should have something like >> 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20, >> but because of the bug we end up with >> 6 11 12 13 14 15 16 17 18 19 20 7 8 9 10. >> So the sequence of the bio's is damaged in the flush queue (and >> therefore in the journal an disk !). This error can be triggered by >> ffs_snapshot(), when a block is read with readblock() and gjournal finds >> this block in the broken flush queue before he goes to the correct >> active queue. >> >> The following patch solved this problem for me: >> >> --- g_journal.c.orig 2017-05-08 14:16:42.000000000 +0200 >> +++ g_journal.c 2017-07-28 21:25:58.891881000 +0200 >> @@ -1261,7 +1261,7 @@ >> strlcpy(hdr.jrh_magic, GJ_RECORD_HEADER_MAGIC, >> sizeof(hdr.jrh_magic)); >> >> bioq = &sc->sc_active.jj_queue; >> - pbp = sc->sc_flush_queue; >> + GJQ_LAST(sc->sc_flush_queue, pbp); >> >> fbp = g_alloc_bio(); >> fbp->bio_parent = NULL; >> >> with macro GJQ_LAST defined in g_journal.h: >> >> #define GJQ_LAST(head, bp) do { \ >> struct bio *_bp; \ >> \ >> if ((head) == NULL) { \ >> (bp) = (head); \ >> break; \ >> } \ >> for (_bp = (head); _bp->bio_next != NULL; _bp = _bp->bio_next);\ >> (bp) = (_bp); \ >> } while (0) >> >> Some more annotations about g_journal.c: >> >> - The patch given in bug 198500 is necessary to run gjournal with >> correct and adequate cache size. >> - The cache size for gjournal can be set by two tunables: cache.limit >> and cache.divisor. I do not know the best practice if there are two >> variables for setting the same resource. At the moment cache.limit is >> ignored at boot time, because it will be overwritten in g_journal_init. >> - In g_journal_flush() there is a writeonly variable "size". >> - If one process writes synchron a block and onother process reads the >> same block just after the write starts, then it is possible that the >> reader gets the new block before the writer has finished his write call. >> This happens, when the write goes to the delayed_queue for some time >> from where the read gets the data. But the write must wait until the bio >> goes to the current queue, where g_io_deliver() is called. If this is >> not intended, then the delayed queue should not be used for reads. > > I recomment you to contact pjd@freebsd.org directly about both the gjournal > patch and WRT further observations you have. I concur that Pawel should review your proposed change. I'll send your note along to him requesting a review. Assuming he is in agreement he will likely check it in himself. Or if he prefers, I'll check it in. Hopefully he will have some thoughts on your other comments. Thank-you Andreas for taking the time to drill down and figure out why this was failing. Kirk McKusick