From owner-freebsd-fs@freebsd.org Sat Jul 29 15:24:51 2017 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id A0C08DC5C63 for ; Sat, 29 Jul 2017 15:24:51 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 493FF6F078 for ; Sat, 29 Jul 2017 15:24:51 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from tom.home (kib@localhost [127.0.0.1]) by kib.kiev.ua (8.15.2/8.15.2) with ESMTPS id v6TFOjBX095142 (version=TLSv1.2 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Sat, 29 Jul 2017 18:24:45 +0300 (EEST) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.10.3 kib.kiev.ua v6TFOjBX095142 Received: (from kostik@localhost) by tom.home (8.15.2/8.15.2/Submit) id v6TFOiha095141; Sat, 29 Jul 2017 18:24:44 +0300 (EEST) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Sat, 29 Jul 2017 18:24:44 +0300 From: Konstantin Belousov To: Andreas Longwitz Cc: Kirk McKusick , freebsd-fs@freebsd.org Subject: Re: ufs snapshot is sometimes corrupt on gjourneled partition Message-ID: <20170729152444.GB1700@kib.kiev.ua> References: <596C7201.8090700@incore.de> <201707180044.v6I0iKvg040471@chez.mckusick.com> <20170718102200.GT1935@kib.kiev.ua> <597C6595.7070404@incore.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <597C6595.7070404@incore.de> User-Agent: Mutt/1.8.3 (2017-05-23) X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00, DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no autolearn_force=no version=3.4.1 X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on tom.home X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 29 Jul 2017 15:24:51 -0000 On Sat, Jul 29, 2017 at 12:38:13PM +0200, Andreas Longwitz wrote: > hello Kirk and Kostik, > > you are both right. > > > So, I really do not understand how you can end up with a buffer with > > invalid contents if B_CACHE is set because we are open coding bread() > > here and that is the criterion it uses to read. If however, you do > > the read when B_CACHE is not set, then as noted you will get the > > wrong results and crash. But see also my comments below when B_CACHE > > is not set. > > The patch > > -- ffs_snapshot.c.1st 2016-06-08 17:25:21.000000000 +0200 > +++ ffs_snapshot.c 2017-07-29 10:51:34.682067000 +0200 > @@ -1403,7 +1403,7 @@ > */ > bp = getblk(cancelvp, lbn, fs->fs_bsize, 0, 0, 0); > bp->b_blkno = fsbtodb(fs, blkno); > - if ((bp->b_flags & (B_DONE | B_DELWRI)) == 0 && > + if ((bp->b_flags & (B_DONE | B_DELWRI | B_CACHE)) == 0 && > (error = readblock(cancelvp, bp, fragstoblks(fs, blkno)))) { > brelse(bp); > return (error); > > is correct, the buffer has always the correct data, when B_CACHE is set > after calling getblk(). So the calls to readblock() are unnecessary in > this case. In my first test the crash of following rm probably was > induced by other readblock() calls, but I did not know at that time. Ok. In fact it is not clear to me why B_DONE or B_DELWRI is tested at all there. We only need to know that the buffer content is valid. > > > So the way to fix the bug is to read gjournal code and understand why > > does it sometime returns wrong data. > > Yes, gjournal is broken in handling his flush_queue. If we have 10 bio's > in the flush_queue: > 1 2 3 4 5 6 7 8 9 10 > and another 10 bio's goes to the flush queue before all the first 10 > bio's was removed from the flush queue we should have something like > 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20, > but because of the bug we end up with > 6 11 12 13 14 15 16 17 18 19 20 7 8 9 10. > So the sequence of the bio's is damaged in the flush queue (and > therefore in the journal an disk !). This error can be triggered by > ffs_snapshot(), when a block is read with readblock() and gjournal finds > this block in the broken flush queue before he goes to the correct > active queue. > > The following patch solved this problem for me: > > --- g_journal.c.orig 2017-05-08 14:16:42.000000000 +0200 > +++ g_journal.c 2017-07-28 21:25:58.891881000 +0200 > @@ -1261,7 +1261,7 @@ > strlcpy(hdr.jrh_magic, GJ_RECORD_HEADER_MAGIC, > sizeof(hdr.jrh_magic)); > > bioq = &sc->sc_active.jj_queue; > - pbp = sc->sc_flush_queue; > + GJQ_LAST(sc->sc_flush_queue, pbp); > > fbp = g_alloc_bio(); > fbp->bio_parent = NULL; > > with macro GJQ_LAST defined in g_journal.h: > > #define GJQ_LAST(head, bp) do { \ > struct bio *_bp; \ > \ > if ((head) == NULL) { \ > (bp) = (head); \ > break; \ > } \ > for (_bp = (head); _bp->bio_next != NULL; _bp = _bp->bio_next);\ > (bp) = (_bp); \ > } while (0) > > Some more annotations about g_journal.c: > > - The patch given in bug 198500 is necessary to run gjournal with > correct and adequate cache size. > - The cache size for gjournal can be set by two tunables: cache.limit > and cache.divisor. I do not know the best practice if there are two > variables for setting the same resource. At the moment cache.limit is > ignored at boot time, because it will be overwritten in g_journal_init. > - In g_journal_flush() there is a writeonly variable "size". > - If one process writes synchron a block and onother process reads the > same block just after the write starts, then it is possible that the > reader gets the new block before the writer has finished his write call. > This happens, when the write goes to the delayed_queue for some time > from where the read gets the data. But the write must wait until the bio > goes to the current queue, where g_io_deliver() is called. If this is > not intended, then the delayed queue should not be used for reads. I recomment you to contact pjd@freebsd.org directly about both the gjournal patch and WRT further observations you have.