Date: Sat, 29 Jul 2017 18:24:44 +0300 From: Konstantin Belousov <kostikbel@gmail.com> To: Andreas Longwitz <longwitz@incore.de> Cc: Kirk McKusick <mckusick@mckusick.com>, freebsd-fs@freebsd.org Subject: Re: ufs snapshot is sometimes corrupt on gjourneled partition Message-ID: <20170729152444.GB1700@kib.kiev.ua> In-Reply-To: <597C6595.7070404@incore.de> References: <596C7201.8090700@incore.de> <201707180044.v6I0iKvg040471@chez.mckusick.com> <20170718102200.GT1935@kib.kiev.ua> <597C6595.7070404@incore.de>
next in thread | previous in thread | raw e-mail | index | archive | help
On Sat, Jul 29, 2017 at 12:38:13PM +0200, Andreas Longwitz wrote: > hello Kirk and Kostik, > > you are both right. > > > So, I really do not understand how you can end up with a buffer with > > invalid contents if B_CACHE is set because we are open coding bread() > > here and that is the criterion it uses to read. If however, you do > > the read when B_CACHE is not set, then as noted you will get the > > wrong results and crash. But see also my comments below when B_CACHE > > is not set. > > The patch > > -- ffs_snapshot.c.1st 2016-06-08 17:25:21.000000000 +0200 > +++ ffs_snapshot.c 2017-07-29 10:51:34.682067000 +0200 > @@ -1403,7 +1403,7 @@ > */ > bp = getblk(cancelvp, lbn, fs->fs_bsize, 0, 0, 0); > bp->b_blkno = fsbtodb(fs, blkno); > - if ((bp->b_flags & (B_DONE | B_DELWRI)) == 0 && > + if ((bp->b_flags & (B_DONE | B_DELWRI | B_CACHE)) == 0 && > (error = readblock(cancelvp, bp, fragstoblks(fs, blkno)))) { > brelse(bp); > return (error); > > is correct, the buffer has always the correct data, when B_CACHE is set > after calling getblk(). So the calls to readblock() are unnecessary in > this case. In my first test the crash of following rm probably was > induced by other readblock() calls, but I did not know at that time. Ok. In fact it is not clear to me why B_DONE or B_DELWRI is tested at all there. We only need to know that the buffer content is valid. > > > So the way to fix the bug is to read gjournal code and understand why > > does it sometime returns wrong data. > > Yes, gjournal is broken in handling his flush_queue. If we have 10 bio's > in the flush_queue: > 1 2 3 4 5 6 7 8 9 10 > and another 10 bio's goes to the flush queue before all the first 10 > bio's was removed from the flush queue we should have something like > 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20, > but because of the bug we end up with > 6 11 12 13 14 15 16 17 18 19 20 7 8 9 10. > So the sequence of the bio's is damaged in the flush queue (and > therefore in the journal an disk !). This error can be triggered by > ffs_snapshot(), when a block is read with readblock() and gjournal finds > this block in the broken flush queue before he goes to the correct > active queue. > > The following patch solved this problem for me: > > --- g_journal.c.orig 2017-05-08 14:16:42.000000000 +0200 > +++ g_journal.c 2017-07-28 21:25:58.891881000 +0200 > @@ -1261,7 +1261,7 @@ > strlcpy(hdr.jrh_magic, GJ_RECORD_HEADER_MAGIC, > sizeof(hdr.jrh_magic)); > > bioq = &sc->sc_active.jj_queue; > - pbp = sc->sc_flush_queue; > + GJQ_LAST(sc->sc_flush_queue, pbp); > > fbp = g_alloc_bio(); > fbp->bio_parent = NULL; > > with macro GJQ_LAST defined in g_journal.h: > > #define GJQ_LAST(head, bp) do { \ > struct bio *_bp; \ > \ > if ((head) == NULL) { \ > (bp) = (head); \ > break; \ > } \ > for (_bp = (head); _bp->bio_next != NULL; _bp = _bp->bio_next);\ > (bp) = (_bp); \ > } while (0) > > Some more annotations about g_journal.c: > > - The patch given in bug 198500 is necessary to run gjournal with > correct and adequate cache size. > - The cache size for gjournal can be set by two tunables: cache.limit > and cache.divisor. I do not know the best practice if there are two > variables for setting the same resource. At the moment cache.limit is > ignored at boot time, because it will be overwritten in g_journal_init. > - In g_journal_flush() there is a writeonly variable "size". > - If one process writes synchron a block and onother process reads the > same block just after the write starts, then it is possible that the > reader gets the new block before the writer has finished his write call. > This happens, when the write goes to the delayed_queue for some time > from where the read gets the data. But the write must wait until the bio > goes to the current queue, where g_io_deliver() is called. If this is > not intended, then the delayed queue should not be used for reads. I recomment you to contact pjd@freebsd.org directly about both the gjournal patch and WRT further observations you have.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20170729152444.GB1700>