Date: Sat, 29 Jul 2017 12:38:13 +0200 From: Andreas Longwitz <longwitz@incore.de> To: Konstantin Belousov <kostikbel@gmail.com> Cc: Kirk McKusick <mckusick@mckusick.com>, freebsd-fs@freebsd.org Subject: Re: ufs snapshot is sometimes corrupt on gjourneled partition Message-ID: <597C6595.7070404@incore.de> In-Reply-To: <20170718102200.GT1935@kib.kiev.ua> References: <596C7201.8090700@incore.de> <201707180044.v6I0iKvg040471@chez.mckusick.com> <20170718102200.GT1935@kib.kiev.ua>
next in thread | previous in thread | raw e-mail | index | archive | help
hello Kirk and Kostik, you are both right. > So, I really do not understand how you can end up with a buffer with > invalid contents if B_CACHE is set because we are open coding bread() > here and that is the criterion it uses to read. If however, you do > the read when B_CACHE is not set, then as noted you will get the > wrong results and crash. But see also my comments below when B_CACHE > is not set. The patch -- ffs_snapshot.c.1st 2016-06-08 17:25:21.000000000 +0200 +++ ffs_snapshot.c 2017-07-29 10:51:34.682067000 +0200 @@ -1403,7 +1403,7 @@ */ bp = getblk(cancelvp, lbn, fs->fs_bsize, 0, 0, 0); bp->b_blkno = fsbtodb(fs, blkno); - if ((bp->b_flags & (B_DONE | B_DELWRI)) == 0 && + if ((bp->b_flags & (B_DONE | B_DELWRI | B_CACHE)) == 0 && (error = readblock(cancelvp, bp, fragstoblks(fs, blkno)))) { brelse(bp); return (error); is correct, the buffer has always the correct data, when B_CACHE is set after calling getblk(). So the calls to readblock() are unnecessary in this case. In my first test the crash of following rm probably was induced by other readblock() calls, but I did not know at that time. > So the way to fix the bug is to read gjournal code and understand why > does it sometime returns wrong data. Yes, gjournal is broken in handling his flush_queue. If we have 10 bio's in the flush_queue: 1 2 3 4 5 6 7 8 9 10 and another 10 bio's goes to the flush queue before all the first 10 bio's was removed from the flush queue we should have something like 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20, but because of the bug we end up with 6 11 12 13 14 15 16 17 18 19 20 7 8 9 10. So the sequence of the bio's is damaged in the flush queue (and therefore in the journal an disk !). This error can be triggered by ffs_snapshot(), when a block is read with readblock() and gjournal finds this block in the broken flush queue before he goes to the correct active queue. The following patch solved this problem for me: --- g_journal.c.orig 2017-05-08 14:16:42.000000000 +0200 +++ g_journal.c 2017-07-28 21:25:58.891881000 +0200 @@ -1261,7 +1261,7 @@ strlcpy(hdr.jrh_magic, GJ_RECORD_HEADER_MAGIC, sizeof(hdr.jrh_magic)); bioq = &sc->sc_active.jj_queue; - pbp = sc->sc_flush_queue; + GJQ_LAST(sc->sc_flush_queue, pbp); fbp = g_alloc_bio(); fbp->bio_parent = NULL; with macro GJQ_LAST defined in g_journal.h: #define GJQ_LAST(head, bp) do { \ struct bio *_bp; \ \ if ((head) == NULL) { \ (bp) = (head); \ break; \ } \ for (_bp = (head); _bp->bio_next != NULL; _bp = _bp->bio_next);\ (bp) = (_bp); \ } while (0) Some more annotations about g_journal.c: - The patch given in bug 198500 is necessary to run gjournal with correct and adequate cache size. - The cache size for gjournal can be set by two tunables: cache.limit and cache.divisor. I do not know the best practice if there are two variables for setting the same resource. At the moment cache.limit is ignored at boot time, because it will be overwritten in g_journal_init. - In g_journal_flush() there is a writeonly variable "size". - If one process writes synchron a block and onother process reads the same block just after the write starts, then it is possible that the reader gets the new block before the writer has finished his write call. This happens, when the write goes to the delayed_queue for some time from where the read gets the data. But the write must wait until the bio goes to the current queue, where g_io_deliver() is called. If this is not intended, then the delayed queue should not be used for reads. -- Dr. Andreas Longwitz
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?597C6595.7070404>