From owner-freebsd-fs@freebsd.org Sat Jul 29 10:45:04 2017 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id A02A5DC12B8 for ; Sat, 29 Jul 2017 10:45:04 +0000 (UTC) (envelope-from longwitz@incore.de) Received: from dss.incore.de (dss.incore.de [195.145.1.138]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 0A37B68170 for ; Sat, 29 Jul 2017 10:45:03 +0000 (UTC) (envelope-from longwitz@incore.de) Received: from inetmail.dmz (inetmail.dmz [10.3.0.3]) by dss.incore.de (Postfix) with ESMTP id A1D9E67911; Sat, 29 Jul 2017 12:38:15 +0200 (CEST) X-Virus-Scanned: amavisd-new at incore.de Received: from dss.incore.de ([10.3.0.3]) by inetmail.dmz (inetmail.dmz [10.3.0.3]) (amavisd-new, port 10024) with LMTP id YX52_7uA3FZF; Sat, 29 Jul 2017 12:38:14 +0200 (CEST) Received: from mail.local.incore (fwintern.dmz [10.0.0.253]) by dss.incore.de (Postfix) with ESMTP id 34A3B6790A; Sat, 29 Jul 2017 12:38:14 +0200 (CEST) Received: from bsdmhs.longwitz (unknown [192.168.99.6]) by mail.local.incore (Postfix) with ESMTP id C5DD8508A1; Sat, 29 Jul 2017 12:38:13 +0200 (CEST) Message-ID: <597C6595.7070404@incore.de> Date: Sat, 29 Jul 2017 12:38:13 +0200 From: Andreas Longwitz User-Agent: Thunderbird 2.0.0.19 (X11/20090113) MIME-Version: 1.0 To: Konstantin Belousov CC: Kirk McKusick , freebsd-fs@freebsd.org Subject: Re: ufs snapshot is sometimes corrupt on gjourneled partition References: <596C7201.8090700@incore.de> <201707180044.v6I0iKvg040471@chez.mckusick.com> <20170718102200.GT1935@kib.kiev.ua> In-Reply-To: <20170718102200.GT1935@kib.kiev.ua> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 29 Jul 2017 10:45:04 -0000 hello Kirk and Kostik, you are both right. > So, I really do not understand how you can end up with a buffer with > invalid contents if B_CACHE is set because we are open coding bread() > here and that is the criterion it uses to read. If however, you do > the read when B_CACHE is not set, then as noted you will get the > wrong results and crash. But see also my comments below when B_CACHE > is not set. The patch -- ffs_snapshot.c.1st 2016-06-08 17:25:21.000000000 +0200 +++ ffs_snapshot.c 2017-07-29 10:51:34.682067000 +0200 @@ -1403,7 +1403,7 @@ */ bp = getblk(cancelvp, lbn, fs->fs_bsize, 0, 0, 0); bp->b_blkno = fsbtodb(fs, blkno); - if ((bp->b_flags & (B_DONE | B_DELWRI)) == 0 && + if ((bp->b_flags & (B_DONE | B_DELWRI | B_CACHE)) == 0 && (error = readblock(cancelvp, bp, fragstoblks(fs, blkno)))) { brelse(bp); return (error); is correct, the buffer has always the correct data, when B_CACHE is set after calling getblk(). So the calls to readblock() are unnecessary in this case. In my first test the crash of following rm probably was induced by other readblock() calls, but I did not know at that time. > So the way to fix the bug is to read gjournal code and understand why > does it sometime returns wrong data. Yes, gjournal is broken in handling his flush_queue. If we have 10 bio's in the flush_queue: 1 2 3 4 5 6 7 8 9 10 and another 10 bio's goes to the flush queue before all the first 10 bio's was removed from the flush queue we should have something like 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20, but because of the bug we end up with 6 11 12 13 14 15 16 17 18 19 20 7 8 9 10. So the sequence of the bio's is damaged in the flush queue (and therefore in the journal an disk !). This error can be triggered by ffs_snapshot(), when a block is read with readblock() and gjournal finds this block in the broken flush queue before he goes to the correct active queue. The following patch solved this problem for me: --- g_journal.c.orig 2017-05-08 14:16:42.000000000 +0200 +++ g_journal.c 2017-07-28 21:25:58.891881000 +0200 @@ -1261,7 +1261,7 @@ strlcpy(hdr.jrh_magic, GJ_RECORD_HEADER_MAGIC, sizeof(hdr.jrh_magic)); bioq = &sc->sc_active.jj_queue; - pbp = sc->sc_flush_queue; + GJQ_LAST(sc->sc_flush_queue, pbp); fbp = g_alloc_bio(); fbp->bio_parent = NULL; with macro GJQ_LAST defined in g_journal.h: #define GJQ_LAST(head, bp) do { \ struct bio *_bp; \ \ if ((head) == NULL) { \ (bp) = (head); \ break; \ } \ for (_bp = (head); _bp->bio_next != NULL; _bp = _bp->bio_next);\ (bp) = (_bp); \ } while (0) Some more annotations about g_journal.c: - The patch given in bug 198500 is necessary to run gjournal with correct and adequate cache size. - The cache size for gjournal can be set by two tunables: cache.limit and cache.divisor. I do not know the best practice if there are two variables for setting the same resource. At the moment cache.limit is ignored at boot time, because it will be overwritten in g_journal_init. - In g_journal_flush() there is a writeonly variable "size". - If one process writes synchron a block and onother process reads the same block just after the write starts, then it is possible that the reader gets the new block before the writer has finished his write call. This happens, when the write goes to the delayed_queue for some time from where the read gets the data. But the write must wait until the bio goes to the current queue, where g_io_deliver() is called. If this is not intended, then the delayed queue should not be used for reads. -- Dr. Andreas Longwitz