From owner-freebsd-fs@freebsd.org  Sat Jul 29 10:45:04 2017
Return-Path: <owner-freebsd-fs@freebsd.org>
Delivered-To: freebsd-fs@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id A02A5DC12B8
 for <freebsd-fs@mailman.ysv.freebsd.org>; Sat, 29 Jul 2017 10:45:04 +0000 (UTC)
 (envelope-from longwitz@incore.de)
Received: from dss.incore.de (dss.incore.de [195.145.1.138])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 0A37B68170
 for <freebsd-fs@freebsd.org>; Sat, 29 Jul 2017 10:45:03 +0000 (UTC)
 (envelope-from longwitz@incore.de)
Received: from inetmail.dmz (inetmail.dmz [10.3.0.3])
 by dss.incore.de (Postfix) with ESMTP id A1D9E67911;
 Sat, 29 Jul 2017 12:38:15 +0200 (CEST)
X-Virus-Scanned: amavisd-new at incore.de
Received: from dss.incore.de ([10.3.0.3])
 by inetmail.dmz (inetmail.dmz [10.3.0.3]) (amavisd-new, port 10024)
 with LMTP id YX52_7uA3FZF; Sat, 29 Jul 2017 12:38:14 +0200 (CEST)
Received: from mail.local.incore (fwintern.dmz [10.0.0.253])
 by dss.incore.de (Postfix) with ESMTP id 34A3B6790A;
 Sat, 29 Jul 2017 12:38:14 +0200 (CEST)
Received: from bsdmhs.longwitz (unknown [192.168.99.6])
 by mail.local.incore (Postfix) with ESMTP id C5DD8508A1;
 Sat, 29 Jul 2017 12:38:13 +0200 (CEST)
Message-ID: <597C6595.7070404@incore.de>
Date: Sat, 29 Jul 2017 12:38:13 +0200
From: Andreas Longwitz <longwitz@incore.de>
User-Agent: Thunderbird 2.0.0.19 (X11/20090113)
MIME-Version: 1.0
To: Konstantin Belousov <kostikbel@gmail.com>
CC: Kirk McKusick <mckusick@mckusick.com>, freebsd-fs@freebsd.org
Subject: Re: ufs snapshot is sometimes corrupt on gjourneled partition
References: <596C7201.8090700@incore.de>
 <201707180044.v6I0iKvg040471@chez.mckusick.com>
 <20170718102200.GT1935@kib.kiev.ua>
In-Reply-To: <20170718102200.GT1935@kib.kiev.ua>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs/>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 29 Jul 2017 10:45:04 -0000

hello Kirk and Kostik,

you are both right.

> So, I really do not understand how you can end up with a buffer with
> invalid contents if B_CACHE is set because we are open coding bread()
> here and that is the criterion it uses to read. If however, you do
> the read when B_CACHE is not set, then as noted you will get the
> wrong results and crash. But see also my comments below when B_CACHE
> is not set.

The patch

-- ffs_snapshot.c.1st  2016-06-08 17:25:21.000000000 +0200
+++ ffs_snapshot.c      2017-07-29 10:51:34.682067000 +0200
@@ -1403,7 +1403,7 @@
         */
        bp = getblk(cancelvp, lbn, fs->fs_bsize, 0, 0, 0);
        bp->b_blkno = fsbtodb(fs, blkno);
-       if ((bp->b_flags & (B_DONE | B_DELWRI)) == 0 &&
+       if ((bp->b_flags & (B_DONE | B_DELWRI | B_CACHE)) == 0 &&
            (error = readblock(cancelvp, bp, fragstoblks(fs, blkno)))) {
                brelse(bp);
                return (error);

is correct, the buffer has always the correct data, when B_CACHE is set
after calling getblk(). So the calls to readblock() are unnecessary in
this case. In my first test the crash of following rm probably was
induced by other readblock() calls, but I did not know at that time.

> So the way to fix the bug is to read gjournal code and understand why
> does it sometime returns wrong data.

Yes, gjournal is broken in handling his flush_queue. If we have 10 bio's
in the flush_queue:
         1 2 3 4 5 6 7 8 9 10
and another 10 bio's goes to the flush queue before all the first 10
bio's was removed from the flush queue we should have something like
         6 7 8 9 10 11 12 13 14 15 16 17 18 19 20,
but because of the bug we end up with
         6 11 12 13 14  15 16 17 18 19 20 7 8 9 10.
So the sequence of the bio's is damaged in the flush queue (and
therefore in the journal an disk !). This error can be triggered by
ffs_snapshot(), when a block is read with readblock() and gjournal finds
this block in the broken flush queue before he goes to the correct
active queue.

The following patch solved this problem for me:

--- g_journal.c.orig     2017-05-08 14:16:42.000000000 +0200
+++ g_journal.c          2017-07-28 21:25:58.891881000 +0200
@@ -1261,7 +1261,7 @@
        strlcpy(hdr.jrh_magic, GJ_RECORD_HEADER_MAGIC,
sizeof(hdr.jrh_magic));

        bioq = &sc->sc_active.jj_queue;
-       pbp = sc->sc_flush_queue;
+       GJQ_LAST(sc->sc_flush_queue, pbp);

        fbp = g_alloc_bio();
        fbp->bio_parent = NULL;

with macro GJQ_LAST defined in g_journal.h:

#define GJQ_LAST(head, bp)      do {                             \
  struct bio *_bp;                                               \
                                                                 \
  if ((head) == NULL) {                                          \
       (bp) = (head);                                            \
       break;                                                    \
  }                                                              \
  for (_bp = (head); _bp->bio_next != NULL; _bp = _bp->bio_next);\
  (bp) = (_bp);                                                  \
} while (0)

Some more annotations about g_journal.c:

- The patch given in bug 198500 is necessary to run gjournal with
correct and adequate cache size.
- The cache size for gjournal can be set by two tunables: cache.limit
and cache.divisor. I do not know the best practice if there are two
variables for setting the same resource. At the moment cache.limit is
ignored at boot time, because it will be overwritten in g_journal_init.
- In g_journal_flush() there is a writeonly variable "size".
- If one process writes synchron a block and onother process reads the
same block just after the write starts, then it is possible that the
reader gets the new block before the writer has finished his write call.
This happens, when the write goes to the delayed_queue for some time
from where the read gets the data. But the write must wait until the bio
goes to the current queue, where g_io_deliver() is called. If this is
not intended, then the delayed queue should not be used for reads.

-- 
Dr. Andreas Longwitz