From owner-freebsd-fs@freebsd.org  Sat Jul 29 15:24:51 2017
Return-Path: <owner-freebsd-fs@freebsd.org>
Delivered-To: freebsd-fs@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id A0C08DC5C63
 for <freebsd-fs@mailman.ysv.freebsd.org>; Sat, 29 Jul 2017 15:24:51 +0000 (UTC)
 (envelope-from kostikbel@gmail.com)
Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 493FF6F078
 for <freebsd-fs@freebsd.org>; Sat, 29 Jul 2017 15:24:51 +0000 (UTC)
 (envelope-from kostikbel@gmail.com)
Received: from tom.home (kib@localhost [127.0.0.1])
 by kib.kiev.ua (8.15.2/8.15.2) with ESMTPS id v6TFOjBX095142
 (version=TLSv1.2 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO);
 Sat, 29 Jul 2017 18:24:45 +0300 (EEST)
 (envelope-from kostikbel@gmail.com)
DKIM-Filter: OpenDKIM Filter v2.10.3 kib.kiev.ua v6TFOjBX095142
Received: (from kostik@localhost)
 by tom.home (8.15.2/8.15.2/Submit) id v6TFOiha095141;
 Sat, 29 Jul 2017 18:24:44 +0300 (EEST)
 (envelope-from kostikbel@gmail.com)
X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com
 using -f
Date: Sat, 29 Jul 2017 18:24:44 +0300
From: Konstantin Belousov <kostikbel@gmail.com>
To: Andreas Longwitz <longwitz@incore.de>
Cc: Kirk McKusick <mckusick@mckusick.com>, freebsd-fs@freebsd.org
Subject: Re: ufs snapshot is sometimes corrupt on gjourneled partition
Message-ID: <20170729152444.GB1700@kib.kiev.ua>
References: <596C7201.8090700@incore.de>
 <201707180044.v6I0iKvg040471@chez.mckusick.com>
 <20170718102200.GT1935@kib.kiev.ua> <597C6595.7070404@incore.de>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <597C6595.7070404@incore.de>
User-Agent: Mutt/1.8.3 (2017-05-23)
X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00,
 DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no
 autolearn_force=no version=3.4.1
X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on tom.home
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs/>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 29 Jul 2017 15:24:51 -0000

On Sat, Jul 29, 2017 at 12:38:13PM +0200, Andreas Longwitz wrote:
> hello Kirk and Kostik,
> 
> you are both right.
> 
> > So, I really do not understand how you can end up with a buffer with
> > invalid contents if B_CACHE is set because we are open coding bread()
> > here and that is the criterion it uses to read. If however, you do
> > the read when B_CACHE is not set, then as noted you will get the
> > wrong results and crash. But see also my comments below when B_CACHE
> > is not set.
> 
> The patch
> 
> -- ffs_snapshot.c.1st  2016-06-08 17:25:21.000000000 +0200
> +++ ffs_snapshot.c      2017-07-29 10:51:34.682067000 +0200
> @@ -1403,7 +1403,7 @@
>          */
>         bp = getblk(cancelvp, lbn, fs->fs_bsize, 0, 0, 0);
>         bp->b_blkno = fsbtodb(fs, blkno);
> -       if ((bp->b_flags & (B_DONE | B_DELWRI)) == 0 &&
> +       if ((bp->b_flags & (B_DONE | B_DELWRI | B_CACHE)) == 0 &&
>             (error = readblock(cancelvp, bp, fragstoblks(fs, blkno)))) {
>                 brelse(bp);
>                 return (error);
> 
> is correct, the buffer has always the correct data, when B_CACHE is set
> after calling getblk(). So the calls to readblock() are unnecessary in
> this case. In my first test the crash of following rm probably was
> induced by other readblock() calls, but I did not know at that time.
Ok.  In fact it is not clear to me why B_DONE or B_DELWRI is tested at all
there.  We only need to know that the buffer content is valid.

> 
> > So the way to fix the bug is to read gjournal code and understand why
> > does it sometime returns wrong data.
> 
> Yes, gjournal is broken in handling his flush_queue. If we have 10 bio's
> in the flush_queue:
>          1 2 3 4 5 6 7 8 9 10
> and another 10 bio's goes to the flush queue before all the first 10
> bio's was removed from the flush queue we should have something like
>          6 7 8 9 10 11 12 13 14 15 16 17 18 19 20,
> but because of the bug we end up with
>          6 11 12 13 14  15 16 17 18 19 20 7 8 9 10.
> So the sequence of the bio's is damaged in the flush queue (and
> therefore in the journal an disk !). This error can be triggered by
> ffs_snapshot(), when a block is read with readblock() and gjournal finds
> this block in the broken flush queue before he goes to the correct
> active queue.
> 
> The following patch solved this problem for me:
> 
> --- g_journal.c.orig     2017-05-08 14:16:42.000000000 +0200
> +++ g_journal.c          2017-07-28 21:25:58.891881000 +0200
> @@ -1261,7 +1261,7 @@
>         strlcpy(hdr.jrh_magic, GJ_RECORD_HEADER_MAGIC,
> sizeof(hdr.jrh_magic));
> 
>         bioq = &sc->sc_active.jj_queue;
> -       pbp = sc->sc_flush_queue;
> +       GJQ_LAST(sc->sc_flush_queue, pbp);
> 
>         fbp = g_alloc_bio();
>         fbp->bio_parent = NULL;
> 
> with macro GJQ_LAST defined in g_journal.h:
> 
> #define GJQ_LAST(head, bp)      do {                             \
>   struct bio *_bp;                                               \
>                                                                  \
>   if ((head) == NULL) {                                          \
>        (bp) = (head);                                            \
>        break;                                                    \
>   }                                                              \
>   for (_bp = (head); _bp->bio_next != NULL; _bp = _bp->bio_next);\
>   (bp) = (_bp);                                                  \
> } while (0)
> 
> Some more annotations about g_journal.c:
> 
> - The patch given in bug 198500 is necessary to run gjournal with
> correct and adequate cache size.
> - The cache size for gjournal can be set by two tunables: cache.limit
> and cache.divisor. I do not know the best practice if there are two
> variables for setting the same resource. At the moment cache.limit is
> ignored at boot time, because it will be overwritten in g_journal_init.
> - In g_journal_flush() there is a writeonly variable "size".
> - If one process writes synchron a block and onother process reads the
> same block just after the write starts, then it is possible that the
> reader gets the new block before the writer has finished his write call.
> This happens, when the write goes to the delayed_queue for some time
> from where the read gets the data. But the write must wait until the bio
> goes to the current queue, where g_io_deliver() is called. If this is
> not intended, then the delayed queue should not be used for reads.

I recomment you to contact pjd@freebsd.org directly about both the gjournal
patch and WRT further observations you have.