Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 17 Jul 2017 09:07:20 +0300
From:      Konstantin Belousov <kostikbel@gmail.com>
To:        Kirk McKusick <mckusick@mckusick.com>
Cc:        Andreas Longwitz <longwitz@incore.de>, freebsd-fs@freebsd.org
Subject:   Re: ufs snapshot is sometimes corrupt on gjourneled partition
Message-ID:  <20170717060720.GH1935@kib.kiev.ua>
In-Reply-To: <201707161633.v6GGXtjd035000@chez.mckusick.com>
References:  <596B8BC7.3030505@incore.de> <201707161633.v6GGXtjd035000@chez.mckusick.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Sun, Jul 16, 2017 at 09:33:55AM -0700, Kirk McKusick wrote:
> I am copying Kostik on this email even though he is on freebsd-fs
> as I would like him to review my answer below.
> 
> > Date: Sun, 16 Jul 2017 17:52:39 +0200
> > From: Andreas Longwitz <longwitz@incore.de>
> > To: Kirk McKusick <mckusick@mckusick.com>
> > CC: freebsd-fs@freebsd.org
> > Subject: Re: ufs snapshot is sometimes corrupt on gjourneled partition
> > 
> > Hello,
> > 
> > I like to discuss the following code snippet from function
> > indiracct_ufs2() in source file ffs_snapshot.c:
> > 
> >     /*
> >      * We have to expand bread here since it will deadlock looking
> >      * up the block number for any blocks that are not in the cache.
> >      */
> >     bp = getblk(cancelvp, lbn, fs->fs_bsize, 0, 0, 0);
> >     bp->b_blkno = fsbtodb(fs, blkno);
> >     if ((bp->b_flags & (B_DONE | B_DELWRI)) == 0 &&
> >         (error = readblock(cancelvp, bp, fragstoblks(fs, blkno)))) {
> >             brelse(bp);
> >             return (error);
> >     }
> > 
> > In all my tests (all mount types, good or bad result) the flags B_DONE
> > and B_DELWRI in bp->b_flags were always zero, so the data buffer given by
> > getblk() at bp->b_data is always overwritten with the data from the
> > explicit I/O performed by readblock(). By the way the flag B_CACHE was
> > always set.
> 
> Key observation here that B_CACHE is set! That means that the data
> in the block is valid and does not need to be read again. So the fix
> is this:
> 
> Index: sys/ufs/ffs/ffs_snapshot.c
> ===================================================================
> --- sys/ufs/ffs/ffs_snapshot.c	(revision 321045)
> +++ sys/ufs/ffs/ffs_snapshot.c	(working copy)
> @@ -1394,7 +1394,7 @@
>  	 */
>  	bp = getblk(cancelvp, lbn, fs->fs_bsize, 0, 0, 0);
>  	bp->b_blkno = fsbtodb(fs, blkno);
> -	if ((bp->b_flags & (B_DONE | B_DELWRI)) == 0 &&
> +	if ((bp->b_flags & (B_DONE | B_DELWRI | B_CACHE)) == 0 &&
>  	    (error = readblock(cancelvp, bp, fragstoblks(fs, blkno)))) {
>  		brelse(bp);
>  		return (error);
> 
> Absent gjournal the extra read was unnecessary but harmless. The
> buffer exists with B_DELWRI so we will not read it, or it is locked
> while being written to disk, so we will wait for the write to complete,
> after which it is safe to read it. With gjournal there is a window of
> vulnerability between when the buffer is marked B_DELWRI and when it
> is actually written to disk. This is because gjournal will say the
> write has completed when it has not yet actually done so. Thus when
> we read it, we get incorrect results. But until gjournal actually
> completes the write, it will keep the buffer around (marked B_CACHE),
> so as long as we check for B_CACHE to avoid the read (as we should
> anyway to avoid an unnecessary read) we should always get consistent
> results.
I tend to disagree with the fix. What if the cached block being
discarded before we enter this code fragment ?

I believe the bug is indeed clear after your analysis, but it is in
gjournal.  Problem is that the disk (gjournal substrate layer, but for UFS
it is all disk) returns data different from what was written to the
volume.  After the write bio signalled completion, unless other write
was done after it, reads _must_ return the content of the write.

The right fix is for gjournal is to perform 'store forwarding', i.e.
to return data from the journal until journal is flushd to disk.

> 
> > I have saved the buffer given by getblk() and compared this buffer
> > with the data read by readblock(). Without gjournal there is never
> > a difference. Using a gjournaled partition  things are more
> > sophisticated. Taking a snapshot in my example environment needs
> > 678 calls of getblk(). In the "good case" all I/O's performed by
> > readblock() gave the same data as getblk(), in the "bad case" there
> > are some differences between the data in buffer cache seen by
> > getblk() and the data on disk seen by readblock(). Mostly two or
> > three of the 678 blocks are different. In this case the block read
> > by readblock() is always zero, the block read by getblk() has a lot
> > of BLK_NOCOPY values, sometimes 4096 (a full block).
> > 
> > There is one other flag in bp->b_flags given by getblk that is of
> > interest: B_CLUSTEROK. This flag is always in sync with the flag
> > BV_SCANNED in bp->b_vflags. In the above code snippet I can check this
> > flag too before going to disk:
> > 
> >    if (bp->b_flags & (B_DONE | B_DELWRI | B_CLUSTEROK)) == 0 && ...
> > 
> > With this patch the problem has gone in all my tests. The resulting
> > snapblklist written to the end of the snapfile is always the same.
> 
> It is merely coincidence that B_CLUSTEROK is set. But the fact that
> skipping the read fixes your problem leads me to believe that checking
> for B_CACHE will fix your problem.
> 
> > I am not familiar with buffer and memory management of FreeBSD, so I can
> > not explain what is going on. But I see in cgaccount() a lot of
> > BLK_NOCOPY values are set and in expunge_ufs2() these values should be
> > seen again. It seems they are in the buffer cache but sometimes not on
> > disk. The patch given above in my test example reads data from buffer
> > cache instead of from disk 141 times.
> 
> The number of buffer flags have grown over time and their meaning has
> subtlely changed. Even I who have been around since the beginning cannot
> claim to fully comprehend exactly what they mean. So, the fact that you
> find them incomprehensible is unfortunately all too understandable.
> 
> > All sync operations in ffs_snapshot() use the parameter MNT_WAIT, the
> > gjournal switcher runs all the time but syncs the same partition with
> > MNT_NOWAIT (g_journal.c: vfs_msync followed by VFS_SYNC). I do not know
> > if this can lead to a race condition between gjournal and snapshot.
> 
> I went down this line of thinking with your first email and did not come
> up with anything that explained the problem. So I am glad that you have
> found something that I do believe explains what is going on.
> 
> 	Kirk McKusick



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20170717060720.GH1935>