From owner-svn-src-all@freebsd.org  Tue Feb  6 00:19:48 2018
Return-Path: <owner-svn-src-all@freebsd.org>
Delivered-To: svn-src-all@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id D425DEDB462;
 Tue,  6 Feb 2018 00:19:47 +0000 (UTC)
 (envelope-from mckusick@FreeBSD.org)
Received: from mxrelay.nyi.freebsd.org (mxrelay.nyi.freebsd.org
 [IPv6:2610:1c1:1:606c::19:3])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client CN "mxrelay.nyi.freebsd.org",
 Issuer "Let's Encrypt Authority X3" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id 883626962A;
 Tue,  6 Feb 2018 00:19:47 +0000 (UTC)
 (envelope-from mckusick@FreeBSD.org)
Received: from repo.freebsd.org (repo.freebsd.org
 [IPv6:2610:1c1:1:6068::e6a:0])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client did not present a certificate)
 by mxrelay.nyi.freebsd.org (Postfix) with ESMTPS id 6284B101C1;
 Tue,  6 Feb 2018 00:19:47 +0000 (UTC)
 (envelope-from mckusick@FreeBSD.org)
Received: from repo.freebsd.org ([127.0.1.37])
 by repo.freebsd.org (8.15.2/8.15.2) with ESMTP id w160Jlhb076254;
 Tue, 6 Feb 2018 00:19:47 GMT (envelope-from mckusick@FreeBSD.org)
Received: (from mckusick@localhost)
 by repo.freebsd.org (8.15.2/8.15.2/Submit) id w160Jl1W076251;
 Tue, 6 Feb 2018 00:19:47 GMT (envelope-from mckusick@FreeBSD.org)
Message-Id: <201802060019.w160Jl1W076251@repo.freebsd.org>
X-Authentication-Warning: repo.freebsd.org: mckusick set sender to
 mckusick@FreeBSD.org using -f
From: Kirk McKusick <mckusick@FreeBSD.org>
Date: Tue, 6 Feb 2018 00:19:47 +0000 (UTC)
To: src-committers@freebsd.org, svn-src-all@freebsd.org,
 svn-src-head@freebsd.org
Subject: svn commit: r328914 - in head/sys: kern ufs/ffs
X-SVN-Group: head
X-SVN-Commit-Author: mckusick
X-SVN-Commit-Paths: in head/sys: kern ufs/ffs
X-SVN-Commit-Revision: 328914
X-SVN-Commit-Repository: base
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-BeenThere: svn-src-all@freebsd.org
X-Mailman-Version: 2.1.25
Precedence: list
List-Id: "SVN commit messages for the entire src tree \(except for &quot;
 user&quot; and &quot; projects&quot; \)" <svn-src-all.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/svn-src-all>,
 <mailto:svn-src-all-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/svn-src-all/>
List-Post: <mailto:svn-src-all@freebsd.org>
List-Help: <mailto:svn-src-all-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/svn-src-all>,
 <mailto:svn-src-all-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 06 Feb 2018 00:19:48 -0000

Author: mckusick
Date: Tue Feb  6 00:19:46 2018
New Revision: 328914
URL: https://svnweb.freebsd.org/changeset/base/328914

Log:
  Occasional cylinder-group check-hash errors were being reported on
  systems running with a heavy filesystem load. Tracking down this
  bug was elusive because there were actually two problems. Sometimes
  the in-memory check hash was wrong and sometimes the check hash
  computed when doing the read was wrong. The occurrence of either
  error caused a check-hash mismatch to be reported.
  
  The first error was that the check hash in the in-memory cylinder
  group was incorrect. This error was caused by the following
  sequence of events:
  
  - We read a cylinder-group buffer and the check hash is valid.
  - We update its cg_time and cg_old_time which makes the in-memory
    check-hash value invalid but we do not mark the cylinder group dirty.
  - We do not make any other changes to the cylinder group, so we
    never mark it dirty, thus do not write it out, and hence never
    update the incorrect check hash for the in-memory buffer.
  - Later, the buffer gets freed, but the page with the old incorrect
    check hash is still in the VM cache.
  - Later, we read the cylinder group again, and the first page with
    the old check hash is still in the VM cache, but some other pages
    are not, so we have to do a read.
  - The read does not actually get the first page from disk, but rather
    from the VM cache, resulting in the old check hash in the buffer.
  - The value computed after doing the read does not match causing the
    error to be printed.
  
  The fix for this problem is to only set cg_time and cg_old_time as
  the cylinder group is being written to disk. This keeps the in-memory
  check-hash valid unless the cylinder group has had other modifications
  which will require it to be written with a new check hash calculated.
  It also requires that the check hash be recalculated in the in-memory
  cylinder group when it is marked clean after doing a background write.
  
  The second problem was that the check hash computed at the end of the
  read was incorrect because the calculation of the check hash on
  completion of the read was being done too soon.
  
  - When a read completes we had the following sequence:
  
    - bufdone()
    -- b_ckhashcalc (calculates check hash)
    -- bufdone_finish()
    --- vfs_vmio_iodone() (replaces bogus pages with the cached ones)
  
  - When we are reading a buffer where one or more pages are already
    in memory (but not all pages, or we wouldn't be doing the read),
    the I/O is done with bogus_page mapped in for the pages that exist
    in the VM cache. This mapping is done to avoid corrupting the
    cached pages if there is any I/O overrun. The vfs_vmio_iodone()
    function is responsible for replacing the bogus_page(s) with the
    cached ones. But we were calculating the check hash before the
    bogus_page(s) were replaced. Hence, when we were calculating the
    check hash, we were partly reading from bogus_page, which means
    we calculated a bad check hash (e.g., because multiple pages have
    been mapped to bogus_page, so its contents are indeterminate).
  
  The second fix is to move the check-hash calculation from bufdone()
  to bufdone_finish() after the call to vfs_vmio_iodone() so that it
  computes the check hash over the correct set of pages.
  
  With these two changes, the occasional cylinder-group check-hash
  errors are gone.
  
  Submitted by: David Pfitzner <dpfitzner@netflix.com>
  Reviewed by: kib
  Tested by: David Pfitzner

Modified:
  head/sys/kern/vfs_bio.c
  head/sys/ufs/ffs/ffs_alloc.c
  head/sys/ufs/ffs/ffs_vfsops.c

Modified: head/sys/kern/vfs_bio.c
==============================================================================
--- head/sys/kern/vfs_bio.c	Tue Feb  6 00:02:30 2018	(r328913)
+++ head/sys/kern/vfs_bio.c	Tue Feb  6 00:19:46 2018	(r328914)
@@ -4075,10 +4075,6 @@ bufdone(struct buf *bp)
 	runningbufwakeup(bp);
 	if (bp->b_iocmd == BIO_WRITE)
 		dropobj = bp->b_bufobj;
-	else if ((bp->b_flags & B_CKHASH) != 0) {
-		KASSERT(buf_mapped(bp), ("biodone: bp %p not mapped", bp));
-		(*bp->b_ckhashcalc)(bp);
-	}
 	/* call optional completion function if requested */
 	if (bp->b_iodone != NULL) {
 		biodone = bp->b_iodone;
@@ -4114,6 +4110,13 @@ bufdone_finish(struct buf *bp)
 		    !(bp->b_ioflags & BIO_ERROR))
 			bp->b_flags |= B_CACHE;
 		vfs_vmio_iodone(bp);
+	}
+	if ((bp->b_flags & B_CKHASH) != 0) {
+		KASSERT(bp->b_iocmd == BIO_READ,
+		    ("bufdone_finish: b_iocmd %d not BIO_READ", bp->b_iocmd));
+		KASSERT(buf_mapped(bp),
+		    ("bufdone_finish: bp %p not mapped", bp));
+		(*bp->b_ckhashcalc)(bp);
 	}
 
 	/*

Modified: head/sys/ufs/ffs/ffs_alloc.c
==============================================================================
--- head/sys/ufs/ffs/ffs_alloc.c	Tue Feb  6 00:02:30 2018	(r328913)
+++ head/sys/ufs/ffs/ffs_alloc.c	Tue Feb  6 00:19:46 2018	(r328914)
@@ -2667,9 +2667,18 @@ ffs_getcg(fs, devvp, cg, bpp, cgpp)
 	}
 	bp->b_flags &= ~B_CKHASH;
 	bp->b_xflags |= BX_BKGRDWRITE;
+	/*
+	 * If we are using check hashes on the cylinder group then we want
+	 * to limit changing the cylinder group time to when we are actually
+	 * going to write it to disk so that its check hash remains correct
+	 * in memory. If the CK_CYLGRP flag is set the time is updated in
+	 * ffs_bufwrite() as the buffer is queued for writing. Otherwise we
+	 * update the time here as we have done historically.
+	 */
 	if ((fs->fs_metackhash & CK_CYLGRP) != 0)
 		bp->b_xflags |= BX_CYLGRP;
-	cgp->cg_old_time = cgp->cg_time = time_second;
+	else
+		cgp->cg_old_time = cgp->cg_time = time_second;
 	*bpp = bp;
 	*cgpp = cgp;
 	return (0);

Modified: head/sys/ufs/ffs/ffs_vfsops.c
==============================================================================
--- head/sys/ufs/ffs/ffs_vfsops.c	Tue Feb  6 00:02:30 2018	(r328913)
+++ head/sys/ufs/ffs/ffs_vfsops.c	Tue Feb  6 00:19:46 2018	(r328914)
@@ -2082,6 +2082,7 @@ static int
 ffs_bufwrite(struct buf *bp)
 {
 	struct buf *newbp;
+	struct cg *cgp;
 
 	CTR3(KTR_BUF, "bufwrite(%p) vp %p flags %X", bp, bp->b_vp, bp->b_flags);
 	if (bp->b_flags & B_INVAL) {
@@ -2163,8 +2164,16 @@ ffs_bufwrite(struct buf *bp)
 		/*
 		 * Initiate write on the copy, release the original.  The
 		 * BKGRDINPROG flag prevents it from going away until 
-		 * the background write completes.
+		 * the background write completes. We have to recalculate
+		 * its check hash in case the buffer gets freed and then
+		 * reconstituted from the buffer cache during a later read.
 		 */
+		if ((bp->b_xflags & BX_CYLGRP) != 0) {
+			cgp = (struct cg *)bp->b_data;
+			cgp->cg_ckhash = 0;
+			cgp->cg_ckhash =
+			    calculate_crc32c(~0L, bp->b_data, bp->b_bcount);
+		}
 		bqrelse(bp);
 		bp = newbp;
 	} else
@@ -2174,6 +2183,13 @@ ffs_bufwrite(struct buf *bp)
 
 	/* Let the normal bufwrite do the rest for us */
 normal_write:
+	/*
+	 * If we are writing a cylinder group, update its time.
+	 */
+	if ((bp->b_xflags & BX_CYLGRP) != 0) {
+		cgp = (struct cg *)bp->b_data;
+		cgp->cg_old_time = cgp->cg_time = time_second;
+	}
 	return (bufwrite(bp));
 }