From owner-freebsd-fs@freebsd.org  Mon Oct  9 07:09:29 2017
Return-Path: <owner-freebsd-fs@freebsd.org>
Delivered-To: freebsd-fs@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id D3905E277EF
 for <freebsd-fs@mailman.ysv.freebsd.org>; Mon,  9 Oct 2017 07:09:29 +0000 (UTC)
 (envelope-from bugzilla-noreply@freebsd.org)
Received: from kenobi.freebsd.org (kenobi.freebsd.org
 [IPv6:2001:1900:2254:206a::16:76])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 57FCC6B3F3
 for <freebsd-fs@FreeBSD.org>; Mon,  9 Oct 2017 07:09:29 +0000 (UTC)
 (envelope-from bugzilla-noreply@freebsd.org)
Received: from bugs.freebsd.org ([127.0.1.118])
 by kenobi.freebsd.org (8.15.2/8.15.2) with ESMTP id v9979SxO025214
 for <freebsd-fs@FreeBSD.org>; Mon, 9 Oct 2017 07:09:29 GMT
 (envelope-from bugzilla-noreply@freebsd.org)
From: bugzilla-noreply@freebsd.org
To: freebsd-fs@FreeBSD.org
Subject: [Bug 222734] 11.1-RELEASE kernel panics while importing ZFS pool
Date: Mon, 09 Oct 2017 07:09:28 +0000
X-Bugzilla-Reason: AssignedTo
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: Base System
X-Bugzilla-Component: kern
X-Bugzilla-Version: 11.1-RELEASE
X-Bugzilla-Keywords: 
X-Bugzilla-Severity: Affects Only Me
X-Bugzilla-Who: avg@FreeBSD.org
X-Bugzilla-Status: Closed
X-Bugzilla-Resolution: Not A Bug
X-Bugzilla-Priority: ---
X-Bugzilla-Assigned-To: freebsd-fs@FreeBSD.org
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: 
Message-ID: <bug-222734-3630-0bwrAhqucJ@https.bugs.freebsd.org/bugzilla/>
In-Reply-To: <bug-222734-3630@https.bugs.freebsd.org/bugzilla/>
References: <bug-222734-3630@https.bugs.freebsd.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: https://bugs.freebsd.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs/>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 09 Oct 2017 07:09:30 -0000

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D222734

--- Comment #4 from Andriy Gapon <avg@FreeBSD.org> ---
(In reply to Ben RUBSON from comment #3)

Not sure what's scary... (at least, scarier than before).

But let me try to clarify the problem first.
A bit flip happened in RAM (non-ECC) and some corrupted (meta-)data got wri=
tten
to disk.  It happened to be a block pointer within an indirect block.  For =
ZFS
the indirect block looked totally valid as its checksum was calculated after
the bit flip.  So, ZFS had no reason to distrust the block pointers in the
block.
Still, the newer ZFS does some additional validation (sanity checking) of t=
hose
block pointers while the older ZFS fully trusted them to be correct.  A cor=
rupt
block pointer would typically result in a crash later on.  And such a crash=
 is
hard(-er) to debug, that's why the extra checks were added.  In some cases =
the
corruption would be almost benign, so things would appear to be okay.  In t=
his
case, the block pointer was actually a hole block pointer and the corruption
was of the almost benign variety.
So, really, the culprit here was faulty RAM.  If your data gets corrupted in
memory, you have corrupted data and there is no way ZFS can help with that.=
  If
your metadata gets corrupted in memory, then ZFS might be able to detect th=
at
and bail out early, or it can fail to detect the problem and crash later on=
, or
it can even try to read a wrong block, but then the checksum error is the m=
ost
likely outcome.
The usual advice applies, use ECC memory and have backups.
Even on a system with ECC memory some hardware can corrupt memory by writin=
g to
wrong location via DMA, even on a system with reliable hardware there still=
 can
be a kernel (driver) bug that would corrupt memory contents, etc.

--=20
You are receiving this mail because:
You are the assignee for the bug.=