From owner-freebsd-fs@FreeBSD.ORG  Mon Aug 20 12:20:36 2007
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 410A616A417
	for <freebsd-fs@freebsd.org>; Mon, 20 Aug 2007 12:20:36 +0000 (UTC)
	(envelope-from kvs@binarysolutions.dk)
Received: from solow.pil.dk (relay.pil.dk [195.41.47.164])
	by mx1.freebsd.org (Postfix) with ESMTP id 069A913C45E
	for <freebsd-fs@freebsd.org>; Mon, 20 Aug 2007 12:20:35 +0000 (UTC)
	(envelope-from kvs@binarysolutions.dk)
Received: from coruscant.local (naboo.binarysolutions.dk [80.196.17.173])
	by solow.pil.dk (Postfix) with ESMTP id 249E31CC0BE;
	Mon, 20 Aug 2007 14:20:35 +0200 (CEST)
Received: by coruscant.local (Postfix, from userid 502)
	id 728DE5A0B4D; Mon, 20 Aug 2007 14:20:33 +0200 (CEST)
To: Pawel Jakub Dawidek <pjd@FreeBSD.org>
References: <m1wsvtkviw.fsf@binarysolutions.dk>
	<20070820112946.GC16977@garage.freebsd.pl>
From: Kenneth Vestergaard Schmidt <kvs@pil.dk>
Date: Mon, 20 Aug 2007 14:20:33 +0200
In-Reply-To: <20070820112946.GC16977@garage.freebsd.pl> (Pawel Jakub Dawidek's
	message of "Mon\, 20 Aug 2007 13\:29\:46 +0200")
Message-ID: <m1ps1iz9bi.fsf@binarysolutions.dk>
User-Agent: Gnus/5.11 (Gnus v5.11) Emacs/22.1 (darwin)
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Cc: freebsd-fs@freebsd.org
Subject: Re: ZFS: 'checksum mismatch' all over the place
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 20 Aug 2007 12:20:36 -0000

Pawel Jakub Dawidek <pjd@FreeBSD.org> writes:
>> The drive-cage was previously used to expose a RAID-5 array, composed of
>> the 12 disks. This worked just fine, connecting to the same machine and
>> controller (i386 IBM xSeries X335, mpt(4) controller).
>
> How do you know it was fine? Did you have something that did
> checksumming? You could try geli with integrity verification feature
> turned on, fill the disks with some random data and then read it back,
> if your controller corrupts the data, geli should tell you this.

I may have to do this. The previous drive was almost filled to the brim
with data, which rsync looked at each day, and we didn't have a lot of
re-transfer, but that doesn't necessarily mean anything.

The same controller is used in 50+ other machines, but only connected to
two internal drives. There are no problems in those machines.

Still, the really weird thing is that we're seeing checksum-errors in
the same block across many drives. This does smell like either an issue
with the driver, the controller, or the drivecage, and not ZFS or GEOM.

The machine should have been in production, but the array just failed,
and if I can't salvage it, I'll have to start over. I might just as well
try geli with integrity verification before recreating the ZFS array,
then.

-- 
Kenneth Schmidt
pil.dk