From owner-freebsd-fs@FreeBSD.ORG  Mon Jun  7 10:38:32 2010
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 1DF5F1065672
	for <freebsd-fs@freebsd.org>; Mon,  7 Jun 2010 10:38:32 +0000 (UTC)
	(envelope-from jdc@koitsu.dyndns.org)
Received: from qmta13.westchester.pa.mail.comcast.net
	(qmta13.westchester.pa.mail.comcast.net [76.96.59.243])
	by mx1.freebsd.org (Postfix) with ESMTP id BFE788FC1D
	for <freebsd-fs@freebsd.org>; Mon,  7 Jun 2010 10:38:31 +0000 (UTC)
Received: from omta17.westchester.pa.mail.comcast.net ([76.96.62.89])
	by qmta13.westchester.pa.mail.comcast.net with comcast
	id SyYS1e0041vXlb85DyeXYP; Mon, 07 Jun 2010 10:38:31 +0000
Received: from koitsu.dyndns.org ([98.248.46.159])
	by omta17.westchester.pa.mail.comcast.net with comcast
	id SyeW1e00A3S48mS3dyeXaE; Mon, 07 Jun 2010 10:38:31 +0000
Received: by icarus.home.lan (Postfix, from userid 1000)
	id 66A9F9B418; Mon,  7 Jun 2010 03:38:29 -0700 (PDT)
Date: Mon, 7 Jun 2010 03:38:29 -0700
From: Jeremy Chadwick <freebsd@jdc.parodius.com>
To: Andriy Gapon <avg@icyb.net.ua>
Message-ID: <20100607103829.GA50106@icarus.home.lan>
References: <4C0CAABA.2010506@icyb.net.ua>
	<20100607083428.GA48419@icarus.home.lan>
	<4C0CB3FC.8070001@icyb.net.ua>
	<20100607090850.GA49166@icarus.home.lan>
	<4C0CBBCA.3050304@icyb.net.ua>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <4C0CBBCA.3050304@icyb.net.ua>
User-Agent: Mutt/1.5.20 (2009-06-14)
Cc: freebsd-fs@freebsd.org
Subject: Re: zfs i/o error, no driver error
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 07 Jun 2010 10:38:32 -0000

On Mon, Jun 07, 2010 at 12:28:42PM +0300, Andriy Gapon wrote:
> on 07/06/2010 12:08 Jeremy Chadwick said the following:
> > On Mon, Jun 07, 2010 at 11:55:24AM +0300, Andriy Gapon wrote:
> >> on 07/06/2010 11:34 Jeremy Chadwick said the following:
> >>> On Mon, Jun 07, 2010 at 11:15:54AM +0300, Andriy Gapon wrote:
> >>>> During recent zpool scrub one read error was detected and "128K repaired".
> >>>>
> >>>> In system log I see the following message:
> >>>> ZFS: vdev I/O failure, zpool=tank
> >>>> path=/dev/gptid/536c6f78-e4f3-11de-b9f8-001cc08221ff offset=284456910848
> >>>> size=131072 error=5
> >>>>
> >>>> On the other hand, there are no other errors, nothing from geom, ahci, etc.
> >>>> Why would that happen? What kind of error could this be?
> >>> I believe this indicates silent data corruption[1], which ZFS can
> >>> auto-correct if the pool is a mirror or raidz (otherwise it can detect
> >>> the problem but not fix it).
> >> This pool is a mirror.
> >>
> >>> This can happen for a lot of reasons, but
> >>> tracking down the source is often difficult.  Usually it indicates the
> >>> disk itself has some kind of problem (cache going bad, some sector
> >>> remaps which didn't happen or failed, etc.).
> >> Please note that this is not a CKSUM error, but READ error.
> > 
> > Okay, then it indicates reading some data off the disk failed.  ZFS
> > auto-corrected it by reading the data from the other member in the pool
> > (ada0p4).  That's confirmed here:
> 
> Yes, right, of course.
> If you read my original post you'll see that my question was: why ZFS saw I/O
> error, but disk/controller/geom/etc driver didn't see it.
> I do not see us moving towards an answer to that.

My understanding is that a "vdev I/O error" indicates some sort of
communication failure with a member in the pool, or some other layer
within FreeBSD (GEOM I think, like you said).  I don't think there has
to be a 1:1 ratio between vdev I/O errors and controller/disk errors.

For AHCI and storage controllers, I/O errors are messages that are
returned from the controller to the OS, or from the disk through the
controller to the OS.  I suppose it's possible ZFS could be throwing
an error for something that isn't actually block/disk-level.

I'm interested to see what this turns out to be!

I agree that your SMART statistics look fine -- the only test that isn't
working is a manual or automatic offline data collection test, but this
one fails (gets aborted) pretty often when the system is in use.  You
can see that here:

> Offline data collection status:  (0x84) Offline data collection activity
>                                         was suspended by an interrupting command from host.
>                                         Auto Offline Data Collection: Enabled.

This is the test that "-t offline" induces (not -t short/long).  It
takes a very long time to run, which is why it often gets aborted:

> Total time to complete Offline
> data collection:                 (11160) seconds.

That's the only thing that looks even remotely of concern with ada1,
and it's not even worth focusing on.

-- 
| Jeremy Chadwick                                   jdc@parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                  Mountain View, CA, USA |
| Making life hard for others since 1977.              PGP: 4BD6C0CB |