From owner-freebsd-fs@FreeBSD.ORG Mon Jun 7 10:38:32 2010 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 1DF5F1065672 for ; Mon, 7 Jun 2010 10:38:32 +0000 (UTC) (envelope-from jdc@koitsu.dyndns.org) Received: from qmta13.westchester.pa.mail.comcast.net (qmta13.westchester.pa.mail.comcast.net [76.96.59.243]) by mx1.freebsd.org (Postfix) with ESMTP id BFE788FC1D for ; Mon, 7 Jun 2010 10:38:31 +0000 (UTC) Received: from omta17.westchester.pa.mail.comcast.net ([76.96.62.89]) by qmta13.westchester.pa.mail.comcast.net with comcast id SyYS1e0041vXlb85DyeXYP; Mon, 07 Jun 2010 10:38:31 +0000 Received: from koitsu.dyndns.org ([98.248.46.159]) by omta17.westchester.pa.mail.comcast.net with comcast id SyeW1e00A3S48mS3dyeXaE; Mon, 07 Jun 2010 10:38:31 +0000 Received: by icarus.home.lan (Postfix, from userid 1000) id 66A9F9B418; Mon, 7 Jun 2010 03:38:29 -0700 (PDT) Date: Mon, 7 Jun 2010 03:38:29 -0700 From: Jeremy Chadwick To: Andriy Gapon Message-ID: <20100607103829.GA50106@icarus.home.lan> References: <4C0CAABA.2010506@icyb.net.ua> <20100607083428.GA48419@icarus.home.lan> <4C0CB3FC.8070001@icyb.net.ua> <20100607090850.GA49166@icarus.home.lan> <4C0CBBCA.3050304@icyb.net.ua> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4C0CBBCA.3050304@icyb.net.ua> User-Agent: Mutt/1.5.20 (2009-06-14) Cc: freebsd-fs@freebsd.org Subject: Re: zfs i/o error, no driver error X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 07 Jun 2010 10:38:32 -0000 On Mon, Jun 07, 2010 at 12:28:42PM +0300, Andriy Gapon wrote: > on 07/06/2010 12:08 Jeremy Chadwick said the following: > > On Mon, Jun 07, 2010 at 11:55:24AM +0300, Andriy Gapon wrote: > >> on 07/06/2010 11:34 Jeremy Chadwick said the following: > >>> On Mon, Jun 07, 2010 at 11:15:54AM +0300, Andriy Gapon wrote: > >>>> During recent zpool scrub one read error was detected and "128K repaired". > >>>> > >>>> In system log I see the following message: > >>>> ZFS: vdev I/O failure, zpool=tank > >>>> path=/dev/gptid/536c6f78-e4f3-11de-b9f8-001cc08221ff offset=284456910848 > >>>> size=131072 error=5 > >>>> > >>>> On the other hand, there are no other errors, nothing from geom, ahci, etc. > >>>> Why would that happen? What kind of error could this be? > >>> I believe this indicates silent data corruption[1], which ZFS can > >>> auto-correct if the pool is a mirror or raidz (otherwise it can detect > >>> the problem but not fix it). > >> This pool is a mirror. > >> > >>> This can happen for a lot of reasons, but > >>> tracking down the source is often difficult. Usually it indicates the > >>> disk itself has some kind of problem (cache going bad, some sector > >>> remaps which didn't happen or failed, etc.). > >> Please note that this is not a CKSUM error, but READ error. > > > > Okay, then it indicates reading some data off the disk failed. ZFS > > auto-corrected it by reading the data from the other member in the pool > > (ada0p4). That's confirmed here: > > Yes, right, of course. > If you read my original post you'll see that my question was: why ZFS saw I/O > error, but disk/controller/geom/etc driver didn't see it. > I do not see us moving towards an answer to that. My understanding is that a "vdev I/O error" indicates some sort of communication failure with a member in the pool, or some other layer within FreeBSD (GEOM I think, like you said). I don't think there has to be a 1:1 ratio between vdev I/O errors and controller/disk errors. For AHCI and storage controllers, I/O errors are messages that are returned from the controller to the OS, or from the disk through the controller to the OS. I suppose it's possible ZFS could be throwing an error for something that isn't actually block/disk-level. I'm interested to see what this turns out to be! I agree that your SMART statistics look fine -- the only test that isn't working is a manual or automatic offline data collection test, but this one fails (gets aborted) pretty often when the system is in use. You can see that here: > Offline data collection status: (0x84) Offline data collection activity > was suspended by an interrupting command from host. > Auto Offline Data Collection: Enabled. This is the test that "-t offline" induces (not -t short/long). It takes a very long time to run, which is why it often gets aborted: > Total time to complete Offline > data collection: (11160) seconds. That's the only thing that looks even remotely of concern with ada1, and it's not even worth focusing on. -- | Jeremy Chadwick jdc@parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB |