From owner-freebsd-fs@FreeBSD.ORG Mon Jan 7 13:59:47 2008 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 7369E16A41B for ; Mon, 7 Jan 2008 13:59:47 +0000 (UTC) (envelope-from ticso@cicely12.cicely.de) Received: from raven.bwct.de (raven.bwct.de [85.159.14.73]) by mx1.freebsd.org (Postfix) with ESMTP id F3B8113C467 for ; Mon, 7 Jan 2008 13:59:46 +0000 (UTC) (envelope-from ticso@cicely12.cicely.de) Received: from cicely5.cicely.de ([10.1.1.7]) by raven.bwct.de (8.13.4/8.13.4) with ESMTP id m07DxcQE029856; Mon, 7 Jan 2008 14:59:38 +0100 (CET) (envelope-from ticso@cicely12.cicely.de) Received: from cicely12.cicely.de (cicely12.cicely.de [10.1.1.14]) by cicely5.cicely.de (8.13.4/8.13.4) with ESMTP id m07DxRFl075184 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Mon, 7 Jan 2008 14:59:27 +0100 (CET) (envelope-from ticso@cicely12.cicely.de) Received: from cicely12.cicely.de (localhost [127.0.0.1]) by cicely12.cicely.de (8.13.4/8.13.3) with ESMTP id m07DxQbl075918; Mon, 7 Jan 2008 14:59:26 +0100 (CET) (envelope-from ticso@cicely12.cicely.de) Received: (from ticso@localhost) by cicely12.cicely.de (8.13.4/8.13.3/Submit) id m07DxQZO075917; Mon, 7 Jan 2008 14:59:26 +0100 (CET) (envelope-from ticso) Date: Mon, 7 Jan 2008 14:59:26 +0100 From: Bernd Walter To: Tz-Huan Huang Message-ID: <20080107135925.GF65134@cicely12.cicely.de> References: <477B16BB.8070104@freebsd.org> <20080102070146.GH49874@cicely12.cicely.de> <477B8440.1020501@freebsd.org> <200801031750.31035.peter.schuller@infidyne.com> <477D16EE.6070804@freebsd.org> <20080103171825.GA28361@lor.one-eyed-alien.net> <6a7033710801061844m59f8c62dvdd3eea80f6c239c1@mail.gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <6a7033710801061844m59f8c62dvdd3eea80f6c239c1@mail.gmail.com> X-Operating-System: FreeBSD cicely12.cicely.de 5.4-STABLE alpha User-Agent: Mutt/1.5.9i X-Spam-Status: No, score=-4.4 required=5.0 tests=ALL_TRUSTED=-1.8, BAYES_00=-2.599 autolearn=ham version=3.2.3 X-Spam-Checker-Version: SpamAssassin 3.2.3 (2007-08-08) on cicely12.cicely.de Cc: freebsd-fs@freebsd.org, Brooks Davis Subject: Re: ZFS i/o errors - which disk is the problem? X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: ticso@cicely.de List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 07 Jan 2008 13:59:47 -0000 On Mon, Jan 07, 2008 at 10:44:13AM +0800, Tz-Huan Huang wrote: > 2008/1/4, Brooks Davis : > > > > We've definitely seen cases where hardware changes fixed ZFS checksum errors. > > In once case, a firmware upgrade on the raid controller fixed it. In another > > case, we'd been connecting to an external array with a SCSI card that didn't > > have a PCI bracket and the errors went away when the replacement one arrived > > and was installed. The fact that there were significant errors caught by ZFS > > was quite disturbing since we wouldn't have found them with UFS. > > Hi, > > We have a nfs server using zfs with the similar problem. > The box is i386 7.0-PRERELEASE with 3G ram: > > # uname -a > FreeBSD cml3 7.0-PRERELEASE FreeBSD 7.0-PRERELEASE #2: > Sat Jan 5 14:42:41 CST 2008 root@cml3:/usr/obj/usr/src/sys/CML2 i386 > > The zfs pool contains 3 raids now: > > 2007-11-20.11:49:17 zpool create pool /dev/label/proware263 > 2007-11-20.11:53:31 zfs create pool/project > ... (zfs create other filesystems) ... > 2007-11-20.11:54:32 zfs set atime=off pool > 2007-12-08.22:59:15 zpool add pool /dev/da0 > 2008-01-05.21:20:03 zpool add pool /dev/label/proware262 > > After a power loss yesterday, the zfs status shows > > # zpool status -v > pool: pool > state: ONLINE > status: One or more devices has experienced an error resulting in data > corruption. Applications may be affected. > action: Restore the file in question if possible. Otherwise restore the > entire pool from backup. > see: http://www.sun.com/msg/ZFS-8000-8A > scrub: scrub completed with 231 errors on Mon Jan 7 08:05:35 2008 > config: > > NAME STATE READ WRITE CKSUM > pool ONLINE 0 0 516 > label/proware263 ONLINE 0 0 231 > da0 ONLINE 0 0 285 > label/proware262 ONLINE 0 0 0 > > errors: Permanent errors have been detected in the following files: > > /system/database/mysql/flickr_geo/flickr_raw_tag.MYI > pool/project:<0x0> > pool/home/master/96:<0xbf36> > > The main problem is that we cannot mount pool/project any more: > > # zfs mount pool/project > cannot mount 'pool/project': Input/output error > # grep ZFS /var/log/messages > Jan 7 10:08:35 cml3 root: ZFS: zpool I/O failure, zpool=pool error=86 > (repeat many times) > > There are many data in pool/project, probably 3.24T. zdb shows > > # zdb pool > ... > Dataset pool/project [ZPL], ID 33, cr_txg 57, 3.24T, 22267231 objects > ... > > (zdb is still running now, we can provide the output if helpful) > > Is there any way to recover any data from pool/project? The data is corrupted by controller and/or disk subsystem. You have no other data sources for the broken data, so it is lost. The only garantied way is to get it back from backup. Maybe older snapshots/clones are still readable - I don't know. Nevertheless data is corrupted and that's the purpose for alternative data sources such as raidz/mirror and at last backup. You shouldn't have ignored those errors at first, because you are running with faulty hardware. Without ZFS checksumming the system would just process the broken data with unpredictable results. If all those errors are fresh then you likely used a broken RAID controller below ZFS, which silently corrupted syncronity and then blow when disk state changed. Unfortunately many RAID controllers are broken and therefor useless. -- B.Walter http://www.bwct.de http://www.fizon.de bernd@bwct.de info@bwct.de support@fizon.de