From owner-freebsd-fs@FreeBSD.ORG Tue Jan 8 05:37:27 2008 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 9DBED16A468 for ; Tue, 8 Jan 2008 05:37:27 +0000 (UTC) (envelope-from scottl@samsco.org) Received: from pooker.samsco.org (pooker.samsco.org [168.103.85.57]) by mx1.freebsd.org (Postfix) with ESMTP id 518DA13C4D1 for ; Tue, 8 Jan 2008 05:37:26 +0000 (UTC) (envelope-from scottl@samsco.org) Received: from phobos.samsco.home (phobos.samsco.home [192.168.254.11]) (authenticated bits=0) by pooker.samsco.org (8.13.8/8.13.8) with ESMTP id m085a0vW010919; Mon, 7 Jan 2008 22:36:01 -0700 (MST) (envelope-from scottl@samsco.org) Message-ID: <47830BC0.5060100@samsco.org> Date: Mon, 07 Jan 2008 22:36:00 -0700 From: Scott Long User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-US; rv:1.8.1.11) Gecko/20071128 SeaMonkey/1.1.7 MIME-Version: 1.0 To: ticso@cicely.de References: <477B16BB.8070104@freebsd.org> <20080102070146.GH49874@cicely12.cicely.de> <477B8440.1020501@freebsd.org> <200801031750.31035.peter.schuller@infidyne.com> <477D16EE.6070804@freebsd.org> <20080103171825.GA28361@lor.one-eyed-alien.net> <6a7033710801061844m59f8c62dvdd3eea80f6c239c1@mail.gmail.com> <20080107135925.GF65134@cicely12.cicely.de> In-Reply-To: <20080107135925.GF65134@cicely12.cicely.de> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Greylist: Sender succeeded SMTP AUTH authentication, not delayed by milter-greylist-2.0.2 (pooker.samsco.org [168.103.85.57]); Mon, 07 Jan 2008 22:36:02 -0700 (MST) X-Spam-Status: No, score=-1.4 required=5.4 tests=ALL_TRUSTED autolearn=failed version=3.1.8 X-Spam-Checker-Version: SpamAssassin 3.1.8 (2007-02-13) on pooker.samsco.org Cc: freebsd-fs@freebsd.org, Brooks Davis , Tz-Huan Huang Subject: Re: ZFS i/o errors - which disk is the problem? X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 08 Jan 2008 05:37:27 -0000 Bernd Walter wrote: > On Mon, Jan 07, 2008 at 10:44:13AM +0800, Tz-Huan Huang wrote: >> 2008/1/4, Brooks Davis : >>> We've definitely seen cases where hardware changes fixed ZFS checksum errors. >>> In once case, a firmware upgrade on the raid controller fixed it. In another >>> case, we'd been connecting to an external array with a SCSI card that didn't >>> have a PCI bracket and the errors went away when the replacement one arrived >>> and was installed. The fact that there were significant errors caught by ZFS >>> was quite disturbing since we wouldn't have found them with UFS. >> Hi, >> >> We have a nfs server using zfs with the similar problem. >> The box is i386 7.0-PRERELEASE with 3G ram: >> >> # uname -a >> FreeBSD cml3 7.0-PRERELEASE FreeBSD 7.0-PRERELEASE #2: >> Sat Jan 5 14:42:41 CST 2008 root@cml3:/usr/obj/usr/src/sys/CML2 i386 >> >> The zfs pool contains 3 raids now: >> >> 2007-11-20.11:49:17 zpool create pool /dev/label/proware263 >> 2007-11-20.11:53:31 zfs create pool/project >> ... (zfs create other filesystems) ... >> 2007-11-20.11:54:32 zfs set atime=off pool >> 2007-12-08.22:59:15 zpool add pool /dev/da0 >> 2008-01-05.21:20:03 zpool add pool /dev/label/proware262 >> >> After a power loss yesterday, the zfs status shows >> >> # zpool status -v >> pool: pool >> state: ONLINE >> status: One or more devices has experienced an error resulting in data >> corruption. Applications may be affected. >> action: Restore the file in question if possible. Otherwise restore the >> entire pool from backup. >> see: http://www.sun.com/msg/ZFS-8000-8A >> scrub: scrub completed with 231 errors on Mon Jan 7 08:05:35 2008 >> config: >> >> NAME STATE READ WRITE CKSUM >> pool ONLINE 0 0 516 >> label/proware263 ONLINE 0 0 231 >> da0 ONLINE 0 0 285 >> label/proware262 ONLINE 0 0 0 >> >> errors: Permanent errors have been detected in the following files: >> >> /system/database/mysql/flickr_geo/flickr_raw_tag.MYI >> pool/project:<0x0> >> pool/home/master/96:<0xbf36> >> >> The main problem is that we cannot mount pool/project any more: >> >> # zfs mount pool/project >> cannot mount 'pool/project': Input/output error >> # grep ZFS /var/log/messages >> Jan 7 10:08:35 cml3 root: ZFS: zpool I/O failure, zpool=pool error=86 >> (repeat many times) >> >> There are many data in pool/project, probably 3.24T. zdb shows >> >> # zdb pool >> ... >> Dataset pool/project [ZPL], ID 33, cr_txg 57, 3.24T, 22267231 objects >> ... >> >> (zdb is still running now, we can provide the output if helpful) >> >> Is there any way to recover any data from pool/project? > > The data is corrupted by controller and/or disk subsystem. > You have no other data sources for the broken data, so it is lost. > The only garantied way is to get it back from backup. > Maybe older snapshots/clones are still readable - I don't know. > Nevertheless data is corrupted and that's the purpose for alternative > data sources such as raidz/mirror and at last backup. > You shouldn't have ignored those errors at first, because you are > running with faulty hardware. > Without ZFS checksumming the system would just process the broken > data with unpredictable results. > If all those errors are fresh then you likely used a broken RAID > controller below ZFS, which silently corrupted syncronity and then > blow when disk state changed. > Unfortunately many RAID controllers are broken and therefor useless. > Huh? Could you be any more vague? Which controllers are broken? Have you contacted anyone about the breakage? Can you describe the breakage? I call bullshit, pure and simple. Scott