Date: Mon, 07 Jan 2008 22:36:00 -0700 From: Scott Long <scottl@samsco.org> To: ticso@cicely.de Cc: freebsd-fs@freebsd.org, Brooks Davis <brooks@freebsd.org>, Tz-Huan Huang <tzhuan@csie.org> Subject: Re: ZFS i/o errors - which disk is the problem? Message-ID: <47830BC0.5060100@samsco.org> In-Reply-To: <20080107135925.GF65134@cicely12.cicely.de> References: <477B16BB.8070104@freebsd.org> <20080102070146.GH49874@cicely12.cicely.de> <477B8440.1020501@freebsd.org> <200801031750.31035.peter.schuller@infidyne.com> <477D16EE.6070804@freebsd.org> <20080103171825.GA28361@lor.one-eyed-alien.net> <6a7033710801061844m59f8c62dvdd3eea80f6c239c1@mail.gmail.com> <20080107135925.GF65134@cicely12.cicely.de>
next in thread | previous in thread | raw e-mail | index | archive | help
Bernd Walter wrote: > On Mon, Jan 07, 2008 at 10:44:13AM +0800, Tz-Huan Huang wrote: >> 2008/1/4, Brooks Davis <brooks@freebsd.org>: >>> We've definitely seen cases where hardware changes fixed ZFS checksum errors. >>> In once case, a firmware upgrade on the raid controller fixed it. In another >>> case, we'd been connecting to an external array with a SCSI card that didn't >>> have a PCI bracket and the errors went away when the replacement one arrived >>> and was installed. The fact that there were significant errors caught by ZFS >>> was quite disturbing since we wouldn't have found them with UFS. >> Hi, >> >> We have a nfs server using zfs with the similar problem. >> The box is i386 7.0-PRERELEASE with 3G ram: >> >> # uname -a >> FreeBSD cml3 7.0-PRERELEASE FreeBSD 7.0-PRERELEASE #2: >> Sat Jan 5 14:42:41 CST 2008 root@cml3:/usr/obj/usr/src/sys/CML2 i386 >> >> The zfs pool contains 3 raids now: >> >> 2007-11-20.11:49:17 zpool create pool /dev/label/proware263 >> 2007-11-20.11:53:31 zfs create pool/project >> ... (zfs create other filesystems) ... >> 2007-11-20.11:54:32 zfs set atime=off pool >> 2007-12-08.22:59:15 zpool add pool /dev/da0 >> 2008-01-05.21:20:03 zpool add pool /dev/label/proware262 >> >> After a power loss yesterday, the zfs status shows >> >> # zpool status -v >> pool: pool >> state: ONLINE >> status: One or more devices has experienced an error resulting in data >> corruption. Applications may be affected. >> action: Restore the file in question if possible. Otherwise restore the >> entire pool from backup. >> see: http://www.sun.com/msg/ZFS-8000-8A >> scrub: scrub completed with 231 errors on Mon Jan 7 08:05:35 2008 >> config: >> >> NAME STATE READ WRITE CKSUM >> pool ONLINE 0 0 516 >> label/proware263 ONLINE 0 0 231 >> da0 ONLINE 0 0 285 >> label/proware262 ONLINE 0 0 0 >> >> errors: Permanent errors have been detected in the following files: >> >> /system/database/mysql/flickr_geo/flickr_raw_tag.MYI >> pool/project:<0x0> >> pool/home/master/96:<0xbf36> >> >> The main problem is that we cannot mount pool/project any more: >> >> # zfs mount pool/project >> cannot mount 'pool/project': Input/output error >> # grep ZFS /var/log/messages >> Jan 7 10:08:35 cml3 root: ZFS: zpool I/O failure, zpool=pool error=86 >> (repeat many times) >> >> There are many data in pool/project, probably 3.24T. zdb shows >> >> # zdb pool >> ... >> Dataset pool/project [ZPL], ID 33, cr_txg 57, 3.24T, 22267231 objects >> ... >> >> (zdb is still running now, we can provide the output if helpful) >> >> Is there any way to recover any data from pool/project? > > The data is corrupted by controller and/or disk subsystem. > You have no other data sources for the broken data, so it is lost. > The only garantied way is to get it back from backup. > Maybe older snapshots/clones are still readable - I don't know. > Nevertheless data is corrupted and that's the purpose for alternative > data sources such as raidz/mirror and at last backup. > You shouldn't have ignored those errors at first, because you are > running with faulty hardware. > Without ZFS checksumming the system would just process the broken > data with unpredictable results. > If all those errors are fresh then you likely used a broken RAID > controller below ZFS, which silently corrupted syncronity and then > blow when disk state changed. > Unfortunately many RAID controllers are broken and therefor useless. > Huh? Could you be any more vague? Which controllers are broken? Have you contacted anyone about the breakage? Can you describe the breakage? I call bullshit, pure and simple. Scott
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?47830BC0.5060100>