From owner-freebsd-fs@FreeBSD.ORG Fri Jun 22 00:23:26 2012 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id C47D21065674 for ; Fri, 22 Jun 2012 00:23:26 +0000 (UTC) (envelope-from rincebrain@gmail.com) Received: from mail-qa0-f51.google.com (mail-qa0-f51.google.com [209.85.216.51]) by mx1.freebsd.org (Postfix) with ESMTP id 75E558FC08 for ; Fri, 22 Jun 2012 00:23:26 +0000 (UTC) Received: by qaea16 with SMTP id a16so57894qae.17 for ; Thu, 21 Jun 2012 17:23:19 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type; bh=lY19PxRk6bVjJ6ht4PcelV4kcKZUHxZiEg5xeMlohYQ=; b=Oc96gh/o+xMlabBPByYYAeVxSZwS+ab3vHBSCrainhRytPhNYJMPTOYVssHsIGGzYc +t+L1eJC8oqeA9uyVWtfiplf7oPEIBkrUoe8Z3pD8fNbaDWAfr2JdyObSSFfUVZBJjyi XCwifXpUjC9jcGDbVjyV9RBVIzYGCSXBiAilqqZ5bLMT5HP9PTPPO/ns9wUz21RdBo/n ohTB6/CH33dHxsJnUu6YbqUT/y80eRNlM+cn+kWFwaz+0aDT3xCvUhtsFPoiMoZGKNL/ /MBKONfgXE7oREtn3Zzsn5G3KIFJie3fjHrOrgdI6DpAV6jLqlfJyLOZNYh8g4ZRVOFq dDCQ== MIME-Version: 1.0 Received: by 10.224.106.136 with SMTP id x8mr3105162qao.12.1340324599772; Thu, 21 Jun 2012 17:23:19 -0700 (PDT) Sender: rincebrain@gmail.com Received: by 10.229.250.6 with HTTP; Thu, 21 Jun 2012 17:23:19 -0700 (PDT) In-Reply-To: <178486397.30705.1340324395308.JavaMail.root@sz0192a.westchester.pa.mail.comcast.net> References: <178486397.30705.1340324395308.JavaMail.root@sz0192a.westchester.pa.mail.comcast.net> Date: Thu, 21 Jun 2012 20:23:19 -0400 X-Google-Sender-Auth: NSAaK4p8yMSijhwxIAp8GyUp0fI Message-ID: From: Rich To: rondzierwa@comcast.net Content-Type: text/plain; charset=UTF-8 Cc: freebsd-fs@freebsd.org Subject: Re: ZFS Checksum errors X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 22 Jun 2012 00:23:26 -0000 What we're telling you is that: - if ZFS reports errors on a scrub, then - you clear and rescrub, then - find more errors your problem is _not_ gone. - Rich On Thu, Jun 21, 2012 at 8:19 PM, wrote: > Guys I want to thank you all for the attention to my problem. but i think we are > barking up the wrong bug chasing an ongoing hardware problem. > > I have no doubt that the problem was most likely caused by a hardware failure. But > probably not because of memory or processor (my cpu rev was not affected by the two > problems you mentioned, and I ran memory pattern tests on this system for days before > i started using it, and ran pattern tests on the raw raid before putting zfs on it in order > to generate baseline performance metrics). > > Three days ago I was running a disk pattern generator/checker to determine > performance metrics on the disk array with ZFS. The test was configured to > operate with a pair of 1TB files, writing one while checking the previous one. > During the first file creation, the raid controller began complaining about slot 1 > (removed, reset, replaced, removed, reset, replaced, etc). I stopped the test, > reseated the connector on the drive, and the complaints stopped. I started the > pattern checker to look at the fragment of the file that it created (about 200gig) > and that was when ZFS began complaining about checksum errors. I did a > zpool stat, and the first of the two files (the one on "/raid") had a name, and it > was the pattern checker file. So, I did an rm on the pattern checker file, and > ZFS took off producing checksum errors on the console, and I was left with the > orphan file. I ran the zpool scrub, and the second file turned up. So, thinking > that there was something foul on the underlying array, i did a verify. and It turned > up a couple of errors that it fixed on the drive in slot 1. so I did the zpool clear, > ran scrub again, with no better results. > > now to the present. Yes. it was undoubtedly caused by a hardware problem. > But I do not believe that it is an ongoing problem. There were physical disk > errors while I was trying to create the pattern file, and I now have a corrupted file. > These things happen, but in a production environment, we have to be able to > fix the resulting mess without starting over. > > I am willing to bet that the checksum errors are related to the pattern checker > file listed as the file that has uncorrectable errors that was being created when > the disk errors occurred. if I was forced to guess, i would expect that not only > are there errors in the data, but that some of the block pointers reference space > that is either in other files, or not withing the space of the raid at all. i'm sure we > have all seen this kind of filesystem corruption before. it used to be as simple > as running fsck and letting it untangle the bogus file. > > The remainder of the array appears to function normally, the system is still in > production, but in a read-only capacity. There are some 6tb of various media > and other files, and they all seem to be accessible. its just these two files that > are corrupted. So, how do i "fsck" a zfs volume, remove the bogus files, and > get on with my otherwise boring, uneventful life?? > > > thanks again, > ron. > > > > ----- Original Message ----- > From: "Xin LI" > To: rondzierwa@comcast.net > Cc: "Steven Hartland" , freebsd-fs@freebsd.org > Sent: Thursday, June 21, 2012 6:52:06 PM > Subject: Re: ZFS Checksum errors > > Hi, > > On Thu, Jun 21, 2012 at 2:48 PM, wrote: >> >> ok, i ran a verify on the raid, and it completed, so I believe that, from >> the hardware standpoint, da0 should be a functioning, 12TB disk. >> >> i did a zpool clear and re-ran the scrub, and the results were almost >> identical: > [...] >> config: >> >> NAME STATE READ WRITE CKSUM >> zfsPool ONLINE 0 0 6.20K >> da0 ONLINE 0 0 12.5K 24K repaired > > This is very likely be a hardware issue, or a driver issue (less > likely, since we have done extensive testing on this RAID card and the > problems are believed to fixed years ago). > > There are however a few erratums from AMD that makes me feel quite concerned: > > http://support.amd.com/us/Embedded_TechDocs/41322.pdf > > Specifically speaking, #264, #298 seems quite serious. How old is > your motherboard BIOS? Are you using ECC memory by the way? > >> errors: Permanent errors have been detected in the following files: >> >> zfsPool/raid:<0x9e241> >> zfsPool/Build:<0x0> >> phoenix# >> >> along with the 6,353 I/O errors, there were over 12,000 checksum mismatch >> errors on the console. >> >> >> The recommendation from ZFS is to restore the file in question. At this >> point, I would just like to delete the two files. >> how do i do that? >> >> its these kind of antics that make me resistant to the thought of allowing >> ZFS to manage the raid. it seems to be having problems just managing a big >> file system. I don't want it to correct anything, or restore anything, just >> let me delete the files that hurt, fix up the free space list so it doesn't >> point outside the bounds of the disk, and get on with life. > > Are you *really* sure that these are files? The second one doesn't > seem to be a file, but rather some metadata. > > If hardware issue have been ruled out, what I would do is to copy data > over to a different dataset (e.g. Build.new, then validate the data > copied, then destroy the current Build dataset, rename Build.new to > Build). > >> if its finding corrupted files that appear to not have a directory entry >> associated with them (unlinked files), why doesn't it just delete them? >> fsck asks you if you want to delete unlinked files, why doesn't zfs do the >> same, or at least give you the option of deleting bad files when it finds >> them? > > Normally, ZFS do tell you which files are corrupted, sometimes it > takes time since your file might be present in multiple snapshots, and > the current set of utilities only gives you one reference for the > file's name, and you may need to remove the file (or the snapshot > containing it), scrub, then remove the newly revealed reference, etc. > > Your case seems to be very serious that I really think there are some > metadata corruption, which are serious enough that they are already > beyond fix. ZFS replicates metadata into different locations, but > that does not prevent it from being corrupted in memory. In these > situations you will have to use a backup. > >> this is causing a lot of down time, and its making linux look very >> attractive in my organization. how do I get this untangled short of >> reformatting and starting over? > > Linux does not have comparable end-to-end data validation ability that > ZFS offers. Use caution if you go that route. > >> ron. >> >> >> ________________________________ >> From: "Xin LI" >> To: rondzierwa@comcast.net >> Cc: "Steven Hartland" , freebsd-fs@freebsd.org >> Sent: Wednesday, June 20, 2012 6:56:09 PM >> >> Subject: Re: ZFS Checksum errors >> >> On Wed, Jun 20, 2012 at 1:55 PM, wrote: >>> Steve. >>> >>> well, it got done, and it found another anonymous file with errors . any >>> idea how to get rid of these? >> >> Normally you need to "zpool clear zfsPool", and rerun zpool scrub. If >> you see these numbers growing again, it's likely that there are some >> other problems with your hardware. The recommended configuration is >> to use ZFS to manage disks, or at least split your RAID volumes into >> smaller ones by the way, since otherwise the volume is seen as a >> "single disk" to ZFS, making it impossible to repair data errors >> unless you add additional redundancy (zfs set copies=2, etc). >> >>> >>> thanks, >>> ron. >>> >>> >>> >>> phoenix# zpool status -v zfsPool >>> pool: zfsPool >>> state: ONLINE >>> status: One or more devices has experienced an error resulting in data >>> corruption. Applications may be affected. >>> action: Restore the file in question if possible. Otherwise restore the >>> entire pool from backup. >>> see: http://www.sun.com/msg/ZFS-8000-8A >>> scrub: scrub completed after 8h29m with 6276 errors on Wed Jun 20 16:18:01 >>> 2012 >>> config: >>> >>> NAME STATE READ WRITE CKSUM >>> zfsPool ONLINE 0 0 6.17K >>> da0 ONLINE 0 0 13.0K 1.34M repaired >>> >>> errors: Permanent errors have been detected in the following files: >>> >>> zfsPool/raid:<0x9e241> >>> zfsPool/Build:<0x0> >>> phoenix# >>> >>> >>> >>> >>> ----- Original Message ----- >>> From: "Steven Hartland" >>> To: rondzierwa@comcast.net, freebsd-fs@freebsd.org >>> Sent: Wednesday, June 20, 2012 1:58:20 PM >>> Subject: Re: ZFS Checksum errors >>> >>> ----- Original Message ----- >>> From: >>> .. >>> >>>> zpool status indicates that a file has errors, but doesn't tell me its >>>> name: >>>> >>>> phoenix# zpool status -v zfsPool >>>> pool: zfsPool >>>> state: ONLINE >>>> status: One or more devices has experienced an error resulting in data >>>> corruption. Applications may be affected. >>>> action: Restore the file in question if possible. Otherwise restore the >>>> entire pool from backup. >>>> see: http://www.sun.com/msg/ZFS-8000-8A >>>> scrub: scrub in progress for 5h27m, 18.71% done, 23h42m to go >>> >>> Try waiting for the scrub to complete and see if its more helpful after >>> that. >>> >>> Regards >>> Steve >>> >>> ================================================ >>> This e.mail is private and confidential between Multiplay (UK) Ltd. and >>> the person or entity to whom it is addressed. In the event of misdirection, >>> the recipient is prohibited from using, copying, printing or otherwise >>> disseminating it or any information contained in it. >>> >>> In the event of misdirection, illegible or incomplete transmission please >>> telephone +44 845 868 1337 >>> or return the E.mail to postmaster@multiplay.co.uk. >>> >>> _______________________________________________ >>> freebsd-fs@freebsd.org mailing list >>> http://lists.freebsd.org/mailman/listinfo/freebsd-fs >>> To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org" >> >> >> >> -- >> Xin LI https://www.delphij.net/ >> FreeBSD - The Power to Serve! Live free or die > > > > -- > Xin LI https://www.delphij.net/ > FreeBSD - The Power to Serve! Live free or die > _______________________________________________ > freebsd-fs@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-fs > To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org"