From owner-freebsd-current@FreeBSD.ORG Fri Mar 20 11:01:19 2009 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id A2EBC1065670 for ; Fri, 20 Mar 2009 11:01:19 +0000 (UTC) (envelope-from M.S.Powell@salford.ac.uk) Received: from relay0.salford.ac.uk (relay0.salford.ac.uk [146.87.0.10]) by mx1.freebsd.org (Postfix) with SMTP id 178298FC27 for ; Fri, 20 Mar 2009 11:01:18 +0000 (UTC) (envelope-from M.S.Powell@salford.ac.uk) Received: (qmail 25120 invoked by uid 98); 20 Mar 2009 11:01:17 -0000 Received: from 146.87.255.121 by relay0.salford.ac.uk (envelope-from , uid 401) with qmail-scanner-2.01 (clamdscan: 0.94.2/9143. spamassassin: 3.2.4. Clear:RC:1(146.87.255.121):. Processed in 0.058034 secs); 20 Mar 2009 11:01:17 -0000 Received: from rust.salford.ac.uk (HELO rust.salford.ac.uk) (146.87.255.121) by relay0.salford.ac.uk (qpsmtpd/0.3x.614) with SMTP; Fri, 20 Mar 2009 11:01:17 +0000 Received: (qmail 82991 invoked by uid 1002); 20 Mar 2009 11:01:15 -0000 Received: from localhost (sendmail-bs@127.0.0.1) by localhost with SMTP; 20 Mar 2009 11:01:15 -0000 Date: Fri, 20 Mar 2009 11:01:15 +0000 (GMT) From: "Mark Powell" To: kevin In-Reply-To: <49BE4EC1.90207@163.com> Message-ID: <20090320102824.W75873@rust.salford.ac.uk> References: <49BD117B.2080706@163.com> <4F9C9299A10AE74E89EA580D14AA10A635E68A@royal64.emp.zapto.org> <49BE4EC1.90207@163.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: FreeBSD Current , Daniel Eriksson Subject: Apparently spurious ZFS CRC errors (was Re: ZFS data error without reasons) X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 20 Mar 2009 11:01:20 -0000 On Mon, 16 Mar 2009, kevin wrote: > My laptop is T61. RAM is also tested by memtest86+ and return no error. Same here. Memtest fine. > "zfs send tank/usr/home/kevin@2009-03-15-16:51:21|zfs receive backup/kevin" > hangs system and i have to power off the machine.when the system up,i find > file error in snapshot tank/usr/home/kevin@2009-03-15-16:51:21.when i destroy > tank/usr/home/kevin@2009-03-15-16:51:21,then reboot system, i find more > errors. I've moved a box that was running that has been running FreeBSD 7 with a 7x1TB drive RAIDZ2 array. I've created the same RAIDZ2 with 8-CURRENT and am restoring data from tape to the new array (I wanted to rejig the zfs setup). All will appear well for a while i.e. no CRC errors, can scrub and rescrub the data whilst the data is restoring without problem. I restored the entire 3.5TB from tape without error. All data still scrubs fine. Then suddenly I get CRC errors on every disk. Repeated scrubs show up different amounts of errors. I just couldn't stop them. So I've started again, this time checking everything and moving drives onto different controllers to isolate problems. I have a gigabyte GA-P35-DS4 MB which has 8xSATA; 6xICH9R & 2xJMB363. It also has an Sil3132 in there which in previous incarnations had the odd drive on it. There's been mention of Sil problems & even though the ICH9, JMB363 and Sil3132 had been perfect with 7, I moved drives off it: 1. Rebuilt kernel and world from last night; Thu Mar 19 18:27:18 GMT 2009. 2. 6x1B drives on ICH9R 2. 2x500GB on JMB363, striped into 1TB 3. / is ufs on USB KEY 4. created RAIDZ2 again 5. recreated zfs filesystems 6. started restore from tape. Same again. I can restore data and perform a scrub after each tape (LTO2 ~200GB each) is restored. No errors. Get up to ~350GB, still no errors. Then the last scrub I've done throws up: ----- pool: pool state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-9P scrub: scrub completed after 0h51m with 0 errors on Fri Mar 20 10:57:18 2009 config: NAME STATE READ WRITE CKSUM pool ONLINE 0 0 0 raidz2 ONLINE 0 0 23 stripe/str0 ONLINE 0 0 489 12.3M repaired ad14 ONLINE 0 0 786 19.7M repaired ad16 ONLINE 0 0 804 20.1M repaired ad18 ONLINE 0 0 754 18.8M repaired ad20 ONLINE 0 0 771 19.3M repaired ad22 ONLINE 0 0 808 20.2M repaired ad24 ONLINE 0 0 848 21.2M repaired errors: No known data errors ----- So it happens on both controllers, on plain drives and the stripe. There just seems no way to get rid of these errors once they appear. As I said, last time I got the whole 3.5TB restored without error, was using it for a few days without error, constantly scrubbing to check reliability, then once the errors appear there's no way to remove them. As this same hardware worked, well with 7 for a long time, and can work perfectly with 8 for several days until the errors strike, this seems like some curious 8 problem? Any help would be appreciated. I'll be happy to provide any further info to help debug this. I didn't want to unnecessarily make this any longer than it already is. Cheers. -- Mark Powell - UNIX System Administrator - The University of Salford Information & Learning Services, Clifford Whitworth Building, Salford University, Manchester, M5 4WT, UK. Tel: +44 161 295 6843 Fax: +44 161 295 5888 www.pgp.com for PGP key