From owner-freebsd-current@FreeBSD.ORG  Fri Mar 20 11:01:19 2009
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id A2EBC1065670
	for <freebsd-current@freebsd.org>; Fri, 20 Mar 2009 11:01:19 +0000 (UTC)
	(envelope-from M.S.Powell@salford.ac.uk)
Received: from relay0.salford.ac.uk (relay0.salford.ac.uk [146.87.0.10])
	by mx1.freebsd.org (Postfix) with SMTP id 178298FC27
	for <freebsd-current@freebsd.org>; Fri, 20 Mar 2009 11:01:18 +0000 (UTC)
	(envelope-from M.S.Powell@salford.ac.uk)
Received: (qmail 25120 invoked by uid 98); 20 Mar 2009 11:01:17 -0000
Received: from 146.87.255.121 by relay0.salford.ac.uk (envelope-from
	<M.S.Powell@salford.ac.uk>, uid 401) with qmail-scanner-2.01 
	(clamdscan: 0.94.2/9143. spamassassin: 3.2.4.  
	Clear:RC:1(146.87.255.121):. 
	Processed in 0.058034 secs); 20 Mar 2009 11:01:17 -0000
Received: from rust.salford.ac.uk (HELO rust.salford.ac.uk) (146.87.255.121)
	by relay0.salford.ac.uk (qpsmtpd/0.3x.614) with SMTP;
	Fri, 20 Mar 2009 11:01:17 +0000
Received: (qmail 82991 invoked by uid 1002); 20 Mar 2009 11:01:15 -0000
Received: from localhost (sendmail-bs@127.0.0.1)
	by localhost with SMTP; 20 Mar 2009 11:01:15 -0000
Date: Fri, 20 Mar 2009 11:01:15 +0000 (GMT)
From: "Mark Powell" <M.S.Powell@salford.ac.uk>
To: kevin <kevinxlinuz@163.com>
In-Reply-To: <49BE4EC1.90207@163.com>
Message-ID: <20090320102824.W75873@rust.salford.ac.uk>
References: <49BD117B.2080706@163.com>
	<4F9C9299A10AE74E89EA580D14AA10A635E68A@royal64.emp.zapto.org>
	<49BE4EC1.90207@163.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: FreeBSD Current <freebsd-current@freebsd.org>,
	Daniel Eriksson <daniel@toomuchdata.com>
Subject: Apparently spurious ZFS CRC errors (was Re: ZFS data error without
 reasons)
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>, 
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 20 Mar 2009 11:01:20 -0000

On Mon, 16 Mar 2009, kevin wrote:

> My laptop is T61. RAM is also tested by memtest86+ and return no error.

Same here. Memtest fine.

> "zfs send tank/usr/home/kevin@2009-03-15-16:51:21|zfs receive backup/kevin" 
> hangs system and i have to power off the machine.when the system up,i find 
> file error in snapshot tank/usr/home/kevin@2009-03-15-16:51:21.when i destroy 
> tank/usr/home/kevin@2009-03-15-16:51:21,then reboot system, i find more 
> errors.

I've moved a box that was running that has been running FreeBSD 7 with a 
7x1TB drive RAIDZ2 array.
   I've created the same RAIDZ2 with 8-CURRENT and am restoring data from 
tape to the new array (I wanted to rejig the zfs setup). All will appear 
well for a while i.e. no CRC errors, can scrub and rescrub the data whilst 
the data is restoring without problem. I restored the entire 3.5TB from 
tape without error. All data still scrubs fine. Then suddenly I get CRC 
errors on every disk. Repeated scrubs show up different amounts of errors.
   I just couldn't stop them. So I've started again, this time checking 
everything and moving drives onto different controllers to isolate 
problems. I have a gigabyte GA-P35-DS4 MB which has 8xSATA; 6xICH9R & 
2xJMB363. It also has an Sil3132 in there which in previous incarnations 
had the odd drive on it. There's been mention of Sil problems & even 
though the ICH9, JMB363 and Sil3132 had been perfect with 7, I moved 
drives off it:

1. Rebuilt kernel and world from last night; Thu Mar 19 18:27:18 GMT 2009.
2. 6x1B drives on ICH9R
2. 2x500GB on JMB363, striped into 1TB
3. / is ufs on USB KEY
4. created RAIDZ2 again
5. recreated zfs filesystems
6. started restore from tape.

Same again. I can restore data and perform a scrub after each tape (LTO2 
~200GB each) is restored. No errors. Get up to ~350GB, still no errors. 
Then the last scrub I've done throws up:

-----
   pool: pool
  state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
         attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
         using 'zpool clear' or replace the device with 'zpool replace'.
    see: http://www.sun.com/msg/ZFS-8000-9P
  scrub: scrub completed after 0h51m with 0 errors on Fri Mar 20 10:57:18 
2009
config:

         NAME             STATE     READ WRITE CKSUM
         pool             ONLINE       0     0     0
           raidz2         ONLINE       0     0    23
             stripe/str0  ONLINE       0     0   489  12.3M repaired
             ad14         ONLINE       0     0   786  19.7M repaired
             ad16         ONLINE       0     0   804  20.1M repaired
             ad18         ONLINE       0     0   754  18.8M repaired
             ad20         ONLINE       0     0   771  19.3M repaired
             ad22         ONLINE       0     0   808  20.2M repaired
             ad24         ONLINE       0     0   848  21.2M repaired

errors: No known data errors
-----

So it happens on both controllers, on plain drives and the stripe. There 
just seems no way to get rid of these errors once they appear. As I said, 
last time I got the whole 3.5TB restored without error, was using it for a 
few days without error, constantly scrubbing to check reliability, then 
once the errors appear there's no way to remove them.
   As this same hardware worked, well with 7 for a long time, and can work 
perfectly with 8 for several days until the errors strike, this seems like 
some curious 8 problem?
   Any help would be appreciated. I'll be happy to provide any further info 
to help debug this. I didn't want to unnecessarily make this any longer 
than it already is.
   Cheers.

-- 
Mark Powell - UNIX System Administrator - The University of Salford
Information & Learning Services, Clifford Whitworth Building,
Salford University, Manchester, M5 4WT, UK.
Tel: +44 161 295 6843  Fax: +44 161 295 5888  www.pgp.com for PGP key