From owner-freebsd-fs@FreeBSD.ORG Tue Mar 8 14:41:12 2011 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id D76A7106564A; Tue, 8 Mar 2011 14:41:12 +0000 (UTC) (envelope-from smckay@internode.on.net) Received: from ipmail06.adl6.internode.on.net (ipmail06.adl6.internode.on.net [150.101.137.145]) by mx1.freebsd.org (Postfix) with ESMTP id 0C9298FC08; Tue, 8 Mar 2011 14:41:11 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AhgFAFbKdU120Fhq/2dsb2JhbACYUAGNbnW/E4VjBA Received: from ppp118-208-88-106.lns20.bne4.internode.on.net (HELO dungeon.home) ([118.208.88.106]) by ipmail06.adl6.internode.on.net with ESMTP; 09 Mar 2011 00:55:52 +1030 Received: from dungeon.home (localhost [127.0.0.1]) by dungeon.home (8.14.3/8.14.3) with ESMTP id p28EPQtM002115; Wed, 9 Mar 2011 00:25:26 +1000 (EST) (envelope-from mckay) Message-Id: <201103081425.p28EPQtM002115@dungeon.home> From: Stephen McKay To: freebsd-fs@freebsd.org Date: Wed, 09 Mar 2011 00:25:26 +1000 Sender: smckay@internode.on.net Cc: Stephen McKay Subject: Constant minor ZFS corruption X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 08 Mar 2011 14:41:13 -0000 Hi! At work I've built a few ZFS based FreeBSD boxes, culminating in a decent sized rack mount server with 12 2TB disks. Unfortunately, I can't make this server stable. Over the last week or so I've repeated a cycle of: 1) copy 1TB of data from an NFS mount into a ZFS filesystem 2) scrub So far, every one of these has exhibited checksum errors in one or both stages. Using smartmontools, I know that none of the disks have reported any errors. No errors are reported by the disk drivers either (ahci and mps). No ECC (MCA) errors are reported. The problem occurs with 8.2.0 and with 9-current (note: I kept the zfsv15 pool). I've swapped the memory with a different brand and I'm now running it at low speed (800MHz). I disabled hyperthreading and all the other funky CPU related things I could find in the BIOS. I've tried the normal and "new experimental" NFS client (mount -t newnfs ...) Nothing so far has had any effect. At all times I can build "world" with no errors, even if I put in stupidly high parallel "-j" values and cause severe swapping. I tried both with the source on ufs and with it on zfs. No problems. So the hardware seems generally stable. I wrote a program to generate streams of pseudorandom trash (using srandom() and random()). I generated a TB of this onto the ZFS pool and read it back. No problems. I even did a few hundred GB to two files in parallel. Again, no problems. So ZFS itself seems generally sound. However, copying 1TB of data from NFS to ZFS always corrupts just a few blocks, as reported by ZFS during the copy, or in the subsequent scrub. These corrupted blocks may be on any disk or disks, and are not limited to just one controller or a subset of disks or to one vdev. ZFS has always successfully reconstructed the data, but I'm hoping to use that redundancy to guard against failing disks, not against whatever gremlin is scrambling my data on the way in. The hardware is: Asus P7F-E (includes 6 3Gb/s SATA ports) PIKE2008 (8 port SAS card based on LSI2008 chip, supports 6Gb/s) Xeon X3440 (2.53GHz 4core with hyperthreading) Chenbro CSPC-41416AB rackmount case 2x 2GB 1333MHz ECC DDR3 RAM (Corsair) (currently using 1x 2GB Kingston ECC RAM) 2x Seagate ST3500418AS 500GB normal disks, for OS booting 12x Seagate ST2000DL003 2TB "green" disks (yes, with 4kB sectors) (4 disks on the onboard Intel SATA controller using ahci driver, 8 disks on the PIKE using the mps driver) What experiments do you think I should try? I note that during the large copies from NFS to ZFS, the "inactive" page list takes all the spare memory, starving the ARC, which drops to its minimum size. During make world and my junk creation tests the ARC remained full size. Could there be a bug in the ARC shrinking code? I also note that -current spits out: kernel: log_sysevent: type 19 is not implemented instead of what 8.2.0 produces: root: ZFS: checksum mismatch, zpool=dread path=/dev/gpt/bay14 offset=766747611136 size=4096 I have added some code to cddl/compat/opensolaris/kern/opensolaris_sysevent.c to print NVLIST elements (type 19) and hope to see the results at the end of the next run. BTW, does /etc/devd.conf need tweaking now? If ZFSv28 produces different format error messages they may not be logged. Indeed, I have added a printf in log_sysevent() because I can't (yet) make devd do what I want. Also, -current produces many scary lock order reversals. Are we still ignoring these? Here's the pool layout: # zpool status pool: dread state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-9P scan: scrub in progress since Tue Mar 8 15:14:51 2011 5.66T scanned out of 8.49T at 402M/s, 2h2m to go 92K repaired, 66.71% done config: NAME STATE READ WRITE CKSUM dread ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 gpt/bay3 ONLINE 0 0 0 gpt/bay4 ONLINE 0 0 6 (repairing) gpt/bay5 ONLINE 0 0 0 gpt/bay6 ONLINE 0 0 0 gpt/bay7 ONLINE 0 0 0 gpt/bay8 ONLINE 0 0 0 raidz2-1 ONLINE 0 0 0 gpt/bay9 ONLINE 0 0 1 (repairing) gpt/bay10 ONLINE 0 0 6 (repairing) gpt/bay11 ONLINE 0 0 2 (repairing) gpt/bay12 ONLINE 0 0 0 gpt/bay13 ONLINE 0 0 8 (repairing) gpt/bay14 ONLINE 0 0 0 errors: No known data errors Bay3 through 6 are on the onboard controller. Bay7 through 14 are on the PIKE card. Each disk is partitioned alike: # gpart show ada2 => 34 3907029101 ada2 GPT (1.8T) 34 94 - free - (47K) 128 128 1 freebsd-boot (64K) 256 3906994176 2 freebsd-zfs (1.8T) 3906994432 34703 - free - (17M) I used well known tricks to fool ZFS into using ashift=12 to align for lying 4kB sector drives. The next run will take NFS out of the equation (substituting SSH as a transport). Any ideas on what I could try after that? Stephen McKay. PS Anybody got a mirror of http://www.sun.com/msg/ZFS-8000-9P and similar pages? Oracle has hidden them all, so it's a bit silly to refer to them in our ZFS implementation.