From owner-freebsd-fs@FreeBSD.ORG Wed Nov 10 19:49:29 2010 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id A6390106564A for ; Wed, 10 Nov 2010 19:49:29 +0000 (UTC) (envelope-from carlson39@llnl.gov) Received: from smtp.llnl.gov (nspiron-3.llnl.gov [128.115.41.83]) by mx1.freebsd.org (Postfix) with ESMTP id 85BD68FC17 for ; Wed, 10 Nov 2010 19:49:29 +0000 (UTC) X-Attachments: None Received: from bagua.llnl.gov (HELO [134.9.197.135]) ([134.9.197.135]) by smtp.llnl.gov with ESMTP; 10 Nov 2010 11:49:28 -0800 Message-ID: <4CDAF749.4000805@llnl.gov> Date: Wed, 10 Nov 2010 11:49:29 -0800 From: Mike Carlson User-Agent: Mozilla/5.0 (X11; U; Linux i686 (x86_64); en-US; rv:1.9.2.12) Gecko/20101027 Lightning/1.0b2 Thunderbird/3.1.6 MIME-Version: 1.0 To: freebsd-fs@freebsd.org References: <4CD84258.6090404@llnl.gov> <4CD986DC.1070401@llnl.gov> <4CD98816.1020306@llnl.gov> In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Content-Filtered-By: Mailman/MimeDel 2.1.5 Subject: Re: 8.1-RELEASE: ZFS data errors X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 10 Nov 2010 19:49:29 -0000 On 11/10/2010 03:03 AM, Ivan Voras wrote: > On 11/09/10 18:42, Mike Carlson wrote: > >>> write# gstripe label -v -s 16384 data /dev/da2 /dev/da3 /dev/da4 >>> /dev/da5 /dev/da6 /dev/da7 /dev/da8 >>> write# df -h >>> Filesystem Size Used Avail Capacity Mounted on >>> /dev/da0s1a 1.7T 22G 1.6T 1% / >>> devfs 1.0K 1.0K 0B 100% /dev >>> /dev/stripe/data 126T 4.0K 116T 0% /mnt >>> write# fsck /mnt >>> fsck: Could not determine filesystem type >>> write# fsck_ufs /mnt >>> ** /dev/stripe/data (NO WRITE) >>> ** Last Mounted on /mnt >>> ** Phase 1 - Check Blocks and Sizes >>> Segmentation fault >>> So, the data appears to be okay, I wanted to run through a FSCK just to >>> do it but that seg faulted. Otherwise, that data looks good. > Hmm, probably it tried to allocate a gazillion internal structures to > check it and didn't take no for an answer. > >>> Question, why did you recommend using a smaller stripe size? Is that to >>> ensure a sample 1GB test file gets written across ALL disk members? > Yes, it's the surest way since MAXPHYS=128 KiB / 8 = 16 KiB. > > Well, as far as I'm concerned this probably shows that there isn't > something wrong about hardware or GEOM, though more testing, like > running a couple of bonnie++ rounds on the UFS on the stripe volume for > a few hours, would probably be better. > > Btw. what bandwidth do you get from this combination (gstripe + UFS)? > The bandwidth for geom_stripe + UFS2 was very nice: write# mount /dev/da0s1a on / (ufs, local, soft-updates) devfs on /dev (devfs, local, multilabel) filevol002 on /filevol002 (zfs, local) /dev/stripe/data on /mnt (ufs, local, soft-updates) Simple DD write: write# dd if=/dev/zero of=/mnt/zero.dat bs=1m count=5000 5000+0 records in 5000+0 records out 5242880000 bytes transferred in 13.503850 secs (388250759 bytes/sec) running bonnie++ write# bonnie++ -u 100 -s24576 -d. -n64 Using uid:100, gid:65533. Writing a byte at a time...done Writing intelligently...done Rewriting...done Reading a byte at a time...done Reading intelligently...done start 'em...done...done...done...done...done... Create files in sequential order...done. Stat files in sequential order...done. Delete files in sequential order...done. Create files in random order...done. Stat files in random order...done. Delete files in random order...done. Version 1.96 ------Sequential Output------ --Sequential Input- --Random- Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP write.llnl.gov 24G 730 99 343750 63 106157 26 1111 86 174698 26 219.2 3 Latency 11492us 149ms 227ms 70274us 66776us 766ms Version 1.96 ------Sequential Create------ --------Random Create-------- write.llnl.gov -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 64 18681 47 +++++ +++ 99516 97 26297 40 +++++ +++ 113937 96 Latency 310ms 149us 152us 68841us 144us 146us 1.96,1.96,write.llnl.gov,1,1289416723,24G,,730,99,343750,63,106157,26,1111,86,174698,26,219.2,3,64,,,,,18681,47,+++++,+++,99516,97,26297,40,+++++,+++,113937,96,11492us,149ms,227ms,70274us,66776us,766ms,310ms,149us,152us,68841us,144us,146us The system immediately and mysteriously reboot after running bonnie++ though, that doesn't seem like a good sign... I've got an iozone benchmark, gstripe + multipath + UFS vs. multipath + ZFS. I can email the gzip'd file to you, as I don't want to clutter the mailing list with file attachments. Another question, for anyone really, but will gmultipath ever have an 'active/active' model? I'm happy that I have some type of redundancy for my SAN, but it it was possible to aggregate the bandwidth of both controllers, that would be pretty cool as well. >> Oh, I almost forgot, here is the ZFS version of that gstripe array: >> >> write# zpool create test01 /dev/stripe/data >> write# zpool scrub >> write# zpool status >> pool: test01 >> state: ONLINE >> scrub: scrub completed after 0h0m with 0 errors on Tue Nov 9 >> 09:41:34 2010 >> config: >> >> NAME STATE READ WRITE CKSUM >> test01 ONLINE 0 0 0 >> stripe/data ONLINE 0 0 0 > "scrub" verifies only written data, not the whole file system space > (that's why it finishes so fast), so it isn't really doing any load on > the array, but I agree that it looks more and more like there really is > an issue in ZFS. > Yeah, I ran scrub when there was around 20GB of random data. In 8.1-RELEASE, that was the way I would trigger ZFS's acknowledgment that the pool had a problem. I also dug through my logs and saw these: Nov 8 15:09:51 write root: ZFS: checksum mismatch, zpool=test01 path=/dev/da5 offset=749207552 size=131072 Nov 8 15:09:51 write root: ZFS: checksum mismatch, zpool=test01 path=/dev/da5 offset=749338624 size=131072 Nov 8 15:09:51 write root: ZFS: zpool I/O failure, zpool=test01 error=86 Nov 8 15:09:51 write root: ZFS: zpool I/O failure, zpool=test01 error=86 Nov 8 15:09:51 write root: ZFS: checksum mismatch, zpool=test01 path=/dev/da3 offset=748421120 size=131072 Nov 8 15:09:51 write root: ZFS: checksum mismatch, zpool=test01 path=/dev/da4 offset=746586112 size=131072 Nov 8 15:09:51 write root: ZFS: checksum mismatch, zpool=test01 path=/dev/da4 offset=746455040 size=131072 Nov 8 15:09:51 write root: ZFS: checksum mismatch, zpool=test01 path=/dev/da4 offset=746717184 size=131072 Nov 8 15:09:52 write root: ZFS: checksum mismatch, zpool=test01 path=/dev/da3 offset=748290048 size=131072 Nov 8 15:09:52 write root: ZFS: checksum mismatch, zpool=test01 path=/dev/da3 offset=748421120 size=131072 Nov 8 15:09:52 write root: ZFS: checksum mismatch, zpool=test01 path=/dev/da4 offset=746586112 size=131072 Nov 8 15:09:52 write root: ZFS: zpool I/O failure, zpool=test01 error=86 I'm inclined to believe it is an issue with ZFS.