From owner-freebsd-fs@FreeBSD.ORG  Wed Nov 10 19:49:29 2010
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id A6390106564A
	for <freebsd-fs@freebsd.org>; Wed, 10 Nov 2010 19:49:29 +0000 (UTC)
	(envelope-from carlson39@llnl.gov)
Received: from smtp.llnl.gov (nspiron-3.llnl.gov [128.115.41.83])
	by mx1.freebsd.org (Postfix) with ESMTP id 85BD68FC17
	for <freebsd-fs@freebsd.org>; Wed, 10 Nov 2010 19:49:29 +0000 (UTC)
X-Attachments: None
Received: from bagua.llnl.gov (HELO [134.9.197.135]) ([134.9.197.135])
	by smtp.llnl.gov with ESMTP; 10 Nov 2010 11:49:28 -0800
Message-ID: <4CDAF749.4000805@llnl.gov>
Date: Wed, 10 Nov 2010 11:49:29 -0800
From: Mike Carlson <carlson39@llnl.gov>
User-Agent: Mozilla/5.0 (X11; U; Linux i686 (x86_64); en-US;
	rv:1.9.2.12) Gecko/20101027 Lightning/1.0b2 Thunderbird/3.1.6
MIME-Version: 1.0
To: freebsd-fs@freebsd.org
References: <4CD84258.6090404@llnl.gov>	<ibbauo$27m$1@dough.gmane.org>	<4CD986DC.1070401@llnl.gov>	<4CD98816.1020306@llnl.gov>
	<ibdu54$fd1$1@dough.gmane.org>
In-Reply-To: <ibdu54$fd1$1@dough.gmane.org>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
X-Content-Filtered-By: Mailman/MimeDel 2.1.5
Subject: Re: 8.1-RELEASE: ZFS data errors
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 10 Nov 2010 19:49:29 -0000

On 11/10/2010 03:03 AM, Ivan Voras wrote:
> On 11/09/10 18:42, Mike Carlson wrote:
>
>>>       write# gstripe label -v -s 16384  data /dev/da2 /dev/da3 /dev/da4
>>>       /dev/da5 /dev/da6 /dev/da7 /dev/da8
>>>       write# df -h
>>>       Filesystem            Size    Used   Avail Capacity  Mounted on
>>>       /dev/da0s1a           1.7T     22G    1.6T     1%    /
>>>       devfs                 1.0K    1.0K      0B   100%    /dev
>>>       /dev/stripe/data      126T    4.0K    116T     0%    /mnt
>>>       write# fsck /mnt
>>>       fsck: Could not determine filesystem type
>>>       write# fsck_ufs  /mnt
>>>       ** /dev/stripe/data (NO WRITE)
>>>       ** Last Mounted on /mnt
>>>       ** Phase 1 - Check Blocks and Sizes
>>>       Segmentation fault
>>> So, the data appears to be okay, I wanted to run through a FSCK just to
>>> do it but that seg faulted. Otherwise, that data looks good.
> Hmm, probably it tried to allocate a gazillion internal structures to
> check it and didn't take no for an answer.
>
>>> Question, why did you recommend using a smaller stripe size? Is that to
>>> ensure a sample 1GB test file gets written across ALL disk members?
> Yes, it's the surest way since MAXPHYS=128 KiB / 8 = 16 KiB.
>
> Well, as far as I'm concerned this probably shows that there isn't
> something wrong about hardware or GEOM, though more testing, like
> running a couple of bonnie++ rounds on the UFS on the stripe volume for
> a few hours, would probably be better.
>
> Btw. what bandwidth do you get from this combination (gstripe + UFS)?
>

The bandwidth for geom_stripe + UFS2 was very nice:

    write# mount
    /dev/da0s1a on / (ufs, local, soft-updates)
    devfs on /dev (devfs, local, multilabel)
    filevol002 on /filevol002 (zfs, local)
    /dev/stripe/data on /mnt (ufs, local, soft-updates)

Simple DD write:

    write# dd if=/dev/zero of=/mnt/zero.dat bs=1m count=5000
    5000+0 records in
    5000+0 records out
    5242880000 bytes transferred in 13.503850 secs (388250759 bytes/sec)

running bonnie++

    write# bonnie++ -u 100 -s24576 -d. -n64
    Using uid:100, gid:65533.
    Writing a byte at a time...done
    Writing intelligently...done
    Rewriting...done
    Reading a byte at a time...done
    Reading intelligently...done
    start 'em...done...done...done...done...done...
    Create files in sequential order...done.
    Stat files in sequential order...done.
    Delete files in sequential order...done.
    Create files in random order...done.
    Stat files in random order...done.
    Delete files in random order...done.
    Version  1.96       ------Sequential Output------ --Sequential
    Input- --Random-
    Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr-
    --Block-- --Seeks--
    Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec
    %CP  /sec %CP
    write.llnl.gov  24G   730  99 343750  63 106157  26  1111  86
    174698  26 219.2   3
    Latency             11492us     149ms     227ms   70274us  
    66776us     766ms
    Version  1.96       ------Sequential Create------ --------Random
    Create--------
    write.llnl.gov      -Create-- --Read--- -Delete-- -Create--
    --Read--- -Delete--
                   files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec
    %CP  /sec %CP
                      64 18681  47 +++++ +++ 99516  97 26297  40 +++++
    +++ 113937  96
    Latency               310ms     149us     152us   68841us    
    144us     146us
    1.96,1.96,write.llnl.gov,1,1289416723,24G,,730,99,343750,63,106157,26,1111,86,174698,26,219.2,3,64,,,,,18681,47,+++++,+++,99516,97,26297,40,+++++,+++,113937,96,11492us,149ms,227ms,70274us,66776us,766ms,310ms,149us,152us,68841us,144us,146us

The system immediately and mysteriously reboot after running bonnie++ 
though, that doesn't seem like a good sign...

I've got an iozone benchmark, gstripe + multipath + UFS vs. multipath + 
ZFS. I can email the gzip'd file to you, as I don't want to clutter the 
mailing list with file attachments.

Another question, for anyone really, but will gmultipath ever have an 
'active/active' model? I'm happy that I have some type of redundancy for 
my SAN, but it it was possible to aggregate the bandwidth of both 
controllers, that would be pretty cool as well.

>> Oh, I almost forgot, here is the ZFS version of that gstripe array:
>>
>>     write# zpool create test01 /dev/stripe/data
>>     write# zpool scrub
>>     write# zpool status
>>        pool: test01
>>       state: ONLINE
>>       scrub: scrub completed after 0h0m with 0 errors on Tue Nov  9
>>     09:41:34 2010
>>     config:
>>
>>          NAME           STATE     READ WRITE CKSUM
>>          test01         ONLINE       0     0     0
>>            stripe/data  ONLINE       0     0     0
> "scrub" verifies only written data, not the whole file system space
> (that's why it finishes so fast), so it isn't really doing any load on
> the array, but I agree that it looks more and more like there really is
> an issue in ZFS.
>
Yeah, I ran scrub when there was around 20GB of random data. In 
8.1-RELEASE, that was the way I would trigger ZFS's acknowledgment that 
the pool had a problem.

I also dug through my logs and saw these:

    Nov  8 15:09:51 write root: ZFS: checksum mismatch, zpool=test01
    path=/dev/da5 offset=749207552 size=131072
    Nov  8 15:09:51 write root: ZFS: checksum mismatch, zpool=test01
    path=/dev/da5 offset=749338624 size=131072
    Nov  8 15:09:51 write root: ZFS: zpool I/O failure, zpool=test01
    error=86
    Nov  8 15:09:51 write root: ZFS: zpool I/O failure, zpool=test01
    error=86
    Nov  8 15:09:51 write root: ZFS: checksum mismatch, zpool=test01
    path=/dev/da3 offset=748421120 size=131072
    Nov  8 15:09:51 write root: ZFS: checksum mismatch, zpool=test01
    path=/dev/da4 offset=746586112 size=131072
    Nov  8 15:09:51 write root: ZFS: checksum mismatch, zpool=test01
    path=/dev/da4 offset=746455040 size=131072
    Nov  8 15:09:51 write root: ZFS: checksum mismatch, zpool=test01
    path=/dev/da4 offset=746717184 size=131072
    Nov  8 15:09:52 write root: ZFS: checksum mismatch, zpool=test01
    path=/dev/da3 offset=748290048 size=131072
    Nov  8 15:09:52 write root: ZFS: checksum mismatch, zpool=test01
    path=/dev/da3 offset=748421120 size=131072
    Nov  8 15:09:52 write root: ZFS: checksum mismatch, zpool=test01
    path=/dev/da4 offset=746586112 size=131072
    Nov  8 15:09:52 write root: ZFS: zpool I/O failure, zpool=test01
    error=86

I'm inclined to believe it is an issue with ZFS.