From owner-freebsd-fs@FreeBSD.ORG  Sun Jan 27 08:36:15 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id 44483885
 for <freebsd-fs@freebsd.org>; Sun, 27 Jan 2013 08:36:15 +0000 (UTC)
 (envelope-from grarpamp@gmail.com)
Received: from mail-ve0-f176.google.com (mail-ve0-f176.google.com
 [209.85.128.176]) by mx1.freebsd.org (Postfix) with ESMTP id E6B926EF
 for <freebsd-fs@freebsd.org>; Sun, 27 Jan 2013 08:36:14 +0000 (UTC)
Received: by mail-ve0-f176.google.com with SMTP id jz10so820786veb.21
 for <freebsd-fs@freebsd.org>; Sun, 27 Jan 2013 00:36:08 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=mime-version:x-received:date:message-id:subject:from:to
 :content-type; bh=q+uUE05NiEyIeIIx+aEuROqd5fZh/rgvROYilMBRroU=;
 b=lE7ff3CBXRhn0XbbkYTq/jKO+30C05ihIKteqoTnUIkUj1UWuLXo1m+S/SD8Z4aSOu
 mltHcAHlJS0RnHfP4tnWl/rk/KIA2Y+rvLRqNkq95ut8pfumrjRxPES1iyWRa7bmVLSe
 KHUSEa595EYDBraTMoHdFe4kaGhUdLMxmFI+ZMipT82+O8GqdI7x7pSrxq/nIoBtUgNF
 3Ko+B7ICn4HvRcSUWIW9Unrf7meuXsbTqspE9SmM5dKaiGS5LDCBoJIDTeOwSN0ugDYR
 i01F3vcvu1a0wLCcirH0WtMk2u9zBKMoNz+X4HZ/NHIs5+Muv6XfgubbHVsatY/OHve8
 XFAQ==
MIME-Version: 1.0
X-Received: by 10.52.67.75 with SMTP id l11mr10033428vdt.29.1359275768423;
 Sun, 27 Jan 2013 00:36:08 -0800 (PST)
Received: by 10.220.219.79 with HTTP; Sun, 27 Jan 2013 00:36:08 -0800 (PST)
Date: Sun, 27 Jan 2013 03:36:08 -0500
Message-ID: <CAD2Ti2-1ROTxQXNA6FzWtcgnMoaAzvfcdh__zH7AVC7zCPsyzw@mail.gmail.com>
Subject: ZFS slackspace, grepping it for data
From: grarpamp <grarpamp@gmail.com>
To: freebsd-fs@freebsd.org
Content-Type: text/plain; charset=UTF-8
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 27 Jan 2013 08:36:15 -0000

Say there's a 100GB zpool over a single vdev (one drive).
It's got a few datasets carved out of it.
How best to stroll through only the 10GB of slackspace
(aka: df 'Avail') that is present?
I tried making a zvol out of it but only got 10mb of zeros,
which makes sense because zfs isn't managing anything
written there in that empty zvol yet.
I could troll the entire drive, but that's 10x the data and
I don't really want the current 90gb of data in the results.
There is zdb -R, but I don't know the offsets of the slack,
unless they are somehow tied to the pathname hierarchy.
Any ideas?

From owner-freebsd-fs@FreeBSD.ORG  Sun Jan 27 09:03:14 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115])
 by hub.freebsd.org (Postfix) with ESMTP id 16697B43
 for <freebsd-fs@freebsd.org>; Sun, 27 Jan 2013 09:03:14 +0000 (UTC)
 (envelope-from nowakpl@platinum.linux.pl)
Received: from platinum.linux.pl (platinum.edu.pl [81.161.192.4])
 by mx1.freebsd.org (Postfix) with ESMTP id CB2D27AB
 for <freebsd-fs@freebsd.org>; Sun, 27 Jan 2013 09:03:13 +0000 (UTC)
Received: by platinum.linux.pl (Postfix, from userid 87)
 id EF21247E11; Sun, 27 Jan 2013 10:03:05 +0100 (CET)
X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on platinum.linux.pl
X-Spam-Level: 
X-Spam-Status: No, score=-1.4 required=3.0 tests=ALL_TRUSTED,AWL
 autolearn=disabled version=3.3.2
Received: from [10.255.0.2] (unknown [83.151.38.73])
 by platinum.linux.pl (Postfix) with ESMTPA id 5410947DE6
 for <freebsd-fs@freebsd.org>; Sun, 27 Jan 2013 10:03:03 +0100 (CET)
Message-ID: <5104ED41.8020800@platinum.linux.pl>
Date: Sun, 27 Jan 2013 10:02:57 +0100
From: Adam Nowacki <nowakpl@platinum.linux.pl>
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64;
 rv:17.0) Gecko/20130107 Thunderbird/17.0.2
MIME-Version: 1.0
To: freebsd-fs@freebsd.org
Subject: Re: ZFS slackspace, grepping it for data
References: <CAD2Ti2-1ROTxQXNA6FzWtcgnMoaAzvfcdh__zH7AVC7zCPsyzw@mail.gmail.com>
In-Reply-To: <CAD2Ti2-1ROTxQXNA6FzWtcgnMoaAzvfcdh__zH7AVC7zCPsyzw@mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 27 Jan 2013 09:03:14 -0000

On 2013-01-27 09:36, grarpamp wrote:
> Say there's a 100GB zpool over a single vdev (one drive).
> It's got a few datasets carved out of it.
> How best to stroll through only the 10GB of slackspace
> (aka: df 'Avail') that is present?
> I tried making a zvol out of it but only got 10mb of zeros,
> which makes sense because zfs isn't managing anything
> written there in that empty zvol yet.
> I could troll the entire drive, but that's 10x the data and
> I don't really want the current 90gb of data in the results.
> There is zdb -R, but I don't know the offsets of the slack,
> unless they are somehow tied to the pathname hierarchy.
> Any ideas?

zdb -mmm pool_name
for on-disk offset add 0x400000 If i remember correctly.

From owner-freebsd-fs@FreeBSD.ORG  Sun Jan 27 10:36:14 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: fs@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id C2CF3901;
 Sun, 27 Jan 2013 10:36:14 +0000 (UTC) (envelope-from uqs@FreeBSD.org)
Received: from acme.spoerlein.net (acme.spoerlein.net
 [IPv6:2a01:4f8:131:23c2::1])
 by mx1.freebsd.org (Postfix) with ESMTP id 4CD9DA06;
 Sun, 27 Jan 2013 10:36:14 +0000 (UTC)
Received: from localhost (acme.spoerlein.net [IPv6:2a01:4f8:131:23c2::1])
 by acme.spoerlein.net (8.14.6/8.14.6) with ESMTP id r0RAaC2Y099978
 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES128-SHA bits=128 verify=NO);
 Sun, 27 Jan 2013 11:36:12 +0100 (CET) (envelope-from uqs@FreeBSD.org)
Date: Sun, 27 Jan 2013 11:36:12 +0100
From: Ulrich =?utf-8?B?U3DDtnJsZWlu?= <uqs@FreeBSD.org>
To: current@FreeBSD.org, fs@FreeBSD.org
Subject: Zpool surgery
Message-ID: <20130127103612.GB38645@acme.spoerlein.net>
Mail-Followup-To: current@FreeBSD.org, fs@FreeBSD.org
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.5.21 (2010-09-15)
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 27 Jan 2013 10:36:14 -0000

Hey all,

I have a slight problem with transplanting a zpool, maybe this is not
possible the way I like to do it, maybe I need to fuzz some
identifiers...

I want to transplant my old zpool tank from a 1TB drive to a new 2TB
drive, but *not* use dd(1) or any other cloning mechanism, as the pool
was very full very often and is surely severely fragmented.

So, I have tank (the old one), the new one, let's call it tank' and
then there's the archive pool where snapshots from tank are sent to, and
these should now come from tank' in the future.

I have:
tank -> sending snapshots to archive

I want:
tank' -> sending snapshots to archive

Ideally I would want archive to not even know that tank and tank' are
different, so as to not have to send a full snapshot again, but
continue the incremental snapshots.

So I did zfs send -R tank | ssh otherhost "zfs recv -d tank" and that
worked well, this contained a snapshot A that was also already on
archive. Then I made a final snapshot B on tank, before turning down that
pool and sent it to tank' as well.

Now I have snapshot A on tank, tank' and archive and they are virtually
identical. I have snapshot B on tank and tank' and would like to send
this from tank' to archive, but it complains:

cannot receive incremental stream: most recent snapshot of archive does
not match incremental source

Is there a way to tweak the identity of tank' to be *really* the same as
tank, so that archive can accept that incremental stream? Or should I
use dd(1) after all to transplant tank to tank'? My other option would
be to turn on dedup on archive and send another full stream of tank',
99.9% of which would hopefully be deduped and not consume precious space
on archive.

Any ideas?

Cheers,
Uli


From owner-freebsd-fs@FreeBSD.ORG  Sun Jan 27 13:01:42 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: fs@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id CF35BFA8
 for <fs@FreeBSD.org>; Sun, 27 Jan 2013 13:01:42 +0000 (UTC)
 (envelope-from nowakpl@platinum.linux.pl)
Received: from platinum.linux.pl (platinum.edu.pl [81.161.192.4])
 by mx1.freebsd.org (Postfix) with ESMTP id 94985F4A
 for <fs@FreeBSD.org>; Sun, 27 Jan 2013 13:01:42 +0000 (UTC)
Received: by platinum.linux.pl (Postfix, from userid 87)
 id 6918947E11; Sun, 27 Jan 2013 14:01:40 +0100 (CET)
X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on platinum.linux.pl
X-Spam-Level: 
X-Spam-Status: No, score=-1.4 required=3.0 tests=ALL_TRUSTED,AWL
 autolearn=disabled version=3.3.2
Received: from [10.255.0.2] (unknown [83.151.38.73])
 by platinum.linux.pl (Postfix) with ESMTPA id E537F47DE6
 for <fs@FreeBSD.org>; Sun, 27 Jan 2013 14:01:39 +0100 (CET)
Message-ID: <5105252D.6060502@platinum.linux.pl>
Date: Sun, 27 Jan 2013 14:01:33 +0100
From: Adam Nowacki <nowakpl@platinum.linux.pl>
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64;
 rv:17.0) Gecko/20130107 Thunderbird/17.0.2
MIME-Version: 1.0
To: fs@FreeBSD.org
Subject: RAID-Z wasted space - asize roundups to nparity +1
Content-Type: text/plain; charset=ISO-8859-2; format=flowed
Content-Transfer-Encoding: 7bit
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 27 Jan 2013 13:01:42 -0000

I've just found something very weird in the ZFS code.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_raidz.c:504 in HEAD

Can someone explain the reason behind this line of code? What it does is 
align on-disk record size to a multiple of number of parity disks + 1 
... this really doesn't make any sense. So far as I can tell those extra 
sectors are just padding - completely unused.

For the array I'm using this results in 4.8% of wasted disk space - 
1.7TB. It's a 12x 3TB disk RAID-Z2.

From owner-freebsd-fs@FreeBSD.ORG  Sun Jan 27 13:48:11 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115])
 by hub.freebsd.org (Postfix) with ESMTP id A2B037F3
 for <freebsd-fs@freebsd.org>; Sun, 27 Jan 2013 13:48:11 +0000 (UTC)
 (envelope-from pawel@dawidek.net)
Received: from mail.dawidek.net (garage.dawidek.net [91.121.88.72])
 by mx1.freebsd.org (Postfix) with ESMTP id 6FF2FE4
 for <freebsd-fs@freebsd.org>; Sun, 27 Jan 2013 13:48:10 +0000 (UTC)
Received: from localhost (89-73-195-149.dynamic.chello.pl [89.73.195.149])
 by mail.dawidek.net (Postfix) with ESMTPSA id 12CC2B0A;
 Sun, 27 Jan 2013 14:45:31 +0100 (CET)
Date: Sun, 27 Jan 2013 14:48:46 +0100
From: Pawel Jakub Dawidek <pjd@FreeBSD.org>
To: Laurence Gill <laurencesgill@googlemail.com>
Subject: Re: HAST performance overheads?
Message-ID: <20130127134845.GC1346@garage.freebsd.pl>
References: <20130125121044.1afac72e@googlemail.com>
MIME-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha1;
 protocol="application/pgp-signature"; boundary="WplhKdTI2c8ulnbP"
Content-Disposition: inline
In-Reply-To: <20130125121044.1afac72e@googlemail.com>
X-OS: FreeBSD 10.0-CURRENT amd64
User-Agent: Mutt/1.5.21 (2010-09-15)
Cc: freebsd-fs@freebsd.org
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 27 Jan 2013 13:48:11 -0000


--WplhKdTI2c8ulnbP
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Fri, Jan 25, 2013 at 12:10:44PM +0000, Laurence Gill wrote:
> If I create ZFS raidz2 on these...
>=20
>  - # zpool create pool raidz2 da0 da1 da2 da3 da4 da5
>=20
> Then run a dd test, a sample output is...
>=20
>  - # dd if=3D/dev/zero of=3Dtest.dat bs=3D1M count=3D1024
>      1073741824 bytes transferred in 7.689634 secs (139634974 bytes/sec)
>=20
>  - # dd if=3D/dev/zero of=3Dtest.dat bs=3D16k count=3D65535
>      1073725440 bytes transferred in 1.909157 secs (562408130 bytes/sec)
>=20
> This is much faster than compared to running hast, I would expect an
> overhead, but not this much.  For example:
>=20
>  - # hastctl create disk0/disk1/disk2/disk3/disk4/disk5
>  - # hastctl role primary all
>  - # zpool create pool raidz2 disk0 disk1 disk2 disk3 disk4 disk5
>=20
> Run a dd test, and the speed is...
>=20
>  - # dd if=3D/dev/zero of=3Dtest.dat bs=3D1M count=3D1024
>      1073741824 bytes transferred in 40.908153 secs (26247624 bytes/sec)
>=20
>  - # dd if=3D/dev/zero of=3Dtest.dat bs=3D16k count=3D65535
>      1073725440 bytes transferred in 42.017997 secs (25553942 bytes/sec)

Let's try to test one step at a time. Can you try to compare sequential
performance of regular disk vs. HAST with no secondary configured?

By no secondary configured I mean 'remote' set to 'none'.

Just do:

	# dd if=3D/dev/zero of=3D/dev/da0 bs=3D1m count=3D10240

then configure HAST and:

	# dd if=3D/dev/zero of=3D/dev/hast/disk0 bs=3D1m count=3D10240

Which FreeBSD version is it?

PS. Your ZFS tests are pretty meaningless, because it is possible that
    everything will end up in memory. I'm sure this is what happens in
    'bs=3D16k count=3D65535' case. Let try raw providers first.

--=20
Pawel Jakub Dawidek                       http://www.wheelsystems.com
FreeBSD committer                         http://www.FreeBSD.org
Am I Evil? Yes, I Am!                     http://tupytaj.pl

--WplhKdTI2c8ulnbP
Content-Type: application/pgp-signature

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (FreeBSD)

iEYEARECAAYFAlEFMD0ACgkQForvXbEpPzRMigCglS8ZP9RggVl0MfVk+A25xgd2
29wAnigH5gA4RXxKI/4XLfKT8sW9eoPP
=D2zj
-----END PGP SIGNATURE-----

--WplhKdTI2c8ulnbP--

From owner-freebsd-fs@FreeBSD.ORG  Sun Jan 27 14:00:27 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: fs@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id AD3958F3;
 Sun, 27 Jan 2013 14:00:27 +0000 (UTC)
 (envelope-from freebsd-listen@fabiankeil.de)
Received: from smtprelay02.ispgateway.de (smtprelay02.ispgateway.de
 [80.67.18.14]) by mx1.freebsd.org (Postfix) with ESMTP id 37874132;
 Sun, 27 Jan 2013 14:00:26 +0000 (UTC)
Received: from [78.35.163.65] (helo=fabiankeil.de)
 by smtprelay02.ispgateway.de with esmtpsa (SSLv3:AES128-SHA:128)
 (Exim 4.68) (envelope-from <freebsd-listen@fabiankeil.de>)
 id 1TzSmQ-0004Q3-NA; Sun, 27 Jan 2013 15:00:18 +0100
Date: Sun, 27 Jan 2013 14:56:01 +0100
From: Fabian Keil <freebsd-listen@fabiankeil.de>
To: Ulrich =?UTF-8?B?U3DDtnJsZWlu?= <uqs@FreeBSD.org>
Subject: Re: Zpool surgery
Message-ID: <20130127145601.7f650d3c@fabiankeil.de>
In-Reply-To: <20130127103612.GB38645@acme.spoerlein.net>
References: <20130127103612.GB38645@acme.spoerlein.net>
Mime-Version: 1.0
Content-Type: multipart/signed; micalg=PGP-SHA1;
 boundary="Sig_/FJX6_h0WkAB0cAJZ1ipHtOB"; protocol="application/pgp-signature"
X-Df-Sender: Nzc1MDY3
Cc: current@FreeBSD.org, fs@FreeBSD.org
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 27 Jan 2013 14:00:27 -0000

--Sig_/FJX6_h0WkAB0cAJZ1ipHtOB
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Ulrich Sp=C3=B6rlein <uqs@FreeBSD.org> wrote:

> I have a slight problem with transplanting a zpool, maybe this is not
> possible the way I like to do it, maybe I need to fuzz some
> identifiers...
>=20
> I want to transplant my old zpool tank from a 1TB drive to a new 2TB
> drive, but *not* use dd(1) or any other cloning mechanism, as the pool
> was very full very often and is surely severely fragmented.
>=20
> So, I have tank (the old one), the new one, let's call it tank' and
> then there's the archive pool where snapshots from tank are sent to, and
> these should now come from tank' in the future.
>=20
> I have:
> tank -> sending snapshots to archive
>=20
> I want:
> tank' -> sending snapshots to archive
>=20
> Ideally I would want archive to not even know that tank and tank' are
> different, so as to not have to send a full snapshot again, but
> continue the incremental snapshots.
>=20
> So I did zfs send -R tank | ssh otherhost "zfs recv -d tank" and that
> worked well, this contained a snapshot A that was also already on
> archive. Then I made a final snapshot B on tank, before turning down that
> pool and sent it to tank' as well.
>=20
> Now I have snapshot A on tank, tank' and archive and they are virtually
> identical. I have snapshot B on tank and tank' and would like to send
> this from tank' to archive, but it complains:
>=20
> cannot receive incremental stream: most recent snapshot of archive does
> not match incremental source

In general this should work, so I'd suggest that you double check
that you are indeed sending the correct incremental.

> Is there a way to tweak the identity of tank' to be *really* the same as
> tank, so that archive can accept that incremental stream? Or should I
> use dd(1) after all to transplant tank to tank'? My other option would
> be to turn on dedup on archive and send another full stream of tank',
> 99.9% of which would hopefully be deduped and not consume precious space
> on archive.

The pools don't have to be the same.

I wouldn't consider dedup as you'll have to recreate the pool if
it turns out the the dedup performance is pathetic. On a system
that hasn't been created with dedup in mind that seems rather
likely.

> Any ideas?

Your whole procedure seems a bit complicated to me.

Why don't you use "zpool replace"?

Fabian

--Sig_/FJX6_h0WkAB0cAJZ1ipHtOB
Content-Type: application/pgp-signature; name=signature.asc
Content-Disposition: attachment; filename=signature.asc

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (FreeBSD)

iEYEARECAAYFAlEFMfgACgkQBYqIVf93VJ11YQCgst43rQ0fEPedB1gaEUIocoQS
I/IAni9cEfESXBY5DZOO+mJ44csGHkYN
=nniE
-----END PGP SIGNATURE-----

--Sig_/FJX6_h0WkAB0cAJZ1ipHtOB--

From owner-freebsd-fs@FreeBSD.ORG  Sun Jan 27 14:31:27 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: fs@FreeBSD.org
Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115])
 by hub.freebsd.org (Postfix) with ESMTP id 6528BEC3;
 Sun, 27 Jan 2013 14:31:27 +0000 (UTC)
 (envelope-from prvs=1739a0aae4=killing@multiplay.co.uk)
Received: from mail1.multiplay.co.uk (mail1.multiplay.co.uk [85.236.96.23])
 by mx1.freebsd.org (Postfix) with ESMTP id B66D4258;
 Sun, 27 Jan 2013 14:31:25 +0000 (UTC)
Received: from r2d2 ([188.220.16.49])
 by mail1.multiplay.co.uk (mail1.multiplay.co.uk [85.236.96.23])
 (MDaemon PRO v10.0.4) with ESMTP id md50001877953.msg;
 Sun, 27 Jan 2013 14:31:18 +0000
X-Spam-Processed: mail1.multiplay.co.uk, Sun, 27 Jan 2013 14:31:18 +0000
 (not processed: message from valid local sender)
X-MDRemoteIP: 188.220.16.49
X-Return-Path: prvs=1739a0aae4=killing@multiplay.co.uk
X-Envelope-From: killing@multiplay.co.uk
Message-ID: <1F0546C4D94D4CCE9F6BB4C8FA19FFF2@multiplay.co.uk>
From: "Steven Hartland" <killing@multiplay.co.uk>
To: =?iso-8859-1?Q?Ulrich_Sp=F6rlein?= <uqs@FreeBSD.org>,
 <current@FreeBSD.org>, <fs@FreeBSD.org>
References: <20130127103612.GB38645@acme.spoerlein.net>
Subject: Re: Zpool surgery
Date: Sun, 27 Jan 2013 14:31:56 -0000
MIME-Version: 1.0
Content-Type: text/plain; format=flowed; charset="iso-8859-1";
 reply-type=original
Content-Transfer-Encoding: 8bit
X-Priority: 3
X-MSMail-Priority: Normal
X-Mailer: Microsoft Outlook Express 6.00.2900.5931
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.6157
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 27 Jan 2013 14:31:27 -0000

----- Original Message ----- 
From: "Ulrich Sp�rlein" <uqs@FreeBSD.org>

> I have a slight problem with transplanting a zpool, maybe this is not
> possible the way I like to do it, maybe I need to fuzz some
> identifiers...
>
> I want to transplant my old zpool tank from a 1TB drive to a new 2TB
> drive, but *not* use dd(1) or any other cloning mechanism, as the pool
> was very full very often and is surely severely fragmented.
>

Cant you just drop the disk in the original machine, set it as a mirror
then once the mirror process has completed break the mirror and remove
the 1TB disk.

If this is a boot disk don't forget to set the boot block as well.

    Regards
    Steve 


================================================
This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it. 

In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337
or return the E.mail to postmaster@multiplay.co.uk.


From owner-freebsd-fs@FreeBSD.ORG  Sun Jan 27 14:54:28 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id E06D1738;
 Sun, 27 Jan 2013 14:54:28 +0000 (UTC)
 (envelope-from utisoft@gmail.com)
Received: from mail-ie0-x22f.google.com (ie-in-x022f.1e100.net
 [IPv6:2607:f8b0:4001:c03::22f])
 by mx1.freebsd.org (Postfix) with ESMTP id 748A5348;
 Sun, 27 Jan 2013 14:54:28 +0000 (UTC)
Received: by mail-ie0-f175.google.com with SMTP id c12so9231ieb.20
 for <multiple recipients>; Sun, 27 Jan 2013 06:54:28 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=mime-version:x-received:in-reply-to:references:date:message-id
 :subject:from:to:cc:content-type;
 bh=+pKTfnRqzbWS09RCaOG86XD13koGRYnf2xF/rpCzF0A=;
 b=mpaAzVheFC3qWalXxCCrGMuIIFGN6EEd+Tybe9VwGaJ1hfzWDOsp8XL2dNNK25g88b
 0CpPVENQO/qyIW3cBFH4XdhHR2w1CzllCH/IXJv+mDCsZaKjcBacieTHolK/i6b5YvGL
 hB6Wy3OetxbOwrGMMd4sw0NTMnXK4Hw+Mc2yXBflff616eyUGs6La8dew5KjdVrYWmHu
 fKBCbQ2TPEw1Tvte3A3OCUTK9mnl7Rpo/G8bkjHpwXv018hcXA9H2KHGsGhX/4gV9yyk
 3d+DPm/noKY3FluVNWlq1a/x+Vw39eVMU4CHhvd7tPnSdhjd0wj+dhFX33aSfg5EL2+T
 C4vA==
MIME-Version: 1.0
X-Received: by 10.50.214.10 with SMTP id nw10mr3061777igc.15.1359298468005;
 Sun, 27 Jan 2013 06:54:28 -0800 (PST)
Received: by 10.64.16.73 with HTTP; Sun, 27 Jan 2013 06:54:27 -0800 (PST)
Received: by 10.64.16.73 with HTTP; Sun, 27 Jan 2013 06:54:27 -0800 (PST)
In-Reply-To: <1F0546C4D94D4CCE9F6BB4C8FA19FFF2@multiplay.co.uk>
References: <20130127103612.GB38645@acme.spoerlein.net>
 <1F0546C4D94D4CCE9F6BB4C8FA19FFF2@multiplay.co.uk>
Date: Sun, 27 Jan 2013 14:54:27 +0000
Message-ID: <CADLo83_p5a9znQ96dku_jJUVyBiULFUBx-xJG0Yi_faUQx26KQ@mail.gmail.com>
Subject: Re: Zpool surgery
From: Chris Rees <utisoft@gmail.com>
To: Steven Hartland <killing@multiplay.co.uk>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
X-Content-Filtered-By: Mailman/MimeDel 2.1.14
Cc: current@freebsd.org, fs@freebsd.org,
 =?ISO-8859-1?Q?Ulrich_Sp=F6rlein?= <uqs@freebsd.org>
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 27 Jan 2013 14:54:29 -0000

On 27 Jan 2013 14:31, "Steven Hartland" <killing@multiplay.co.uk> wrote:
>
> ----- Original Message ----- From: "Ulrich Sp=F6rlein" <uqs@FreeBSD.org>
>
>
>> I have a slight problem with transplanting a zpool, maybe this is not
>> possible the way I like to do it, maybe I need to fuzz some
>> identifiers...
>>
>> I want to transplant my old zpool tank from a 1TB drive to a new 2TB
>> drive, but *not* use dd(1) or any other cloning mechanism, as the pool
>> was very full very often and is surely severely fragmented.
>>
>
> Cant you just drop the disk in the original machine, set it as a mirror
> then once the mirror process has completed break the mirror and remove
> the 1TB disk.
>
> If this is a boot disk don't forget to set the boot block as well.

I managed to replace a drive this way without even rebooting.  I believe
it's the same as a zpool replace.

Chris

From owner-freebsd-fs@FreeBSD.ORG  Sun Jan 27 14:56:19 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: fs@freebsd.org
Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115])
 by hub.freebsd.org (Postfix) with ESMTP id 250C1A41;
 Sun, 27 Jan 2013 14:56:19 +0000 (UTC)
 (envelope-from prvs=1739a0aae4=killing@multiplay.co.uk)
Received: from mail1.multiplay.co.uk (mail1.multiplay.co.uk [85.236.96.23])
 by mx1.freebsd.org (Postfix) with ESMTP id 976F1382;
 Sun, 27 Jan 2013 14:56:18 +0000 (UTC)
Received: from r2d2 ([188.220.16.49])
 by mail1.multiplay.co.uk (mail1.multiplay.co.uk [85.236.96.23])
 (MDaemon PRO v10.0.4) with ESMTP id md50001878220.msg;
 Sun, 27 Jan 2013 14:56:16 +0000
X-Spam-Processed: mail1.multiplay.co.uk, Sun, 27 Jan 2013 14:56:16 +0000
 (not processed: message from valid local sender)
X-MDRemoteIP: 188.220.16.49
X-Return-Path: prvs=1739a0aae4=killing@multiplay.co.uk
X-Envelope-From: killing@multiplay.co.uk
Message-ID: <16B555759C2041ED8185DF478193A59D@multiplay.co.uk>
From: "Steven Hartland" <killing@multiplay.co.uk>
To: "Vladislav Prodan" <universite@ukr.net>
References: <13391.1359029978.3957795939058384896@ffe16.ukr.net>
 <221B307551154F489452F89E304CA5F7@multiplay.co.uk>
 <93308.1359297551.14145052969567453184@ffe15.ukr.net>
Subject: Re: Re[2]: AHCI timeout when using ZFS + AIO + NCQ
Date: Sun, 27 Jan 2013 14:56:52 -0000
MIME-Version: 1.0
Content-Type: text/plain; format=flowed; charset="Windows-1251";
 reply-type=original
Content-Transfer-Encoding: 7bit
X-Priority: 3
X-MSMail-Priority: Normal
X-Mailer: Microsoft Outlook Express 6.00.2900.5931
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.6157
Cc: current@freebsd.org, fs@freebsd.org
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 27 Jan 2013 14:56:19 -0000


----- Original Message ----- 
From: "Vladislav Prodan" <universite@ukr.net>

>> Is it always the same disk, of so replace it SMART helps identify issues
>> but doesn't tell you 100% there's no problem.
> 
> 
> Now it has fallen off a different HDD - ada0.
> I'm 99% sure that MHDD will not find problems in HDD - ada0 and ada2.
> I still have three servers with similar chipsets that have similar problems
> with blade ahci times out.

I notice your disks are connecting at SATA 3.x, which rings bells. We had
a very similar issue on a new Supermicro machine here and after much
testing we proved to our satisfaction that the problem was the HW.

Essentially the combination of SATA 3 speeds the midplane / backplane
degraded the connection between the MB and HDD enough to cause
the disks to randomly drop when under load.

If we connected the disks directly to the MB with SATA cables the
problem went away. In the end we had midplanes changed from an
AHCI pass-through to active LSI controller.

So if you have any sort of midplane / backplane connecting your disks
try connecting them direct to the MB / controller via known SATA 3.x
compliant cables and see if that stops the drops.

Another test you can do is to force the disks to connect at SATA 2.x
this also fixed it in our case, but wasn't something we wanted to
put into production hence the controller swap.

To force SATA 2 speeds you can use the following in /boot/loader.conf
where 'X' is disk identifier e.g. for ada0 X = 0:-
hint.ahcich.X.sata_rev=2

Hope this helps.

    Regards
    Steve

================================================
This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it. 

In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337
or return the E.mail to postmaster@multiplay.co.uk.


From owner-freebsd-fs@FreeBSD.ORG  Sun Jan 27 14:56:52 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id BC70FBB8;
 Sun, 27 Jan 2013 14:56:52 +0000 (UTC)
 (envelope-from universite@ukr.net)
Received: from ffe15.ukr.net (ffe15.ukr.net [195.214.192.50])
 by mx1.freebsd.org (Postfix) with ESMTP id 44A9939A;
 Sun, 27 Jan 2013 14:56:52 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=ukr.net;
 s=ffe; 
 h=Date:Message-Id:From:To:References:In-Reply-To:Subject:Cc:Content-Type:Content-Transfer-Encoding:MIME-Version;
 bh=BJYHeKR7eBEVXWimtxiFQHBLwTNZ7ccMMCeS+OY67SI=; 
 b=tT/t+Tzvfv2tHCj10pm29lc8jWiEiPQQeDvc4vUpp9k+P5Cy4vxy75XYr2wt1DWHKMPWdItBCoDrWYbm4f2EOTyy+2yAFk5ih22MqegzTJSSGQIcuuvFtVjGAkQfI31ZlDwe6ohRGikCgki6of0NnQWid+6Ypxo3jMWSam2D3F0=;
Received: from mail by ffe15.ukr.net with local ID 1TzTO3-000OUJ-Cy
 ; Sun, 27 Jan 2013 16:39:11 +0200
MIME-Version: 1.0
Content-Disposition: inline
Content-Transfer-Encoding: binary
Content-Type: text/plain; charset="windows-1251"
Subject: Re[2]: AHCI timeout when using ZFS + AIO + NCQ
In-Reply-To: <221B307551154F489452F89E304CA5F7@multiplay.co.uk>
References: <13391.1359029978.3957795939058384896@ffe16.ukr.net>
 <221B307551154F489452F89E304CA5F7@multiplay.co.uk>
To: "Steven Hartland" <killing@multiplay.co.uk>
From: "Vladislav Prodan" <universite@ukr.net>
X-Mailer: freemail.ukr.net 4.0
Message-Id: <93308.1359297551.14145052969567453184@ffe15.ukr.net>
X-Browser: Mozilla/5.0 (Windows NT 6.1; WOW64;
 rv:18.0) Gecko/20100101 Firefox/18.0
Date: Sun, 27 Jan 2013 16:39:11 +0200
Cc: current@freebsd.org, fs@freebsd.org
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 27 Jan 2013 14:56:52 -0000


> Is it always the same disk, of so replace it SMART helps identify issues
> but doesn't tell you 100% there's no problem.


Now it has fallen off a different HDD - ada0.
I'm 99% sure that MHDD will not find problems in HDD - ada0 and ada2.
I still have three servers with similar chipsets that have similar problems with blade ahci times out.


> ----- Original Message ----- 
> From: "Vladislav Prodan" <universite@ukr.net>
> To: <fs@freebsd.org>
> Cc: <current@freebsd.org>
> Sent: Thursday, January 24, 2013 12:19 PM
> Subject: AHCI timeout when using ZFS + AIO + NCQ
> 
> 
> >I have the server:
> >
> > FreeBSD 9.1-PRERELEASE #0: Wed Jul 25 01:40:56 EEST 2012
> >
> > Jan 24 12:53:01 vesuvius kernel: atapci0: <JMicron ATA controller> port 
> > 0xc040-0xc047,0xc030-0xc033,0xc020-0xc027,0xc010-0xc013,0xc000-0xc00f mem 0xfe210000-0xfe2101ff irq 51 at device 0.0 on pci3
> > ...
> > Jan 24 12:53:01 vesuvius kernel: ahci0: <ATI IXP700 AHCI SATA controller> port 
> > 0xf040-0xf047,0xf030-0xf033,0xf020-0xf027,0xf010-0xf013,0xf000-0xf00f mem 0xfe307000-0xfe3073ff irq 19 at device 17.0 on pci0
> > Jan 24 12:53:01 vesuvius kernel: ahci0: AHCI v1.20 with 6 6Gbps ports, Port Multiplier supported
> > ...
> > Jan 24 12:53:01 vesuvius kernel: ada2 at ahcich2 bus 0 scbus4 target 0 lun 0
> > Jan 24 12:53:01 vesuvius kernel: ada2: <ST3000DM001-9YN166 CC4C> ATA-8 SATA 3.x device
> > Jan 24 12:53:01 vesuvius kernel: ada2: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
> > Jan 24 12:53:01 vesuvius kernel: ada2: Command Queueing enabled
> > Jan 24 12:53:01 vesuvius kernel: ada2: 2861588MB (5860533168 512 byte sectors: 16H 63S/T 16383C)
> > Jan 24 12:53:01 vesuvius kernel: ada2: Previously was known as ad12
> > ...
> > I use 4 HDD in RAID10 via ZFS.
> >
> > With a very irregular intervals fall off HDD drives. As a result, the server stops.
> >
> > Jan 24 06:48:06 vesuvius kernel: ahcich2: Timeout on slot 6 port 0
> > Jan 24 06:48:06 vesuvius kernel: ahcich2: is 00000000 cs 00000000 ss 000000c0 rs 000000c0 tfd 40 serr 00000000 cmd 0000e817
> > Jan 24 06:48:06 vesuvius kernel: (ada2:ahcich2:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 4c 4e 1e 40 68 00 00 01 00 00
> > Jan 24 06:48:06 vesuvius kernel: (ada2:ahcich2:0:0:0): CAM status: Command timeout
> > Jan 24 06:48:06 vesuvius kernel: (ada2:ahcich2:0:0:0): Retrying command
> > Jan 24 06:51:11 vesuvius kernel: ahcich2: AHCI reset: device not ready after 31000ms (tfd = 00000080)
> > Jan 24 06:51:11 vesuvius kernel: ahcich2: Timeout on slot 8 port 0
> > Jan 24 06:51:11 vesuvius kernel: ahcich2: is 00000000 cs 00000100 ss 00000000 rs 00000100 tfd 00 serr 00000000 cmd 0000e817
> > Jan 24 06:51:11 vesuvius kernel: (aprobe0:ahcich2:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
> > Jan 24 06:51:11 vesuvius kernel: (aprobe0:ahcich2:0:0:0): CAM status: Command timeout
> > Jan 24 06:51:11 vesuvius kernel: (aprobe0:ahcich2:0:0:0): Error 5, Retry was blocked
> > Jan 24 06:51:11 vesuvius kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 4227133, size: 8192
> > Jan 24 06:51:11 vesuvius kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 4227133, size: 8192
> > Jan 24 06:51:11 vesuvius kernel: ahcich2: AHCI reset: device not ready after 31000ms (tfd = 00000080)
> > Jan 24 06:51:11 vesuvius kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 4227133, size: 8192
> > Jan 24 06:51:11 vesuvius kernel: ahcich2: Timeout on slot 8 port 0
> > Jan 24 06:51:11 vesuvius kernel: ahcich2: is 00000000 cs 00000100 ss 00000000 rs 00000100 tfd 00 serr 00000000 cmd 0000e817
> > Jan 24 06:51:11 vesuvius kernel: (aprobe0:ahcich2:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
> > Jan 24 06:51:11 vesuvius kernel: (aprobe0:ahcich2:0:0:0): CAM status: Command timeout
> > Jan 24 06:51:11 vesuvius kernel: (aprobe0:ahcich2:0:0:0): Error 5, Retry was blocked
> > Jan 24 06:51:11 vesuvius kernel: swap_pager: I/O error - pagein failed; blkno 4227133,size 8192, error 6
> > Jan 24 06:51:11 vesuvius kernel: (ada2:(pass2:vm_fault: pager read error, pid 1943 (named)
> > Jan 24 06:51:11 vesuvius kernel: ahcich2:0:ahcich2:0:0:0:0): lost device
> > Jan 24 06:51:11 vesuvius kernel: 0): passdevgonecb: devfs entry is gone
> > Jan 24 06:51:11 vesuvius kernel: pid 1943 (named), uid 53: exited on signal 11
> > ...
> >
> > Helps only restart by pressing Power.
> > Judging by the state of SMART, HDD have no problems. SATA data cable changed.
> >
> >
> > I found a similar problem:
> >
> > http://lists.freebsd.org/pipermail/freebsd-stable/2010-February/055374.html
> > PR: amd64/165547: NVIDIA MCP67 AHCI SATA controller timeout
> >
> > -- 
> > Vladislav V. Prodan
> > System & Network Administrator
> > http://support.od.ua
> > +380 67 4584408, +380 99 4060508
> > VVP88-RIPE


-- 
Vladislav V. Prodan            
System & Network Administrator 
http://support.od.ua           
+380 67 4584408, +380 99 4060508
VVP88-RIPE


From owner-freebsd-fs@FreeBSD.ORG  Sun Jan 27 15:29:05 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id A3FE55B8;
 Sun, 27 Jan 2013 15:29:05 +0000 (UTC)
 (envelope-from universite@ukr.net)
Received: from ffe11.ukr.net (ffe11.ukr.net [195.214.192.31])
 by mx1.freebsd.org (Postfix) with ESMTP id 5511A734;
 Sun, 27 Jan 2013 15:29:05 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=ukr.net;
 s=ffe; 
 h=Date:Message-Id:From:To:References:In-Reply-To:Subject:Cc:Content-Type:Content-Transfer-Encoding:MIME-Version;
 bh=KA3pILGNPBJkWcpAOJAYYa+9vxKGxcXhrIZ+SsImMNo=; 
 b=C/fyFj5QdF2pqtrKTZinJdNeCy1WcGK2v1aRtcfK8JlWdZ66n2w1VaslcpA2T3rskP6ez496vmWq14Y/kIdfYMbfj1M6GmyQg5Q431bDrf99VTN7cpJWdeqjLKhHPPTsh2UBMmdj0ASbG+X/Sv8Z+KsSFo4rNlE7HirG30yUXxY=;
Received: from mail by ffe11.ukr.net with local ID 1TzTvB-000Idu-Kk
 ; Sun, 27 Jan 2013 17:13:25 +0200
MIME-Version: 1.0
Content-Disposition: inline
Content-Transfer-Encoding: binary
Content-Type: text/plain; charset="windows-1251"
Subject: Re[2]: Re[2]: AHCI timeout when using ZFS + AIO + NCQ
In-Reply-To: <16B555759C2041ED8185DF478193A59D@multiplay.co.uk>
References: <16B555759C2041ED8185DF478193A59D@multiplay.co.uk>
 <93308.1359297551.14145052969567453184@ffe15.ukr.net>
 <13391.1359029978.3957795939058384896@ffe16.ukr.net>
 <221B307551154F489452F89E304CA5F7@multiplay.co.uk>
To: "Steven Hartland" <killing@multiplay.co.uk>
From: "Vladislav Prodan" <universite@ukr.net>
X-Mailer: freemail.ukr.net 4.0
Message-Id: <70362.1359299605.3196836531757973504@ffe11.ukr.net>
X-Browser: Mozilla/5.0 (Windows NT 6.1; WOW64;
 rv:18.0) Gecko/20100101 Firefox/18.0
Date: Sun, 27 Jan 2013 17:13:25 +0200
Cc: current@freebsd.org, fs@freebsd.org
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 27 Jan 2013 15:29:05 -0000


> ----- Original Message ----- 
> From: "Vladislav Prodan" <universite@ukr.net>
> 
> >> Is it always the same disk, of so replace it SMART helps identify issues
> >> but doesn't tell you 100% there's no problem.
> > 
> > 
> > Now it has fallen off a different HDD - ada0.
> > I'm 99% sure that MHDD will not find problems in HDD - ada0 and ada2.
> > I still have three servers with similar chipsets that have similar problems
> > with blade ahci times out.
> 
> I notice your disks are connecting at SATA 3.x, which rings bells. We had
> a very similar issue on a new Supermicro machine here and after much
> testing we proved to our satisfaction that the problem was the HW.


I have a motherboard ASUS M5A97 PRO
http://www.asus.com/Motherboard/M5A97_PRO/#specifications
Has replacement SATA data cables.
Putting hard RAID controller does not guarantee data recovery at his death.
 
> Essentially the combination of SATA 3 speeds the midplane / backplane
> degraded the connection between the MB and HDD enough to cause
> the disks to randomly drop when under load.
> 
> If we connected the disks directly to the MB with SATA cables the
> problem went away. In the end we had midplanes changed from an
> AHCI pass-through to active LSI controller.
> 
> So if you have any sort of midplane / backplane connecting your disks
> try connecting them direct to the MB / controller via known SATA 3.x
> compliant cables and see if that stops the drops.
> 
> Another test you can do is to force the disks to connect at SATA 2.x
> this also fixed it in our case, but wasn't something we wanted to
> put into production hence the controller swap.
> 
> To force SATA 2 speeds you can use the following in /boot/loader.conf
> where 'X' is disk identifier e.g. for ada0 X = 0:-
> hint.ahcich.X.sata_rev=2
> 
> Hope this helps.
> 
> Regards
> Steve
> 

-- 
Vladislav V. Prodan            
System & Network Administrator 
http://support.od.ua           
+380 67 4584408, +380 99 4060508
VVP88-RIPE


From owner-freebsd-fs@FreeBSD.ORG  Sun Jan 27 18:44:04 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id ADC8F485;
 Sun, 27 Jan 2013 18:44:04 +0000 (UTC)
 (envelope-from prvs=1739a0aae4=killing@multiplay.co.uk)
Received: from mail1.multiplay.co.uk (mail1.multiplay.co.uk [85.236.96.23])
 by mx1.freebsd.org (Postfix) with ESMTP id 17EA3FC1;
 Sun, 27 Jan 2013 18:44:03 +0000 (UTC)
Received: from r2d2 ([188.220.16.49])
 by mail1.multiplay.co.uk (mail1.multiplay.co.uk [85.236.96.23])
 (MDaemon PRO v10.0.4) with ESMTP id md50001881137.msg;
 Sun, 27 Jan 2013 18:44:01 +0000
X-Spam-Processed: mail1.multiplay.co.uk, Sun, 27 Jan 2013 18:44:01 +0000
 (not processed: message from valid local sender)
X-MDRemoteIP: 188.220.16.49
X-Return-Path: prvs=1739a0aae4=killing@multiplay.co.uk
X-Envelope-From: killing@multiplay.co.uk
Message-ID: <917933DB5C9A490D93A739058C2507A1@multiplay.co.uk>
From: "Steven Hartland" <killing@multiplay.co.uk>
To: "Vladislav Prodan" <universite@ukr.net>
References: <16B555759C2041ED8185DF478193A59D@multiplay.co.uk>
 <93308.1359297551.14145052969567453184@ffe15.ukr.net>
 <13391.1359029978.3957795939058384896@ffe16.ukr.net>
 <221B307551154F489452F89E304CA5F7@multiplay.co.uk>
 <70362.1359299605.3196836531757973504@ffe11.ukr.net>
Subject: Re: AHCI timeout when using ZFS + AIO + NCQ
Date: Sun, 27 Jan 2013 18:44:37 -0000
MIME-Version: 1.0
Content-Type: text/plain; format=flowed; charset="iso-8859-1";
 reply-type=original
Content-Transfer-Encoding: 7bit
X-Priority: 3
X-MSMail-Priority: Normal
X-Mailer: Microsoft Outlook Express 6.00.2900.5931
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.6157
Cc: current@freebsd.org, fs@freebsd.org
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 27 Jan 2013 18:44:04 -0000


----- Original Message ----- 
From: "Vladislav Prodan" <universite@ukr.net>
To: "Steven Hartland" <killing@multiplay.co.uk>
Cc: <current@freebsd.org>; <fs@freebsd.org>
Sent: Sunday, January 27, 2013 3:13 PM
Subject: Re[2]: Re[2]: AHCI timeout when using ZFS + AIO + NCQ


> 
> 
>> ----- Original Message ----- 
>> From: "Vladislav Prodan" <universite@ukr.net>
>> 
>> >> Is it always the same disk, of so replace it SMART helps identify issues
>> >> but doesn't tell you 100% there's no problem.
>> > 
>> > 
>> > Now it has fallen off a different HDD - ada0.
>> > I'm 99% sure that MHDD will not find problems in HDD - ada0 and ada2.
>> > I still have three servers with similar chipsets that have similar problems
>> > with blade ahci times out.
>> 
>> I notice your disks are connecting at SATA 3.x, which rings bells. We had
>> a very similar issue on a new Supermicro machine here and after much
>> testing we proved to our satisfaction that the problem was the HW.
> 
> 
> I have a motherboard ASUS M5A97 PRO
> http://www.asus.com/Motherboard/M5A97_PRO/#specifications
> Has replacement SATA data cables.
> Putting hard RAID controller does not guarantee data recovery at his death.

Not sure what that has to do with cable / track lengths via things
like a backplane?

Do you or do you not have a hotswap backplane?

>> Essentially the combination of SATA 3 speeds the midplane / backplane
>> degraded the connection between the MB and HDD enough to cause
>> the disks to randomly drop when under load.
>> 
>> If we connected the disks directly to the MB with SATA cables the
>> problem went away. In the end we had midplanes changed from an
>> AHCI pass-through to active LSI controller.
>> 
>> So if you have any sort of midplane / backplane connecting your disks
>> try connecting them direct to the MB / controller via known SATA 3.x
>> compliant cables and see if that stops the drops.
>> 
>> Another test you can do is to force the disks to connect at SATA 2.x
>> this also fixed it in our case, but wasn't something we wanted to
>> put into production hence the controller swap.
>> 
>> To force SATA 2 speeds you can use the following in /boot/loader.conf
>> where 'X' is disk identifier e.g. for ada0 X = 0:-
>> hint.ahcich.X.sata_rev=2

This is still worth trying as it could still indicate a problem
with your controller, cables or disks.

    Regards
    Steve

================================================
This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it. 

In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337
or return the E.mail to postmaster@multiplay.co.uk.


From owner-freebsd-fs@FreeBSD.ORG  Sun Jan 27 19:02:08 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id 5E7EE6BF;
 Sun, 27 Jan 2013 19:02:08 +0000 (UTC)
 (envelope-from universite@ukr.net)
Received: from ffe16.ukr.net (ffe16.ukr.net [195.214.192.51])
 by mx1.freebsd.org (Postfix) with ESMTP id 06DD7B1;
 Sun, 27 Jan 2013 19:02:07 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=ukr.net;
 s=ffe; 
 h=Date:Message-Id:From:To:References:In-Reply-To:Subject:Cc:Content-Type:Content-Transfer-Encoding:MIME-Version;
 bh=sgAv80HRGHZrDFfsP8s1dqUbMJPy4blkmHgI4So/RBw=; 
 b=TBW+YxoNuAsbRBw0DTEacjNpHoqgd2/MiI5S12C8LYbpvjyK0JDiIwlIAYq2BJ/URRD6LK0rdE4ssuq66gE/kbB9N84f3JWlBn3se8FzCrEARihxnjm58tGxIbArgLaf6WJ7lkc5r5VQ7hhjPOUGDRxI1MZTLfJjv7oyuXypDno=;
Received: from mail by ffe16.ukr.net with local ID 1TzXUN-000IPl-KG
 ; Sun, 27 Jan 2013 21:01:59 +0200
MIME-Version: 1.0
Content-Disposition: inline
Content-Transfer-Encoding: binary
Content-Type: text/plain; charset="windows-1251"
Subject: Re[2]: AHCI timeout when using ZFS + AIO + NCQ
In-Reply-To: <917933DB5C9A490D93A739058C2507A1@multiplay.co.uk>
References: <917933DB5C9A490D93A739058C2507A1@multiplay.co.uk>
 <16B555759C2041ED8185DF478193A59D@multiplay.co.uk>
 <93308.1359297551.14145052969567453184@ffe15.ukr.net>
 <13391.1359029978.3957795939058384896@ffe16.ukr.net>
 <221B307551154F489452F89E304CA5F7@multiplay.co.uk>
 <70362.1359299605.3196836531757973504@ffe11.ukr.net>
To: "Steven Hartland" <killing@multiplay.co.uk>
From: "Vladislav Prodan" <universite@ukr.net>
X-Mailer: freemail.ukr.net 4.0
Message-Id: <70578.1359313319.18126575192049975296@ffe16.ukr.net>
X-Browser: Mozilla/5.0 (Windows NT 6.1; WOW64;
 rv:18.0) Gecko/20100101 Firefox/18.0
Date: Sun, 27 Jan 2013 21:01:59 +0200
Cc: current@freebsd.org, fs@freebsd.org
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 27 Jan 2013 19:02:08 -0000


> >> Essentially the combination of SATA 3 speeds the midplane / backplane
> >> degraded the connection between the MB and HDD enough to cause
> >> the disks to randomly drop when under load.
> >> 
> >> If we connected the disks directly to the MB with SATA cables the
> >> problem went away. In the end we had midplanes changed from an
> >> AHCI pass-through to active LSI controller.
> >> 
> >> So if you have any sort of midplane / backplane connecting your disks
> >> try connecting them direct to the MB / controller via known SATA 3.x
> >> compliant cables and see if that stops the drops.
> >> 
> >> Another test you can do is to force the disks to connect at SATA 2.x
> >> this also fixed it in our case, but wasn't something we wanted to
> >> put into production hence the controller swap.
> >> 
> >> To force SATA 2 speeds you can use the following in /boot/loader.conf
> >> where 'X' is disk identifier e.g. for ada0 X = 0:-
> >> hint.ahcich.X.sata_rev=2
> 
> This is still worth trying as it could still indicate a problem
> with your controller, cables or disks.
> 

Or, simply disable the ahci kernel module and use only ata?


-- 
Vladislav V. Prodan            
System & Network Administrator 
http://support.od.ua           
+380 67 4584408, +380 99 4060508
VVP88-RIPE


From owner-freebsd-fs@FreeBSD.ORG  Sun Jan 27 19:08:08 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: fs@FreeBSD.org
Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115])
 by hub.freebsd.org (Postfix) with ESMTP id 9D773C4A;
 Sun, 27 Jan 2013 19:08:08 +0000 (UTC) (envelope-from uqs@FreeBSD.org)
Received: from acme.spoerlein.net (acme.spoerlein.net
 [IPv6:2a01:4f8:131:23c2::1])
 by mx1.freebsd.org (Postfix) with ESMTP id 2DFC5127;
 Sun, 27 Jan 2013 19:08:08 +0000 (UTC)
Received: from localhost (acme.spoerlein.net [IPv6:2a01:4f8:131:23c2::1])
 by acme.spoerlein.net (8.14.6/8.14.6) with ESMTP id r0RJ86S1009795
 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES128-SHA bits=128 verify=NO);
 Sun, 27 Jan 2013 20:08:06 +0100 (CET) (envelope-from uqs@FreeBSD.org)
Date: Sun, 27 Jan 2013 20:08:06 +0100
From: Ulrich =?utf-8?B?U3DDtnJsZWlu?= <uqs@FreeBSD.org>
To: Fabian Keil <freebsd-listen@fabiankeil.de>
Subject: Re: Zpool surgery
Message-ID: <20130127190806.GQ35868@acme.spoerlein.net>
Mail-Followup-To: Fabian Keil <freebsd-listen@fabiankeil.de>,
 current@FreeBSD.org, fs@FreeBSD.org
References: <20130127103612.GB38645@acme.spoerlein.net>
 <20130127145601.7f650d3c@fabiankeil.de>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <20130127145601.7f650d3c@fabiankeil.de>
User-Agent: Mutt/1.5.21 (2010-09-15)
Cc: current@FreeBSD.org, fs@FreeBSD.org
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 27 Jan 2013 19:08:08 -0000

On Sun, 2013-01-27 at 14:56:01 +0100, Fabian Keil wrote:
> Ulrich Spörlein <uqs@FreeBSD.org> wrote:
> 
> > I have a slight problem with transplanting a zpool, maybe this is not
> > possible the way I like to do it, maybe I need to fuzz some
> > identifiers...
> > 
> > I want to transplant my old zpool tank from a 1TB drive to a new 2TB
> > drive, but *not* use dd(1) or any other cloning mechanism, as the pool
> > was very full very often and is surely severely fragmented.
> > 
> > So, I have tank (the old one), the new one, let's call it tank' and
> > then there's the archive pool where snapshots from tank are sent to, and
> > these should now come from tank' in the future.
> > 
> > I have:
> > tank -> sending snapshots to archive
> > 
> > I want:
> > tank' -> sending snapshots to archive
> > 
> > Ideally I would want archive to not even know that tank and tank' are
> > different, so as to not have to send a full snapshot again, but
> > continue the incremental snapshots.
> > 
> > So I did zfs send -R tank | ssh otherhost "zfs recv -d tank" and that
> > worked well, this contained a snapshot A that was also already on
> > archive. Then I made a final snapshot B on tank, before turning down that
> > pool and sent it to tank' as well.
> > 
> > Now I have snapshot A on tank, tank' and archive and they are virtually
> > identical. I have snapshot B on tank and tank' and would like to send
> > this from tank' to archive, but it complains:
> > 
> > cannot receive incremental stream: most recent snapshot of archive does
> > not match incremental source
> 
> In general this should work, so I'd suggest that you double check
> that you are indeed sending the correct incremental.
> 
> > Is there a way to tweak the identity of tank' to be *really* the same as
> > tank, so that archive can accept that incremental stream? Or should I
> > use dd(1) after all to transplant tank to tank'? My other option would
> > be to turn on dedup on archive and send another full stream of tank',
> > 99.9% of which would hopefully be deduped and not consume precious space
> > on archive.
> 
> The pools don't have to be the same.
> 
> I wouldn't consider dedup as you'll have to recreate the pool if
> it turns out the the dedup performance is pathetic. On a system
> that hasn't been created with dedup in mind that seems rather
> likely.
> 
> > Any ideas?
> 
> Your whole procedure seems a bit complicated to me.
> 
> Why don't you use "zpool replace"?


Ehhh, .... "zpool replace", eh? I have to say I didn't know that option
was available, but also because this is on a newer machine, I needed
some way to do this over the network, so a direct zpool replace is not
that easy.

I dug out an old ATA-to-USB case and will use that to attach the old
tank to the new machine and then have a try at this zpool replace thing.

How will that affect the fragmentation level of the new pool? Will the
resilver do something sensible wrt. keeping files together for better
read-ahead performance?

Cheers,
Uli

From owner-freebsd-fs@FreeBSD.ORG  Sun Jan 27 19:12:13 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id 4F7EDEE0;
 Sun, 27 Jan 2013 19:12:13 +0000 (UTC)
 (envelope-from hselasky@c2i.net)
Received: from swip.net (mailfe02.c2i.net [212.247.154.34])
 by mx1.freebsd.org (Postfix) with ESMTP id 2BE98155;
 Sun, 27 Jan 2013 19:12:11 +0000 (UTC)
X-T2-Spam-Status: No, hits=-1.0 required=5.0 tests=ALL_TRUSTED
Received: from [176.74.213.204] (account mc467741@c2i.net HELO
 laptop015.hselasky.homeunix.org)
 by mailfe02.swip.net (CommuniGate Pro SMTP 5.4.4)
 with ESMTPA id 372818926; Sun, 27 Jan 2013 20:12:10 +0100
From: Hans Petter Selasky <hselasky@c2i.net>
To: freebsd-current@freebsd.org
Subject: Re: Zpool surgery
Date: Sun, 27 Jan 2013 20:13:24 +0100
User-Agent: KMail/1.13.7 (FreeBSD/9.1-STABLE; KDE/4.8.4; amd64; ; )
References: <20130127103612.GB38645@acme.spoerlein.net>
 <20130127145601.7f650d3c@fabiankeil.de>
 <20130127190806.GQ35868@acme.spoerlein.net>
In-Reply-To: <20130127190806.GQ35868@acme.spoerlein.net>
X-Face: ?p&W)c<w+fb'sd#*]i!s!UQ[!|W8%I(I+\}LQ-Z1nB*t(u_>(
 =?iso-8859-1?q?+80hU=3B=27=7B=2E=245K+zq=7BoC6y=7C=0A=09/D=27an*6mw?=>j'f:eBsex\Gi,
 <a&s"J*gPjz0nO, Sucb&N84|zkA2YlM?`/Vp=@
 =?iso-8859-1?q?Bh=7DpyW=3DTQ=0A=09WtqwY=27=5FSy0aa=7E?=(jJCeFpDW
MIME-Version: 1.0
Content-Type: Text/Plain;
  charset="utf-8"
Content-Transfer-Encoding: quoted-printable
Message-Id: <201301272013.24445.hselasky@c2i.net>
Cc: fs@freebsd.org, current@freebsd.org,
 Ulrich =?utf-8?q?Sp=C3=B6rlein?= <uqs@freebsd.org>
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 27 Jan 2013 19:12:13 -0000

On Sunday 27 January 2013 20:08:06 Ulrich Sp=C3=B6rlein wrote:
> I dug out an old ATA-to-USB case and will use that to attach the old
> tank to the new machine and then have a try at this zpool replace thing.

If you are using -current you might want this patch first:

http://svnweb.freebsd.org/changeset/base/245995

=2D-HPS

From owner-freebsd-fs@FreeBSD.ORG  Sun Jan 27 20:11:57 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id 6B452D00;
 Sun, 27 Jan 2013 20:11:57 +0000 (UTC)
 (envelope-from peter@rulingia.com)
Received: from vps.rulingia.com (host-122-100-2-194.octopus.com.au
 [122.100.2.194]) by mx1.freebsd.org (Postfix) with ESMTP id 031062F8;
 Sun, 27 Jan 2013 20:11:56 +0000 (UTC)
Received: from server.rulingia.com
 (c220-239-246-167.belrs5.nsw.optusnet.com.au [220.239.246.167])
 by vps.rulingia.com (8.14.5/8.14.5) with ESMTP id r0RKBltH004480
 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK);
 Mon, 28 Jan 2013 07:11:47 +1100 (EST)
 (envelope-from peter@rulingia.com)
X-Bogosity: Ham, spamicity=0.000000
Received: from server.rulingia.com (localhost.rulingia.com [127.0.0.1])
 by server.rulingia.com (8.14.5/8.14.5) with ESMTP id r0RKBgLt092002
 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
 Mon, 28 Jan 2013 07:11:42 +1100 (EST)
 (envelope-from peter@server.rulingia.com)
Received: (from peter@localhost)
 by server.rulingia.com (8.14.5/8.14.5/Submit) id r0RKBffX092001;
 Mon, 28 Jan 2013 07:11:41 +1100 (EST) (envelope-from peter)
Date: Mon, 28 Jan 2013 07:11:40 +1100
From: Peter Jeremy <peter@rulingia.com>
To: Steven Hartland <killing@multiplay.co.uk>
Subject: Re: Zpool surgery
Message-ID: <20130127201140.GD29105@server.rulingia.com>
References: <20130127103612.GB38645@acme.spoerlein.net>
 <1F0546C4D94D4CCE9F6BB4C8FA19FFF2@multiplay.co.uk>
MIME-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha1;
 protocol="application/pgp-signature"; boundary="QRj9sO5tAVLaXnSD"
Content-Disposition: inline
In-Reply-To: <1F0546C4D94D4CCE9F6BB4C8FA19FFF2@multiplay.co.uk>
X-PGP-Key: http://www.rulingia.com/keys/peter.pgp
User-Agent: Mutt/1.5.21 (2010-09-15)
Cc: current@freebsd.org, fs@freebsd.org,
 Ulrich =?iso-8859-1?Q?Sp=F6rlein?= <uqs@freebsd.org>
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 27 Jan 2013 20:11:57 -0000


--QRj9sO5tAVLaXnSD
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On 2013-Jan-27 14:31:56 -0000, Steven Hartland <killing@multiplay.co.uk> wr=
ote:
>----- Original Message -----=20
>From: "Ulrich Sp=F6rlein" <uqs@FreeBSD.org>
>> I want to transplant my old zpool tank from a 1TB drive to a new 2TB
>> drive, but *not* use dd(1) or any other cloning mechanism, as the pool
>> was very full very often and is surely severely fragmented.
>
>Cant you just drop the disk in the original machine, set it as a mirror
>then once the mirror process has completed break the mirror and remove
>the 1TB disk.

That will replicate any fragmentation as well.  "zfs send | zfs recv"
is the only (current) way to defragment a ZFS pool.

--=20
Peter Jeremy

--QRj9sO5tAVLaXnSD
Content-Type: application/pgp-signature

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (FreeBSD)

iEYEARECAAYFAlEFifwACgkQ/opHv/APuIeuUACgqCCNXfxYUs6MF9RcFnRvANg3
T+AAnAsdg/RXxe7Y9nCPRFmKWizYzuKB
=Y809
-----END PGP SIGNATURE-----

--QRj9sO5tAVLaXnSD--

From owner-freebsd-fs@FreeBSD.ORG  Sun Jan 27 21:34:22 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id 16381190
 for <freebsd-fs@freebsd.org>; Sun, 27 Jan 2013 21:34:22 +0000 (UTC)
 (envelope-from grarpamp@gmail.com)
Received: from mail-vb0-f54.google.com (mail-vb0-f54.google.com
 [209.85.212.54]) by mx1.freebsd.org (Postfix) with ESMTP id BF7AF817
 for <freebsd-fs@freebsd.org>; Sun, 27 Jan 2013 21:34:21 +0000 (UTC)
Received: by mail-vb0-f54.google.com with SMTP id l1so1467342vba.13
 for <freebsd-fs@freebsd.org>; Sun, 27 Jan 2013 13:34:21 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=mime-version:x-received:in-reply-to:references:date:message-id
 :subject:from:to:content-type;
 bh=rfX5ThjrzcCQ5Nj9fKH9eJ2RK1KR0C5ZkUxSG+k61QA=;
 b=ntvQ6JnnzhGD+l2MF+swYgZr/CpoxS5lQeSzAXkB4hdQHJD+hjJQxWHWFyMX2lVn0r
 h4mA2WGWTnuXtofQWJgypEr7xIwT/XbLRKo9z/EIlvKMJIyBFL6468IpGNHxqk10Datd
 3IhKF52npE6ghu3rR3XofucLNIJ9+fJR72JonFmJHMDRkIwzLjoodJKzIchvYZdknpN+
 A0PYQ0Xe8U/BW3UC8K1lQGGWMr3h538iH4OLED/x95+DtKDB5vMrcEtQBf+EOkhfYFTs
 ymTrlYGzGB9avSjLa8Nd5dpJstXWq99kfW7nCMLcaUX/vvZAZEDwvD9hpmywGPwVTtsB
 fUpg==
MIME-Version: 1.0
X-Received: by 10.220.150.136 with SMTP id y8mr12785591vcv.34.1359322461178;
 Sun, 27 Jan 2013 13:34:21 -0800 (PST)
Received: by 10.220.219.79 with HTTP; Sun, 27 Jan 2013 13:34:20 -0800 (PST)
In-Reply-To: <CAD2Ti2-1ROTxQXNA6FzWtcgnMoaAzvfcdh__zH7AVC7zCPsyzw@mail.gmail.com>
References: <CAD2Ti2-1ROTxQXNA6FzWtcgnMoaAzvfcdh__zH7AVC7zCPsyzw@mail.gmail.com>
Date: Sun, 27 Jan 2013 16:34:20 -0500
Message-ID: <CAD2Ti28WBAgJTu2480Cs88LGmTG+OFXe_XDL4GXxuNuTPrDVaA@mail.gmail.com>
Subject: Re: ZFS slackspace, grepping it for data
From: grarpamp <grarpamp@gmail.com>
To: freebsd-fs@freebsd.org
Content-Type: text/plain; charset=UTF-8
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 27 Jan 2013 21:34:22 -0000

> zdb -mmm pool_name

Ahh, I saw this later too, thanks. Seems I've got 425k free ranges
to scan among 25k free txg's. This will take a while but
it's still a nice feature. I doubt it was meant for this
purpose though. More likely for debugging zfs structures
and data issues.

> for on-disk offset add 0x400000 If i remember correctly.

I could check for it with a string search near the head of data.
Does that fs to disk offset stay the same throughout the fs?

The minimum range size appears to be 4KiB (245k worth),
with another 75k at 8KiB and 100k more on up to 32KiB.
So not sure yet whether using zdb to collect the slack will
perform any worse than supplying the list to dd, or even
trying to write some C to avoid the shell overhead and further
to read the disk direct.

I occaisionally get failed assertions and core dumps with
various zdb operations. Is there interest in ticketing them?

Assertion failed: (object_count == usedobjs (0x0 == 0x1e33ec)), file
/re8/src/cddl/usr.sbin/zdb/../../../cddl/contrib/opensolaris/cmd/zdb/zdb.c,
line 1649.

From owner-freebsd-fs@FreeBSD.ORG  Mon Jan 28 06:35:45 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id C902ED8C;
 Mon, 28 Jan 2013 06:35:45 +0000 (UTC)
 (envelope-from pyunyh@gmail.com)
Received: from mail-pb0-f47.google.com (mail-pb0-f47.google.com
 [209.85.160.47]) by mx1.freebsd.org (Postfix) with ESMTP id 6E955BF2;
 Mon, 28 Jan 2013 06:35:45 +0000 (UTC)
Received: by mail-pb0-f47.google.com with SMTP id rp8so514298pbb.34
 for <multiple recipients>; Sun, 27 Jan 2013 22:35:39 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=x-received:from:date:to:cc:subject:message-id:reply-to:references
 :mime-version:content-type:content-disposition:in-reply-to
 :user-agent; bh=2RacGay4nKiXjswap+LYj/1R3BklExGECrJVUYi2YcE=;
 b=jfDdnNRfagARNXe8mlBcGP6m59yqAXmAUhck8r3GUBB9aKYCz/AS2sLDQBQLPyTFAi
 RR2Ha8pOTsd4haaQKclcY/u/eIv/r99tR6CS5Dn4jNx3VKS/LM8U3cd9Q/5CKEVGptQ2
 qqQCNGXBz2bcpn408+9hpu7F8CJRK5Ls97wwK7XYofCyTJjQbOhlI59q7udckpA7SmIn
 ktvckWbbowjY7eqpI7NoE7RcOgTPRuZna7igRLmHQVk7qR2L8SuVSMBduKg233e7JuDz
 ihjvOEv7Yu1n0INf2iNCcKKp65xNmO6MGZR2CmWNS3KwUFdJlofE/KOqGVVhL/6L+LDA
 pMBg==
X-Received: by 10.66.84.195 with SMTP id b3mr33785573paz.30.1359354939703;
 Sun, 27 Jan 2013 22:35:39 -0800 (PST)
Received: from pyunyh@gmail.com (lpe4.p59-icn.cdngp.net. [114.111.62.249])
 by mx.google.com with ESMTPS id x6sm6157347paw.0.2013.01.27.22.35.35
 (version=TLSv1 cipher=RC4-SHA bits=128/128);
 Sun, 27 Jan 2013 22:35:38 -0800 (PST)
Received: by pyunyh@gmail.com (sSMTP sendmail emulation);
 Mon, 28 Jan 2013 15:35:31 +0900
From: YongHyeon PYUN <pyunyh@gmail.com>
Date: Mon, 28 Jan 2013 15:35:31 +0900
To: Christian Gusenbauer <c47g@gmx.at>
Subject: Re: 9.1-stable crashes while copying data from a NFS mounted directory
Message-ID: <20130128063531.GC1447@michelle.cdnetworks.com>
References: <201301241805.57623.c47g@gmx.at>
 <20130125043043.GA1429@michelle.cdnetworks.com>
 <20130125045048.GB1429@michelle.cdnetworks.com>
 <201301251809.50929.c47g@gmx.at>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <201301251809.50929.c47g@gmx.at>
User-Agent: Mutt/1.4.2.3i
Cc: freebsd-fs@freebsd.org, net@freebsd.org, yongari@freebsd.org
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
Reply-To: pyunyh@gmail.com
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 28 Jan 2013 06:35:45 -0000

On Fri, Jan 25, 2013 at 06:09:50PM +0100, Christian Gusenbauer wrote:
> On Friday 25 January 2013 05:50:48 YongHyeon PYUN wrote:
> > On Fri, Jan 25, 2013 at 01:30:43PM +0900, YongHyeon PYUN wrote:
> > > On Thu, Jan 24, 2013 at 05:21:50PM -0500, John Baldwin wrote:
> > > > On Thursday, January 24, 2013 4:22:12 pm Konstantin Belousov wrote:
> > > > > On Thu, Jan 24, 2013 at 09:50:52PM +0100, Christian Gusenbauer wrote:
> > > > > > On Thursday 24 January 2013 20:37:09 Konstantin Belousov wrote:
> > > > > > > On Thu, Jan 24, 2013 at 07:50:49PM +0100, Christian Gusenbauer 
> wrote:
> > > > > > > > On Thursday 24 January 2013 19:07:23 Konstantin Belousov wrote:
> > > > > > > > > On Thu, Jan 24, 2013 at 08:03:59PM +0200, Konstantin Belousov 
> wrote:
> > > > > > > > > > On Thu, Jan 24, 2013 at 06:05:57PM +0100, Christian 
> Gusenbauer wrote:
> > > > > > > > > > > Hi!
> > > > > > > > > > > 
> > > > > > > > > > > I'm using 9.1 stable svn revision 245605 and I get the
> > > > > > > > > > > panic below if I execute the following commands (as
> > > > > > > > > > > single user):
> > > > > > > > > > > 
> > > > > > > > > > > # swapon -a
> > > > > > > > > > > # dumpon /dev/ada0s3b
> > > > > > > > > > > # mount -u /
> > > > > > > > > > > # ifconfig age0 inet 192.168.2.2 mtu 6144 up
> > > > > > > > > > > # mount -t nfs -o rsize=32768 data:/multimedia /mnt
> > > > > > > > > > > # cp /mnt/Movies/test/a.m2ts /tmp
> > > > > > > > > > > 
> > > > > > > > > > > then the system panics almost immediately. I'll attach
> > > > > > > > > > > the stack trace.
> > > > > > > > > > > 
> > > > > > > > > > > Note, that I'm using jumbo frames (6144 byte) on a 1Gbit
> > > > > > > > > > > network, maybe that's the cause for the panic, because
> > > > > > > > > > > the bcopy (see stack frame #15) fails.
> > > > > > > > > > > 
> > > > > > > > > > > Any clues?
> > > > > > > > > > 
> > > > > > > > > > I tried a similar operation with the nfs mount of
> > > > > > > > > > rsize=32768 and mtu 6144, but the machine runs HEAD and em
> > > > > > > > > > instead of age. I was unable to reproduce the panic on the
> > > > > > > > > > copy of the 5GB file from nfs mount.
> > > > > > > > 
> > > > > > > > Hmmm, I did a quick test. If I do not change the MTU, so just
> > > > > > > > configuring age0 with
> > > > > > > > 
> > > > > > > > # ifconfig age0 inet 192.168.2.2 up
> > > > > > > > 
> > > > > > > > then I can copy all files from the mounted directory without
> > > > > > > > any problems, too. So it's probably age0 related?
> > > > > > > 
> > > > > > > From your backtrace and the buffer printout, I see somewhat
> > > > > > > strange thing. The buffer data address is 0xffffff8171418000,
> > > > > > > while kernel faulted at the attempt to write at
> > > > > > > 0xffffff8171413000, which is is lower then the buffer data
> > > > > > > pointer, at the attempt to bcopy to the buffer.
> > > > > > > 
> > > > > > > The other data suggests that there were no overflow of the data
> > > > > > > from the server response. So it might be that mbuf_len(mp)
> > > > > > > returned negative number ? I am not sure is it possible at all.
> > > > > > > 
> > > > > > > Try this debugging patch, please. You need to add INVARIANTS etc
> > > > > > > to the kernel config.
> > > > > > > 
> > > > > > > diff --git a/sys/fs/nfs/nfs_commonsubs.c
> > > > > > > b/sys/fs/nfs/nfs_commonsubs.c index efc0786..9a6bda5 100644
> > > > > > > --- a/sys/fs/nfs/nfs_commonsubs.c
> > > > > > > +++ b/sys/fs/nfs/nfs_commonsubs.c
> > > > > > > @@ -218,6 +218,7 @@ nfsm_mbufuio(struct nfsrv_descript *nd,
> > > > > > > struct uio *uiop, int siz) }
> > > > > > > 
> > > > > > >  				mbufcp = NFSMTOD(mp, caddr_t);
> > > > > > >  				len = mbuf_len(mp);
> > > > > > > 
> > > > > > > +				KASSERT(len > 0, ("len %d", len));
> > > > > > > 
> > > > > > >  			}
> > > > > > >  			xfer = (left > len) ? len : left;
> > > > > > >  
> > > > > > >  #ifdef notdef
> > > > > > > 
> > > > > > > @@ -239,6 +240,8 @@ nfsm_mbufuio(struct nfsrv_descript *nd,
> > > > > > > struct uio *uiop, int siz) uiop->uio_resid -= xfer;
> > > > > > > 
> > > > > > >  		}
> > > > > > >  		if (uiop->uio_iov->iov_len <= siz) {
> > > > > > > 
> > > > > > > +			KASSERT(uiop->uio_iovcnt > 1, ("uio_iovcnt %d",
> > > > > > > +			    uiop->uio_iovcnt));
> > > > > > > 
> > > > > > >  			uiop->uio_iovcnt--;
> > > > > > >  			uiop->uio_iov++;
> > > > > > >  		
> > > > > > >  		} else {
> > > > > > > 
> > > > > > > I thought that server have returned too long response, but it
> > > > > > > seems to be not the case from your data. Still, I think the
> > > > > > > patch below might be due.
> > > > > > > 
> > > > > > > diff --git a/sys/fs/nfsclient/nfs_clrpcops.c
> > > > > > > b/sys/fs/nfsclient/nfs_clrpcops.c index be0476a..a89b907 100644
> > > > > > > --- a/sys/fs/nfsclient/nfs_clrpcops.c
> > > > > > > +++ b/sys/fs/nfsclient/nfs_clrpcops.c
> > > > > > > @@ -1444,7 +1444,7 @@ nfsrpc_readrpc(vnode_t vp, struct uio
> > > > > > > *uiop, struct ucred *cred, NFSM_DISSECT(tl, u_int32_t *,
> > > > > > > NFSX_UNSIGNED);
> > > > > > > 
> > > > > > >  			eof = fxdr_unsigned(int, *tl);
> > > > > > >  		
> > > > > > >  		}
> > > > > > > 
> > > > > > > -		NFSM_STRSIZ(retlen, rsize);
> > > > > > > +		NFSM_STRSIZ(retlen, len);
> > > > > > > 
> > > > > > >  		error = nfsm_mbufuio(nd, uiop, retlen);
> > > > > > >  		if (error)
> > > > > > >  		
> > > > > > >  			goto nfsmout;
> > > > > > 
> > > > > > I applied your patches and now I get a
> > > > > > 
> > > > > > panic: len -4
> > > > > > cpuid = 1
> > > > > > KDB: enter: panic
> > > > > > Dumping 377 out of 6116
> > > > > > MB:..5%..13%..22%..34%..43%..51%..64%..73%..81%..94%
> > > > > 
> > > > > This means that the age driver either produced corrupted mbuf chain,
> > > > > or filled wrong negative value into the mbuf len field. I am quite
> > > > > certain that the issue is in the driver.
> > > > > 
> > > > > I added the net@ to Cc:, hopefully you could get help there.
> > > > 
> > > > And I've cc'd Pyun who has written most of this driver and is likely
> > > > the one most familiar with its handling of jumbo frames.
> > > 
> > > Try attached one and let me know how it goes.
> > > Note, I don't have age(4) anymore so it wasn't tested at all.
> > 
> > Sorry, ignore previous patch and use this one(age.diff2) instead.
> 
> Thanks for the patch! I ignored the first and applied only the second one, but 
> unfortunately that did not change anything. I still get the "panic: len -4" 
> :-(.

Ok, I contacted QAC and got a hint for its descriptor usage and I
realized the controller does not work as I initially expected!
When I wrote age(4) for the controller, the hardware was available
only for a couple of weeks so I may have not enough time to test
it.  Sorry about that.
I'll let you know when experimental patch is available. Due to lack
of hardware, it would take more time than it used to be.

Thanks for reporting!

> 
> Ciao,
> Christian.

From owner-freebsd-fs@FreeBSD.ORG  Mon Jan 28 08:58:25 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id EB380A82;
 Mon, 28 Jan 2013 08:58:24 +0000 (UTC) (envelope-from uqs@FreeBSD.org)
Received: from acme.spoerlein.net (acme.spoerlein.net
 [IPv6:2a01:4f8:131:23c2::1])
 by mx1.freebsd.org (Postfix) with ESMTP id 786A9256;
 Mon, 28 Jan 2013 08:58:24 +0000 (UTC)
Received: from localhost (acme.spoerlein.net [IPv6:2a01:4f8:131:23c2::1])
 by acme.spoerlein.net (8.14.6/8.14.6) with ESMTP id r0S8wLP2029200
 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES128-SHA bits=128 verify=NO);
 Mon, 28 Jan 2013 09:58:22 +0100 (CET) (envelope-from uqs@FreeBSD.org)
Date: Mon, 28 Jan 2013 09:58:20 +0100
From: Ulrich =?utf-8?B?U3DDtnJsZWlu?= <uqs@FreeBSD.org>
To: Peter Jeremy <peter@rulingia.com>
Subject: Re: Zpool surgery
Message-ID: <20130128085820.GR35868@acme.spoerlein.net>
Mail-Followup-To: Peter Jeremy <peter@rulingia.com>,
 Steven Hartland <killing@multiplay.co.uk>, current@freebsd.org,
 fs@freebsd.org
References: <20130127103612.GB38645@acme.spoerlein.net>
 <1F0546C4D94D4CCE9F6BB4C8FA19FFF2@multiplay.co.uk>
 <20130127201140.GD29105@server.rulingia.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <20130127201140.GD29105@server.rulingia.com>
User-Agent: Mutt/1.5.21 (2010-09-15)
Cc: current@freebsd.org, fs@freebsd.org
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 28 Jan 2013 08:58:25 -0000

On Mon, 2013-01-28 at 07:11:40 +1100, Peter Jeremy wrote:
> On 2013-Jan-27 14:31:56 -0000, Steven Hartland <killing@multiplay.co.uk> wrote:
> >----- Original Message ----- 
> >From: "Ulrich Spörlein" <uqs@FreeBSD.org>
> >> I want to transplant my old zpool tank from a 1TB drive to a new 2TB
> >> drive, but *not* use dd(1) or any other cloning mechanism, as the pool
> >> was very full very often and is surely severely fragmented.
> >
> >Cant you just drop the disk in the original machine, set it as a mirror
> >then once the mirror process has completed break the mirror and remove
> >the 1TB disk.
> 
> That will replicate any fragmentation as well.  "zfs send | zfs recv"
> is the only (current) way to defragment a ZFS pool.

But are you then also supposed to be able send incremental snapshots to
a third pool from the pool that you just cloned?

I did the zpool replace now over night, and it did not remove the old
device yet, as it found cksum errors on the pool:

root@coyote:~# zpool status -v
  pool: tank
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: resilvered 873G in 11h33m with 24 errors on Mon Jan 28 09:45:32 2013
config:

        NAME           STATE     READ WRITE CKSUM
        tank           ONLINE       0     0    27
          replacing-0  ONLINE       0     0    61
            da0.eli    ONLINE       0     0    61
            ada1.eli   ONLINE       0     0    61

errors: Permanent errors have been detected in the following files:

        tank/src@2013-01-17:/.svn/pristine/8e/8ed35772a38e0fec00bc1cbc2f05480f4fd4759b.svn-base
        tank/src@2013-01-17:/.svn/pristine/4f/4febd82f50bd408f958d4412ceea50cef48fe8f7.svn-base
        tank/src@2013-01-17:/sys/dev/mvs/mvs_soc.c
        tank/src@2013-01-17:/secure/usr.bin/openssl/man/pkcs8.1
        tank/src@2013-01-17:/.svn/pristine/ab/ab1efecf2c0a8f67162b2ed760772337017c5a64.svn-base
        tank/src@2013-01-17:/.svn/pristine/90/907580a473b00f09b01815a52251fbdc3e34e8f6.svn-base
        tank/src@2013-01-17:/sys/dev/agp/agpreg.h
        tank/src@2013-01-17:/sys/dev/isci/scil/scic_sds_remote_node_context.h
        tank/src@2013-01-17:/.svn/pristine/a8/a8dfc65edca368c5d2af3d655859f25150795bc5.svn-base
        tank/src@2013-01-17:/contrib/llvm/utils/TableGen/DAGISelMatcher.cpp
        tank/src@2013-01-17:/contrib/tcpdump/print-babel.c
        tank/src@2013-01-17:/.svn/pristine/30/30ef0f53aa09a5185f55f4ecac842dbc13dab8fd.svn-base
        tank/src@2013-01-17:/.svn/pristine/cb/cb32411a6873621a449b24d9127305b2ee6630e9.svn-base
        tank/src@2013-01-17:/.svn/pristine/03/030d211b1e95f703f9a61201eed63efdbb8e41c0.svn-base
        tank/src@2013-01-17:/.svn/pristine/27/27f1181d33434a72308de165c04202b6159d6ac2.svn-base
        tank/src@2013-01-17:/lib/libpam/modules/pam_exec/pam_exec.c
        tank/src@2013-01-17:/contrib/llvm/include/llvm/PassSupport.h
        tank/src@2013-01-17:/.svn/pristine/90/90f818b5f897f26c7b301c1ac2d0ce0d3eaef28d.svn-base
        tank/src@2013-01-17:/sys/vm/vm_pager.c
        tank/src@2013-01-17:/.svn/pristine/5e/5e9331052e8c2e0fa5fd8c74c4edb04058e3b95f.svn-base
        tank/src@2013-01-17:/.svn/pristine/1d/1d5d6e75cfb77e48e4711ddd10148986392c4fae.svn-base
        tank/src@2013-01-17:/.svn/pristine/c5/c55e964c62ed759089c4bf5e49adf6e49eb59108.svn-base
        tank/src@2013-01-17:/crypto/openssl/crypto/cms/cms_lcl.h
        tank/ncvs@2013-01-17:/ports/textproc/uncrustify/distinfo,v

Interestingly, these only seem to affect the snapshot, and I'm now
wondering if that is the problem why the backup pool did not accept the
next incremental snapshot from the new pool.

How does the receiving pool known that it has the correct snapshot to
store an incremental one anyway? Is there a toplevel checksum, like for
git commits? How can I display and compare that?

Cheers,
Uli

From owner-freebsd-fs@FreeBSD.ORG  Mon Jan 28 11:06:43 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@FreeBSD.org
Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115])
 by hub.freebsd.org (Postfix) with ESMTP id 71370913
 for <freebsd-fs@FreeBSD.org>; Mon, 28 Jan 2013 11:06:43 +0000 (UTC)
 (envelope-from owner-bugmaster@FreeBSD.org)
Received: from freefall.freebsd.org (freefall.freebsd.org
 [IPv6:2001:1900:2254:206c::16:87])
 by mx1.freebsd.org (Postfix) with ESMTP id 62FAECCF
 for <freebsd-fs@FreeBSD.org>; Mon, 28 Jan 2013 11:06:43 +0000 (UTC)
Received: from freefall.freebsd.org (localhost [127.0.0.1])
 by freefall.freebsd.org (8.14.6/8.14.6) with ESMTP id r0SB6hYb034536
 for <freebsd-fs@FreeBSD.org>; Mon, 28 Jan 2013 11:06:43 GMT
 (envelope-from owner-bugmaster@FreeBSD.org)
Received: (from gnats@localhost)
 by freefall.freebsd.org (8.14.6/8.14.6/Submit) id r0SB6hLk034534
 for freebsd-fs@FreeBSD.org; Mon, 28 Jan 2013 11:06:43 GMT
 (envelope-from owner-bugmaster@FreeBSD.org)
Date: Mon, 28 Jan 2013 11:06:43 GMT
Message-Id: <201301281106.r0SB6hLk034534@freefall.freebsd.org>
X-Authentication-Warning: freefall.freebsd.org: gnats set sender to
 owner-bugmaster@FreeBSD.org using -f
From: FreeBSD bugmaster <bugmaster@freebsd.org>
To: freebsd-fs@FreeBSD.org
Subject: Current problem reports assigned to freebsd-fs@FreeBSD.org
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 28 Jan 2013 11:06:43 -0000

Note: to view an individual PR, use:
  http://www.freebsd.org/cgi/query-pr.cgi?pr=(number).

The following is a listing of current problems submitted by FreeBSD users.
These represent problem reports covering all versions including
experimental development code and obsolete releases.


S Tracker      Resp.      Description
--------------------------------------------------------------------------------
o kern/175179  fs         [zfs] ZFS may attach wrong device on move
o kern/175071  fs         [ufs] [panic] softdep_deallocate_dependencies: unrecov
o kern/174372  fs         [zfs] Pagefault appears to be related to ZFS
o kern/174315  fs         [zfs] chflags uchg not supported
o kern/174310  fs         [zfs] root point mounting broken on CURRENT with multi
o kern/174279  fs         [ufs] UFS2-SU+J journal and filesystem corruption
o kern/174060  fs         [ext2fs] Ext2FS system crashes (buffer overflow?)
o kern/173830  fs         [zfs] Brain-dead simple change to ZFS error descriptio
o kern/173718  fs         [zfs] phantom directory in zraid2 pool
f kern/173657  fs         [nfs] strange UID map with nfsuserd
o kern/173363  fs         [zfs] [panic] Panic on 'zpool replace' on readonly poo
o kern/173136  fs         [unionfs] mounting above the NFS read-only share panic
o kern/172348  fs         [unionfs] umount -f of filesystem in use with readonly
o kern/172334  fs         [unionfs] unionfs permits recursive union mounts; caus
o kern/171626  fs         [tmpfs] tmpfs should be noisier when the requested siz
o kern/171415  fs         [zfs] zfs recv fails with "cannot receive incremental 
o kern/170945  fs         [gpt] disk layout not portable between direct connect 
o bin/170778   fs         [zfs] [panic] FreeBSD panics randomly
o kern/170680  fs         [nfs] Multiple NFS Client bug in the FreeBSD 7.4-RELEA
o kern/170497  fs         [xfs][panic] kernel will panic whenever I ls a mounted
o kern/169945  fs         [zfs] [panic] Kernel panic while importing zpool (afte
o kern/169480  fs         [zfs] ZFS stalls on heavy I/O
o kern/169398  fs         [zfs] Can't remove file with permanent error
o kern/169339  fs         panic while " : > /etc/123"
o kern/169319  fs         [zfs] zfs resilver can't complete
o kern/168947  fs         [nfs] [zfs] .zfs/snapshot directory is messed up when 
o kern/168942  fs         [nfs] [hang] nfsd hangs after being restarted (not -HU
o kern/168158  fs         [zfs] incorrect parsing of sharenfs options in zfs (fs
o kern/167979  fs         [ufs] DIOCGDINFO ioctl does not work on 8.2 file syste
o kern/167977  fs         [smbfs] mount_smbfs results are differ when utf-8 or U
o kern/167688  fs         [fusefs] Incorrect signal handling with direct_io
o kern/167685  fs         [zfs] ZFS on USB drive prevents shutdown / reboot
o kern/167612  fs         [portalfs] The portal file system gets stuck inside po
o kern/167272  fs         [zfs] ZFS Disks reordering causes ZFS to pick the wron
o kern/167260  fs         [msdosfs] msdosfs disk was mounted the second time whe
o kern/167109  fs         [zfs] [panic] zfs diff kernel panic Fatal trap 9: gene
o kern/167105  fs         [nfs] mount_nfs can not handle source exports wiht mor
o kern/167067  fs         [zfs] [panic] ZFS panics the server
o kern/167065  fs         [zfs] boot fails when a spare is the boot disk
o kern/167048  fs         [nfs] [patch] RELEASE-9 crash when using ZFS+NULLFS+NF
o kern/166912  fs         [ufs] [panic] Panic after converting Softupdates to jo
o kern/166851  fs         [zfs] [hang] Copying directory from the mounted UFS di
o kern/166477  fs         [nfs] NFS data corruption.
o kern/165950  fs         [ffs] SU+J and fsck problem
o kern/165923  fs         [nfs] Writing to NFS-backed mmapped files fails if flu
o kern/165521  fs         [zfs] [hang] livelock on 1 Gig of RAM with zfs when 31
o kern/165392  fs         Multiple mkdir/rmdir fails with errno 31
o kern/165087  fs         [unionfs] lock violation in unionfs
o kern/164472  fs         [ufs] fsck -B panics on particular data inconsistency
o kern/164370  fs         [zfs] zfs destroy for snapshot fails on i386 and sparc
o kern/164261  fs         [nullfs] [patch] fix panic with NFS served from NULLFS
o kern/164256  fs         [zfs] device entry for volume is not created after zfs
o kern/164184  fs         [ufs] [panic] Kernel panic with ufs_makeinode
o kern/163801  fs         [md] [request] allow mfsBSD legacy installed in 'swap'
o kern/163770  fs         [zfs] [hang] LOR between zfs&syncer + vnlru leading to
o kern/163501  fs         [nfs] NFS exporting a dir and a subdir in that dir to 
o kern/162944  fs         [coda] Coda file system module looks broken in 9.0
o kern/162860  fs         [zfs] Cannot share ZFS filesystem to hosts with a hyph
o kern/162751  fs         [zfs] [panic] kernel panics during file operations
o kern/162591  fs         [nullfs] cross-filesystem nullfs does not work as expe
o kern/162519  fs         [zfs] "zpool import" relies on buggy realpath() behavi
o kern/162362  fs         [snapshots] [panic] ufs with snapshot(s) panics when g
o kern/161968  fs         [zfs] [hang] renaming snapshot with -r including a zvo
o kern/161864  fs         [ufs] removing journaling from UFS partition fails on 
o bin/161807   fs         [patch] add option for explicitly specifying metadata 
o kern/161579  fs         [smbfs] FreeBSD sometimes panics when an smb share is 
o kern/161533  fs         [zfs] [panic] zfs receive panic: system ioctl returnin
o kern/161438  fs         [zfs] [panic] recursed on non-recursive spa_namespace_
o kern/161424  fs         [nullfs] __getcwd() calls fail when used on nullfs mou
o kern/161280  fs         [zfs] Stack overflow in gptzfsboot
o kern/161205  fs         [nfs] [pfsync] [regression] [build] Bug report freebsd
o kern/161169  fs         [zfs] [panic] ZFS causes kernel panic in dbuf_dirty
o kern/161112  fs         [ufs] [lor] filesystem LOR in FreeBSD 9.0-BETA3
o kern/160893  fs         [zfs] [panic] 9.0-BETA2 kernel panic
o kern/160860  fs         [ufs] Random UFS root filesystem corruption with SU+J 
o kern/160801  fs         [zfs] zfsboot on 8.2-RELEASE fails to boot from root-o
o kern/160790  fs         [fusefs] [panic] VPUTX: negative ref count with FUSE
o kern/160777  fs         [zfs] [hang] RAID-Z3 causes fatal hang upon scrub/impo
o kern/160706  fs         [zfs] zfs bootloader fails when a non-root vdev exists
o kern/160591  fs         [zfs] Fail to boot on zfs root with degraded raidz2 [r
o kern/160410  fs         [smbfs] [hang] smbfs hangs when transferring large fil
o kern/160283  fs         [zfs] [patch] 'zfs list' does abort in make_dataset_ha
o kern/159930  fs         [ufs] [panic] kernel core
o kern/159402  fs         [zfs][loader] symlinks cause I/O errors
o kern/159357  fs         [zfs] ZFS MAXNAMELEN macro has confusing name (off-by-
o kern/159356  fs         [zfs] [patch] ZFS NAME_ERR_DISKLIKE check is Solaris-s
o kern/159351  fs         [nfs] [patch] - divide by zero in mountnfs()
o kern/159251  fs         [zfs] [request]: add FLETCHER4 as DEDUP hash option
o kern/159077  fs         [zfs] Can't cd .. with latest zfs version
o kern/159048  fs         [smbfs] smb mount corrupts large files
o kern/159045  fs         [zfs] [hang] ZFS scrub freezes system
o kern/158839  fs         [zfs] ZFS Bootloader Fails if there is a Dead Disk
o kern/158802  fs         amd(8) ICMP storm and unkillable process.
o kern/158231  fs         [nullfs] panic on unmounting nullfs mounted over ufs o
f kern/157929  fs         [nfs] NFS slow read
o kern/157399  fs         [zfs] trouble with: mdconfig force delete && zfs strip
o kern/157179  fs         [zfs] zfs/dbuf.c: panic: solaris assert: arc_buf_remov
o kern/156797  fs         [zfs] [panic] Double panic with FreeBSD 9-CURRENT and 
o kern/156781  fs         [zfs] zfs is losing the snapshot directory,
p kern/156545  fs         [ufs] mv could break UFS on SMP systems
o kern/156193  fs         [ufs] [hang] UFS snapshot hangs && deadlocks processes
o kern/156039  fs         [nullfs] [unionfs] nullfs + unionfs do not compose, re
o kern/155615  fs         [zfs] zfs v28 broken on sparc64 -current
o kern/155587  fs         [zfs] [panic] kernel panic with zfs
p kern/155411  fs         [regression] [8.2-release] [tmpfs]: mount: tmpfs : No 
o kern/155199  fs         [ext2fs] ext3fs mounted as ext2fs gives I/O errors
o bin/155104   fs         [zfs][patch] use /dev prefix by default when importing
o kern/154930  fs         [zfs] cannot delete/unlink file from full volume -> EN
o kern/154828  fs         [msdosfs] Unable to create directories on external USB
o kern/154491  fs         [smbfs] smb_co_lock: recursive lock for object 1
p kern/154228  fs         [md] md getting stuck in wdrain state
o kern/153996  fs         [zfs] zfs root mount error while kernel is not located
o kern/153753  fs         [zfs] ZFS v15 - grammatical error when attempting to u
o kern/153716  fs         [zfs] zpool scrub time remaining is incorrect
o kern/153695  fs         [patch] [zfs] Booting from zpool created on 4k-sector 
o kern/153680  fs         [xfs] 8.1 failing to mount XFS partitions
o kern/153418  fs         [zfs] [panic] Kernel Panic occurred writing to zfs vol
o kern/153351  fs         [zfs] locking directories/files in ZFS
o bin/153258   fs         [patch][zfs] creating ZVOLs requires `refreservation' 
s kern/153173  fs         [zfs] booting from a gzip-compressed dataset doesn't w
o bin/153142   fs         [zfs] ls -l outputs `ls: ./.zfs: Operation not support
o kern/153126  fs         [zfs] vdev failure, zpool=peegel type=vdev.too_small
o kern/152022  fs         [nfs] nfs service hangs with linux client [regression]
o kern/151942  fs         [zfs] panic during ls(1) zfs snapshot directory
o kern/151905  fs         [zfs] page fault under load in /sbin/zfs
o bin/151713   fs         [patch] Bug in growfs(8) with respect to 32-bit overfl
o kern/151648  fs         [zfs] disk wait bug
o kern/151629  fs         [fs] [patch] Skip empty directory entries during name 
o kern/151330  fs         [zfs] will unshare all zfs filesystem after execute a 
o kern/151326  fs         [nfs] nfs exports fail if netgroups contain duplicate 
o kern/151251  fs         [ufs] Can not create files on filesystem with heavy us
o kern/151226  fs         [zfs] can't delete zfs snapshot
o kern/150503  fs         [zfs] ZFS disks are UNAVAIL and corrupted after reboot
o kern/150501  fs         [zfs] ZFS vdev failure vdev.bad_label on amd64
o kern/150390  fs         [zfs] zfs deadlock when arcmsr reports drive faulted
o kern/150336  fs         [nfs] mountd/nfsd became confused; refused to reload n
o kern/149208  fs         mksnap_ffs(8) hang/deadlock
o kern/149173  fs         [patch] [zfs] make OpenSolaris <sys/nvpair.h> installa
o kern/149015  fs         [zfs] [patch] misc fixes for ZFS code to build on Glib
o kern/149014  fs         [zfs] [patch] declarations in ZFS libraries/utilities 
o kern/149013  fs         [zfs] [patch] make ZFS makefiles use the libraries fro
o kern/148504  fs         [zfs] ZFS' zpool does not allow replacing drives to be
o kern/148490  fs         [zfs]: zpool attach - resilver bidirectionally, and re
o kern/148368  fs         [zfs] ZFS hanging forever on 8.1-PRERELEASE
o kern/148138  fs         [zfs] zfs raidz pool commands freeze
o kern/147903  fs         [zfs] [panic] Kernel panics on faulty zfs device
o kern/147881  fs         [zfs] [patch] ZFS "sharenfs" doesn't allow different "
o kern/147420  fs         [ufs] [panic] ufs_dirbad, nullfs, jail panic (corrupt 
o kern/146941  fs         [zfs] [panic] Kernel Double Fault - Happens constantly
o kern/146786  fs         [zfs] zpool import hangs with checksum errors
o kern/146708  fs         [ufs] [panic] Kernel panic in softdep_disk_write_compl
o kern/146528  fs         [zfs] Severe memory leak in ZFS on i386
o kern/146502  fs         [nfs] FreeBSD 8 NFS Client Connection to Server
s kern/145712  fs         [zfs] cannot offline two drives in a raidz2 configurat
o kern/145411  fs         [xfs] [panic] Kernel panics shortly after mounting an 
f bin/145309   fs         bsdlabel: Editing disk label invalidates the whole dev
o kern/145272  fs         [zfs] [panic] Panic during boot when accessing zfs on 
o kern/145246  fs         [ufs] dirhash in 7.3 gratuitously frees hashes when it
o kern/145238  fs         [zfs] [panic] kernel panic on zpool clear tank
o kern/145229  fs         [zfs] Vast differences in ZFS ARC behavior between 8.0
o kern/145189  fs         [nfs] nfsd performs abysmally under load
o kern/144929  fs         [ufs] [lor] vfs_bio.c + ufs_dirhash.c
p kern/144447  fs         [zfs] sharenfs fsunshare() & fsshare_main() non functi
o kern/144416  fs         [panic] Kernel panic on online filesystem optimization
s kern/144415  fs         [zfs] [panic] kernel panics on boot after zfs crash
o kern/144234  fs         [zfs] Cannot boot machine with recent gptzfsboot code 
o kern/143825  fs         [nfs] [panic] Kernel panic on NFS client
o bin/143572   fs         [zfs] zpool(1): [patch] The verbose output from iostat
o kern/143212  fs         [nfs] NFSv4 client strange work ...
o kern/143184  fs         [zfs] [lor] zfs/bufwait LOR
o kern/142878  fs         [zfs] [vfs] lock order reversal
o kern/142597  fs         [ext2fs] ext2fs does not work on filesystems with real
o kern/142489  fs         [zfs] [lor] allproc/zfs LOR
o kern/142466  fs         Update 7.2 -> 8.0 on Raid 1 ends with screwed raid [re
o kern/142306  fs         [zfs] [panic] ZFS drive (from OSX Leopard) causes two 
o kern/142068  fs         [ufs] BSD labels are got deleted spontaneously
o kern/141897  fs         [msdosfs] [panic] Kernel panic. msdofs: file name leng
o kern/141463  fs         [nfs] [panic] Frequent kernel panics after upgrade fro
o kern/141305  fs         [zfs] FreeBSD ZFS+sendfile severe performance issues (
o kern/141091  fs         [patch] [nullfs] fix panics with DIAGNOSTIC enabled
o kern/141086  fs         [nfs] [panic] panic("nfs: bioread, not dir") on FreeBS
o kern/141010  fs         [zfs] "zfs scrub" fails when backed by files in UFS2
o kern/140888  fs         [zfs] boot fail from zfs root while the pool resilveri
o kern/140661  fs         [zfs] [patch] /boot/loader fails to work on a GPT/ZFS-
o kern/140640  fs         [zfs] snapshot crash
o kern/140068  fs         [smbfs] [patch] smbfs does not allow semicolon in file
o kern/139725  fs         [zfs] zdb(1) dumps core on i386 when examining zpool c
o kern/139715  fs         [zfs] vfs.numvnodes leak on busy zfs
p bin/139651   fs         [nfs] mount(8): read-only remount of NFS volume does n
o kern/139407  fs         [smbfs] [panic] smb mount causes system crash if remot
o kern/138662  fs         [panic] ffs_blkfree: freeing free block
o kern/138421  fs         [ufs] [patch] remove UFS label limitations
o kern/138202  fs         mount_msdosfs(1) see only 2Gb
o kern/136968  fs         [ufs] [lor] ufs/bufwait/ufs (open)
o kern/136945  fs         [ufs] [lor] filedesc structure/ufs (poll)
o kern/136944  fs         [ffs] [lor] bufwait/snaplk (fsync)
o kern/136873  fs         [ntfs] Missing directories/files on NTFS volume
o kern/136865  fs         [nfs] [patch] NFS exports atomic and on-the-fly atomic
p kern/136470  fs         [nfs] Cannot mount / in read-only, over NFS
o kern/135546  fs         [zfs] zfs.ko module doesn't ignore zpool.cache filenam
o kern/135469  fs         [ufs] [panic] kernel crash on md operation in ufs_dirb
o kern/135050  fs         [zfs] ZFS clears/hides disk errors on reboot
o kern/134491  fs         [zfs] Hot spares are rather cold...
o kern/133676  fs         [smbfs] [panic] umount -f'ing a vnode-based memory dis
p kern/133174  fs         [msdosfs] [patch] msdosfs must support multibyte inter
o kern/132960  fs         [ufs] [panic] panic:ffs_blkfree: freeing free frag
o kern/132397  fs         reboot causes filesystem corruption (failure to sync b
o kern/132331  fs         [ufs] [lor] LOR ufs and syncer
o kern/132237  fs         [msdosfs] msdosfs has problems to read MSDOS Floppy
o kern/132145  fs         [panic] File System Hard Crashes
o kern/131441  fs         [unionfs] [nullfs] unionfs and/or nullfs not combineab
o kern/131360  fs         [nfs] poor scaling behavior of the NFS server under lo
o kern/131342  fs         [nfs] mounting/unmounting of disks causes NFS to fail
o bin/131341   fs         makefs: error "Bad file descriptor"  on the mount poin
o kern/130920  fs         [msdosfs] cp(1) takes 100% CPU time while copying file
o kern/130210  fs         [nullfs] Error by check nullfs
o kern/129760  fs         [nfs] after 'umount -f' of a stale NFS share FreeBSD l
o kern/129488  fs         [smbfs] Kernel "bug" when using smbfs in smbfs_smb.c: 
o kern/129231  fs         [ufs] [patch] New UFS mount (norandom) option - mostly
o kern/129152  fs         [panic] non-userfriendly panic when trying to mount(8)
o kern/127787  fs         [lor] [ufs] Three LORs: vfslock/devfs/vfslock, ufs/vfs
o bin/127270   fs         fsck_msdosfs(8) may crash if BytesPerSec is zero
o kern/127029  fs         [panic] mount(8): trying to mount a write protected zi
o kern/126287  fs         [ufs] [panic] Kernel panics while mounting an UFS file
o kern/125895  fs         [ffs] [panic] kernel: panic: ffs_blkfree: freeing free
s kern/125738  fs         [zfs] [request] SHA256 acceleration in ZFS
o kern/123939  fs         [msdosfs] corrupts new files
o kern/122380  fs         [ffs] ffs_valloc:dup alloc (Soekris 4801/7.0/USB Flash
o bin/122172   fs         [fs]: amd(8) automount daemon dies on 6.3-STABLE i386,
o bin/121898   fs         [nullfs] pwd(1)/getcwd(2) fails with Permission denied
o bin/121072   fs         [smbfs] mount_smbfs(8) cannot normally convert the cha
o kern/120483  fs         [ntfs] [patch] NTFS filesystem locking changes
o kern/120482  fs         [ntfs] [patch] Sync style changes between NetBSD and F
o kern/118912  fs         [2tb] disk sizing/geometry problem with large array
o kern/118713  fs         [minidump] [patch] Display media size required for a k
o kern/118318  fs         [nfs] NFS server hangs under special circumstances
o bin/118249   fs         [ufs] mv(1): moving a directory changes its mtime
o kern/118126  fs         [nfs] [patch] Poor NFS server write performance
o kern/118107  fs         [ntfs] [panic] Kernel panic when accessing a file at N
o kern/117954  fs         [ufs] dirhash on very large directories blocks the mac
o bin/117315   fs         [smbfs] mount_smbfs(8) and related options can't mount
o kern/117158  fs         [zfs] zpool scrub causes panic if geli vdevs detach on
o bin/116980   fs         [msdosfs] [patch] mount_msdosfs(8) resets some flags f
o conf/116931  fs         lack of fsck_cd9660 prevents mounting iso images with 
o kern/116583  fs         [ffs] [hang] System freezes for short time when using 
o bin/115361   fs         [zfs] mount(8) gets into a state where it won't set/un
o kern/114955  fs         [cd9660] [patch] [request] support for mask,dirmask,ui
o kern/114847  fs         [ntfs] [patch] [request] dirmask support for NTFS ala 
o kern/114676  fs         [ufs] snapshot creation panics: snapacct_ufs2: bad blo
o bin/114468   fs         [patch] [request] add -d option to umount(8) to detach
o kern/113852  fs         [smbfs] smbfs does not properly implement DFS referral
o bin/113838   fs         [patch] [request] mount(8): add support for relative p
o bin/113049   fs         [patch] [request] make quot(8) use getopt(3) and show 
o kern/112658  fs         [smbfs] [patch] smbfs and caching problems (resolves b
o kern/111843  fs         [msdosfs] Long Names of files are incorrectly created 
o kern/111782  fs         [ufs] dump(8) fails horribly for large filesystems
s bin/111146   fs         [2tb] fsck(8) fails on 6T filesystem
o bin/107829   fs         [2TB] fdisk(8): invalid boundary checking in fdisk / w
o kern/106107  fs         [ufs] left-over fsck_snapshot after unfinished backgro
o kern/104406  fs         [ufs] Processes get stuck in "ufs" state under persist
o kern/104133  fs         [ext2fs] EXT2FS module corrupts EXT2/3 filesystems
o kern/103035  fs         [ntfs] Directories in NTFS mounted disc images appear 
o kern/101324  fs         [smbfs] smbfs sometimes not case sensitive when it's s
o kern/99290   fs         [ntfs] mount_ntfs ignorant of cluster sizes
s bin/97498    fs         [request] newfs(8) has no option to clear the first 12
o kern/97377   fs         [ntfs] [patch] syntax cleanup for ntfs_ihash.c
o kern/95222   fs         [cd9660] File sections on ISO9660 level 3 CDs ignored
o kern/94849   fs         [ufs] rename on UFS filesystem is not atomic
o bin/94810    fs         fsck(8) incorrectly reports 'file system marked clean'
o kern/94769   fs         [ufs] Multiple file deletions on multi-snapshotted fil
o kern/94733   fs         [smbfs] smbfs may cause double unlock
o kern/93942   fs         [vfs] [patch] panic: ufs_dirbad: bad dir (patch from D
o kern/92272   fs         [ffs] [hang] Filling a filesystem while creating a sna
o kern/91134   fs         [smbfs] [patch] Preserve access and modification time 
a kern/90815   fs         [smbfs] [patch] SMBFS with character conversions somet
o kern/88657   fs         [smbfs] windows client hang when browsing a samba shar
o kern/88555   fs         [panic] ffs_blkfree: freeing free frag on AMD 64
o bin/87966    fs         [patch] newfs(8): introduce -A flag for newfs to enabl
o kern/87859   fs         [smbfs] System reboot while umount smbfs.
o kern/86587   fs         [msdosfs] rm -r /PATH fails with lots of small files
o bin/85494    fs         fsck_ffs: unchecked use of cg_inosused macro etc.
o kern/80088   fs         [smbfs] Incorrect file time setting on NTFS mounted vi
o bin/74779    fs         Background-fsck checks one filesystem twice and omits 
o kern/73484   fs         [ntfs] Kernel panic when doing `ls` from the client si
o bin/73019    fs         [ufs] fsck_ufs(8) cannot alloc 607016868 bytes for ino
o kern/71774   fs         [ntfs] NTFS cannot "see" files on a WinXP filesystem
o bin/70600    fs         fsck(8) throws files away when it can't grow lost+foun
o kern/68978   fs         [panic] [ufs] crashes with failing hard disk, loose po
o kern/65920   fs         [nwfs] Mounted Netware filesystem behaves strange
o kern/65901   fs         [smbfs] [patch] smbfs fails fsx write/truncate-down/tr
o kern/61503   fs         [smbfs] mount_smbfs does not work as non-root
o kern/55617   fs         [smbfs] Accessing an nsmb-mounted drive via a smb expo
o kern/51685   fs         [hang] Unbounded inode allocation causes kernel to loc
o kern/36566   fs         [smbfs] System reboot with dead smb mount and umount
o bin/27687    fs         fsck(8) wrapper is not properly passing options to fsc
o kern/18874   fs         [2TB] 32bit NFS servers export wrong negative values t

296 problems total.


From owner-freebsd-fs@FreeBSD.ORG  Mon Jan 28 12:00:10 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115])
 by hub.freebsd.org (Postfix) with ESMTP id 2DB5D23F;
 Mon, 28 Jan 2013 12:00:10 +0000 (UTC)
 (envelope-from laurencesgill@googlemail.com)
Received: from mail-we0-x22b.google.com (we-in-x022b.1e100.net
 [IPv6:2a00:1450:400c:c03::22b])
 by mx1.freebsd.org (Postfix) with ESMTP id 987C3292;
 Mon, 28 Jan 2013 12:00:09 +0000 (UTC)
Received: by mail-we0-f171.google.com with SMTP id u54so1398103wey.2
 for <multiple recipients>; Mon, 28 Jan 2013 04:00:08 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=googlemail.com; s=20120113;
 h=x-received:date:from:to:cc:subject:message-id:in-reply-to
 :references:x-mailer:mime-version:content-type
 :content-transfer-encoding;
 bh=DWoix26UxXcqQk/DIlK7sgYcybDwgf+SEv1R/pUwWB4=;
 b=b3/gXcZEXPWAOOL5m1X2Ah1aYdOolv5QYWyEBfvYvLYtgflsH4rcm4IklTF3tIryrR
 S1v1iMpvZZ8jz/D7SQ4A5Om9cXTqs3zKbJmW7bL/LNAqcNS5RTiUFE/vvyI+MxtOfOFL
 Pbujq9GshVUwudwMCNr+FAdP3qx2pnmqsKMEdS9x90sCIkSq9DP7OZwK4L73+/JeENsr
 oOve6nKyoSW0NzJwuDc7ONC6t9fooMakEk+9Shk6QLVaXD4Nn6R/QMl3I7FKlxR+65To
 UL26CX4/tkMX/cfaLuEfbrc4d05V/Bg6djTIecZGRvhLMgDGVZ/SieFgzzd5n5gAhtPw
 5PeA==
X-Received: by 10.180.80.170 with SMTP id s10mr9158900wix.27.1359374408538;
 Mon, 28 Jan 2013 04:00:08 -0800 (PST)
Received: from localhost (gateway.ash.thebunker.net. [213.129.64.4])
 by mx.google.com with ESMTPS id ge2sm5859532wib.4.2013.01.28.04.00.08
 (version=TLSv1.2 cipher=RC4-SHA bits=128/128);
 Mon, 28 Jan 2013 04:00:08 -0800 (PST)
Date: Mon, 28 Jan 2013 12:00:55 +0000
From: Laurence Gill <laurencesgill@googlemail.com>
To: Pawel Jakub Dawidek <pjd@FreeBSD.org>
Subject: Re: HAST performance overheads?
Message-ID: <20130128120055.6ca7c734@googlemail.com>
In-Reply-To: <20130127134845.GC1346@garage.freebsd.pl>
References: <20130125121044.1afac72e@googlemail.com>
 <20130127134845.GC1346@garage.freebsd.pl>
X-Mailer: Claws Mail 3.8.1 (GTK+ 2.24.12; x86_64-pc-linux-gnu)
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: base64
Cc: freebsd-fs@freebsd.org
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 28 Jan 2013 12:00:10 -0000

LS0tLS1CRUdJTiBQR1AgU0lHTkVEIE1FU1NBR0UtLS0tLQ0KSGFzaDogU0hBMQ0KDQpPbiBTdW4s
IDI3IEphbiAyMDEzIDE0OjQ4OjQ2ICswMTAwDQpQYXdlbCBKYWt1YiBEYXdpZGVrIDxwamRARnJl
ZUJTRC5vcmc+IHdyb3RlOg0KDQo+IE9uIEZyaSwgSmFuIDI1LCAyMDEzIGF0IDEyOjEwOjQ0UE0g
KzAwMDAsIExhdXJlbmNlIEdpbGwgd3JvdGU6DQo+ID4gSWYgSSBjcmVhdGUgWkZTIHJhaWR6MiBv
biB0aGVzZS4uLg0KPiA+IA0KPiA+ICAtICMgenBvb2wgY3JlYXRlIHBvb2wgcmFpZHoyIGRhMCBk
YTEgZGEyIGRhMyBkYTQgZGE1DQo+ID4gDQo+ID4gVGhlbiBydW4gYSBkZCB0ZXN0LCBhIHNhbXBs
ZSBvdXRwdXQgaXMuLi4NCj4gPiANCj4gPiAgLSAjIGRkIGlmPS9kZXYvemVybyBvZj10ZXN0LmRh
dCBicz0xTSBjb3VudD0xMDI0DQo+ID4gICAgICAxMDczNzQxODI0IGJ5dGVzIHRyYW5zZmVycmVk
IGluIDcuNjg5NjM0IHNlY3MgKDEzOTYzNDk3NA0KPiA+IGJ5dGVzL3NlYykNCj4gPiANCj4gPiAg
LSAjIGRkIGlmPS9kZXYvemVybyBvZj10ZXN0LmRhdCBicz0xNmsgY291bnQ9NjU1MzUNCj4gPiAg
ICAgIDEwNzM3MjU0NDAgYnl0ZXMgdHJhbnNmZXJyZWQgaW4gMS45MDkxNTcgc2VjcyAoNTYyNDA4
MTMwDQo+ID4gYnl0ZXMvc2VjKQ0KPiA+IA0KPiA+IFRoaXMgaXMgbXVjaCBmYXN0ZXIgdGhhbiBj
b21wYXJlZCB0byBydW5uaW5nIGhhc3QsIEkgd291bGQgZXhwZWN0IGFuDQo+ID4gb3ZlcmhlYWQs
IGJ1dCBub3QgdGhpcyBtdWNoLiAgRm9yIGV4YW1wbGU6DQo+ID4gDQo+ID4gIC0gIyBoYXN0Y3Rs
IGNyZWF0ZSBkaXNrMC9kaXNrMS9kaXNrMi9kaXNrMy9kaXNrNC9kaXNrNQ0KPiA+ICAtICMgaGFz
dGN0bCByb2xlIHByaW1hcnkgYWxsDQo+ID4gIC0gIyB6cG9vbCBjcmVhdGUgcG9vbCByYWlkejIg
ZGlzazAgZGlzazEgZGlzazIgZGlzazMgZGlzazQgZGlzazUNCj4gPiANCj4gPiBSdW4gYSBkZCB0
ZXN0LCBhbmQgdGhlIHNwZWVkIGlzLi4uDQo+ID4gDQo+ID4gIC0gIyBkZCBpZj0vZGV2L3plcm8g
b2Y9dGVzdC5kYXQgYnM9MU0gY291bnQ9MTAyNA0KPiA+ICAgICAgMTA3Mzc0MTgyNCBieXRlcyB0
cmFuc2ZlcnJlZCBpbiA0MC45MDgxNTMgc2VjcyAoMjYyNDc2MjQNCj4gPiBieXRlcy9zZWMpDQo+
ID4gDQo+ID4gIC0gIyBkZCBpZj0vZGV2L3plcm8gb2Y9dGVzdC5kYXQgYnM9MTZrIGNvdW50PTY1
NTM1DQo+ID4gICAgICAxMDczNzI1NDQwIGJ5dGVzIHRyYW5zZmVycmVkIGluIDQyLjAxNzk5NyBz
ZWNzICgyNTU1Mzk0Mg0KPiA+IGJ5dGVzL3NlYykNCj4gDQo+IExldCdzIHRyeSB0byB0ZXN0IG9u
ZSBzdGVwIGF0IGEgdGltZS4gQ2FuIHlvdSB0cnkgdG8gY29tcGFyZQ0KPiBzZXF1ZW50aWFsIHBl
cmZvcm1hbmNlIG9mIHJlZ3VsYXIgZGlzayB2cy4gSEFTVCB3aXRoIG5vIHNlY29uZGFyeQ0KPiBj
b25maWd1cmVkPw0KPiANCj4gQnkgbm8gc2Vjb25kYXJ5IGNvbmZpZ3VyZWQgSSBtZWFuICdyZW1v
dGUnIHNldCB0byAnbm9uZScuDQo+IA0KPiBKdXN0IGRvOg0KPiANCj4gCSMgZGQgaWY9L2Rldi96
ZXJvIG9mPS9kZXYvZGEwIGJzPTFtIGNvdW50PTEwMjQwDQo+IA0KPiB0aGVuIGNvbmZpZ3VyZSBI
QVNUIGFuZDoNCj4gDQo+IAkjIGRkIGlmPS9kZXYvemVybyBvZj0vZGV2L2hhc3QvZGlzazAgYnM9
MW0gY291bnQ9MTAyNDANCj4gDQo+IFdoaWNoIEZyZWVCU0QgdmVyc2lvbiBpcyBpdD8NCj4gDQo+
IFBTLiBZb3VyIFpGUyB0ZXN0cyBhcmUgcHJldHR5IG1lYW5pbmdsZXNzLCBiZWNhdXNlIGl0IGlz
IHBvc3NpYmxlIHRoYXQNCj4gICAgIGV2ZXJ5dGhpbmcgd2lsbCBlbmQgdXAgaW4gbWVtb3J5LiBJ
J20gc3VyZSB0aGlzIGlzIHdoYXQgaGFwcGVucyBpbg0KPiAgICAgJ2JzPTE2ayBjb3VudD02NTUz
NScgY2FzZS4gTGV0IHRyeSByYXcgcHJvdmlkZXJzIGZpcnN0Lg0KPiANCg0KVGhhbmtzIGZvciB0
aGUgcmVwbHkuICBJJ20gdXNpbmcgRnJlZUJTRCA5LjEtUkVMRUFTRS4gSGVyZSBhcmUgdGhlDQpy
ZXN1bHRzOg0KDQogIyBkZCBpZj0vZGV2L3plcm8gb2Y9L2Rldi9kYTAgYnM9MW0gY291bnQ9MTAy
NDANCiAxMDczNzQxODI0MCBieXRlcyB0cmFuc2ZlcnJlZCBpbiA3NTUuMTQ0NjQ0IHNlY3MgKDE0
MjE5MDIyIGJ5dGVzL3NlYykNCg0KICMgZGQgaWY9L2Rldi96ZXJvIG9mPS9kZXYvaGFzdC9kaXNr
MCBicz0xbSBjb3VudD0xMDI0MA0KIDEwNzM3NDE4MjQwIGJ5dGVzIHRyYW5zZmVycmVkIGluIDg0
NC4xNjc2MDIgc2VjcyAoMTI3MTk1MzQgYnl0ZXMvc2VjKQ0KDQoNCldoaWNoIGluZGljYXRlcyBh
IHZlcnkgc21hbGwgb3ZlcmhlYWQsIGhtbW0uLi4NCg0KDQotIC0tIA0KTGF1cmVuY2UgR2lsbA0K
DQpmOiAwODcyMSAxNTcgNjY1DQpza3lwZTogbGF1cmVuY2VnZw0KZTogbGF1cmVuY2VzZ2lsbEBn
b29nbGVtYWlsLmNvbQ0KUEdQIG9uIEtleSBTZXJ2ZXJzDQotLS0tLUJFR0lOIFBHUCBTSUdOQVRV
UkUtLS0tLQ0KVmVyc2lvbjogR251UEcgdjIuMC4xOSAoR05VL0xpbnV4KQ0KDQppRVlFQVJFQ0FB
WUZBbEVHYUg0QUNna1F5Z1Z0OFNxMFBmOFFhUUNmWDQvU0FHbndZWGZDeEorRkZuRTFPaVJ2DQpS
M01BbjIyYnhqaFhuQ081QXFzeDc0R3hxNVplbVVqWA0KPTdkZ1INCi0tLS0tRU5EIFBHUCBTSUdO
QVRVUkUtLS0tLQ0K

From owner-freebsd-fs@FreeBSD.ORG  Mon Jan 28 15:44:56 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id E83479B7
 for <freebsd-fs@freebsd.org>; Mon, 28 Jan 2013 15:44:56 +0000 (UTC)
 (envelope-from c47g@gmx.at)
Received: from mout.gmx.net (mout.gmx.net [212.227.15.18])
 by mx1.freebsd.org (Postfix) with ESMTP id 96A85680
 for <freebsd-fs@freebsd.org>; Mon, 28 Jan 2013 15:44:56 +0000 (UTC)
Received: from mailout-de.gmx.net ([10.1.76.12]) by mrigmx.server.lan
 (mrigmx002) with ESMTP (Nemesis) id 0MTdbK-1UQLx92daS-00QQGP for
 <freebsd-fs@freebsd.org>; Mon, 28 Jan 2013 16:44:55 +0100
Received: (qmail invoked by alias); 28 Jan 2013 15:44:55 -0000
Received: from cm56-168-232.liwest.at (EHLO bones.gusis.at) [86.56.168.232]
 by mail.gmx.net (mp012) with SMTP; 28 Jan 2013 16:44:55 +0100
X-Authenticated: #9978462
X-Provags-ID: V01U2FsdGVkX1+RNOEagoaR8Aj+CRm8gJ6VegtnFdM0y+SAncGTcB
 2AVQOib+LqwOIQ
From: Christian Gusenbauer <c47g@gmx.at>
To: pyunyh@gmail.com
Subject: Re: 9.1-stable crashes while copying data from a NFS mounted directory
Date: Mon, 28 Jan 2013 16:46:43 +0100
User-Agent: KMail/1.13.7 (FreeBSD/9.1-STABLE; KDE/4.8.4; amd64; ; )
References: <201301241805.57623.c47g@gmx.at> <201301251809.50929.c47g@gmx.at>
 <20130128063531.GC1447@michelle.cdnetworks.com>
In-Reply-To: <20130128063531.GC1447@michelle.cdnetworks.com>
MIME-Version: 1.0
Content-Type: Text/Plain;
  charset="us-ascii"
Content-Transfer-Encoding: 7bit
Message-Id: <201301281646.43551.c47g@gmx.at>
X-Y-GMX-Trusted: 0
Cc: freebsd-fs@freebsd.org, net@freebsd.org, yongari@freebsd.org
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 28 Jan 2013 15:44:57 -0000

On Monday 28 January 2013 07:35:31 YongHyeon PYUN wrote:
> On Fri, Jan 25, 2013 at 06:09:50PM +0100, Christian Gusenbauer wrote:
> > On Friday 25 January 2013 05:50:48 YongHyeon PYUN wrote:
> > > On Fri, Jan 25, 2013 at 01:30:43PM +0900, YongHyeon PYUN wrote:
> > > > On Thu, Jan 24, 2013 at 05:21:50PM -0500, John Baldwin wrote:
> > > > > On Thursday, January 24, 2013 4:22:12 pm Konstantin Belousov wrote:
> > > > > > On Thu, Jan 24, 2013 at 09:50:52PM +0100, Christian Gusenbauer 
wrote:
> > > > > > > On Thursday 24 January 2013 20:37:09 Konstantin Belousov wrote:
> > > > > > > > On Thu, Jan 24, 2013 at 07:50:49PM +0100, Christian
> > > > > > > > Gusenbauer
> > 
> > wrote:
> > > > > > > > > On Thursday 24 January 2013 19:07:23 Konstantin Belousov 
wrote:
> > > > > > > > > > On Thu, Jan 24, 2013 at 08:03:59PM +0200, Konstantin
> > > > > > > > > > Belousov
> > 
> > wrote:
> > > > > > > > > > > On Thu, Jan 24, 2013 at 06:05:57PM +0100, Christian
> > 
> > Gusenbauer wrote:
> > > > > > > > > > > > Hi!
> > > > > > > > > > > > 
> > > > > > > > > > > > I'm using 9.1 stable svn revision 245605 and I get
> > > > > > > > > > > > the panic below if I execute the following commands
> > > > > > > > > > > > (as single user):
> > > > > > > > > > > > 
> > > > > > > > > > > > # swapon -a
> > > > > > > > > > > > # dumpon /dev/ada0s3b
> > > > > > > > > > > > # mount -u /
> > > > > > > > > > > > # ifconfig age0 inet 192.168.2.2 mtu 6144 up
> > > > > > > > > > > > # mount -t nfs -o rsize=32768 data:/multimedia /mnt
> > > > > > > > > > > > # cp /mnt/Movies/test/a.m2ts /tmp
> > > > > > > > > > > > 
> > > > > > > > > > > > then the system panics almost immediately. I'll
> > > > > > > > > > > > attach the stack trace.
> > > > > > > > > > > > 
> > > > > > > > > > > > Note, that I'm using jumbo frames (6144 byte) on a
> > > > > > > > > > > > 1Gbit network, maybe that's the cause for the panic,
> > > > > > > > > > > > because the bcopy (see stack frame #15) fails.
> > > > > > > > > > > > 
> > > > > > > > > > > > Any clues?
> > > > > > > > > > > 
> > > > > > > > > > > I tried a similar operation with the nfs mount of
> > > > > > > > > > > rsize=32768 and mtu 6144, but the machine runs HEAD and
> > > > > > > > > > > em instead of age. I was unable to reproduce the panic
> > > > > > > > > > > on the copy of the 5GB file from nfs mount.
> > > > > > > > > 
> > > > > > > > > Hmmm, I did a quick test. If I do not change the MTU, so
> > > > > > > > > just configuring age0 with
> > > > > > > > > 
> > > > > > > > > # ifconfig age0 inet 192.168.2.2 up
> > > > > > > > > 
> > > > > > > > > then I can copy all files from the mounted directory
> > > > > > > > > without any problems, too. So it's probably age0 related?
> > > > > > > > 
> > > > > > > > From your backtrace and the buffer printout, I see somewhat
> > > > > > > > strange thing. The buffer data address is 0xffffff8171418000,
> > > > > > > > while kernel faulted at the attempt to write at
> > > > > > > > 0xffffff8171413000, which is is lower then the buffer data
> > > > > > > > pointer, at the attempt to bcopy to the buffer.
> > > > > > > > 
> > > > > > > > The other data suggests that there were no overflow of the
> > > > > > > > data from the server response. So it might be that
> > > > > > > > mbuf_len(mp) returned negative number ? I am not sure is it
> > > > > > > > possible at all.
> > > > > > > > 
> > > > > > > > Try this debugging patch, please. You need to add INVARIANTS
> > > > > > > > etc to the kernel config.
> > > > > > > > 
> > > > > > > > diff --git a/sys/fs/nfs/nfs_commonsubs.c
> > > > > > > > b/sys/fs/nfs/nfs_commonsubs.c index efc0786..9a6bda5 100644
> > > > > > > > --- a/sys/fs/nfs/nfs_commonsubs.c
> > > > > > > > +++ b/sys/fs/nfs/nfs_commonsubs.c
> > > > > > > > @@ -218,6 +218,7 @@ nfsm_mbufuio(struct nfsrv_descript *nd,
> > > > > > > > struct uio *uiop, int siz) }
> > > > > > > > 
> > > > > > > >  				mbufcp = NFSMTOD(mp, caddr_t);
> > > > > > > >  				len = mbuf_len(mp);
> > > > > > > > 
> > > > > > > > +				KASSERT(len > 0, ("len %d", len));
> > > > > > > > 
> > > > > > > >  			}
> > > > > > > >  			xfer = (left > len) ? len : left;
> > > > > > > >  
> > > > > > > >  #ifdef notdef
> > > > > > > > 
> > > > > > > > @@ -239,6 +240,8 @@ nfsm_mbufuio(struct nfsrv_descript *nd,
> > > > > > > > struct uio *uiop, int siz) uiop->uio_resid -= xfer;
> > > > > > > > 
> > > > > > > >  		}
> > > > > > > >  		if (uiop->uio_iov->iov_len <= siz) {
> > > > > > > > 
> > > > > > > > +			KASSERT(uiop->uio_iovcnt > 1, ("uio_iovcnt %d",
> > > > > > > > +			    uiop->uio_iovcnt));
> > > > > > > > 
> > > > > > > >  			uiop->uio_iovcnt--;
> > > > > > > >  			uiop->uio_iov++;
> > > > > > > >  		
> > > > > > > >  		} else {
> > > > > > > > 
> > > > > > > > I thought that server have returned too long response, but it
> > > > > > > > seems to be not the case from your data. Still, I think the
> > > > > > > > patch below might be due.
> > > > > > > > 
> > > > > > > > diff --git a/sys/fs/nfsclient/nfs_clrpcops.c
> > > > > > > > b/sys/fs/nfsclient/nfs_clrpcops.c index be0476a..a89b907
> > > > > > > > 100644 --- a/sys/fs/nfsclient/nfs_clrpcops.c
> > > > > > > > +++ b/sys/fs/nfsclient/nfs_clrpcops.c
> > > > > > > > @@ -1444,7 +1444,7 @@ nfsrpc_readrpc(vnode_t vp, struct uio
> > > > > > > > *uiop, struct ucred *cred, NFSM_DISSECT(tl, u_int32_t *,
> > > > > > > > NFSX_UNSIGNED);
> > > > > > > > 
> > > > > > > >  			eof = fxdr_unsigned(int, *tl);
> > > > > > > >  		
> > > > > > > >  		}
> > > > > > > > 
> > > > > > > > -		NFSM_STRSIZ(retlen, rsize);
> > > > > > > > +		NFSM_STRSIZ(retlen, len);
> > > > > > > > 
> > > > > > > >  		error = nfsm_mbufuio(nd, uiop, retlen);
> > > > > > > >  		if (error)
> > > > > > > >  		
> > > > > > > >  			goto nfsmout;
> > > > > > > 
> > > > > > > I applied your patches and now I get a
> > > > > > > 
> > > > > > > panic: len -4
> > > > > > > cpuid = 1
> > > > > > > KDB: enter: panic
> > > > > > > Dumping 377 out of 6116
> > > > > > > MB:..5%..13%..22%..34%..43%..51%..64%..73%..81%..94%
> > > > > > 
> > > > > > This means that the age driver either produced corrupted mbuf
> > > > > > chain, or filled wrong negative value into the mbuf len field. I
> > > > > > am quite certain that the issue is in the driver.
> > > > > > 
> > > > > > I added the net@ to Cc:, hopefully you could get help there.
> > > > > 
> > > > > And I've cc'd Pyun who has written most of this driver and is
> > > > > likely the one most familiar with its handling of jumbo frames.
> > > > 
> > > > Try attached one and let me know how it goes.
> > > > Note, I don't have age(4) anymore so it wasn't tested at all.
> > > 
> > > Sorry, ignore previous patch and use this one(age.diff2) instead.
> > 
> > Thanks for the patch! I ignored the first and applied only the second
> > one, but unfortunately that did not change anything. I still get the
> > "panic: len -4"
> > 
> > :-(.
> 
> Ok, I contacted QAC and got a hint for its descriptor usage and I
> realized the controller does not work as I initially expected!
> When I wrote age(4) for the controller, the hardware was available
> only for a couple of weeks so I may have not enough time to test
> it.  Sorry about that.
> I'll let you know when experimental patch is available. Due to lack
> of hardware, it would take more time than it used to be.
> 
> Thanks for reporting!

Thanks for investing your time! I'm looking forward to test your next 
patch(es) :-)!

Ciao,
Christian.

From owner-freebsd-fs@FreeBSD.ORG  Mon Jan 28 16:11:48 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id 78D5CD06;
 Mon, 28 Jan 2013 16:11:48 +0000 (UTC)
 (envelope-from laurencesgill@googlemail.com)
Received: from mail-wi0-f179.google.com (mail-wi0-f179.google.com
 [209.85.212.179])
 by mx1.freebsd.org (Postfix) with ESMTP id BFD5C8D9;
 Mon, 28 Jan 2013 16:11:47 +0000 (UTC)
Received: by mail-wi0-f179.google.com with SMTP id o1so1587895wic.12
 for <multiple recipients>; Mon, 28 Jan 2013 08:11:41 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=googlemail.com; s=20120113;
 h=x-received:date:from:to:cc:subject:message-id:in-reply-to
 :references:x-mailer:mime-version:content-type
 :content-transfer-encoding;
 bh=rg9H/7I/0w+W7kHM8c5+sIRDAUeRSogB+FSwAWsitUs=;
 b=1KWjJpJXQGe93+JPOWBoEPmjYV5OF5ctAz1j8cfrcJbE4vrH5xJp/3K39rGOQHRVyJ
 /oMBZC8u8fDQexWst0lvfLibcrNI6lkUhlaFnyTiShfXPgDgOEUMFn9WW7mX2cekP0mU
 RifRbRoNI1JhSfvEQKJnPn0T9yog1ph3MxbVRBkrFyjt3hfpWbE2RUs7+9tf0p0EzFE5
 HhdskdjoSkZf6RZKVYlIx5i/U6xz+tj8Wc4tFayu+8/lo5LlJNfV6OTbSZ/jpXUcmEgy
 cTO68Ok8J8GWt8T192p61FjC+xJjrbXxr5KQQQt6vQ56YY2dmywUfWM5diKNmQi26zem
 gC3Q==
X-Received: by 10.180.81.39 with SMTP id w7mr10873810wix.15.1359389501290;
 Mon, 28 Jan 2013 08:11:41 -0800 (PST)
Received: from localhost (gateway.ash.thebunker.net. [213.129.64.4])
 by mx.google.com with ESMTPS id bd7sm14112933wib.8.2013.01.28.08.11.40
 (version=TLSv1.2 cipher=RC4-SHA bits=128/128);
 Mon, 28 Jan 2013 08:11:41 -0800 (PST)
Date: Mon, 28 Jan 2013 16:12:28 +0000
From: Laurence Gill <laurencesgill@googlemail.com>
To: freebsd-fs@freebsd.org
Subject: Re: HAST performance overheads?
Message-ID: <20130128161228.477ce174@googlemail.com>
In-Reply-To: <20130128120055.6ca7c734@googlemail.com>
References: <20130125121044.1afac72e@googlemail.com>
 <20130127134845.GC1346@garage.freebsd.pl>
 <20130128120055.6ca7c734@googlemail.com>
X-Mailer: Claws Mail 3.8.1 (GTK+ 2.24.12; x86_64-pc-linux-gnu)
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: base64
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 28 Jan 2013 16:11:48 -0000

LS0tLS1CRUdJTiBQR1AgU0lHTkVEIE1FU1NBR0UtLS0tLQ0KSGFzaDogU0hBMQ0KDQpPbiBNb24s
IDI4IEphbiAyMDEzIDEyOjAwOjU1ICswMDAwDQpMYXVyZW5jZSBHaWxsIDxsYXVyZW5jZXNnaWxs
QGdvb2dsZW1haWwuY29tPiB3cm90ZToNCj4gT24gU3VuLCAyNyBKYW4gMjAxMyAxNDo0ODo0NiAr
MDEwMA0KPiBQYXdlbCBKYWt1YiBEYXdpZGVrIDxwamRARnJlZUJTRC5vcmc+IHdyb3RlOg0KPiA+
IA0KPiA+IExldCdzIHRyeSB0byB0ZXN0IG9uZSBzdGVwIGF0IGEgdGltZS4gQ2FuIHlvdSB0cnkg
dG8gY29tcGFyZQ0KPiA+IHNlcXVlbnRpYWwgcGVyZm9ybWFuY2Ugb2YgcmVndWxhciBkaXNrIHZz
LiBIQVNUIHdpdGggbm8gc2Vjb25kYXJ5DQo+ID4gY29uZmlndXJlZD8NCj4gPiANCj4gPiBCeSBu
byBzZWNvbmRhcnkgY29uZmlndXJlZCBJIG1lYW4gJ3JlbW90ZScgc2V0IHRvICdub25lJy4NCj4g
PiANCj4gPiBKdXN0IGRvOg0KPiA+IA0KPiA+IAkjIGRkIGlmPS9kZXYvemVybyBvZj0vZGV2L2Rh
MCBicz0xbSBjb3VudD0xMDI0MA0KPiA+IA0KPiA+IHRoZW4gY29uZmlndXJlIEhBU1QgYW5kOg0K
PiA+IA0KPiA+IAkjIGRkIGlmPS9kZXYvemVybyBvZj0vZGV2L2hhc3QvZGlzazAgYnM9MW0gY291
bnQ9MTAyNDANCj4gPiANCj4gPiBXaGljaCBGcmVlQlNEIHZlcnNpb24gaXMgaXQ/DQo+ID4gDQo+
ID4gUFMuIFlvdXIgWkZTIHRlc3RzIGFyZSBwcmV0dHkgbWVhbmluZ2xlc3MsIGJlY2F1c2UgaXQg
aXMgcG9zc2libGUNCj4gPiB0aGF0IGV2ZXJ5dGhpbmcgd2lsbCBlbmQgdXAgaW4gbWVtb3J5LiBJ
J20gc3VyZSB0aGlzIGlzIHdoYXQNCj4gPiBoYXBwZW5zIGluICdicz0xNmsgY291bnQ9NjU1MzUn
IGNhc2UuIExldCB0cnkgcmF3IHByb3ZpZGVycyBmaXJzdC4NCj4gPiANCj4gDQo+IFRoYW5rcyBm
b3IgdGhlIHJlcGx5LiAgSSdtIHVzaW5nIEZyZWVCU0QgOS4xLVJFTEVBU0UuIEhlcmUgYXJlIHRo
ZQ0KPiByZXN1bHRzOg0KPiANCj4gICMgZGQgaWY9L2Rldi96ZXJvIG9mPS9kZXYvZGEwIGJzPTFt
IGNvdW50PTEwMjQwDQo+ICAxMDczNzQxODI0MCBieXRlcyB0cmFuc2ZlcnJlZCBpbiA3NTUuMTQ0
NjQ0IHNlY3MgKDE0MjE5MDIyIGJ5dGVzL3NlYykNCj4gDQo+ICAjIGRkIGlmPS9kZXYvemVybyBv
Zj0vZGV2L2hhc3QvZGlzazAgYnM9MW0gY291bnQ9MTAyNDANCj4gIDEwNzM3NDE4MjQwIGJ5dGVz
IHRyYW5zZmVycmVkIGluIDg0NC4xNjc2MDIgc2VjcyAoMTI3MTk1MzQgYnl0ZXMvc2VjKQ0KPiAN
Cj4gDQo+IFdoaWNoIGluZGljYXRlcyBhIHZlcnkgc21hbGwgb3ZlcmhlYWQsIGhtbW0uLi4NCj4g
DQoNCkZ1cnRoZXIgdG8gdGhpcywgc3RpY2tpbmcgd2l0aCB0aGUgMSBkaXNrIGZvciB0ZXN0aW5n
LCBJIHNlZSB0aGUNCmZvbGxvd2luZzoNCg0KIC0gVUZTIG9uIGRhMA0KICMgZGQgaWY9L2Rldi96
ZXJvIG9mPXRlc3QuZGF0IGJzPTFtIGNvdW50PTEwMjQwDQogMTA3Mzc0MTgyNDAgYnl0ZXMgdHJh
bnNmZXJyZWQgaW4gNzYuMTEyODczIHNlY3MgKDE0MTA3MjMwMiBieXRlcy9zZWMpDQoNCiAtIFVG
UyBvbiBoYXN0L2Rpc2swDQogIyAgZGQgaWY9L2Rldi96ZXJvIG9mPXRlc3QuZGF0ICBicz0xbSBj
b3VudD0xMDI0MA0KIDEwNzM3NDE4MjQwIGJ5dGVzIHRyYW5zZmVycmVkIGluIDg1NS43MjA5ODUg
c2VjcyAoMTI1NDc4MDMgYnl0ZXMvc2VjKQ0KDQpXaGljaCBpcyByb3VnaGx5IHRoZSBzYW1lIGFz
IHVzaW5nIHRoZSByYXcgaGFzdCBwcm92aWRlci4NCg0KDQogLSB6ZnMgb24gZGEwDQogIyBkZCBp
Zj0vZGV2L3plcm8gb2Y9dGVzdC5kYXQgYnM9MW0gY291bnQ9MTAyNDANCiAxMDczNzQxODI0MCBi
eXRlcyB0cmFuc2ZlcnJlZCBpbiAxMTQuMzM4OTAwIHNlY3MgKDkzOTA4NzA3IGJ5dGVzL3NlYykN
Cg0KIC0gemZzIG9uIGhhc3QvZGlzazANCiAjIGRkIGlmPS9kZXYvemVybyBvZj10ZXN0LmRhdCBi
cz0xbSBjb3VudD0xMDI0MA0KIDEwNzM3NDE4MjQwIGJ5dGVzIHRyYW5zZmVycmVkIGluIDEyODcu
MDg4NDE2IHNlY3MgKDgzNDI0MDkgYnl0ZXMvc2VjKQ0KDQpXaGljaCBzZWVtcyBzbG93ZXIgdGhh
biB0aGUgcmF3IHByb3ZpZGVyIGJ5IGFwcHJveCA0TUIvcy4NCg0KU28gSSdtIHN0aWxsIHRyeWlu
ZyB0byB3b3JrIG91dCB3aHkgdGhlIGV4dHJhICJkcm9wIiB3aGVuIHVzaW5nIFpGUyBvbg0KaGFz
dC4uLg0KDQoNCg0KLSAtLSANCkxhdXJlbmNlIEdpbGwNCg0KZjogMDg3MjEgMTU3IDY2NQ0Kc2t5
cGU6IGxhdXJlbmNlZ2cNCmU6IGxhdXJlbmNlc2dpbGxAZ29vZ2xlbWFpbC5jb20NClBHUCBvbiBL
ZXkgU2VydmVycw0KLS0tLS1CRUdJTiBQR1AgU0lHTkFUVVJFLS0tLS0NClZlcnNpb246IEdudVBH
IHYyLjAuMTkgKEdOVS9MaW51eCkNCg0KaUVZRUFSRUNBQVlGQWxFR28zUUFDZ2tReWdWdDhTcTBQ
ZjhLM1FDZlZBK25vZklnUkhNL2dZaUF6aXM2VEY1Kw0KVnZZQW4ya0VPVnRHeVNSMGVadGVnR3J2
VWFwNUJWaHgNCj05ZkN2DQotLS0tLUVORCBQR1AgU0lHTkFUVVJFLS0tLS0NCg==

From owner-freebsd-fs@FreeBSD.ORG  Mon Jan 28 18:06:12 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: fs@freebsd.org
Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115])
 by hub.freebsd.org (Postfix) with ESMTP id 0245A7CD;
 Mon, 28 Jan 2013 18:06:12 +0000 (UTC)
 (envelope-from hag@linnaean.org)
Received: from perdition.linnaean.org (perdition.linnaean.org
 [IPv6:2001:470:8917:1::1])
 by mx1.freebsd.org (Postfix) with ESMTP id CF5B6FA3;
 Mon, 28 Jan 2013 18:06:11 +0000 (UTC)
Received: by perdition.linnaean.org (Postfix, from userid 31013)
 id EC36E884; Mon, 28 Jan 2013 13:06:10 -0500 (EST)
From: Daniel Hagerty <hag@linnaean.org>
To: Ulrich =?utf-8?Q?Sp=C3=B6rlein?= <uqs@FreeBSD.org>
Subject: Re: Zpool surgery
References: <20130127103612.GB38645@acme.spoerlein.net>
 <1F0546C4D94D4CCE9F6BB4C8FA19FFF2@multiplay.co.uk>
 <20130127201140.GD29105@server.rulingia.com>
 <20130128085820.GR35868@acme.spoerlein.net>
Sender: Daniel Hagerty <hag@perdition.linnaean.org>
Date: Mon, 28 Jan 2013 13:06:10 -0500
In-Reply-To: <20130128085820.GR35868@acme.spoerlein.net> (Ulrich Sp's message
 of "Mon, 28 Jan 2013 09:58:20 +0100")
Message-ID: <c2ilibd41yl.fsf@perdition.linnaean.org>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.4 (berkeley-unix)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
Cc: current@freebsd.org, fs@freebsd.org
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
Reply-To: Daniel Hagerty <hag@linnaean.org>
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 28 Jan 2013 18:06:12 -0000

Ulrich Sp=C3=B6rlein <uqs@FreeBSD.org> writes:

> But are you then also supposed to be able send incremental snapshots to
> a third pool from the pool that you just cloned?

    I can't speak to your problems, but I did recently do what you seem
to be doing, without incident.  That is, I had a pool and an archive.  I
copied datasets from pool to a new pool', and pool' could send to the
archive as if it were the original pool.

    Two possible differences in what I do that leap to mind:

1. I only send select snapshots to archive; the synchronization
snapshots are not among them.
2. I use receive -F.

> How does the receiving pool known that it has the correct snapshot to
> store an incremental one anyway? Is there a toplevel checksum, like for
> git commits? How can I display and compare that?

    I don't know for sure, but I'd hazard a guess that:

$ zfs get -p guid pool/home@daily-2013-01-28
NAME                        PROPERTY  VALUE  SOURCE
pool/home@daily-2013-01-28  guid      259258190084829958  -

    plays a part.

    Good luck!

From owner-freebsd-fs@FreeBSD.ORG  Mon Jan 28 20:04:51 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id 0F7E7138;
 Mon, 28 Jan 2013 20:04:51 +0000 (UTC)
 (envelope-from freebsd-listen@fabiankeil.de)
Received: from smtprelay03.ispgateway.de (smtprelay03.ispgateway.de
 [80.67.29.28]) by mx1.freebsd.org (Postfix) with ESMTP id 7D2D584A;
 Mon, 28 Jan 2013 20:04:50 +0000 (UTC)
Received: from [78.35.168.72] (helo=fabiankeil.de)
 by smtprelay03.ispgateway.de with esmtpsa (SSLv3:AES128-SHA:128)
 (Exim 4.68) (envelope-from <freebsd-listen@fabiankeil.de>)
 id 1TzuwH-0001SW-MR; Mon, 28 Jan 2013 21:04:21 +0100
Date: Mon, 28 Jan 2013 20:58:02 +0100
From: Fabian Keil <freebsd-listen@fabiankeil.de>
To: Ulrich =?UTF-8?B?U3DDtnJsZWlu?= <uqs@FreeBSD.org>
Subject: Re: Zpool surgery
Message-ID: <20130128205802.1ffab53e@fabiankeil.de>
In-Reply-To: <20130128085820.GR35868@acme.spoerlein.net>
References: <20130127103612.GB38645@acme.spoerlein.net>
 <1F0546C4D94D4CCE9F6BB4C8FA19FFF2@multiplay.co.uk>
 <20130127201140.GD29105@server.rulingia.com>
 <20130128085820.GR35868@acme.spoerlein.net>
Mime-Version: 1.0
Content-Type: multipart/signed; micalg=PGP-SHA1;
 boundary="Sig_/NoSyaoazf+aPmp=rJ5E9Umz"; protocol="application/pgp-signature"
X-Df-Sender: Nzc1MDY3
Cc: current@freebsd.org, fs@freebsd.org
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 28 Jan 2013 20:04:51 -0000

--Sig_/NoSyaoazf+aPmp=rJ5E9Umz
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Ulrich Sp=C3=B6rlein <uqs@FreeBSD.org> wrote:

> On Mon, 2013-01-28 at 07:11:40 +1100, Peter Jeremy wrote:
> > On 2013-Jan-27 14:31:56 -0000, Steven Hartland <killing@multiplay.co.uk=
> wrote:
> > >----- Original Message -----=20
> > >From: "Ulrich Sp=C3=B6rlein" <uqs@FreeBSD.org>
> > >> I want to transplant my old zpool tank from a 1TB drive to a new 2TB
> > >> drive, but *not* use dd(1) or any other cloning mechanism, as the po=
ol
> > >> was very full very often and is surely severely fragmented.
> > >
> > >Cant you just drop the disk in the original machine, set it as a mirror
> > >then once the mirror process has completed break the mirror and remove
> > >the 1TB disk.
> >=20
> > That will replicate any fragmentation as well.  "zfs send | zfs recv"
> > is the only (current) way to defragment a ZFS pool.

It's not obvious to me why "zpool replace" (or doing it manually)
would replicate the fragmentation.

> But are you then also supposed to be able send incremental snapshots to
> a third pool from the pool that you just cloned?

Yes.

> I did the zpool replace now over night, and it did not remove the old
> device yet, as it found cksum errors on the pool:
>=20
> root@coyote:~# zpool status -v
>   pool: tank
>  state: ONLINE
> status: One or more devices has experienced an error resulting in data
>         corruption.  Applications may be affected.
> action: Restore the file in question if possible.  Otherwise restore the
>         entire pool from backup.
>    see: http://illumos.org/msg/ZFS-8000-8A
>   scan: resilvered 873G in 11h33m with 24 errors on Mon Jan 28 09:45:32 2=
013
> config:
>=20
>         NAME           STATE     READ WRITE CKSUM
>         tank           ONLINE       0     0    27
>           replacing-0  ONLINE       0     0    61
>             da0.eli    ONLINE       0     0    61
>             ada1.eli   ONLINE       0     0    61
>=20
> errors: Permanent errors have been detected in the following files:
>=20
>         tank/src@2013-01-17:/.svn/pristine/8e/8ed35772a38e0fec00bc1cbc2f0=
5480f4fd4759b.svn-base
[...]
>         tank/ncvs@2013-01-17:/ports/textproc/uncrustify/distinfo,v
>=20
> Interestingly, these only seem to affect the snapshot, and I'm now
> wondering if that is the problem why the backup pool did not accept the
> next incremental snapshot from the new pool.

I doubt that. My expectation would be that it only prevents
the "zfs send" to finish successfully.

BTW, you could try reading the files to be sure that the checksum
problems are permanent and not just temporary USB issues.

> How does the receiving pool known that it has the correct snapshot to
> store an incremental one anyway? Is there a toplevel checksum, like for
> git commits? How can I display and compare that?

Try zstreamdump:

fk@r500 ~ $sudo zfs send -i @2013-01-24_20:48 tank/etc@2013-01-26_21:14 | z=
streamdump | head -11
BEGIN record
	hdrtype =3D 1
	features =3D 4
	magic =3D 2f5bacbac
	creation_time =3D 5104392a
	type =3D 2
	flags =3D 0x0
	toguid =3D a1eb3cfe794e675c
	fromguid =3D 77fb8881b19cb41f
	toname =3D tank/etc@2013-01-26_21:14
END checksum =3D 1047a3f2dceb/67c999f5e40ecf9/442237514c1120ed/efd508ab5203=
c91c

fk@r500 ~ $sudo zfs send lexmark/backup/r500/tank/etc@2013-01-24_20:48 | zs=
treamdump | head -11
BEGIN record
	hdrtype =3D 1
	features =3D 4
	magic =3D 2f5bacbac
	creation_time =3D 51018ff4
	type =3D 2
	flags =3D 0x0
	toguid =3D 77fb8881b19cb41f
	fromguid =3D 0
	toname =3D lexmark/backup/r500/tank/etc@2013-01-24_20:48
END checksum =3D 1c262b5ffe935/78d8a68e0eb0c8e7/eb1dde3bd923d153/9e08291036=
49ae22

Fabian

--Sig_/NoSyaoazf+aPmp=rJ5E9Umz
Content-Type: application/pgp-signature; name=signature.asc
Content-Disposition: attachment; filename=signature.asc

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (FreeBSD)

iEYEARECAAYFAlEG2FAACgkQBYqIVf93VJ28pwCdElXCUi5LtiuQDigCoscMjT3q
bXAAn0MaWH2Uuj3tqtaoWIKXeMBeW76D
=w0Za
-----END PGP SIGNATURE-----

--Sig_/NoSyaoazf+aPmp=rJ5E9Umz--

From owner-freebsd-fs@FreeBSD.ORG  Mon Jan 28 21:44:28 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id 4B5FCD8B;
 Mon, 28 Jan 2013 21:44:28 +0000 (UTC)
 (envelope-from dan@dan.emsphone.com)
Received: from email2.allantgroup.com (email2.emsphone.com [199.67.51.116])
 by mx1.freebsd.org (Postfix) with ESMTP id D7B1EE60;
 Mon, 28 Jan 2013 21:44:27 +0000 (UTC)
Received: from dan.emsphone.com (dan.emsphone.com [172.17.17.101])
 by email2.allantgroup.com (8.14.5/8.14.5) with ESMTP id r0SLfD5F000686
 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
 Mon, 28 Jan 2013 15:41:13 -0600 (CST)
 (envelope-from dan@dan.emsphone.com)
Received: from dan.emsphone.com (smmsp@localhost [127.0.0.1])
 by dan.emsphone.com (8.14.6/8.14.6) with ESMTP id r0SLfCYg060240
 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
 Mon, 28 Jan 2013 15:41:12 -0600 (CST)
 (envelope-from dan@dan.emsphone.com)
Received: (from dan@localhost)
 by dan.emsphone.com (8.14.6/8.14.6/Submit) id r0SLfBWT060239;
 Mon, 28 Jan 2013 15:41:11 -0600 (CST) (envelope-from dan)
Date: Mon, 28 Jan 2013 15:41:11 -0600
From: Dan Nelson <dnelson@allantgroup.com>
To: Fabian Keil <freebsd-listen@fabiankeil.de>
Subject: Re: Zpool surgery
Message-ID: <20130128214111.GA14888@dan.emsphone.com>
References: <20130127103612.GB38645@acme.spoerlein.net>
 <1F0546C4D94D4CCE9F6BB4C8FA19FFF2@multiplay.co.uk>
 <20130127201140.GD29105@server.rulingia.com>
 <20130128085820.GR35868@acme.spoerlein.net>
 <20130128205802.1ffab53e@fabiankeil.de>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <20130128205802.1ffab53e@fabiankeil.de>
X-OS: FreeBSD 9.1-STABLE
User-Agent: Mutt/1.5.21 (2010-09-15)
X-Virus-Scanned: clamav-milter 0.97.6 at email2.allantgroup.com
X-Virus-Status: Clean
X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.2.7
 (email2.allantgroup.com [172.17.19.78]); Mon, 28 Jan 2013 15:41:13 -0600 (CST)
X-Spam-Status: No, score=-2.9 required=5.0 tests=ALL_TRUSTED,BAYES_00,
 RP_MATCHES_RCVD autolearn=ham version=3.3.2
X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on
 email2.allantgroup.com
X-Scanned-By: MIMEDefang 2.73
Cc: current@freebsd.org, fs@freebsd.org,
 Ulrich =?utf-8?B?U3DDtnJsZWlu?= <uqs@freebsd.org>
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 28 Jan 2013 21:44:28 -0000

In the last episode (Jan 28), Fabian Keil said:
> Ulrich Spörlein <uqs@FreeBSD.org> wrote:
> > On Mon, 2013-01-28 at 07:11:40 +1100, Peter Jeremy wrote:
> > > On 2013-Jan-27 14:31:56 -0000, Steven Hartland <killing@multiplay.co.uk> wrote:
> > > >----- Original Message ----- 
> > > >From: "Ulrich Spörlein" <uqs@FreeBSD.org>
> > > >> I want to transplant my old zpool tank from a 1TB drive to a new
> > > >> 2TB drive, but *not* use dd(1) or any other cloning mechanism, as
> > > >> the pool was very full very often and is surely severely
> > > >> fragmented.
> > > >
> > > >Cant you just drop the disk in the original machine, set it as a
> > > >mirror then once the mirror process has completed break the mirror
> > > >and remove the 1TB disk.
> > > 
> > > That will replicate any fragmentation as well.  "zfs send | zfs recv"
> > > is the only (current) way to defragment a ZFS pool.
> 
> It's not obvious to me why "zpool replace" (or doing it manually)
> would replicate the fragmentation.

"zpool replace" essentially adds your new disk as a mirror to the parent
vdev, then deletes the original disk when the resilver is done.  Since
mirrors are block-identical copies of each other, the new disk will contain
an exact copy of the original disk, followed by 1TB of freespace.

-- 
	Dan Nelson
	dnelson@allantgroup.com

From owner-freebsd-fs@FreeBSD.ORG  Mon Jan 28 21:55:55 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: fs@freebsd.org
Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115])
 by hub.freebsd.org (Postfix) with ESMTP id 5781C314
 for <fs@freebsd.org>; Mon, 28 Jan 2013 21:55:55 +0000 (UTC)
 (envelope-from matthew.ahrens@delphix.com)
Received: from mail-wi0-f179.google.com (mail-wi0-f179.google.com
 [209.85.212.179]) by mx1.freebsd.org (Postfix) with ESMTP id E760BF38
 for <fs@freebsd.org>; Mon, 28 Jan 2013 21:55:54 +0000 (UTC)
Received: by mail-wi0-f179.google.com with SMTP id o1so1888499wic.12
 for <fs@freebsd.org>; Mon, 28 Jan 2013 13:55:53 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=delphix.com; s=google;
 h=mime-version:x-received:in-reply-to:references:date:message-id
 :subject:from:to:cc:content-type;
 bh=y8OpT1j5CjrT1VzbLMRWCOk/mby9FAJJsjTdnEF+mso=;
 b=GetgPVQ1HdVIyAKc2Y0cIvhFjd1SYtTXxCxpjfZ+5TGSDfrrNJATnur35toaZ+m/IR
 igVfI1WHVvOtqJxLCtGyWb9Sy21rr1R1ZkSo7vtPcIp6NqdzfBbV79DV0knGbdhsYCGz
 Q/e55tVaGJmdrMxqO08g4DFU8GnYDIIYR6+dI=
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=google.com; s=20120113;
 h=mime-version:x-received:in-reply-to:references:date:message-id
 :subject:from:to:cc:content-type:x-gm-message-state;
 bh=y8OpT1j5CjrT1VzbLMRWCOk/mby9FAJJsjTdnEF+mso=;
 b=CGj0/cZH3l5LPaMnsMi6V4myPuTYyjBBY9ZLHRKGS+j8nmDyH0xYzIkQY1DJAahz6/
 KTnp9kArchpdngvUfqs02yj5RRNBygYtZi7Hh1GcYke8C95xOmAQBEudkKXle8MJCFSE
 fRAZXLlsDGbNb3d9ypD5Qel3lUBEa4gtasdD4J0/8MOnh926JN6VMDxcDMRs7lK31Zep
 SSOFCy8qeZ7oF9mIOtFGxh1Un8z/qvp88GXtQ5LniRuCVnuOSSOoAN7F4rG7WtRC17Jh
 /yciM1RWtWmGa12TBTszk6v4J6PL55Oc1bzOexGrHt5kFksAGhc51Cf8Q2hAcbMx2bSe
 IDNw==
MIME-Version: 1.0
X-Received: by 10.194.123.105 with SMTP id lz9mr23895914wjb.43.1359410153419; 
 Mon, 28 Jan 2013 13:55:53 -0800 (PST)
Received: by 10.194.32.168 with HTTP; Mon, 28 Jan 2013 13:55:53 -0800 (PST)
In-Reply-To: <5105252D.6060502@platinum.linux.pl>
References: <5105252D.6060502@platinum.linux.pl>
Date: Mon, 28 Jan 2013 13:55:53 -0800
Message-ID: <CAJjvXiEQSqnKYP75crTkgVqLKSk92q9UTikFtdyPHmF6shJFbg@mail.gmail.com>
Subject: Re: RAID-Z wasted space - asize roundups to nparity +1
From: Matthew Ahrens <mahrens@delphix.com>
To: Adam Nowacki <nowakpl@platinum.linux.pl>
X-Gm-Message-State: ALoCoQmFjZxrlyGjrcnEgskfc5TytCpM4alOcN+vLYpof6HbkJ02aF5UI2FElmGFk0H7/wXAi3Z+
Content-Type: text/plain; charset=ISO-8859-1
X-Content-Filtered-By: Mailman/MimeDel 2.1.14
Cc: fs@freebsd.org
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 28 Jan 2013 21:55:55 -0000

This is so that we won't end up with small, unallocatable segments.  E.g.
if you are using RAIDZ2, the smallest usable segment would be 3 sectors (1
sector data + 2 sectors parity).  If we left a 1 or 2 sector free segment,
it would be unusable and you'd be able to get into strange accounting
situations where you have free space but can't write because you're "out of
space".

The amount of waste due to this can be minimized by using larger blocksizes
(e.g. the default recordsize of 128k and files larger than 128k), and by
using smaller sector sizes (e.g. 512b sector disks rather than 4k sector
disks).  In your case these techniques would limit the waste to 0.6%.

--matt

On Sun, Jan 27, 2013 at 5:01 AM, Adam Nowacki <nowakpl@platinum.linux.pl>wrote:

> I've just found something very weird in the ZFS code.
>
> sys/cddl/contrib/opensolaris/**uts/common/fs/zfs/vdev_raidz.**c:504 in
> HEAD
>
> Can someone explain the reason behind this line of code? What it does is
> align on-disk record size to a multiple of number of parity disks + 1 ...
> this really doesn't make any sense. So far as I can tell those extra
> sectors are just padding - completely unused.
>
> For the array I'm using this results in 4.8% of wasted disk space - 1.7TB.
> It's a 12x 3TB disk RAID-Z2.
> ______________________________**_________________
> freebsd-fs@freebsd.org mailing list
> http://lists.freebsd.org/**mailman/listinfo/freebsd-fs<http://lists.freebsd.org/mailman/listinfo/freebsd-fs>
> To unsubscribe, send any mail to "freebsd-fs-unsubscribe@**freebsd.org<freebsd-fs-unsubscribe@freebsd.org>
> "
>

From owner-freebsd-fs@FreeBSD.ORG  Tue Jan 29 00:21:42 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115])
 by hub.freebsd.org (Postfix) with ESMTP id DE47BB90
 for <freebsd-fs@freebsd.org>; Tue, 29 Jan 2013 00:21:42 +0000 (UTC)
 (envelope-from dumbbell@FreeBSD.org)
Received: from mail.made4.biz (unknown [IPv6:2001:41d0:1:7018::1:3])
 by mx1.freebsd.org (Postfix) with ESMTP id A4B1A995
 for <freebsd-fs@freebsd.org>; Tue, 29 Jan 2013 00:21:42 +0000 (UTC)
Received: from [2a01:e35:8b20:ae00:290:f5ff:fe9d:b78c]
 (helo=magellan.dumbbell.fr)
 by mail.made4.biz with esmtpsa (TLSv1:DHE-RSA-CAMELLIA256-SHA:256)
 (Exim 4.80.1 (FreeBSD)) (envelope-from <dumbbell@FreeBSD.org>)
 id 1TzyxI-000IZ1-MC; Tue, 29 Jan 2013 01:21:41 +0100
Message-ID: <5107160F.9000008@FreeBSD.org>
Date: Tue, 29 Jan 2013 01:21:35 +0100
From: =?ISO-8859-1?Q?Jean-S=E9bastien_P=E9dron?= <dumbbell@FreeBSD.org>
User-Agent: Mozilla/5.0 (X11; FreeBSD amd64;
 rv:17.0) Gecko/20130110 Thunderbird/17.0.2
MIME-Version: 1.0
To: Will DeVries <william.devries@gmail.com>
Subject: Re: Read-only port of NetBSD's UDF filesystem.
References: <CAEF54-TS-cQVcF5KxqJR=hSHGOF5+gbXcNYqYDfDhMHqa1froA@mail.gmail.com>
In-Reply-To: <CAEF54-TS-cQVcF5KxqJR=hSHGOF5+gbXcNYqYDfDhMHqa1froA@mail.gmail.com>
X-Enigmail-Version: 1.4.6
Content-Type: multipart/signed; micalg=pgp-sha1;
 protocol="application/pgp-signature";
 boundary="------------enig32A5BA3EC9C1CE5C726A29F3"
Cc: freebsd-fs@freebsd.org
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 29 Jan 2013 00:21:42 -0000

This is an OpenPGP/MIME signed message (RFC 2440 and 3156)
--------------enig32A5BA3EC9C1CE5C726A29F3
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

On 21.01.2013 00:30, Will DeVries wrote:
> I have been working on a read-only port of NetBSD's UDF file
> system implementation, which I now believe to be complete except for an=
y
> bug related fixes that may arise.  This file system supports UDF versio=
ns
> through 2.60 on CDs, DVDs and Blu-rays.
>=20
> While it could use more testing, it seems to be stable and working well=
,
> and now seems like a good time to publish it for review.  At the very
> least, I can judge interest and get advice on aspects that perhaps need=

> more work.

Hi Will!

I just tested your port and it's working for me! I was able to mount a
Blu-Ray disc and play the movie using VLC.

However, it seems limited to 3 MB/s, which prevents a smooth read of the
movie. Running dd(1) confirms that. I didn't investigate further for now
and fear I won't have the time to do it in the short term...

Have you tested the speed on NetBSD?

If you have any ideas, I'll gladly test them!

Thanks for your work!

--=20
Jean-S=E9bastien P=E9dron


--------------enig32A5BA3EC9C1CE5C726A29F3
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: OpenPGP digital signature
Content-Disposition: attachment; filename="signature.asc"

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (FreeBSD)
Comment: Using GnuPG with undefined - http://www.enigmail.net/

iEYEARECAAYFAlEHFhQACgkQa+xGJsFYOlOGvQCeOOMne4aACOYkv9kv5G+6XuIk
r00An2ZLjGeC/Ck3O5IMVM6KnPQx9+eP
=R+SK
-----END PGP SIGNATURE-----

--------------enig32A5BA3EC9C1CE5C726A29F3--

From owner-freebsd-fs@FreeBSD.ORG  Tue Jan 29 03:21:20 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id 28860CFF;
 Tue, 29 Jan 2013 03:21:20 +0000 (UTC)
 (envelope-from wollman@hergotha.csail.mit.edu)
Received: from hergotha.csail.mit.edu
 (wollman-1-pt.tunnel.tserv4.nyc4.ipv6.he.net [IPv6:2001:470:1f06:ccb::2])
 by mx1.freebsd.org (Postfix) with ESMTP id 4E4701B7;
 Tue, 29 Jan 2013 03:21:19 +0000 (UTC)
Received: from hergotha.csail.mit.edu (localhost [127.0.0.1])
 by hergotha.csail.mit.edu (8.14.5/8.14.5) with ESMTP id r0T3LHOh080812;
 Mon, 28 Jan 2013 22:21:17 -0500 (EST)
 (envelope-from wollman@hergotha.csail.mit.edu)
Received: (from wollman@localhost)
 by hergotha.csail.mit.edu (8.14.5/8.14.4/Submit) id r0T3LHvB080809;
 Mon, 28 Jan 2013 22:21:17 -0500 (EST) (envelope-from wollman)
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Message-ID: <20743.16429.97668.569869@hergotha.csail.mit.edu>
Date: Mon, 28 Jan 2013 22:21:17 -0500
From: Garrett Wollman <wollman@freebsd.org>
To: freebsd-stable@freebsd.org, freebsd-fs@freebsd.org
Subject: ZFS deadlock on rrl->rr_ -- look familiar to anyone?
X-Mailer: VM 7.17 under 21.4 (patch 22) "Instant Classic" XEmacs Lucid
X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.2.7
 (hergotha.csail.mit.edu [127.0.0.1]); Mon, 28 Jan 2013 22:21:17 -0500 (EST)
X-Spam-Status: No, score=-1.0 required=5.0 tests=ALL_TRUSTED
 autolearn=disabled version=3.3.2
X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on
 hergotha.csail.mit.edu
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 29 Jan 2013 03:21:20 -0000

I just had a big fileserver deadlock in an odd way.  I was
investigating a user's problem, and decided for various reasons to
restart mountd.  It had been complaining like this:

Jan 28 21:06:43 nfs-prod-1 mountd[1108]: can't delete exports for /usr/local/.zfs/snapshot/monthly-2013-01: Invalid argument 

for a while, which is odd because /usr/local was never exported.  When
I restarted mountd, it hung waiting on rrl->rr_, but the system may
already have been deadlocked at that point.  procstat reported:

87678 104365 mountd           -                mi_switch sleepq_wait _cv_wait rrw_enter zfs_root lookup namei vfs_donmount sys_nmount amd64_syscall Xfast_syscall 

I was able to run shutdown, and the rc scripts eventually hung in
sync(1) and timed out.  The kernel then hung trying to do the same
thing, but I was able to break into the debugger.  The debugger
interrupted an idle thread, which was not particularly helpful, but I
was able to quickly gather the following information before I had to
reset the machine to restore normal service.

Locked vnodes


0xfffffe00536383c0: 0xfffffe00536383c0: tag syncer, type VNON
tag syncer, type VNON
    usecount 1, writecount 0, refcount 2 mountedhere 0
    usecount 1, writecount 0, refcount 2 mountedhere 0
    flags (VI(0x200))
    flags (VI(0x200))
        lock type syncer: EXCL by thread 0xfffffe00348cc470 (pid 22)
lock type syncer: EXCL by thread 0xfffffe00348cc470 (pid 22)

db> ps
  pid  ppid  pgrp   uid   state   wmesg         wchan        cmd
87996     1 87994 65534  D       rrl->rr_ 0xfffffe0048ff8108 df
87976     1 87726     0  D+      rrl->rr_ 0xfffffe0048ff8108 sync
87707     1 87705 65534  D       rrl->rr_ 0xfffffe0048ff8108 df
87700     1 87698 65534  D       rrl->rr_ 0xfffffe0048ff8108 df
87678     1 87657     0  D+      rrl->rr_ 0xfffffe0048ff8108 mountd
87531     1 87529 65534  D       rrl->rr_ 0xfffffe0048ff8108 df
87387     1 87385 65534  D       rrl->rr_ 0xfffffe0048ff8108 df
87380     1 87378 65534  D       rrl->rr_ 0xfffffe0048ff8108 df
87103     1 87101 65534  D       rrl->rr_ 0xfffffe0048ff8108 df
87096     1 87094 65534  D       rrl->rr_ 0xfffffe0048ff8108 df
85193     1 85192     0  D       zio->io_ 0xfffffe10d3e75320 zfs
   24     0     0     0  DL      sdflush  0xffffffff80e50878 [softdepflush]
   23     0     0     0  DL      vlruwt   0xfffffe0048c0a940 [vnlru]
   22     0     0     0  DL      rrl->rr_ 0xfffffe0048ff8108 [syncer]
   21     0     0     0  DL      psleep   0xffffffff80e3c048 [bufdaemon]
   20     0     0     0  DL      pgzero   0xffffffff80e5a81c [pagezero]
   19     0     0     0  DL      psleep   0xffffffff80e599e8 [vmdaemon]
   18     0     0     0  DL      psleep   0xffffffff80e599ac [pagedaemon]
   17     0     0     0  DL      gkt:wait 0xffffffff80de6c0c [g_mp_kt]
   16     0     0     0  DL      ipmireq  0xfffffe00347400b8 [ipmi0: kcs]
    9     0     0     0  DL      ccb_scan 0xffffffff80dc1360 [xpt_thrd]
    8     0     0     0  DL      waiting_ 0xffffffff80e41e80 [sctp_iterator]
    7     0     0     0  DL      (threaded)                  [zfskern]
101355                   D       tx->tx_s 0xfffffe0050342e10 [txg_thread_enter]
101354                   D       tx->tx_q 0xfffffe0050342e30 [txg_thread_enter]
100989                   D       tx->tx_s 0xfffffe004fd27a10 [txg_thread_enter]
100988                   D       tx->tx_q 0xfffffe004fd27a30 [txg_thread_enter]
100593                   D       tx->tx_s 0xfffffe004a8c0a10 [txg_thread_enter]
100592                   D       tx->tx_q 0xfffffe004a8c0a30 [txg_thread_enter]
100216                   D       l2arc_fe 0xffffffff81228bc0 [l2arc_feed_thread]
100215                   D       arc_recl 0xffffffff81218d20 [arc_reclaim_thread]
   15     0     0     0  DL      (threaded)                  [usb]
[32 uninteresting and identical threads deleted]
    6     0     0     0  DL      mps_scan 0xfffffe00276816a8 [mps_scan2]
    5     0     0     0  DL      mps_scan 0xfffffe0027612ca8 [mps_scan1]
    4     0     0     0  DL      mps_scan 0xfffffe00274ef4a8 [mps_scan0]
   14     0     0     0  DL      -        0xffffffff80ded764 [yarrow]
    3     0     0     0  DL      crypto_r 0xffffffff80e4e0a0 [crypto returns]
    2     0     0     0  DL      crypto_w 0xffffffff80e4e060 [crypto]
   13     0     0     0  DL      (threaded)                  [geom]
100055                   D       -        0xffffffff80de6b90 [g_down]
100054                   D       -        0xffffffff80de6b88 [g_up]
100053                   D       -        0xffffffff80de6b78 [g_event]
   12     0     0     0  WL      (threaded)                  [intr]
100189                   I                                   [irq1: atkbd0]
100188                   I                                   [swi0: uart uart]
100187                   I                                   [irq19: atapci1]
100186                   I                                   [irq18: atapci0+]
100169                   I                                   [irq294: igb1:link]
100167                   I                                   [irq293: igb1:que 7]
100165                   I                                   [irq292: igb1:que 6]
100163                   I                                   [irq291: igb1:que 5]
100161                   I                                   [irq290: igb1:que 4]
100159                   I                                   [irq289: igb1:que 3]
100157                   I                                   [irq288: igb1:que 2]
100155                   I                                   [irq287: igb1:que 1]
100153                   I                                   [irq286: igb1:que 0]
100152                   I                                   [irq285: igb0:link]
100150                   I                                   [irq284: igb0:que 7]
100148                   I                                   [irq283: igb0:que 6]
100146                   I                                   [irq282: igb0:que 5]
100144                   I                                   [irq281: igb0:que 4]
100142                   I                                   [irq280: igb0:que 3]
100140                   I                                   [irq279: igb0:que 2]
100138                   I                                   [irq278: igb0:que 1]
100136                   I                                   [irq277: igb0:que 0]
100131                   I                                   [irq20: hpet0 ehci0]
100126                   I                                   [irq21: uhci2 uhci5]
100121                   I                                   [irq22: uhci1 uhci4]
100116                   I                                   [irq23: uhci0 uhci3+]
100115                   I                                   [irq276: mps2]
100112                   I                                   [irq275: mps1]
100108                   I                                   [irq274: ix1:link]
100106                   I                                   [irq273: ix1:que 7]
100104                   I                                   [irq272: ix1:que 6]
100102                   I                                   [irq271: ix1:que 5]
100100                   I                                   [irq270: ix1:que 4]
100098                   I                                   [irq269: ix1:que 3]
100096                   I                                   [irq268: ix1:que 2]
100094                   I                                   [irq267: ix1:que 1]
100092                   I                                   [irq266: ix1:que 0]
100090                   I                                   [irq265: ix0:link]
100088                   I                                   [irq264: ix0:que 7]
100086                   I                                   [irq263: ix0:que 6]
100084                   I                                   [irq262: ix0:que 5]
100082                   I                                   [irq261: ix0:que 4]
100080                   I                                   [irq260: ix0:que 3]
100078                   I                                   [irq259: ix0:que 2]
100076                   I                                   [irq258: ix0:que 1]
100074                   I                                   [irq257: ix0:que 0]
100073                   I                                   [irq256: mps0]
100065                   I                                   [swi2: cambio]
100064                   I                                   [swi6: task queue]
100063                   I                                   [swi6: Giant taskq]
100060                   I                                   [swi5: +]
[24 identical [swi4: clock] threads deleted]
100028                   I                                   [swi1: netisr 0]
100027                   I                                   [swi3: vm]
   11     0     0     0  RL      (threaded)                  [idle]
[24 identical idle threads deleted]
    1     0     1     0  DLs     rrl->rr_ 0xfffffe0048ff8108 [init]
   10     0     0     0  DL      audit_wo 0xffffffff80e4f7f0 [audit]
    0     0     0     0  DLs     (threaded)                  [kernel]
420220                   D       -        0xfffffe07bf578380 [zil_clean]
[66 similar zil_clean threads deleted]
101353                   D       -        0xfffffe004a481c80 [zfs_vn_rele_taskq]
101352                   D       -        0xfffffe005324fa80 [zio_ioctl_intr]
101351                   D       -        0xfffffe005324fb00 [zio_ioctl_issue]
101350                   D       -        0xfffffe005324fb80 [zio_claim_intr]
101349                   D       -        0xfffffe005324fc00 [zio_claim_issue]
101348                   D       -        0xfffffe005324fc80 [zio_free_intr]
101347                   D       -        0xfffffe005324fd00 [zio_free_issue_99]
[99 similar zio_free_issue_* threads deleted]
101247                   D       -        0xfffffe005324fd80 [zio_write_intr_high]
101246                   D       -        0xfffffe005324fd80 [zio_write_intr_high]
101245                   D       -        0xfffffe005324fd80 [zio_write_intr_high]
101244                   D       -        0xfffffe005324fd80 [zio_write_intr_high]
101243                   D       -        0xfffffe005324fd80 [zio_write_intr_high]
101242                   D       -        0xfffffe005324fe00 [zio_write_intr_7]
101241                   D       -        0xfffffe005324fe00 [zio_write_intr_6]
101240                   D       -        0xfffffe005324fe00 [zio_write_intr_5]
101239                   D       -        0xfffffe005324fe00 [zio_write_intr_4]
101238                   D       -        0xfffffe005324fe00 [zio_write_intr_3]
101237                   D       -        0xfffffe005324fe00 [zio_write_intr_2]
101236                   D       -        0xfffffe005324fe00 [zio_write_intr_1]
101235                   D       -        0xfffffe005324fe00 [zio_write_intr_0]
101234                   D       -        0xfffffe0053250000 [zio_write_issue_hig]
101233                   D       -        0xfffffe0053250000 [zio_write_issue_hig]
101232                   D       -        0xfffffe0053250000 [zio_write_issue_hig]
101231                   D       -        0xfffffe0053250000 [zio_write_issue_hig]
101230                   D       -        0xfffffe0053250000 [zio_write_issue_hig]
101229                   D       -        0xfffffe0053250080 [zio_write_issue_23]
101228                   D       -        0xfffffe0053250080 [zio_write_issue_22]
101227                   D       -        0xfffffe0053250080 [zio_write_issue_21]
101226                   D       -        0xfffffe0053250080 [zio_write_issue_20]
101225                   D       -        0xfffffe0053250080 [zio_write_issue_19]
101224                   D       -        0xfffffe0053250080 [zio_write_issue_18]
101223                   D       -        0xfffffe0053250080 [zio_write_issue_17]
101222                   D       -        0xfffffe0053250080 [zio_write_issue_16]
101221                   D       -        0xfffffe0053250080 [zio_write_issue_15]
101220                   D       -        0xfffffe0053250080 [zio_write_issue_14]
101219                   D       -        0xfffffe0053250080 [zio_write_issue_13]
101218                   D       -        0xfffffe0053250080 [zio_write_issue_12]
101217                   D       -        0xfffffe0053250080 [zio_write_issue_11]
101216                   D       -        0xfffffe0053250080 [zio_write_issue_10]
101215                   D       -        0xfffffe0053250080 [zio_write_issue_9]
101214                   D       -        0xfffffe0053250080 [zio_write_issue_8]
101213                   D       -        0xfffffe0053250080 [zio_write_issue_7]
101212                   D       -        0xfffffe0053250080 [zio_write_issue_6]
101211                   D       -        0xfffffe0053250080 [zio_write_issue_5]
101210                   D       -        0xfffffe0053250080 [zio_write_issue_4]
101209                   D       -        0xfffffe0053250080 [zio_write_issue_3]
101208                   D       -        0xfffffe0053250080 [zio_write_issue_2]
101207                   D       -        0xfffffe0053250080 [zio_write_issue_1]
101206                   D       -        0xfffffe0053250080 [zio_write_issue_0]
101205                   D       -        0xfffffe0053250100 [zio_read_intr_23]
101204                   D       -        0xfffffe0053250100 [zio_read_intr_22]
101203                   D       -        0xfffffe0053250100 [zio_read_intr_21]
101202                   D       -        0xfffffe0053250100 [zio_read_intr_20]
101201                   D       -        0xfffffe0053250100 [zio_read_intr_19]
101200                   D       -        0xfffffe0053250100 [zio_read_intr_18]
101199                   D       -        0xfffffe0053250100 [zio_read_intr_17]
101198                   D       -        0xfffffe0053250100 [zio_read_intr_16]
101197                   D       -        0xfffffe0053250100 [zio_read_intr_15]
101196                   D       -        0xfffffe0053250100 [zio_read_intr_14]
101195                   D       -        0xfffffe0053250100 [zio_read_intr_13]
101194                   D       -        0xfffffe0053250100 [zio_read_intr_12]
101193                   D       -        0xfffffe0053250100 [zio_read_intr_11]
101192                   D       -        0xfffffe0053250100 [zio_read_intr_10]
101191                   D       -        0xfffffe0053250100 [zio_read_intr_9]
101190                   D       -        0xfffffe0053250100 [zio_read_intr_8]
101189                   D       -        0xfffffe0053250100 [zio_read_intr_7]
101188                   D       -        0xfffffe0053250100 [zio_read_intr_6]
101187                   D       -        0xfffffe0053250100 [zio_read_intr_5]
101186                   D       -        0xfffffe0053250100 [zio_read_intr_4]
101185                   D       -        0xfffffe0053250100 [zio_read_intr_3]
101184                   D       -        0xfffffe0053250100 [zio_read_intr_2]
101183                   D       -        0xfffffe0053250100 [zio_read_intr_1]
101182                   D       -        0xfffffe0053250100 [zio_read_intr_0]
101181                   D       -        0xfffffe0053250180 [zio_read_issue_7]
101180                   D       -        0xfffffe0053250180 [zio_read_issue_6]
101179                   D       -        0xfffffe0053250180 [zio_read_issue_5]
101178                   D       -        0xfffffe0053250180 [zio_read_issue_4]
101177                   D       -        0xfffffe0053250180 [zio_read_issue_3]
101176                   D       -        0xfffffe0053250180 [zio_read_issue_2]
101175                   D       -        0xfffffe0053250180 [zio_read_issue_1]
101174                   D       -        0xfffffe0053250180 [zio_read_issue_0]
101173                   D       -        0xfffffe0053250200 [zio_null_intr]
101172                   D       -        0xfffffe0053250280 [zio_null_issue]
100987                   D       -        0xfffffe0048cc9500 [zfs_vn_rele_taskq]
100986                   D       -        0xfffffe0048c72280 [zio_ioctl_intr]
100985                   D       -        0xfffffe0048c71a00 [zio_ioctl_issue]
100984                   D       -        0xfffffe0048dd0d00 [zio_claim_intr]
100983                   D       -        0xfffffe0048dd0680 [zio_claim_issue]
100982                   D       -        0xfffffe004a949080 [zio_free_intr]
100981                   D       -        0xfffffe0048b77d80 [zio_free_issue_99]
[99 more zio_free_issue_* threads deleted]
100881                   D       -        0xfffffe004a94a480 [zio_write_intr_high]
100880                   D       -        0xfffffe004a94a480 [zio_write_intr_high]
100879                   D       -        0xfffffe004a94a480 [zio_write_intr_high]
100878                   D       -        0xfffffe004a94a480 [zio_write_intr_high]
100877                   D       -        0xfffffe004a94a480 [zio_write_intr_high]
100876                   D       -        0xfffffe0048dd1180 [zio_write_intr_7]
100875                   D       -        0xfffffe0048dd1180 [zio_write_intr_6]
100874                   D       -        0xfffffe0048dd1180 [zio_write_intr_5]
100873                   D       -        0xfffffe0048dd1180 [zio_write_intr_4]
100872                   D       -        0xfffffe0048dd1180 [zio_write_intr_3]
100871                   D       -        0xfffffe0048dd1180 [zio_write_intr_2]
100870                   D       -        0xfffffe0048dd1180 [zio_write_intr_1]
100869                   D       -        0xfffffe0048dd1180 [zio_write_intr_0]
100868                   D       -        0xfffffe0048dd1100 [zio_write_issue_hig]
100867                   D       -        0xfffffe0048dd1100 [zio_write_issue_hig]
100866                   D       -        0xfffffe0048dd1100 [zio_write_issue_hig]
100865                   D       -        0xfffffe0048dd1100 [zio_write_issue_hig]
100864                   D       -        0xfffffe0048dd1100 [zio_write_issue_hig]
100863                   D       -        0xfffffe0048dd1080 [zio_write_issue_23]
100862                   D       -        0xfffffe0048dd1080 [zio_write_issue_22]
100861                   D       -        0xfffffe0048dd1080 [zio_write_issue_21]
100860                   D       -        0xfffffe0048dd1080 [zio_write_issue_20]
100859                   D       -        0xfffffe0048dd1080 [zio_write_issue_19]
100858                   D       -        0xfffffe0048dd1080 [zio_write_issue_18]
100857                   D       -        0xfffffe0048dd1080 [zio_write_issue_17]
100856                   D       -        0xfffffe0048dd1080 [zio_write_issue_16]
100855                   D       -        0xfffffe0048dd1080 [zio_write_issue_15]
100854                   D       -        0xfffffe0048dd1080 [zio_write_issue_14]
100853                   D       -        0xfffffe0048dd1080 [zio_write_issue_13]
100852                   D       -        0xfffffe0048dd1080 [zio_write_issue_12]
100851                   D       -        0xfffffe0048dd1080 [zio_write_issue_11]
100850                   D       -        0xfffffe0048dd1080 [zio_write_issue_10]
100849                   D       -        0xfffffe0048dd1080 [zio_write_issue_9]
100848                   D       -        0xfffffe0048dd1080 [zio_write_issue_8]
100847                   D       -        0xfffffe0048dd1080 [zio_write_issue_7]
100846                   D       -        0xfffffe0048dd1080 [zio_write_issue_6]
100845                   D       -        0xfffffe0048dd1080 [zio_write_issue_5]
100844                   D       -        0xfffffe0048dd1080 [zio_write_issue_4]
100843                   D       -        0xfffffe0048dd1080 [zio_write_issue_3]
100842                   D       -        0xfffffe0048dd1080 [zio_write_issue_2]
100841                   D       -        0xfffffe0048dd1080 [zio_write_issue_1]
100840                   D       -        0xfffffe0048dd1080 [zio_write_issue_0]
100839                   D       -        0xfffffe0048dd1000 [zio_read_intr_23]
100838                   D       -        0xfffffe0048dd1000 [zio_read_intr_22]
100837                   D       -        0xfffffe0048dd1000 [zio_read_intr_21]
100836                   D       -        0xfffffe0048dd1000 [zio_read_intr_20]
100835                   D       -        0xfffffe0048dd1000 [zio_read_intr_19]
100834                   D       -        0xfffffe0048dd1000 [zio_read_intr_18]
100833                   D       -        0xfffffe0048dd1000 [zio_read_intr_17]
100832                   D       -        0xfffffe0048dd1000 [zio_read_intr_16]
100831                   D       -        0xfffffe0048dd1000 [zio_read_intr_15]
100830                   D       -        0xfffffe0048dd1000 [zio_read_intr_14]
100829                   D       -        0xfffffe0048dd1000 [zio_read_intr_13]
100828                   D       -        0xfffffe0048dd1000 [zio_read_intr_12]
100827                   D       -        0xfffffe0048dd1000 [zio_read_intr_11]
100826                   D       -        0xfffffe0048dd1000 [zio_read_intr_10]
100825                   D       -        0xfffffe0048dd1000 [zio_read_intr_9]
100824                   D       -        0xfffffe0048dd1000 [zio_read_intr_8]
100823                   D       -        0xfffffe0048dd1000 [zio_read_intr_7]
100822                   D       -        0xfffffe0048dd1000 [zio_read_intr_6]
100821                   D       -        0xfffffe0048dd1000 [zio_read_intr_5]
100820                   D       -        0xfffffe0048dd1000 [zio_read_intr_4]
100819                   D       -        0xfffffe0048dd1000 [zio_read_intr_3]
100818                   D       -        0xfffffe0048dd1000 [zio_read_intr_2]
100817                   D       -        0xfffffe0048dd1000 [zio_read_intr_1]
100816                   D       -        0xfffffe0048dd1000 [zio_read_intr_0]
100815                   D       -        0xfffffe0048dd0e00 [zio_read_issue_7]
100814                   D       -        0xfffffe0048dd0e00 [zio_read_issue_6]
100813                   D       -        0xfffffe0048dd0e00 [zio_read_issue_5]
100812                   D       -        0xfffffe0048dd0e00 [zio_read_issue_4]
100811                   D       -        0xfffffe0048dd0e00 [zio_read_issue_3]
100810                   D       -        0xfffffe0048dd0e00 [zio_read_issue_2]
100809                   D       -        0xfffffe0048dd0e00 [zio_read_issue_1]
100808                   D       -        0xfffffe0048dd0e00 [zio_read_issue_0]
100807                   D       -        0xfffffe0048dd0600 [zio_null_intr]
100806                   D       -        0xfffffe0048dd0180 [zio_null_issue]
100594                   D       -        0xfffffe004a3bcc80 [zil_clean]
100591                   D       -        0xfffffe0048c65100 [zfs_vn_rele_taskq]
100590                   D       -        0xfffffe0048d5c280 [zio_ioctl_intr]
100589                   D       -        0xfffffe0048d5c300 [zio_ioctl_issue]
100588                   D       -        0xfffffe0048d5c380 [zio_claim_intr]
100587                   D       -        0xfffffe0048d5c400 [zio_claim_issue]
100586                   D       -        0xfffffe0048d5c480 [zio_free_intr]
100585                   D       -        0xfffffe0048d5c500 [zio_free_issue_99]
[99 more zio_free_issue_* threads deleted]
100485                   D       -        0xfffffe0048d5c580 [zio_write_intr_high]
100484                   D       -        0xfffffe0048d5c580 [zio_write_intr_high]
100483                   D       -        0xfffffe0048d5c580 [zio_write_intr_high]
100482                   D       -        0xfffffe0048d5c580 [zio_write_intr_high]
100481                   D       -        0xfffffe0048d5c580 [zio_write_intr_high]
100480                   D       -        0xfffffe0048d5c600 [zio_write_intr_7]
100479                   D       -        0xfffffe0048d5c600 [zio_write_intr_6]
100478                   D       -        0xfffffe0048d5c600 [zio_write_intr_5]
100477                   D       -        0xfffffe0048d5c600 [zio_write_intr_4]
100476                   D       -        0xfffffe0048d5c600 [zio_write_intr_3]
100475                   D       -        0xfffffe0048d5c600 [zio_write_intr_2]
100474                   D       -        0xfffffe0048d5c600 [zio_write_intr_1]
100473                   D       -        0xfffffe0048d5c600 [zio_write_intr_0]
100472                   D       -        0xfffffe0048d5c680 [zio_write_issue_hig]
100471                   D       -        0xfffffe0048d5c680 [zio_write_issue_hig]
100470                   D       -        0xfffffe0048d5c680 [zio_write_issue_hig]
100469                   D       -        0xfffffe0048d5c680 [zio_write_issue_hig]
100468                   D       -        0xfffffe0048d5c680 [zio_write_issue_hig]
100467                   D       -        0xfffffe0048d5c700 [zio_write_issue_23]
100466                   D       -        0xfffffe0048d5c700 [zio_write_issue_22]
100465                   D       -        0xfffffe0048d5c700 [zio_write_issue_21]
100464                   D       -        0xfffffe0048d5c700 [zio_write_issue_20]
100463                   D       -        0xfffffe0048d5c700 [zio_write_issue_19]
100462                   D       -        0xfffffe0048d5c700 [zio_write_issue_18]
100461                   D       -        0xfffffe0048d5c700 [zio_write_issue_17]
100460                   D       -        0xfffffe0048d5c700 [zio_write_issue_16]
100459                   D       -        0xfffffe0048d5c700 [zio_write_issue_15]
100458                   D       -        0xfffffe0048d5c700 [zio_write_issue_14]
100457                   D       -        0xfffffe0048d5c700 [zio_write_issue_13]
100456                   D       -        0xfffffe0048d5c700 [zio_write_issue_12]
100455                   D       -        0xfffffe0048d5c700 [zio_write_issue_11]
100454                   D       -        0xfffffe0048d5c700 [zio_write_issue_10]
100453                   D       -        0xfffffe0048d5c700 [zio_write_issue_9]
100452                   D       -        0xfffffe0048d5c700 [zio_write_issue_8]
100451                   D       -        0xfffffe0048d5c700 [zio_write_issue_7]
100450                   D       -        0xfffffe0048d5c700 [zio_write_issue_6]
100449                   D       -        0xfffffe0048d5c700 [zio_write_issue_5]
100448                   D       -        0xfffffe0048d5c700 [zio_write_issue_4]
100447                   D       -        0xfffffe0048d5c700 [zio_write_issue_3]
100446                   D       -        0xfffffe0048d5c700 [zio_write_issue_2]
100445                   D       -        0xfffffe0048d5c700 [zio_write_issue_1]
100444                   D       -        0xfffffe0048d5c700 [zio_write_issue_0]
100443                   D       -        0xfffffe0048d5c780 [zio_read_intr_23]
100442                   D       -        0xfffffe0048d5c780 [zio_read_intr_22]
100441                   D       -        0xfffffe0048d5c780 [zio_read_intr_21]
100440                   D       -        0xfffffe0048d5c780 [zio_read_intr_20]
100439                   D       -        0xfffffe0048d5c780 [zio_read_intr_19]
100438                   D       -        0xfffffe0048d5c780 [zio_read_intr_18]
100437                   D       -        0xfffffe0048d5c780 [zio_read_intr_17]
100436                   D       -        0xfffffe0048d5c780 [zio_read_intr_16]
100435                   D       -        0xfffffe0048d5c780 [zio_read_intr_15]
100434                   D       -        0xfffffe0048d5c780 [zio_read_intr_14]
100433                   D       -        0xfffffe0048d5c780 [zio_read_intr_13]
100432                   D       -        0xfffffe0048d5c780 [zio_read_intr_12]
100431                   D       -        0xfffffe0048d5c780 [zio_read_intr_11]
100430                   D       -        0xfffffe0048d5c780 [zio_read_intr_10]
100429                   D       -        0xfffffe0048d5c780 [zio_read_intr_9]
100428                   D       -        0xfffffe0048d5c780 [zio_read_intr_8]
100427                   D       -        0xfffffe0048d5c780 [zio_read_intr_7]
100426                   D       -        0xfffffe0048d5c780 [zio_read_intr_6]
100425                   D       -        0xfffffe0048d5c780 [zio_read_intr_5]
100424                   D       -        0xfffffe0048d5c780 [zio_read_intr_4]
100423                   D       -        0xfffffe0048d5c780 [zio_read_intr_3]
100422                   D       -        0xfffffe0048d5c780 [zio_read_intr_2]
100421                   D       -        0xfffffe0048d5c780 [zio_read_intr_1]
100420                   D       -        0xfffffe0048d5c780 [zio_read_intr_0]
100419                   D       -        0xfffffe0048d5c800 [zio_read_issue_7]
100418                   D       -        0xfffffe0048d5c800 [zio_read_issue_6]
100417                   D       -        0xfffffe0048d5c800 [zio_read_issue_5]
100416                   D       -        0xfffffe0048d5c800 [zio_read_issue_4]
100415                   D       -        0xfffffe0048d5c800 [zio_read_issue_3]
100414                   D       -        0xfffffe0048d5c800 [zio_read_issue_2]
100413                   D       -        0xfffffe0048d5c800 [zio_read_issue_1]
100412                   D       -        0xfffffe0048d5c800 [zio_read_issue_0]
100411                   D       -        0xfffffe0048d5c880 [zio_null_intr]
100410                   D       -        0xfffffe0048d5c900 [zio_null_issue]
100214                   D       -        0xfffffe00348bbc00 [system_taskq_23]
[23 more system_taskq_* threads deleted]
100190                   D       -        0xfffffe00348bbc80 [mca taskq]
100168                   D       -        0xfffffe0034092b00 [igb1 que]
100166                   D       -        0xfffffe0034092c80 [igb1 que]
100164                   D       -        0xfffffe0034092e00 [igb1 que]
100162                   D       -        0xfffffe0034092180 [igb1 que]
100160                   D       -        0xfffffe0034092300 [igb1 que]
100158                   D       -        0xfffffe0034092480 [igb1 que]
100156                   D       -        0xfffffe003408b300 [igb1 que]
100154                   D       -        0xfffffe003408b480 [igb1 que]
100151                   D       -        0xfffffe003405b500 [igb0 que]
100149                   D       -        0xfffffe003405b680 [igb0 que]
100147                   D       -        0xfffffe0034054580 [igb0 que]
100145                   D       -        0xfffffe003404d400 [igb0 que]
100143                   D       -        0xfffffe003404d580 [igb0 que]
100141                   D       -        0xfffffe003404d700 [igb0 que]
100139                   D       -        0xfffffe003404d880 [igb0 que]
100137                   D       -        0xfffffe003404da00 [igb0 que]
100113                   D       -        0xfffffe0027807300 [mps2 taskq]
100110                   D       -        0xfffffe0027697a80 [mps1 taskq]
100109                   D       -        0xfffffe002768a700 [ix1 linkq]
100107                   D       -        0xfffffe002768a800 [ix1 que]
100105                   D       -        0xfffffe002768a980 [ix1 que]
100103                   D       -        0xfffffe002768ab00 [ix1 que]
100101                   D       -        0xfffffe002768ac80 [ix1 que]
100099                   D       -        0xfffffe0027680b80 [ix1 que]
100097                   D       -        0xfffffe0027680d00 [ix1 que]
100095                   D       -        0xfffffe0027623680 [ix1 que]
100093                   D       -        0xfffffe0027623380 [ix1 que]
100091                   D       -        0xfffffe0027600480 [ix0 linkq]
100089                   D       -        0xfffffe0027600580 [ix0 que]
100087                   D       -        0xfffffe0027600700 [ix0 que]
100085                   D       -        0xfffffe0027527300 [ix0 que]
100083                   D       -        0xfffffe0027527480 [ix0 que]
100081                   D       -        0xfffffe0027527600 [ix0 que]
100079                   D       -        0xfffffe0027527780 [ix0 que]
100077                   D       -        0xfffffe0027527900 [ix0 que]
100075                   D       -        0xfffffe0027527a80 [ix0 que]
100071                   D       -        0xfffffe00274ffe00 [mps0 taskq]
100070                   D       -        0xfffffe00273fdb00 [kqueue taskq]
100069                   D       -        0xfffffe00273fdb80 [ffs_trim taskq]
100068                   D       -        0xfffffe00273fdc00 [acpi_task_2]
100067                   D       -        0xfffffe00273fdc00 [acpi_task_1]
100066                   D       -        0xfffffe00273fdc00 [acpi_task_0]
100062                   D       -        0xfffffe002743b280 [aiod_bio taskq]
100061                   D       zfsvfs-> 0xfffffe0048ff8138 [thread taskq]
100056                   D       -        0xfffffe002732d600 [firmware taskq]
100000                   D       sched    0xffffffff80de6d80 [swapper]

The stuck df(1) processes running as nobody were undoubtedly started
by munin-node, and seem to be related to my user's symptom (munin
graphs show no response for about half an hour after the user's
problem starts).  It may not be a *true* deadlock, because over the
past few days, munin has been showing this problem at about the same
time of day, but the system always comes back (without a reboot) in
a little over half an hour.

Does anyone recognize this?  If it happens again, which threads' stack
traces would be useful in diagnosing this?

-GAWollman


From owner-freebsd-fs@FreeBSD.ORG  Tue Jan 29 08:20:54 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@FreeBSD.org
Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115])
 by hub.freebsd.org (Postfix) with ESMTP id 020B76DB;
 Tue, 29 Jan 2013 08:20:54 +0000 (UTC) (envelope-from avg@FreeBSD.org)
Received: from citadel.icyb.net.ua (citadel.icyb.net.ua [212.40.38.140])
 by mx1.freebsd.org (Postfix) with ESMTP id DD910F10;
 Tue, 29 Jan 2013 08:20:52 +0000 (UTC)
Received: from porto.starpoint.kiev.ua (porto-e.starpoint.kiev.ua
 [212.40.38.100])
 by citadel.icyb.net.ua (8.8.8p3/ICyb-2.3exp) with ESMTP id KAA10855;
 Tue, 29 Jan 2013 10:20:50 +0200 (EET) (envelope-from avg@FreeBSD.org)
Received: from localhost ([127.0.0.1])
 by porto.starpoint.kiev.ua with esmtp (Exim 4.34 (FreeBSD))
 id 1U06Qz-000M5B-TM; Tue, 29 Jan 2013 10:20:49 +0200
Message-ID: <51078660.8000004@FreeBSD.org>
Date: Tue, 29 Jan 2013 10:20:48 +0200
From: Andriy Gapon <avg@FreeBSD.org>
User-Agent: Mozilla/5.0 (X11; FreeBSD amd64;
 rv:17.0) Gecko/20130121 Thunderbird/17.0.2
MIME-Version: 1.0
To: Garrett Wollman <wollman@FreeBSD.org>
Subject: Re: ZFS deadlock on rrl->rr_ -- look familiar to anyone?
References: <20743.16429.97668.569869@hergotha.csail.mit.edu>
In-Reply-To: <20743.16429.97668.569869@hergotha.csail.mit.edu>
X-Enigmail-Version: 1.4.6
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Cc: freebsd-fs@FreeBSD.org, freebsd-stable@FreeBSD.org
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 29 Jan 2013 08:20:54 -0000

on 29/01/2013 05:21 Garrett Wollman said the following:
> When
> I restarted mountd, it hung waiting on rrl->rr_, but the system may
> already have been deadlocked at that point.  procstat reported:
> 
> 87678 104365 mountd           -                mi_switch sleepq_wait _cv_wait rrw_enter zfs_root lookup namei vfs_donmount sys_nmount amd64_syscall Xfast_syscall 
...
> If it happens again

procstat -kk -a

-- 
Andriy Gapon

From owner-freebsd-fs@FreeBSD.ORG  Tue Jan 29 10:51:50 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id 734F87B8
 for <fs@freebsd.org>; Tue, 29 Jan 2013 10:51:50 +0000 (UTC)
 (envelope-from nowakpl@platinum.linux.pl)
Received: from platinum.linux.pl (platinum.edu.pl [81.161.192.4])
 by mx1.freebsd.org (Postfix) with ESMTP id 1D428F25
 for <fs@freebsd.org>; Tue, 29 Jan 2013 10:51:49 +0000 (UTC)
Received: by platinum.linux.pl (Postfix, from userid 87)
 id E089147E16; Tue, 29 Jan 2013 11:51:41 +0100 (CET)
X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on platinum.linux.pl
X-Spam-Level: 
X-Spam-Status: No, score=-1.4 required=3.0 tests=ALL_TRUSTED,AWL
 autolearn=disabled version=3.3.2
Received: from [10.255.0.2] (unknown [83.151.38.73])
 by platinum.linux.pl (Postfix) with ESMTPA id 4297647E0F;
 Tue, 29 Jan 2013 11:51:38 +0100 (CET)
Message-ID: <5107A9B7.5030803@platinum.linux.pl>
Date: Tue, 29 Jan 2013 11:51:35 +0100
From: Adam Nowacki <nowakpl@platinum.linux.pl>
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64;
 rv:17.0) Gecko/20130107 Thunderbird/17.0.2
MIME-Version: 1.0
To: Matthew Ahrens <mahrens@delphix.com>
Subject: Re: RAID-Z wasted space - asize roundups to nparity +1
References: <5105252D.6060502@platinum.linux.pl>
 <CAJjvXiEQSqnKYP75crTkgVqLKSk92q9UTikFtdyPHmF6shJFbg@mail.gmail.com>
In-Reply-To: <CAJjvXiEQSqnKYP75crTkgVqLKSk92q9UTikFtdyPHmF6shJFbg@mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: fs@freebsd.org
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 29 Jan 2013 10:51:50 -0000

On 2013-01-28 22:55, Matthew Ahrens wrote:
> This is so that we won't end up with small, unallocatable segments.
>   E.g. if you are using RAIDZ2, the smallest usable segment would be 3
> sectors (1 sector data + 2 sectors parity).  If we left a 1 or 2 sector
> free segment, it would be unusable and you'd be able to get into strange
> accounting situations where you have free space but can't write because
> you're "out of space".

Sounds reasonable.

> The amount of waste due to this can be minimized by using larger
> blocksizes (e.g. the default recordsize of 128k and files larger than
> 128k), and by using smaller sector sizes (e.g. 512b sector disks rather
> than 4k sector disks).  In your case these techniques would limit the
> waste to 0.6%.

This brings another issue - recordsize capped at 128KiB. We are using 
the pool for off-line storage of large files (from 50MB to 20GB). Files 
are stored and read sequentially as a whole. With 12 disks in RAID-Z2, 
4KiB sectors, 128KiB record size and the padding above 9.4% of disk 
space goes completely unused - one whole disk.

Increasing recordsize cap seems trivial enough. On-disk structures and 
kernel code support it already - a single of code had to be changed 
(#define SPA_MAXBLOCKSHIFT - from 17 to 20) to support 1MiB recordsizes. 
This of course breaks compatibility with any other system without this 
modification. With Suns cooperation this could be handled in safe and 
compatible manner via pool version upgrade. Recordsize of 128KiB would 
remain the default but anyone could increase it with zfs set.

Pool appears to work just fine with 15TB copied so far from another 
pool. Wasted disk space drops down to 0.7%. Sequential read speed 
increased from ~400MB/s to ~600MB/s. Writes stay about the same at ~300MB/s.

So far however I was not able to boot from that pool. gptzfsboot 
required a heap size increase and appears to work. zfsloader crashes and 
I've become lost in the code.

I've also identified another problem with ZFS wasting disk space. When 
compression is off allocations are always a multiple of record size. 
With the default recordsize of 128KiB a 129KiB file would use 256KiB of 
disk space (+ parity and other inefficiencies mentioned above). This may 
be there to help with fragmentation but then it would be good to have a 
setting to turn it off - even if by means of a no-op compression that 
would count zeroes backwards and return short psize.

>
> --matt
>
> On Sun, Jan 27, 2013 at 5:01 AM, Adam Nowacki <nowakpl@platinum.linux.pl
> <mailto:nowakpl@platinum.linux.pl>> wrote:
>
>     I've just found something very weird in the ZFS code.
>
>     sys/cddl/contrib/opensolaris/__uts/common/fs/zfs/vdev_raidz.__c:504
>     in HEAD
>
>     Can someone explain the reason behind this line of code? What it
>     does is align on-disk record size to a multiple of number of parity
>     disks + 1 ... this really doesn't make any sense. So far as I can
>     tell those extra sectors are just padding - completely unused.
>
>     For the array I'm using this results in 4.8% of wasted disk space -
>     1.7TB. It's a 12x 3TB disk RAID-Z2.
>     _________________________________________________
>     freebsd-fs@freebsd.org <mailto:freebsd-fs@freebsd.org> mailing list
>     http://lists.freebsd.org/__mailman/listinfo/freebsd-fs
>     <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>
>     To unsubscribe, send any mail to
>     "freebsd-fs-unsubscribe@__freebsd.org
>     <mailto:freebsd-fs-unsubscribe@freebsd.org>"
>
>


From owner-freebsd-fs@FreeBSD.ORG  Tue Jan 29 10:58:06 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: fs@freebsd.org
Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115])
 by hub.freebsd.org (Postfix) with ESMTP id E4422AC2
 for <fs@freebsd.org>; Tue, 29 Jan 2013 10:58:06 +0000 (UTC)
 (envelope-from prvs=1741a054e2=killing@multiplay.co.uk)
Received: from mail1.multiplay.co.uk (mail1.multiplay.co.uk [85.236.96.23])
 by mx1.freebsd.org (Postfix) with ESMTP id 87E82F8C
 for <fs@freebsd.org>; Tue, 29 Jan 2013 10:58:06 +0000 (UTC)
Received: from r2d2 ([188.220.16.49])
 by mail1.multiplay.co.uk (mail1.multiplay.co.uk [85.236.96.23])
 (MDaemon PRO v10.0.4) with ESMTP id md50001906002.msg
 for <fs@freebsd.org>; Tue, 29 Jan 2013 10:57:58 +0000
X-Spam-Processed: mail1.multiplay.co.uk, Tue, 29 Jan 2013 10:57:58 +0000
 (not processed: message from valid local sender)
X-MDRemoteIP: 188.220.16.49
X-Return-Path: prvs=1741a054e2=killing@multiplay.co.uk
X-Envelope-From: killing@multiplay.co.uk
X-MDaemon-Deliver-To: fs@freebsd.org
Message-ID: <2FD375DC62B24754B8945BF0A1E26B78@multiplay.co.uk>
From: "Steven Hartland" <killing@multiplay.co.uk>
To: "Adam Nowacki" <nowakpl@platinum.linux.pl>,
 "Matthew Ahrens" <mahrens@delphix.com>
References: <5105252D.6060502@platinum.linux.pl>
 <CAJjvXiEQSqnKYP75crTkgVqLKSk92q9UTikFtdyPHmF6shJFbg@mail.gmail.com>
 <5107A9B7.5030803@platinum.linux.pl>
Subject: Re: RAID-Z wasted space - asize roundups to nparity +1
Date: Tue, 29 Jan 2013 10:58:40 -0000
MIME-Version: 1.0
Content-Type: text/plain; format=flowed; charset="iso-8859-1";
 reply-type=response
Content-Transfer-Encoding: 7bit
X-Priority: 3
X-MSMail-Priority: Normal
X-Mailer: Microsoft Outlook Express 6.00.2900.5931
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.6157
Cc: fs@freebsd.org
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 29 Jan 2013 10:58:07 -0000


----- Original Message ----- 
From: "Adam Nowacki" <nowakpl@platinum.linux.pl>


> On 2013-01-28 22:55, Matthew Ahrens wrote:
>> This is so that we won't end up with small, unallocatable segments.
>>   E.g. if you are using RAIDZ2, the smallest usable segment would be 3
>> sectors (1 sector data + 2 sectors parity).  If we left a 1 or 2 sector
>> free segment, it would be unusable and you'd be able to get into strange
>> accounting situations where you have free space but can't write because
>> you're "out of space".
> 
> Sounds reasonable.
> 
>> The amount of waste due to this can be minimized by using larger
>> blocksizes (e.g. the default recordsize of 128k and files larger than
>> 128k), and by using smaller sector sizes (e.g. 512b sector disks rather
>> than 4k sector disks).  In your case these techniques would limit the
>> waste to 0.6%.
> 
> This brings another issue - recordsize capped at 128KiB. We are using 
> the pool for off-line storage of large files (from 50MB to 20GB). Files 
> are stored and read sequentially as a whole. With 12 disks in RAID-Z2, 
> 4KiB sectors, 128KiB record size and the padding above 9.4% of disk 
> space goes completely unused - one whole disk.

This is something thats being worked on upstream, its not as trivial
as it first looks unfortuantely.

    Regards
    Steve

================================================
This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it. 

In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337
or return the E.mail to postmaster@multiplay.co.uk.


From owner-freebsd-fs@FreeBSD.ORG  Tue Jan 29 11:06:29 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id A9419D4A
 for <fs@freebsd.org>; Tue, 29 Jan 2013 11:06:29 +0000 (UTC)
 (envelope-from olivier@gid0.org)
Received: from mail-ea0-f170.google.com (mail-ea0-f170.google.com
 [209.85.215.170]) by mx1.freebsd.org (Postfix) with ESMTP id 485BF68
 for <fs@freebsd.org>; Tue, 29 Jan 2013 11:06:28 +0000 (UTC)
Received: by mail-ea0-f170.google.com with SMTP id a11so133773eaa.15
 for <fs@freebsd.org>; Tue, 29 Jan 2013 03:06:27 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=google.com; s=20120113;
 h=mime-version:x-received:in-reply-to:references:date:message-id
 :subject:from:to:cc:content-type:x-gm-message-state;
 bh=lJY/gvBy7AioQdkXF6YAjmup0aVH8VnDEEeCuAKC6PQ=;
 b=Y81RnZeVOUTSflQDeHTHE3fxUJxvKmuzvHxFjVO9d1DHSR23Gt3/FAz3aIvbjib9Rp
 UxZ4e8ZYDFDKmuoKa2eMP/3xD1nDmgUl/khRBr3UtMJRsNopFqxtOHpEy3oeR9O83JTk
 vYL07KI9psaGG15Y9V4lzNXYlPW4JEL5fm+VAQRs2hsvYmgs9EgB7Kp3bT6g93572Mcj
 ueVMuS3hdON9bqlauT4c/jKNYkVrrohGCY0UgLx9/8M6pwUmNXp/fgOv28ymtoj8CeP8
 CcZM/uPmL24qAIr+oZdzL0OGUwZgwSpzrrJtIxecVajP8iH/QFO7RS32TiIszIHgN/E8
 eGDQ==
MIME-Version: 1.0
X-Received: by 10.14.220.1 with SMTP id n1mr2369333eep.16.1359457587713; Tue,
 29 Jan 2013 03:06:27 -0800 (PST)
Received: by 10.14.189.5 with HTTP; Tue, 29 Jan 2013 03:06:27 -0800 (PST)
In-Reply-To: <5107A9B7.5030803@platinum.linux.pl>
References: <5105252D.6060502@platinum.linux.pl>
 <CAJjvXiEQSqnKYP75crTkgVqLKSk92q9UTikFtdyPHmF6shJFbg@mail.gmail.com>
 <5107A9B7.5030803@platinum.linux.pl>
Date: Tue, 29 Jan 2013 12:06:27 +0100
Message-ID: <CABzXLYMOK0ZDeDw95se1LZShaCowH1k7CZV7vkHdbkmwbZ9eDQ@mail.gmail.com>
Subject: Re: RAID-Z wasted space - asize roundups to nparity +1
From: Olivier Smedts <olivier@gid0.org>
To: Adam Nowacki <nowakpl@platinum.linux.pl>
Content-Type: text/plain; charset=ISO-8859-1
X-Gm-Message-State: ALoCoQmgNjtoHUGCvXPsiAl/x+UqiYPo50O4eIWShX8ViX1E9EDjjdqcqRvXRpwDBeJXJEdWRleI
Cc: Matthew Ahrens <mahrens@delphix.com>, fs@freebsd.org
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 29 Jan 2013 11:06:29 -0000

2013/1/29 Adam Nowacki <nowakpl@platinum.linux.pl>:
> This brings another issue - recordsize capped at 128KiB. We are using the
> pool for off-line storage of large files (from 50MB to 20GB). Files are
> stored and read sequentially as a whole. With 12 disks in RAID-Z2, 4KiB
> sectors, 128KiB record size and the padding above 9.4% of disk space goes
> completely unused - one whole disk.
>
> Increasing recordsize cap seems trivial enough. On-disk structures and
> kernel code support it already - a single of code had to be changed (#define
> SPA_MAXBLOCKSHIFT - from 17 to 20) to support 1MiB recordsizes. This of
> course breaks compatibility with any other system without this modification.
> With Suns cooperation this could be handled in safe and compatible manner
> via pool version upgrade. Recordsize of 128KiB would remain the default but
> anyone could increase it with zfs set.

One MB blocksize is already implemented by Oracle with zpool version 32.

-- 
Olivier Smedts                                                 _
                                        ASCII ribbon campaign ( )
e-mail: olivier@gid0.org        - against HTML email & vCards  X
www: http://www.gid0.org    - against proprietary attachments / \

  "Il y a seulement 10 sortes de gens dans le monde :
  ceux qui comprennent le binaire,
  et ceux qui ne le comprennent pas."

From owner-freebsd-fs@FreeBSD.ORG  Tue Jan 29 11:18:54 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: fs@freebsd.org
Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115])
 by hub.freebsd.org (Postfix) with ESMTP id 95D167BD
 for <fs@freebsd.org>; Tue, 29 Jan 2013 11:18:54 +0000 (UTC)
 (envelope-from prvs=1741a054e2=killing@multiplay.co.uk)
Received: from mail1.multiplay.co.uk (mail1.multiplay.co.uk [85.236.96.23])
 by mx1.freebsd.org (Postfix) with ESMTP id 381FB208
 for <fs@freebsd.org>; Tue, 29 Jan 2013 11:18:53 +0000 (UTC)
Received: from r2d2 ([188.220.16.49])
 by mail1.multiplay.co.uk (mail1.multiplay.co.uk [85.236.96.23])
 (MDaemon PRO v10.0.4) with ESMTP id md50001906339.msg
 for <fs@freebsd.org>; Tue, 29 Jan 2013 11:18:53 +0000
X-Spam-Processed: mail1.multiplay.co.uk, Tue, 29 Jan 2013 11:18:53 +0000
 (not processed: message from valid local sender)
X-MDRemoteIP: 188.220.16.49
X-Return-Path: prvs=1741a054e2=killing@multiplay.co.uk
X-Envelope-From: killing@multiplay.co.uk
X-MDaemon-Deliver-To: fs@freebsd.org
Message-ID: <32655B893F594E9BB0CBDD88C186E27E@multiplay.co.uk>
From: "Steven Hartland" <killing@multiplay.co.uk>
To: "Olivier Smedts" <olivier@gid0.org>,
 "Adam Nowacki" <nowakpl@platinum.linux.pl>
References: <5105252D.6060502@platinum.linux.pl>
 <CAJjvXiEQSqnKYP75crTkgVqLKSk92q9UTikFtdyPHmF6shJFbg@mail.gmail.com>
 <5107A9B7.5030803@platinum.linux.pl>
 <CABzXLYMOK0ZDeDw95se1LZShaCowH1k7CZV7vkHdbkmwbZ9eDQ@mail.gmail.com>
Subject: Re: RAID-Z wasted space - asize roundups to nparity +1
Date: Tue, 29 Jan 2013 11:19:31 -0000
MIME-Version: 1.0
X-Priority: 3
X-MSMail-Priority: Normal
X-Mailer: Microsoft Outlook Express 6.00.2900.5931
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.6157
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
X-Content-Filtered-By: Mailman/MimeDel 2.1.14
Cc: Matthew Ahrens <mahrens@delphix.com>, fs@freebsd.org
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 29 Jan 2013 11:18:54 -0000

----- Original Message -----=20
From: "Olivier Smedts" <olivier@gid0.org>


> 2013/1/29 Adam Nowacki <nowakpl@platinum.linux.pl>:
>> This brings another issue - recordsize capped at 128KiB. We are using the
>> pool for off-line storage of large files (from 50MB to 20GB). Files are
>> stored and read sequentially as a whole. With 12 disks in RAID-Z2, 4KiB
>> sectors, 128KiB record size and the padding above 9.4% of disk space goes
>> completely unused - one whole disk.
>>
>> Increasing recordsize cap seems trivial enough. On-disk structures and
>> kernel code support it already - a single of code had to be changed (#define
>> SPA_MAXBLOCKSHIFT - from 17 to 20) to support 1MiB recordsizes. This of
>> course breaks compatibility with any other system without this modification.
>> With Suns cooperation this could be handled in safe and compatible manner
>> via pool version upgrade. Recordsize of 128KiB would remain the default but
>> anyone could increase it with zfs set.
>=20
> One MB blocksize is already implemented by Oracle with zpool version 32.

Oracle is not the upstream, since they went closed source, illumos is our new
upstream.

It you want to follow the discussion see the thread titled "128K max blocksize in
zfs" on developer@lists.illumos.org.

    Regards
    Steve

=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it.=20

In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337
or return the E.mail to postmaster@multiplay.co.uk.
From owner-freebsd-fs@FreeBSD.ORG  Tue Jan 29 14:58:23 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: fs@freebsd.org
Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115])
 by hub.freebsd.org (Postfix) with ESMTP id A350E5D2;
 Tue, 29 Jan 2013 14:58:23 +0000 (UTC)
 (envelope-from freebsd-listen@fabiankeil.de)
Received: from smtprelay05.ispgateway.de (smtprelay05.ispgateway.de
 [80.67.31.98]) by mx1.freebsd.org (Postfix) with ESMTP id 3746586;
 Tue, 29 Jan 2013 14:58:23 +0000 (UTC)
Received: from [78.35.166.2] (helo=fabiankeil.de)
 by smtprelay05.ispgateway.de with esmtpsa (SSLv3:AES128-SHA:128)
 (Exim 4.68) (envelope-from <freebsd-listen@fabiankeil.de>)
 id 1U0Cdb-0002ju-CX; Tue, 29 Jan 2013 15:58:15 +0100
Date: Tue, 29 Jan 2013 15:52:50 +0100
From: Fabian Keil <freebsd-listen@fabiankeil.de>
To: Dan Nelson <dnelson@allantgroup.com>
Subject: Re: Zpool surgery
Message-ID: <20130129155250.29d8f764@fabiankeil.de>
In-Reply-To: <20130128214111.GA14888@dan.emsphone.com>
References: <20130127103612.GB38645@acme.spoerlein.net>
 <1F0546C4D94D4CCE9F6BB4C8FA19FFF2@multiplay.co.uk>
 <20130127201140.GD29105@server.rulingia.com>
 <20130128085820.GR35868@acme.spoerlein.net>
 <20130128205802.1ffab53e@fabiankeil.de>
 <20130128214111.GA14888@dan.emsphone.com>
Mime-Version: 1.0
Content-Type: multipart/signed; micalg=PGP-SHA1;
 boundary="Sig_/WXg2ahZC0rmXbVAQa_iy9g/"; protocol="application/pgp-signature"
X-Df-Sender: Nzc1MDY3
Cc: current@freebsd.org, fs@freebsd.org,
 Ulrich =?UTF-8?B?U3DDtnJsZWlu?= <uqs@freebsd.org>
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 29 Jan 2013 14:58:23 -0000

--Sig_/WXg2ahZC0rmXbVAQa_iy9g/
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Dan Nelson <dnelson@allantgroup.com> wrote:

> In the last episode (Jan 28), Fabian Keil said:
> > Ulrich Sp=C3=B6rlein <uqs@FreeBSD.org> wrote:
> > > On Mon, 2013-01-28 at 07:11:40 +1100, Peter Jeremy wrote:
> > > > On 2013-Jan-27 14:31:56 -0000, Steven Hartland <killing@multiplay.c=
o.uk> wrote:
> > > > >----- Original Message -----=20
> > > > >From: "Ulrich Sp=C3=B6rlein" <uqs@FreeBSD.org>
> > > > >> I want to transplant my old zpool tank from a 1TB drive to a new
> > > > >> 2TB drive, but *not* use dd(1) or any other cloning mechanism, as
> > > > >> the pool was very full very often and is surely severely
> > > > >> fragmented.
> > > > >
> > > > >Cant you just drop the disk in the original machine, set it as a
> > > > >mirror then once the mirror process has completed break the mirror
> > > > >and remove the 1TB disk.
> > > >=20
> > > > That will replicate any fragmentation as well.  "zfs send | zfs rec=
v"
> > > > is the only (current) way to defragment a ZFS pool.
> >=20
> > It's not obvious to me why "zpool replace" (or doing it manually)
> > would replicate the fragmentation.
>=20
> "zpool replace" essentially adds your new disk as a mirror to the parent
> vdev, then deletes the original disk when the resilver is done.  Since
> mirrors are block-identical copies of each other, the new disk will conta=
in
> an exact copy of the original disk, followed by 1TB of freespace.

Thanks for the explanation.

I was under the impression that zfs mirrors worked at a higher
level than traditional mirrors like gmirror but there seems to
be indeed less magic than I expected.

Fabian

--Sig_/WXg2ahZC0rmXbVAQa_iy9g/
Content-Type: application/pgp-signature; name=signature.asc
Content-Disposition: attachment; filename=signature.asc

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (FreeBSD)

iEYEARECAAYFAlEH4kgACgkQBYqIVf93VJ1Z4ACgsP2gJkFDDqwImnab1rnKF5Xu
gc8AoJuwpBMZrXVyX8ZSboeS6co0PHOk
=8PGU
-----END PGP SIGNATURE-----

--Sig_/WXg2ahZC0rmXbVAQa_iy9g/--

From owner-freebsd-fs@FreeBSD.ORG  Tue Jan 29 15:12:10 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@FreeBSD.org
Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115])
 by hub.freebsd.org (Postfix) with ESMTP id CF585933;
 Tue, 29 Jan 2013 15:12:10 +0000 (UTC)
 (envelope-from gergely.czuczy@harmless.hu)
Received: from marvin.harmless.hu (marvin.harmless.hu [195.56.55.204])
 by mx1.freebsd.org (Postfix) with ESMTP id 6D4B012D;
 Tue, 29 Jan 2013 15:12:10 +0000 (UTC)
Received: from gprs4f7a62e4.pool.t-umts.hu ([79.122.98.228] helo=unknown)
 by marvin.harmless.hu with esmtpsa (TLSv1:AES128-SHA:128)
 (Exim 4.75 (FreeBSD)) (envelope-from <gergely.czuczy@harmless.hu>)
 id 1U0CUZ-000HPm-On; Tue, 29 Jan 2013 15:48:55 +0100
Date: Tue, 29 Jan 2013 15:48:52 +0100
From: Gergely CZUCZY <gergely.czuczy@harmless.hu>
To: Nicolas Rachinsky <fbsd-mas-0@ml.turing-complete.org>
Subject: Re: slowdown of zfs (tx->tx)
Message-ID: <20130129154852.000021f1@unknown>
In-Reply-To: <20130117093259.GA83951@mid.pc5.i.0x5.de>
References: <20130114195148.GA20540@mid.pc5.i.0x5.de>
 <CAFqOu6jwJ4qhbOovN_NhzusdQJvrbvUC3g93sziR=Uw99SGenw@mail.gmail.com>
 <20130114214652.GA76779@mid.pc5.i.0x5.de>
 <CAFqOu6jKX-Ks6C1RK5GwZ51ZVUSnGSe7S99_EfK+fwLPjAFFYw@mail.gmail.com>
 <20130115224556.GA41774@mid.pc5.i.0x5.de>
 <CAFqOu6jJnWdbikPmE1-UML5i_x7meF+iyY=9WBRyv2j7AeOaSg@mail.gmail.com>
 <50F67551.5020704@FreeBSD.org>
 <20130116095009.GA36867@mid.pc5.i.0x5.de>
 <FD780217EA4548F187715AF1AAF2B91A@multiplay.co.uk>
 <50F69788.2040506@FreeBSD.org>
 <20130117093259.GA83951@mid.pc5.i.0x5.de>
Organization: Harmless Digital
X-Mailer: Claws Mail 3.7.6 (GTK+ 2.16.0; i586-pc-mingw32msvc)
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Cc: freebsd-fs <freebsd-fs@FreeBSD.org>, Andriy Gapon <avg@FreeBSD.org>
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 29 Jan 2013 15:12:10 -0000

Hello,

Once think we've noticed on our systems, might be unrelated, but still.
After heavy usage of dedup, our ZFS pools tended to slow down
drastically. The solution was to deallocate and reallocate
dedup-enabled filesystems (copying or send/recieving data back and
forth).

Just an idea, might be unrelated in your case.

Best regards,
Gergely

On Thu, 17 Jan 2013 10:32:59 +0100
Nicolas Rachinsky <fbsd-mas-0@ml.turing-complete.org> wrote:

> * Andriy Gapon <avg@FreeBSD.org> [2013-01-16 14:05 +0200]:
> > on 16/01/2013 12:14 Steven Hartland said the following:
> > > You only have ~11% free so yer it is pretty full ;-)
> > 
> > just in case, Steve is not kidding.
> > 
> > Those free hundreds of gigabytes could be spread over the terabytes
> > and could be quite fragmented if the pool has a history of adding
> > and removing lots of files. ZFS could be spending quite a lot of
> > time in that case when it looks for some free space and tries to
> > minimize further fragmentation.
> > 
> > Empirical/anecdotal safe limit on pool utilization is said to be
> > about 70-80%.
> > 
> > You can test if this guess is true by doing the following:
> > kgdb -w
> > (kgdb) set metaslab_min_alloc_size=4096
> > 
> > If performance noticeably improves after that, then this is your
> > problem indeed.
> 
> I tried this, but I didn't notice any difference in performance.
> 
> Next I'll try the update Artem suggested.
> 
> Thanks
> 
> Nicolas


From owner-freebsd-fs@FreeBSD.ORG  Tue Jan 29 15:45:12 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id 0EC92475
 for <freebsd-fs@freebsd.org>; Tue, 29 Jan 2013 15:45:12 +0000 (UTC)
 (envelope-from romain@blogreen.org)
Received: from marvin.blogreen.org (unknown [IPv6:2001:470:1f12:b9c::2])
 by mx1.freebsd.org (Postfix) with ESMTP id B822231C
 for <freebsd-fs@freebsd.org>; Tue, 29 Jan 2013 15:45:11 +0000 (UTC)
Received: by marvin.blogreen.org (Postfix, from userid 1001)
 id 57AE31B1C1; Tue, 29 Jan 2013 16:45:07 +0100 (CET)
Date: Tue, 29 Jan 2013 16:45:07 +0100
From: Romain =?iso-8859-1?Q?Tarti=E8re?= <romain@FreeBSD.org>
To: freebsd-fs@freebsd.org
Subject: Re: ZFS deduplication
Message-ID: <20130129154507.GA53833@blogreen.org>
References: <20130123143728.GA84218@blogreen.org>
MIME-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha1;
 protocol="application/pgp-signature"; boundary="pWyiEgJYm5f9v55/"
Content-Disposition: inline
In-Reply-To: <20130123143728.GA84218@blogreen.org>
X-PGP-Key: http://romain.blogreen.org/pubkey.asc
User-Agent: Mutt/1.5.21 (2010-09-15)
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 29 Jan 2013 15:45:12 -0000


--pWyiEgJYm5f9v55/
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Wed, Jan 23, 2013 at 03:37:29PM +0100, Romain Tarti=E8re wrote:
> However, `zpool list` reports an inconsistent deduplication value (it
> used to be ~1.4 AFAICR):
>=20
> > zpool list data
> > NAME   SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
> > data   460G   101G   359G    21%  1386985.39x  ONLINE  -

Looks like this was some kind of corruption: the system crashed (it has
crashed a few times over the last few months but I could not get details
because of a failing serial port on the machine and the X server was
running) and was then unable to reboot: just after importing the zpool,
the kernel panicked in ddt_phys_decref() trying to dereference a NULL
pointer (because of the serial port I don't have a text backtrace,
however I took a few shots just in case).

I replaced the disks of the pool with new ones, reinstalled FreeBSD and
restored from backup.

I keep the old disks untouched in case some FreeBSD developer involved
in ZFS is interested about this corruption and needs a real-life
corrupted filesystem for analysis.  Please let me know by private mail.

Thanks,
Romain

--=20
Romain Tarti=E8re <romain@FreeBSD.org>  http://people.FreeBSD.org/~romain/
pgp: 8234 9A78 E7C0 B807 0B59  80FF BA4D 1D95 5112 336F (ID: 0x5112336F)
(plain text =3Dnon-HTML=3D PGP/GPG encrypted/signed e-mail much appreciated)

--pWyiEgJYm5f9v55/
Content-Type: application/pgp-signature

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (FreeBSD)

iQGcBAEBAgAGBQJRB+6DAAoJELpNHZVREjNvuiwMAI9c94X5rN00GgKkPbPdF9KX
WlOv67UEXrdN6GfsjKXEnC9BTPOCGjs3ZCw3esllQqyJEoUL+nzZVyxU9IrLdVPA
v9AHX523u8d1SqpsGZKnhCd+JWWMuOa6CXK5GgoVtSajxZXurt1CSpXnyRw6yxQZ
PDM8oPXWMbtT+mx+AJseZKyAF2TSDDkYCoKVx1NaaZFAWwVZw4cFpBZHcvGrzvHh
SqLsiYnSAJ5watETrowrNZrpOK75EKOVDaPCpTesvY+Yhzu2gKMeOIlyNqYodssT
j9ezEKjYNYFkFMXxvUS9QP0BtIOUq/O4rd42Bu6HwzX8WCoe9Dyj1iIr8X+/A65o
DP4zVSIW6u3y4haEd0ZmhZNUid4S1vsBixiYSKc8B69uqkzmDgYrruCoTFOKzlFB
h5BmB1JpGa2tml8Kq18+3+KEmEAgyY+Qy6AF4rdXvpfizUibATKg1Lvh94uZSq+O
2cZZRvLu4nfcT7AuA+/BvFdns9Cwz1ampM4RHGEfdQ==
=7M8k
-----END PGP SIGNATURE-----

--pWyiEgJYm5f9v55/--

From owner-freebsd-fs@FreeBSD.ORG  Tue Jan 29 16:44:53 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115])
 by hub.freebsd.org (Postfix) with ESMTP id 85C4D8C3;
 Tue, 29 Jan 2013 16:44:53 +0000 (UTC)
 (envelope-from mavbsd@gmail.com)
Received: from mail-bk0-f53.google.com (mail-bk0-f53.google.com
 [209.85.214.53]) by mx1.freebsd.org (Postfix) with ESMTP id EAA1D888;
 Tue, 29 Jan 2013 16:44:52 +0000 (UTC)
Received: by mail-bk0-f53.google.com with SMTP id j10so394834bkw.26
 for <multiple recipients>; Tue, 29 Jan 2013 08:44:51 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=x-received:sender:message-id:date:from:user-agent:mime-version:to
 :cc:subject:references:in-reply-to:content-type
 :content-transfer-encoding;
 bh=jiTPKyMmtazJoleQdasofoHQVMUTWO2oBZAxEyP661o=;
 b=SPFkVH2aw57Y0QTUA0U9yVcGAYvdXWg3WNlmq9dWJdkxm1U2Fto9KDeS7F9+Vz95+m
 2feVB5zUVRJGJw62WCSqz3f5FzezkaWHBxZIGnVBdSAPHKa5I2Ob0f34LHaqEUpjHiQ8
 WQBEEARUnBwjYah7aT+qomXFywfVVZHQIrdItOapLYGI7tK3IsqN1d7Eaiq3Tgb66CR+
 cUrlZmMFfaLOkU+wuoiPpB+jQ5IZlljcNWp2CmVGp1DfZQLOjR3CBbM0Ku9vbK1VT9WX
 obL/Vbex4EFw8dIsdJmn0689TiX8MCoOGO5Naou19RVng/z9oA/dzTtYba3HgUCuF/JF
 kNyA==
X-Received: by 10.204.150.134 with SMTP id y6mr112305bkv.15.1359477891771;
 Tue, 29 Jan 2013 08:44:51 -0800 (PST)
Received: from mavbook.mavhome.dp.ua ([91.198.175.1])
 by mx.google.com with ESMTPS id gy3sm4878528bkc.16.2013.01.29.08.44.49
 (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128);
 Tue, 29 Jan 2013 08:44:50 -0800 (PST)
Sender: Alexander Motin <mavbsd@gmail.com>
Message-ID: <5107FC7E.8070108@FreeBSD.org>
Date: Tue, 29 Jan 2013 18:44:46 +0200
From: Alexander Motin <mav@FreeBSD.org>
User-Agent: Mozilla/5.0 (X11; FreeBSD amd64;
 rv:17.0) Gecko/20130125 Thunderbird/17.0.2
MIME-Version: 1.0
To: Jeremy Chadwick <jdc@koitsu.org>
Subject: Re: disk "flipped" - a known problem?
References: <20130121221617.GA23909@icarus.home.lan>
 <50FED818.7070704@FreeBSD.org> <20130125083619.GA51096@icarus.home.lan>
 <20130125211232.GA3037@icarus.home.lan>
In-Reply-To: <20130125211232.GA3037@icarus.home.lan>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Cc: freebsd-fs@freebsd.org, avg@freebsd.org
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 29 Jan 2013 16:44:53 -0000

On 25.01.2013 23:12, Jeremy Chadwick wrote:
> Now about cam_periph_alloc -- I wanted to provide proof that I have seen
> this message before / proving Andriy isn't crazy.  :-)  This is from
> when I was messing about with this bad disk the day I received it:
> 
> Jan 18 19:54:57 icarus kernel: ada5 at ahcich5 bus 0 scbus5 target 0 lun 0
> Jan 18 19:54:57 icarus kernel: ada5: <WDC WD1500ADFD-00NLR4 21.07QR4> ATA-7 SATA 1.x device
> Jan 18 19:54:57 icarus kernel: ada5: 150.000MB/s transfers (SATA 1.x, UDMA6, PIO 8192bytes)
> Jan 18 19:54:57 icarus kernel: ada5: Command Queueing enabled
> Jan 18 19:54:57 icarus kernel: ada5: 143089MB (293046768 512 byte sectors: 16H 63S/T 16383C)
> Jan 18 19:54:57 icarus kernel: ada5: Previously was known as ad14
> Jan 18 19:54:57 icarus kernel: cam_periph_alloc: attempt to re-allocate valid device pass5 rejected flags 0x18 refcount 1
> Jan 18 19:54:57 icarus kernel: passasync: Unable to attach new device due to status 0x6: CCB request was invalid
> Jan 18 19:54:57 icarus kernel: GEOM_RAID: NVIDIA-6: Array NVIDIA-6 created.
> Jan 18 19:55:27 icarus kernel: GEOM_RAID: NVIDIA-6: Force array start due to timeout.
> Jan 18 19:55:27 icarus kernel: GEOM_RAID: NVIDIA-6: Disk ada5 state changed from NONE to ACTIVE.
> Jan 18 19:55:27 icarus kernel: GEOM_RAID: NVIDIA-6: Subdisk RAID 0+1 279.47G:3-ada5 state changed from NONE to REBUILD.
> Jan 18 19:55:27 icarus kernel: GEOM_RAID: NVIDIA-6: Array started.
> Jan 18 19:55:27 icarus kernel: GEOM_RAID: NVIDIA-6: Volume RAID 0+1 279.47G state changed from STARTING to BROKEN.
> Jan 18 19:55:39 icarus kernel: GEOM_RAID: NVIDIA-6: Volume RAID 0+1 279.47G state changed from BROKEN to STOPPED.
> Jan 18 19:55:49 icarus kernel: GEOM_RAID: NVIDIA-6: Array NVIDIA-6 destroyed.
> 
> So why didn't I see this message today?  On January 20th I rebuild
> world/kernel after removing GEOM_RAID from my kernel config.  The reason
> I removed GEOM_RAID is that, as you can see, that bad disk** was
> previously in a system (not my own) with an nVidia SATA chipset with
> their RAID option ROM enabled (my system is Intel, hence "array timeout"
> since there's no nVidia option ROM, I believe).

Array timeout means that within defined timeout GEOM RAID failed to
detect all of array components. GEOM RAID doesn't depend on option ROM
presence to access the data. Array was finally marked as BROKEN because
it was one disk of RAID0+1's four.

> I got sick and tired of having to "fight" with the kernel.  The last two
> messages were a result of me doing "graid stop ada5".  And of course "dd
> if=/dev/zero of=/dev/ada5 bs=64k" will cause GEOM to re-taste, causing
> the RAID metadata to get re-read, "NVIDIA-7" created, rinse lather
> repeat.  But there's already a thread on this:
> 
> http://lists.freebsd.org/pipermail/freebsd-fs/2013-January/016292.html
> 
> Just easier for me to remove the option, that's all.

I would personally prefer to erase unwanted stale metadata with `graid
delete NVIDIA-7`. It erases only one sector where metadata stored and
doesn't corrupt any other data on the disk.

-- 
Alexander Motin

From owner-freebsd-fs@FreeBSD.ORG  Tue Jan 29 17:05:34 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id 69B6F260
 for <fs@freebsd.org>; Tue, 29 Jan 2013 17:05:34 +0000 (UTC)
 (envelope-from fjwcash@gmail.com)
Received: from mail-qe0-f41.google.com (mail-qe0-f41.google.com
 [209.85.128.41]) by mx1.freebsd.org (Postfix) with ESMTP id 3016B974
 for <fs@freebsd.org>; Tue, 29 Jan 2013 17:05:33 +0000 (UTC)
Received: by mail-qe0-f41.google.com with SMTP id 7so285570qeb.14
 for <fs@freebsd.org>; Tue, 29 Jan 2013 09:05:33 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=mime-version:x-received:in-reply-to:references:date:message-id
 :subject:from:to:cc:content-type;
 bh=piWKtDuiAP8LtdFRC0cHlCzkmupVT9dc7P0zadttVnk=;
 b=fhArPGdkLTZ5zix6lZB72ZATc7Bo77DRCRMJSpFK4R8RdQyIxp32Z7QdsyZbCTJSu0
 vBdCJ5axcDXvY0pyiGSdzm4ZkRjLWg5e4IHF4UsJ6XI6YY/81rx//aH1QJ7oIal5p6QX
 o33JJ3WyYQixBL3fLsmlhxvNBAVXCbE0spQr7z/Xjmd+M6SWA1UArTTc2GDG1AXlz0DP
 gQESB+EDHUrRAX2tjDngl33rJpviK9T5SwKWzX60tQs74H++QFqUxytwnTr2W/A47bPz
 CGeofD5BsUE2nbiFb6rTK0Mg9Q6PffU/B9LY94k2YnMEXcOm0aOQ5sCAPQ0mH9nRWZL1
 iELw==
MIME-Version: 1.0
X-Received: by 10.224.177.10 with SMTP id bg10mr1846578qab.78.1359479133312;
 Tue, 29 Jan 2013 09:05:33 -0800 (PST)
Received: by 10.49.106.233 with HTTP; Tue, 29 Jan 2013 09:05:33 -0800 (PST)
Received: by 10.49.106.233 with HTTP; Tue, 29 Jan 2013 09:05:33 -0800 (PST)
In-Reply-To: <5107A9B7.5030803@platinum.linux.pl>
References: <5105252D.6060502@platinum.linux.pl>
 <CAJjvXiEQSqnKYP75crTkgVqLKSk92q9UTikFtdyPHmF6shJFbg@mail.gmail.com>
 <5107A9B7.5030803@platinum.linux.pl>
Date: Tue, 29 Jan 2013 09:05:33 -0800
Message-ID: <CAOjFWZ4W6yASf-z2vZ1vGi0qKi6E2LY7GmrOJ0uaGVau00sz5w@mail.gmail.com>
Subject: Re: RAID-Z wasted space - asize roundups to nparity +1
From: Freddie Cash <fjwcash@gmail.com>
To: Adam Nowacki <nowakpl@platinum.linux.pl>
Content-Type: text/plain; charset=UTF-8
X-Content-Filtered-By: Mailman/MimeDel 2.1.14
Cc: Matthew Ahrens <mahrens@delphix.com>, fs@freebsd.org
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 29 Jan 2013 17:05:34 -0000

On Jan 29, 2013 2:52 AM, "Adam Nowacki" <nowakpl@platinum.linux.pl> wrote:
>
> On 2013-01-28 22:55, Matthew Ahrens wrote:
>>
>> This is so that we won't end up with small, unallocatable segments.
>>   E.g. if you are using RAIDZ2, the smallest usable segment would be 3
>> sectors (1 sector data + 2 sectors parity).  If we left a 1 or 2 sector
>> free segment, it would be unusable and you'd be able to get into strange
>> accounting situations where you have free space but can't write because
>> you're "out of space".
>
>
> Sounds reasonable.
>
>
>> The amount of waste due to this can be minimized by using larger
>> blocksizes (e.g. the default recordsize of 128k and files larger than
>> 128k), and by using smaller sector sizes (e.g. 512b sector disks rather
>> than 4k sector disks).  In your case these techniques would limit the
>> waste to 0.6%.
>
>
> This brings another issue - recordsize capped at 128KiB. We are using the
pool for off-line storage of large files (from 50MB to 20GB). Files are
stored and read sequentially as a whole. With 12 disks in RAID-Z2, 4KiB
sectors, 128KiB record size and the padding above 9.4% of disk space goes
completely unused - one whole disk.
>
> Increasing recordsize cap seems trivial enough. On-disk structures and
kernel code support it already - a single of code had to be changed
(#define SPA_MAXBLOCKSHIFT - from 17 to 20) to support 1MiB recordsizes.
This of course breaks compatibility with any other system without this
modification. With Suns cooperation this could be handled in safe and
compatible manner via pool version upgrade. Recordsize of 128KiB would
remain the default but anyone could increase it with zfs set.

There's work upstream (Illumos, I believe, maybe Delphix?) to add support
for recordings above 128 KB. It'll be added ad a feature flag, so only
compatible with open-source ZFS.

From owner-freebsd-fs@FreeBSD.ORG  Tue Jan 29 18:14:40 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id 0EDB7287
 for <fs@freebsd.org>; Tue, 29 Jan 2013 18:14:40 +0000 (UTC)
 (envelope-from matthew.ahrens@delphix.com)
Received: from mail-la0-x235.google.com (mail-la0-x235.google.com
 [IPv6:2a00:1450:4010:c03::235])
 by mx1.freebsd.org (Postfix) with ESMTP id 89A16E49
 for <fs@freebsd.org>; Tue, 29 Jan 2013 18:14:39 +0000 (UTC)
Received: by mail-la0-f53.google.com with SMTP id fr10so533902lab.12
 for <fs@freebsd.org>; Tue, 29 Jan 2013 10:14:38 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=delphix.com; s=google;
 h=mime-version:x-received:in-reply-to:references:date:message-id
 :subject:from:to:cc:content-type;
 bh=ysQRgwYXD0w0xg518IqGyFdrPwrB5w6+A+jKT8g4WyU=;
 b=NLTFL0iOEbRABik60N5vh/eqmaXioQQzFDgovUPfrUkpF6MBvSSQR+fGFk4xkVZiyd
 ML94sFJF1cA9do8TZevBfgyFzxu5+y8bs0rOs3ryHh2pfFGLsYK4ur0L55NlPLKPlgXq
 0SEmbMlsn69KMB70sX/5UrYxAB/AhidwROc88=
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=google.com; s=20120113;
 h=mime-version:x-received:in-reply-to:references:date:message-id
 :subject:from:to:cc:content-type:x-gm-message-state;
 bh=ysQRgwYXD0w0xg518IqGyFdrPwrB5w6+A+jKT8g4WyU=;
 b=QDqMyIzY9q6V0wt/9I6qFCOBIjvmqdtFvuhzvNsdzQaNjzvHGjEeQA9PRMIDoJ1kWW
 iQ8ya2vp/R8YcD4zh7Pw40DPcst/JuVlwKNuWbneQAiJfgb8eussFKrQjvSSxT4y9qBM
 I7oktKVg/HOty3UU36oEz6fjgKFpv1FzpZR5rFaVkz0SDPDnhCwzQQlQzD9AUxIhumIq
 J1gtKSSNWguqPSt9/429maLNSnW3hxqc1kvYeKt+YzPgp/Ld/cU4wto6p6x30BPDs+oO
 0bxI5bxzbIXF3lQh701KPQNooo2125KGgN2b1edzU9e9utSK7WkN4dTi719ssN0dWZok
 4IOg==
MIME-Version: 1.0
X-Received: by 10.152.144.202 with SMTP id so10mr1976797lab.9.1359483278471;
 Tue, 29 Jan 2013 10:14:38 -0800 (PST)
Received: by 10.114.68.109 with HTTP; Tue, 29 Jan 2013 10:14:38 -0800 (PST)
In-Reply-To: <5107A9B7.5030803@platinum.linux.pl>
References: <5105252D.6060502@platinum.linux.pl>
 <CAJjvXiEQSqnKYP75crTkgVqLKSk92q9UTikFtdyPHmF6shJFbg@mail.gmail.com>
 <5107A9B7.5030803@platinum.linux.pl>
Date: Tue, 29 Jan 2013 10:14:38 -0800
Message-ID: <CAJjvXiHNZ-5R+gg5T=d_jgm_NT3SKqmHfmsfD_4nPmfrxdgdfQ@mail.gmail.com>
Subject: Re: RAID-Z wasted space - asize roundups to nparity +1
From: Matthew Ahrens <mahrens@delphix.com>
To: Adam Nowacki <nowakpl@platinum.linux.pl>
X-Gm-Message-State: ALoCoQn3bsZdbLdhB6W7iNDz3krWVpHcLic/HP8dTdlpiNiyeSTVvA/l5kIN0m4ZOMkX7RUU/6ft
Content-Type: text/plain; charset=ISO-8859-1
X-Content-Filtered-By: Mailman/MimeDel 2.1.14
Cc: fs@freebsd.org
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 29 Jan 2013 18:14:40 -0000

On Tue, Jan 29, 2013 at 2:51 AM, Adam Nowacki <nowakpl@platinum.linux.pl>wrote:

> I've also identified another problem with ZFS wasting disk space. When
> compression is off allocations are always a multiple of record size. With
> the default recordsize of 128KiB a 129KiB file would use 256KiB of disk
> space (+ parity and other inefficiencies mentioned above). This may be
> there to help with fragmentation but then it would be good to have a
> setting to turn it off - even if by means of a no-op compression that would
> count zeroes backwards and return short psize.
>

The most straightforward way to do this would be, as you alluded, to always
compress the last block of the file, even if no compression has been
selected.  For maximum speed, we could use the already-implemented zle
(zero-length encoding) algorithm.

--matt

From owner-freebsd-fs@FreeBSD.ORG  Tue Jan 29 19:00:02 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@smarthost.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115])
 by hub.freebsd.org (Postfix) with ESMTP id 0F8B9D99
 for <freebsd-fs@smarthost.ysv.freebsd.org>;
 Tue, 29 Jan 2013 19:00:02 +0000 (UTC)
 (envelope-from gnats@FreeBSD.org)
Received: from freefall.freebsd.org (freefall.freebsd.org
 [IPv6:2001:1900:2254:206c::16:87])
 by mx1.freebsd.org (Postfix) with ESMTP id E70046D
 for <freebsd-fs@smarthost.ysv.freebsd.org>;
 Tue, 29 Jan 2013 19:00:01 +0000 (UTC)
Received: from freefall.freebsd.org (localhost [127.0.0.1])
 by freefall.freebsd.org (8.14.6/8.14.6) with ESMTP id r0TJ01IU093310
 for <freebsd-fs@freefall.freebsd.org>; Tue, 29 Jan 2013 19:00:01 GMT
 (envelope-from gnats@freefall.freebsd.org)
Received: (from gnats@localhost)
 by freefall.freebsd.org (8.14.6/8.14.6/Submit) id r0TJ01Vt093309;
 Tue, 29 Jan 2013 19:00:01 GMT (envelope-from gnats)
Date: Tue, 29 Jan 2013 19:00:01 GMT
Message-Id: <201301291900.r0TJ01Vt093309@freefall.freebsd.org>
To: freebsd-fs@FreeBSD.org
Cc: 
From: Jeremy Chadwick <jdc@koitsu.org>
Subject: Re: kern/169480: [zfs] ZFS stalls on heavy I/O
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
Reply-To: Jeremy Chadwick <jdc@koitsu.org>
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 29 Jan 2013 19:00:02 -0000

The following reply was made to PR kern/169480; it has been noted by GNATS.

From: Jeremy Chadwick <jdc@koitsu.org>
To: Harry Coin <hgcoin@gmail.com>
Cc: bug-followup@FreeBSD.org, levent.serinol@mynet.com
Subject: Re: kern/169480: [zfs] ZFS stalls on heavy I/O
Date: Tue, 29 Jan 2013 10:50:28 -0800

 Re 1,2: that transfer speed (183MBytes/second) sounds much better/much
 more accurate for what's going on.  The speed-limiting factors were
 certainly a small blocksize (512 bytes) used by dd, and using
 /dev/random rather than /dev/zero.  I realise you're probably expecting
 to see something like 480MBytes/second (4 drives * 120MB/sec), but
 that's probably not going to happen on that model of system and with
 that CPU.
 
 For example, on my Q9550 system described earlier, I can get about this:
 
 $ dd if=/dev/zero of=testfile bs=64k
 ^C27148+0 records in
 27147+0 records out
 1779105792 bytes transferred in 6.935566 secs (256519186 bytes/sec)
 
 While "gstat -I500ms" shows each disk going between 60MBytes/sec and
 140MBytes/sec.  "zpool iostat -v data 1" shows between 120-220MBytes/sec
 at the pool level, and showing around 65-110MBytes/sec on a per-disk
 level.
 
 Anyway, point being, things are faster with a large bs and from a source
 that doesn't churn interrupts.  But don't necessarily "pull a Linux" and
 start doing things like bs=1m -- as I said before, Linux dd is
 different, because the I/O is cached (without --direct), while on
 FreeBSD dd is always direct.
 
 Re 3: That sounds a bit on the slow side.  I would expect those disks,
 at least during writes, to do more.  If **all** the drives show this
 behaviour consistently in gstat, then you know the issue IS NOT with an
 individual disk, and is instead the issue lies elsewhere.  That rules
 out one piece of the puzzle, and that's good.
 
 Re 5: Did you mean to type 14MBytes/second, not 14mbits/second?  If so,
 yes, I would agree that's slow.  Scrubbing is not necessarily a good way
 to "benchmark" disks, but I understand for "benchmarking" ZFS it's the
 best you've got to some degree.
 
 Regarding dd'ing and 512 bytes -- as I described to you in my previous
 mail:
 
 > This speed will be "bursty" and "sporadic" due to the how ZFS ARC
 > works.  The interval at which "things are flushed to disk" is based on
 > the vfs.zfs.txg.timeout sysctl, which on FreeBSD 9.1-RELEASE should
 > default to 5 (5 seconds).
 
 This is where your "4 secs or so" magic value comes from.  Please do not
 change this sysctl/value; keep it at 5.
 
 Finally, your vmstat -i output shows something of concern, UNLESS you
 did this WHILE you had the dd (doesn't matter what block size) going,
 and are using /dev/random or /dev/urandom (same thing on FreeBSD):
 
 > irq20: hpet0                      620136        328
 > irq259: ahci1                     849746        450
 
 These interrupt rates are quite high.  hpet0 refers to your event
 timer/clock timer (see kern.eventtimer.choice and kern.eventtimer.timer)
 being HPET, and ahci1 refers to your Intel ICH7 AHCI controller.
 
 Basically what's happening here is that you're generating a ton of
 interrupts doing dd if=/dev/urandom bs=512.  And it makes perfect sense
 to me why: because /dev/urandom has to harvest entropy from interrupt
 sources (please see random(4) man page), and you're generating a lot of
 interrupts to your AHCI controller for each individual 512-byte write.
 
 When you say "move a video from one dataset to another", please explain
 what it is you're moving from and to.  Specifically: what filesystems,
 and output from "zfs list".
 
 If you're moving a file from a ZFS filesystem to another ZFS filesystem
 on the same pool, then please state that.  That may help kernel folks
 figure out where your issue lies.
 
 At this stage, a kernel developer is going to need to step in and try to
 help you figure out where the actual bottleneck is occurring.  This is
 going to be very difficult/complex/very likely not possible with you
 using nas4free, because you will almost certainly be asked to rebuild
 world/kernel to include some new options and possibly asked to include
 DTrace/CTF support (for real-time debugging).  The situation is tricky.
 
 It would really help if you would/could remove nas4free from the picture
 and instead just run stock FreeBSD, because as I said, if there are some
 kind of kernel tunings or adjustment values the nas4free folks put in
 place that stock FreeBSD doesn't, those could be harming you.
 
 I can't be of more help here, I'm sorry to say.  The good news is that
 your disks sound fine.  Kernel developers will need to take this up.
 
 P.S. -- I would strongly recommend updating your nas4free forum post
 with a link to this conversation in this PR.  IMO, the nas4free people
 need to step up and take responsibility (and that almost certainly means
 talking/working with the FreeBSD folks).
 
 -- 
 | Jeremy Chadwick                                   jdc@koitsu.org |
 | UNIX Systems Administrator                http://jdc.koitsu.org/ |
 | Mountain View, CA, US                                            |
 | Making life hard for others since 1977.             PGP 4BD6C0CB |
 

From owner-freebsd-fs@FreeBSD.ORG  Tue Jan 29 23:20:19 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115])
 by hub.freebsd.org (Postfix) with ESMTP id A87AFCC5
 for <freebsd-fs@freebsd.org>; Tue, 29 Jan 2013 23:20:19 +0000 (UTC)
 (envelope-from toasty@dragondata.com)
Received: from mail-ia0-x232.google.com (mail-ia0-x232.google.com
 [IPv6:2607:f8b0:4001:c02::232])
 by mx1.freebsd.org (Postfix) with ESMTP id 79612EB9
 for <freebsd-fs@freebsd.org>; Tue, 29 Jan 2013 23:20:19 +0000 (UTC)
Received: by mail-ia0-f178.google.com with SMTP id y26so1457887iab.9
 for <freebsd-fs@freebsd.org>; Tue, 29 Jan 2013 15:20:19 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=dragondata.com; s=google;
 h=x-received:from:content-type:content-transfer-encoding:subject
 :message-id:date:to:mime-version:x-mailer;
 bh=mV634ayqjMBOx7SILnuvSR+59ucUPOAyAQgnk69w53U=;
 b=cvKZdBK8TuaL52Hpw3gkErDkhUMxpGbMyTCsWlwT0D6Y6nu4/sbHZBbS8sae0X7hHE
 AEJXfjMNsW+TQlu1LKOxKt/h32HDXEey6Q/eWXQkf89+XamYEzmGn0Q8nlccuuEtlc2Y
 NkR0rU8aR545WcEn2GJ3U8wcLxPViDhggoQUk=
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=google.com; s=20120113;
 h=x-received:from:content-type:content-transfer-encoding:subject
 :message-id:date:to:mime-version:x-mailer:x-gm-message-state;
 bh=mV634ayqjMBOx7SILnuvSR+59ucUPOAyAQgnk69w53U=;
 b=TJv8YdYIq32YMay5tCTDPsnDGW+yb/+QWvEl0fNw7AG/6gWZEW1I1tx6xW/5fe0ePE
 PKRss0mdOP5mBFnN4xHMec7uNwpH/Y8E3J7kuy5r1jh8qQtS/+FGZocH4KKwC5xFPihN
 4utlKCSglWZoQmmK/UVrBxcEkVYvuU/v6NCpMFuHIYUXQSfGhMd8Clfn2BItYH0jrEp6
 d9MlfLHT/9UT3gqR5pI24a6BCt67iM03Id9XYEUxyt7oBUJU5Qb8yZOJ67zfcbdm5c1/
 z/JVuGY5hots0/fUmcM5EBiesKRfqNKXM3L/CgaE5Pe5AKLOzBCgJP3wNdRqQd1QgieM
 cklw==
X-Received: by 10.42.30.132 with SMTP id v4mr1808396icc.34.1359501619173;
 Tue, 29 Jan 2013 15:20:19 -0800 (PST)
Received: from vpn132.rw1.your.org (vpn132.rw1.your.org. [204.9.51.132])
 by mx.google.com with ESMTPS id vq4sm2912997igb.10.2013.01.29.15.20.17
 (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128);
 Tue, 29 Jan 2013 15:20:18 -0800 (PST)
From: Kevin Day <toasty@dragondata.com>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: quoted-printable
Subject: Improving ZFS performance for large directories
Message-Id: <19DB8F4A-6788-44F6-9A2C-E01DEA01BED9@dragondata.com>
Date: Tue, 29 Jan 2013 17:20:15 -0600
To: FreeBSD Filesystems <freebsd-fs@freebsd.org>
Mime-Version: 1.0 (Mac OS X Mail 6.2 \(1499\))
X-Mailer: Apple Mail (2.1499)
X-Gm-Message-State: ALoCoQlKSPeuKIj2/xZpRMC2K963YVXcKPluY/0NMQ+f5nOEn8076UECfrirKNoNH0eFdz2/QQWg
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 29 Jan 2013 23:20:19 -0000


I'm trying to improve performance when using ZFS in large (>60000 files) =
directories. A common activity is to use "getdirentries" to enumerate =
all the files in the directory, then "lstat" on each one to get =
information about it. Doing an "ls -l" in a large directory like this =
can take 10-30 seconds to complete. Trying to figure out why, I did:

ktrace ls -l /path/to/large/directory
kdump -R |sort -rn |more

to see what sys calls were taking the most time, I ended up with:

 69247 ls       0.190729 STRU  struct stat {dev=3D846475008, =
ino=3D46220085, mode=3D-rw-r--r-- , nlink=3D1, uid=3D0, gid=3D0, =
rdev=3D4294967295, atime=3D1333196714, stime=3D1201004393, =
ctime=3D1333196714.547566024, birthtime=3D1333196714.547566024, =
size=3D30784, blksize=3D31232, blocks=3D62, flags=3D0x0 }
 69247 ls       0.180121 STRU  struct stat {dev=3D846475008, =
ino=3D46233417, mode=3D-rw-r--r-- , nlink=3D1, uid=3D0, gid=3D0, =
rdev=3D4294967295, atime=3D1333197088, stime=3D1209814737, =
ctime=3D1333197088.913571042, birthtime=3D1333197088.913571042, =
size=3D3162220, blksize=3D131072, blocks=3D6409, flags=3D0x0 }
 69247 ls       0.152370 RET   getdirentries 4088/0xff8
 69247 ls       0.139939 CALL  stat(0x800d8f598,0x7fffffffcca0)
 69247 ls       0.130411 RET   __acl_get_link 0
 69247 ls       0.121602 RET   __acl_get_link 0
 69247 ls       0.105799 RET   getdirentries 4064/0xfe0
 69247 ls       0.105069 RET   getdirentries 4068/0xfe4
 69247 ls       0.096862 RET   getdirentries 4028/0xfbc
 69247 ls       0.085012 RET   getdirentries 4088/0xff8
 69247 ls       0.082722 STRU  struct stat {dev=3D846475008, =
ino=3D72941319, mode=3D-rw-r--r-- , nlink=3D1, uid=3D0, gid=3D0, =
rdev=3D4294967295, atime=3D1348686155, stime=3D1348347621, =
ctime=3D1348686155.768875422, birthtime=3D1348686155.768875422, =
size=3D6686225, blksize=3D131072, blocks=3D13325, flags=3D0x0 }
 69247 ls       0.070318 STRU  struct stat {dev=3D846475008, =
ino=3D46211679, mode=3D-rw-r--r-- , nlink=3D1, uid=3D0, gid=3D0, =
rdev=3D4294967295, atime=3D1333196475, stime=3D1240230314, =
ctime=3D1333196475.038567672, birthtime=3D1333196475.038567672, =
size=3D829895, blksize=3D131072, blocks=3D1797, flags=3D0x0 }
 69247 ls       0.068060 RET   getdirentries 4048/0xfd0
 69247 ls       0.065118 RET   getdirentries 4088/0xff8
 69247 ls       0.062536 RET   getdirentries 4096/0x1000
 69247 ls       0.061118 RET   getdirentries 4020/0xfb4
 69247 ls       0.055038 STRU  struct stat {dev=3D846475008, =
ino=3D46220358, mode=3D-rw-r--r-- , nlink=3D1, uid=3D0, gid=3D0, =
rdev=3D4294967295, atime=3D1333196720, stime=3D1274282669, =
ctime=3D1333196720.972567345, birthtime=3D1333196720.972567345, =
size=3D382344, blksize=3D131072, blocks=3D773, flags=3D0x0 }
 69247 ls       0.054948 STRU  struct stat {dev=3D846475008, =
ino=3D75025952, mode=3D-rw-r--r-- , nlink=3D1, uid=3D0, gid=3D0, =
rdev=3D4294967295, atime=3D1351071350, stime=3D1349726805, =
ctime=3D1351071350.800873870, birthtime=3D1351071350.800873870, =
size=3D2575559, blksize=3D131072, blocks=3D5127, flags=3D0x0 }
 69247 ls       0.054828 STRU  struct stat {dev=3D846475008, =
ino=3D65021883, mode=3D-rw-r--r-- , nlink=3D1, uid=3D0, gid=3D0, =
rdev=3D4294967295, atime=3D1335730367, stime=3D1332843230, =
ctime=3D1335730367.541567371, birthtime=3D1335730367.541567371, =
size=3D226347, blksize=3D131072, blocks=3D517, flags=3D0x0 }
 69247 ls       0.053743 STRU  struct stat {dev=3D846475008, =
ino=3D46222016, mode=3D-rw-r--r-- , nlink=3D1, uid=3D0, gid=3D0, =
rdev=3D4294967295, atime=3D1333196765, stime=3D1257110706, =
ctime=3D1333196765.206574132, birthtime=3D1333196765.206574132, =
size=3D62112, blksize=3D62464, blocks=3D123, flags=3D0x0 }
 69247 ls       0.052015 RET   getdirentries 4060/0xfdc
 69247 ls       0.051388 RET   getdirentries 4068/0xfe4
 69247 ls       0.049875 RET   getdirentries 4088/0xff8
 69247 ls       0.049156 RET   getdirentries 4032/0xfc0
 69247 ls       0.048609 RET   getdirentries 4040/0xfc8
 69247 ls       0.048279 RET   getdirentries 4032/0xfc0
 69247 ls       0.048062 RET   getdirentries 4064/0xfe0
 69247 ls       0.047577 RET   getdirentries 4076/0xfec
(snip)

the STRU are returns from calling lstat().

It looks like both getdirentries and lstat are taking quite a while to =
return. The shortest return for any lstat() call is 0.000004 seconds, =
the maximum is 0.190729 and the average is around 0.0004. Just from =
lstat() alone, that makes "ls" take over 20 seconds.

I'm prepared to try an L2arc cache device (with =
secondarycache=3Dmetadata), but I'm having trouble determining how big =
of a device I'd need. We've got >30M inodes now on this filesystem, =
including some files with extremely long names. Is there some way to =
determine the amount of metadata on a ZFS filesystem?


From owner-freebsd-fs@FreeBSD.ORG  Tue Jan 29 23:42:30 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id 5A31C248
 for <freebsd-fs@freebsd.org>; Tue, 29 Jan 2013 23:42:30 +0000 (UTC)
 (envelope-from matthew.ahrens@delphix.com)
Received: from mail-lb0-f179.google.com (mail-lb0-f179.google.com
 [209.85.217.179]) by mx1.freebsd.org (Postfix) with ESMTP id AF4BFF98
 for <freebsd-fs@freebsd.org>; Tue, 29 Jan 2013 23:42:29 +0000 (UTC)
Received: by mail-lb0-f179.google.com with SMTP id j14so1406798lbo.24
 for <freebsd-fs@freebsd.org>; Tue, 29 Jan 2013 15:42:28 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=delphix.com; s=google;
 h=mime-version:x-received:in-reply-to:references:date:message-id
 :subject:from:to:cc:content-type;
 bh=4dOb06WmFFmGv//BJ/aIa1Nknh5ufuQM5mi+Ga3VnSE=;
 b=VyQiUBfj5DJ0MCTh5ethc/8eZZuncYteMgWu0H8X7PCl78Y7pzMfYyXwZ+l7O40Vse
 iZgMM7fbnRR08NnOmm5vS9wpHgJcZ9CwOvTLTTSluKwfCiOF7mcuCz6txlsO9Y0FKTXf
 bCRm4ogQG+8dhgn9AMjeYySEXJEdpgLPIIn1E=
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=google.com; s=20120113;
 h=mime-version:x-received:in-reply-to:references:date:message-id
 :subject:from:to:cc:content-type:x-gm-message-state;
 bh=4dOb06WmFFmGv//BJ/aIa1Nknh5ufuQM5mi+Ga3VnSE=;
 b=PnT4pYXsrcWQeDvcvIszTQ9ElSCtOau6utyEFYLaFrXm7LkVhy9u961J+LHl/LDtJE
 MS6qTL6x4DtbstAAJkawT1Zf6wv6gfQxAc7yhFRW9TpU9EPgDA0jboYaLlYL1lbsdGfK
 oG8GNZuSS6eg+sxn2kXm1V9Jqbel5X7YQ5J0aah8mD89ml5oDjAvR9QFjxf5kL05RpCj
 EfY4DgJw4DWxOvfrUyx91eSe4c+qz0VRhN7wZjznIaVznyU6/YRGtOeIg0vEQfx836z6
 QpJ/S43a1EYgBgh2OnCuHEJUlCfhsRndM9IdPIn1zXfiO2cIt9GXtwahJHNWVUhaLIyW
 DNGg==
MIME-Version: 1.0
X-Received: by 10.152.144.202 with SMTP id so10mr2721142lab.9.1359502948362;
 Tue, 29 Jan 2013 15:42:28 -0800 (PST)
Received: by 10.114.68.109 with HTTP; Tue, 29 Jan 2013 15:42:28 -0800 (PST)
In-Reply-To: <19DB8F4A-6788-44F6-9A2C-E01DEA01BED9@dragondata.com>
References: <19DB8F4A-6788-44F6-9A2C-E01DEA01BED9@dragondata.com>
Date: Tue, 29 Jan 2013 15:42:28 -0800
Message-ID: <CAJjvXiE+8OMu_yvdRAsWugH7W=fhFW7bicOLLyjEn8YrgvCwiw@mail.gmail.com>
Subject: Re: Improving ZFS performance for large directories
From: Matthew Ahrens <mahrens@delphix.com>
To: Kevin Day <toasty@dragondata.com>
X-Gm-Message-State: ALoCoQmoJx68nM8xURko1bIRspmB/I2cz6arNabSxA3deWbDWGQe5KWlDHVA5r1lc+FVZokLVdpR
Content-Type: text/plain; charset=ISO-8859-1
X-Content-Filtered-By: Mailman/MimeDel 2.1.14
Cc: FreeBSD Filesystems <freebsd-fs@freebsd.org>
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 29 Jan 2013 23:42:30 -0000

On Tue, Jan 29, 2013 at 3:20 PM, Kevin Day <toasty@dragondata.com> wrote:

> I'm prepared to try an L2arc cache device (with secondarycache=metadata),


You might first see how long it takes when everything is cached.  E.g. by
doing this in the same directory several times.  This will give you a lower
bound on the time it will take (or put another way, an upper bound on the
improvement available from a cache device).


> but I'm having trouble determining how big of a device I'd need. We've got
> >30M inodes now on this filesystem, including some files with extremely
> long names. Is there some way to determine the amount of metadata on a ZFS
> filesystem?


For a specific filesystem, nothing comes to mind, but I'm sure you could
cobble something together with zdb.  There are several tools to determine
the amount of metadata in a ZFS storage pool:

 - "zdb -bbb <pool>"
     but this is unreliable on pools that are in use
 - "zpool scrub <pool>; <wait for scrub to complete>; echo '::walk
spa|::zfs_blkstats' | mdb -k"
    the scrub is slow, but this can be mitigated by setting the global
variable zfs_no_scrub_io to 1.  If you don't have mdb or equivalent
debugging tools on freebsd, you can manually look at
<spa_t>->spa_dsl_pool->dp_blkstats.

In either case, the "LSIZE" is the size that's required for caching (in
memory or on a l2arc cache device).  At a minimum you will need 512 bytes
for each file, to cache the dnode_phys_t.

--matt

From owner-freebsd-fs@FreeBSD.ORG  Wed Jan 30 00:06:06 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115])
 by hub.freebsd.org (Postfix) with ESMTP id 16348650
 for <freebsd-fs@freebsd.org>; Wed, 30 Jan 2013 00:06:06 +0000 (UTC)
 (envelope-from toasty@dragondata.com)
Received: from mail-ia0-x22e.google.com (mail-ia0-x22e.google.com
 [IPv6:2607:f8b0:4001:c02::22e])
 by mx1.freebsd.org (Postfix) with ESMTP id D7221FE
 for <freebsd-fs@freebsd.org>; Wed, 30 Jan 2013 00:06:05 +0000 (UTC)
Received: by mail-ia0-f174.google.com with SMTP id o25so1453847iad.5
 for <freebsd-fs@freebsd.org>; Tue, 29 Jan 2013 16:06:05 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=dragondata.com; s=google;
 h=x-received:content-type:mime-version:subject:from:in-reply-to:date
 :cc:message-id:references:to:x-mailer;
 bh=8sboX/ToX1azSjD1R2MMSsZH/dhn5fvGFOkTgIZv4jY=;
 b=jbE8aMdNgBQK7S8V2vlZbNCt1iQ9dHQRYxk4BiBv0kiz/0scoxWzMdVHrAummJl5Ts
 mUcdXnNm1dVmUmNbFLRiCCQrrVTZ/4FTsfc1/EkykJrdDpblGlfAYY9lVwqOqwKd4vf9
 TyDIq7fK7n+sNx+mc3O7SQ7v/wYuFNpDpL/uY=
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=google.com; s=20120113;
 h=x-received:content-type:mime-version:subject:from:in-reply-to:date
 :cc:message-id:references:to:x-mailer:x-gm-message-state;
 bh=8sboX/ToX1azSjD1R2MMSsZH/dhn5fvGFOkTgIZv4jY=;
 b=G+laN224gQKTIm2dkhUxiJbxhFZ7zQe13nP/iOThgaF2Zonr1LlnEL+9/rcAXYEa0/
 gvatwHT0Nj0NWGsr5eJboVTpK17WRCkpCI+nwpykit98Iee503M4r4msS6z1cTrmWKhq
 yFiOnrWM83YJvYmq5ab0qpMTAr70jIVCdpLTVyPVumcW63GpGVnbLVlj6eEZx6DhEHXs
 IZ9cslDVDqNGJRgkgws2BAQ7fRDEU3k5ElvhfL3NCDUrRopBvUBMQbb14QGr/kQbZfc7
 GNdFVtov9te0XIyNe8Lspzalo0rknuVxYdw1W3IwfJi2jVaRymyWYfESAc7RPEv46k71
 5lSA==
X-Received: by 10.42.11.203 with SMTP id v11mr1911977icv.28.1359504365528;
 Tue, 29 Jan 2013 16:06:05 -0800 (PST)
Received: from vpn132.rw1.your.org (vpn132.rw1.your.org. [204.9.51.132])
 by mx.google.com with ESMTPS id uj6sm3844598igb.4.2013.01.29.16.06.03
 (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128);
 Tue, 29 Jan 2013 16:06:04 -0800 (PST)
Mime-Version: 1.0 (Mac OS X Mail 6.2 \(1499\))
Subject: Re: Improving ZFS performance for large directories
From: Kevin Day <toasty@dragondata.com>
In-Reply-To: <CAJjvXiE+8OMu_yvdRAsWugH7W=fhFW7bicOLLyjEn8YrgvCwiw@mail.gmail.com>
Date: Tue, 29 Jan 2013 18:06:01 -0600
Message-Id: <F4420A8C-FB92-4771-B261-6C47A736CF7F@dragondata.com>
References: <19DB8F4A-6788-44F6-9A2C-E01DEA01BED9@dragondata.com>
 <CAJjvXiE+8OMu_yvdRAsWugH7W=fhFW7bicOLLyjEn8YrgvCwiw@mail.gmail.com>
To: Matthew Ahrens <mahrens@delphix.com>
X-Mailer: Apple Mail (2.1499)
X-Gm-Message-State: ALoCoQn+7di3aQMcA75dIfGldt7pYAFItZYBgEiliyBw3tVFZFyVAdaNgRG692kEa9iKr5URNyr/
Content-Type: text/plain;
	charset=iso-8859-1
Content-Transfer-Encoding: quoted-printable
X-Content-Filtered-By: Mailman/MimeDel 2.1.14
Cc: FreeBSD Filesystems <freebsd-fs@freebsd.org>
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 30 Jan 2013 00:06:06 -0000


On Jan 29, 2013, at 5:42 PM, Matthew Ahrens <mahrens@delphix.com> wrote:

> On Tue, Jan 29, 2013 at 3:20 PM, Kevin Day <toasty@dragondata.com> =
wrote:
> I'm prepared to try an L2arc cache device (with =
secondarycache=3Dmetadata),
>=20
> You might first see how long it takes when everything is cached.  E.g. =
by doing this in the same directory several times.  This will give you a =
lower bound on the time it will take (or put another way, an upper bound =
on the improvement available from a cache device).
> =20

Doing it twice back-to-back makes a bit of difference but it's still =
slow either way.

After not touching this directory for about 30 minutes:

# time ls -l >/dev/null
0.773u 2.665s 0:18.21 18.8%	35+2749k 3012+0io 0pf+0w

Immediately again:

# time ls -l > /dev/null
0.665u 1.077s 0:08.60 20.1%	35+2719k 556+0io 0pf+0w

18.2 vs 8.6 seconds is an improvement, but even the 8.6 seconds is =
longer than what I was expecting.

>=20
> For a specific filesystem, nothing comes to mind, but I'm sure you =
could cobble something together with zdb.  There are several tools to =
determine the amount of metadata in a ZFS storage pool:
>=20
>  - "zdb -bbb <pool>"
>      but this is unreliable on pools that are in use

I tried this and it consumed >16GB of memory after about 5 minutes so I =
had to kill it. I'll try it again during our next maintenance window =
where it can be the only thing running.

>  - "zpool scrub <pool>; <wait for scrub to complete>; echo '::walk =
spa|::zfs_blkstats' | mdb -k"
>     the scrub is slow, but this can be mitigated by setting the global =
variable zfs_no_scrub_io to 1.  If you don't have mdb or equivalent =
debugging tools on freebsd, you can manually look at =
<spa_t>->spa_dsl_pool->dp_blkstats.
>=20
> In either case, the "LSIZE" is the size that's required for caching =
(in memory or on a l2arc cache device).  At a minimum you will need 512 =
bytes for each file, to cache the dnode_phys_t.

Okay, thanks a bunch. I'll try this on the next chance I get too.


I think some of the issue is that nothing is being allowed to stay =
cached long. We have several parallel rsyncs running at once that are =
basically scanning every directory as fast as they can, combined with a =
bunch of rsync, http and ftp clients. I'm guessing with all that =
activity things are getting shoved out pretty quickly.


From owner-freebsd-fs@FreeBSD.ORG  Wed Jan 30 01:28:58 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115])
 by hub.freebsd.org (Postfix) with ESMTP id CA8297CD
 for <freebsd-fs@freebsd.org>; Wed, 30 Jan 2013 01:28:58 +0000 (UTC)
 (envelope-from prvs=1742e413e9=killing@multiplay.co.uk)
Received: from mail1.multiplay.co.uk (mail1.multiplay.co.uk [85.236.96.23])
 by mx1.freebsd.org (Postfix) with ESMTP id 679793FF
 for <freebsd-fs@freebsd.org>; Wed, 30 Jan 2013 01:28:58 +0000 (UTC)
Received: from r2d2 ([188.220.16.49])
 by mail1.multiplay.co.uk (mail1.multiplay.co.uk [85.236.96.23])
 (MDaemon PRO v10.0.4) with ESMTP id md50001920725.msg
 for <freebsd-fs@freebsd.org>; Wed, 30 Jan 2013 01:28:56 +0000
X-Spam-Processed: mail1.multiplay.co.uk, Wed, 30 Jan 2013 01:28:56 +0000
 (not processed: message from valid local sender)
X-MDRemoteIP: 188.220.16.49
X-Return-Path: prvs=1742e413e9=killing@multiplay.co.uk
X-Envelope-From: killing@multiplay.co.uk
X-MDaemon-Deliver-To: freebsd-fs@freebsd.org
Message-ID: <9792709BF58143EFBDAABE638F769775@multiplay.co.uk>
From: "Steven Hartland" <killing@multiplay.co.uk>
To: "Kevin Day" <toasty@dragondata.com>, "Matthew Ahrens" <mahrens@delphix.com>
References: <19DB8F4A-6788-44F6-9A2C-E01DEA01BED9@dragondata.com>
 <CAJjvXiE+8OMu_yvdRAsWugH7W=fhFW7bicOLLyjEn8YrgvCwiw@mail.gmail.com>
 <F4420A8C-FB92-4771-B261-6C47A736CF7F@dragondata.com>
Subject: Re: Improving ZFS performance for large directories
Date: Wed, 30 Jan 2013 01:29:35 -0000
MIME-Version: 1.0
Content-Type: text/plain; format=flowed; charset="iso-8859-1";
 reply-type=original
Content-Transfer-Encoding: 7bit
X-Priority: 3
X-MSMail-Priority: Normal
X-Mailer: Microsoft Outlook Express 6.00.2900.5931
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.6157
Cc: FreeBSD Filesystems <freebsd-fs@freebsd.org>
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 30 Jan 2013 01:28:58 -0000


----- Original Message ----- 
From: "Kevin Day" <toasty@dragondata.com>

> I think some of the issue is that nothing is being allowed to stay cached long. We have several parallel rsyncs running at once 
> that are basically scanning every directory as fast as they can, combined with a bunch of rsync, http and ftp clients. I'm 
> guessing with all that activity things are getting shoved out pretty quickly.

zfs send / recv a possible replacements for the rsyncs?

    Regards
    Steve 


================================================
This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it. 

In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337
or return the E.mail to postmaster@multiplay.co.uk.


From owner-freebsd-fs@FreeBSD.ORG  Wed Jan 30 02:24:58 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115])
 by hub.freebsd.org (Postfix) with ESMTP id E10623DF
 for <freebsd-fs@freebsd.org>; Wed, 30 Jan 2013 02:24:58 +0000 (UTC)
 (envelope-from toasty@dragondata.com)
Received: from mail-ie0-x235.google.com (mail-ie0-x235.google.com
 [IPv6:2607:f8b0:4001:c03::235])
 by mx1.freebsd.org (Postfix) with ESMTP id 737257C3
 for <freebsd-fs@freebsd.org>; Wed, 30 Jan 2013 02:24:58 +0000 (UTC)
Received: by mail-ie0-f181.google.com with SMTP id 17so919412iea.12
 for <freebsd-fs@freebsd.org>; Tue, 29 Jan 2013 18:24:58 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=dragondata.com; s=google;
 h=x-received:subject:mime-version:content-type:from:x-priority
 :in-reply-to:date:cc:content-transfer-encoding:message-id:references
 :to:x-mailer; bh=8Zzcq+JZK8DfDrybaLFcueH81JvX4Y2StmRgnRHs5C4=;
 b=YG6yuEdOTGr0886IDC2CRkM79p8JwMoNcX5wHp2qNdipOapVWvmCxaaSTXcQPbweIu
 62CmrPOFr+riuPrmEA3btybecCpdWuHqLiF4U4BFD6krE+aHmAkpQZQOoHkuTpRKS1HK
 i3gnJxaBgNGagFa/uAuuuGIwB91GKzydBSJhE=
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=google.com; s=20120113;
 h=x-received:subject:mime-version:content-type:from:x-priority
 :in-reply-to:date:cc:content-transfer-encoding:message-id:references
 :to:x-mailer:x-gm-message-state;
 bh=8Zzcq+JZK8DfDrybaLFcueH81JvX4Y2StmRgnRHs5C4=;
 b=EuCe7QedueA8qTP/b5u1UencayIhMgGavcLPECP6/3IY8azBUf9cpF9oUwwsYH2ump
 OWRJzQ1YdX7ufpY/kCjGiaQAyaxTBTvg5go6UyzeoFMXfB/WnbRJ0u6cyCXth84x85AD
 6PqN8/LTwlh+x7fs+WyvxaMczHNMP3gysZPHcXRbnNSaon3vemaQcp0hI40DbVzxWqve
 zW/fHI10q8MrQdY3hCgrknbQoEUcxwZv8DaYCOGjFkJc7WR2Iy64YPHn6DINbUOAWhLj
 /+WgqPmI332EWEaN5TXWNBpQDnOomJnZi6m9/GNpiqCQD1vKAFHtajKR/pMKAlEnuhds
 PJmA==
X-Received: by 10.42.27.74 with SMTP id i10mr2064818icc.47.1359512698139;
 Tue, 29 Jan 2013 18:24:58 -0800 (PST)
Received: from vpn132.rw1.your.org (vpn132.rw1.your.org. [204.9.51.132])
 by mx.google.com with ESMTPS id bg10sm3322632igc.6.2013.01.29.18.24.55
 (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128);
 Tue, 29 Jan 2013 18:24:57 -0800 (PST)
Subject: Re: Improving ZFS performance for large directories
Mime-Version: 1.0 (Mac OS X Mail 6.2 \(1499\))
Content-Type: text/plain; charset=iso-8859-1
From: Kevin Day <toasty@dragondata.com>
X-Priority: 3
In-Reply-To: <9792709BF58143EFBDAABE638F769775@multiplay.co.uk>
Date: Tue, 29 Jan 2013 20:24:54 -0600
Content-Transfer-Encoding: quoted-printable
Message-Id: <A8A14F21-1D9A-4B7A-93A0-266BD8099ECF@dragondata.com>
References: <19DB8F4A-6788-44F6-9A2C-E01DEA01BED9@dragondata.com>
 <CAJjvXiE+8OMu_yvdRAsWugH7W=fhFW7bicOLLyjEn8YrgvCwiw@mail.gmail.com>
 <F4420A8C-FB92-4771-B261-6C47A736CF7F@dragondata.com>
 <9792709BF58143EFBDAABE638F769775@multiplay.co.uk>
To: "Steven Hartland" <killing@multiplay.co.uk>
X-Mailer: Apple Mail (2.1499)
X-Gm-Message-State: ALoCoQmoEPhlFq2Ko0MEwkfXrVoWjeJSF3zKexdL1H5iWwXSQ6a8mS6EPsPU9xyPK4jSzM1dVNW8
Cc: FreeBSD Filesystems <freebsd-fs@freebsd.org>,
 Matthew Ahrens <mahrens@delphix.com>
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 30 Jan 2013 02:24:58 -0000


On Jan 29, 2013, at 7:29 PM, "Steven Hartland" <killing@multiplay.co.uk> =
wrote:

>=20
> ----- Original Message ----- From: "Kevin Day" <toasty@dragondata.com>
>=20
>> I think some of the issue is that nothing is being allowed to stay =
cached long. We have several parallel rsyncs running at once that are =
basically scanning every directory as fast as they can, combined with a =
bunch of rsync, http and ftp clients. I'm guessing with all that =
activity things are getting shoved out pretty quickly.
>=20
> zfs send / recv a possible replacements for the rsyncs?

Unfortunately not. We're pulling these files from a host that we do not =
control, and isn't running ZFS. We're also serving these files up via a =
public rsync daemon, and the vast majority of the clients receiving =
files from it are not running ZFS either.

Total data size is about 125TB now, growing to ~300TB in the near =
future. It's just a ton of data that really isn't being stored in the =
best manner for this kind of system, but we don't control the layout.

-- Kevin


From owner-freebsd-fs@FreeBSD.ORG  Wed Jan 30 09:43:34 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id CAA2F5F1;
 Wed, 30 Jan 2013 09:43:34 +0000 (UTC) (envelope-from uqs@FreeBSD.org)
Received: from acme.spoerlein.net (acme.spoerlein.net
 [IPv6:2a01:4f8:131:23c2::1])
 by mx1.freebsd.org (Postfix) with ESMTP id 3A1EAAC8;
 Wed, 30 Jan 2013 09:43:34 +0000 (UTC)
Received: from localhost (acme.spoerlein.net [IPv6:2a01:4f8:131:23c2::1])
 by acme.spoerlein.net (8.14.6/8.14.6) with ESMTP id r0U9hQMG090453
 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES128-SHA bits=128 verify=NO);
 Wed, 30 Jan 2013 10:43:26 +0100 (CET) (envelope-from uqs@FreeBSD.org)
Date: Wed, 30 Jan 2013 10:43:26 +0100
From: Ulrich =?utf-8?B?U3DDtnJsZWlu?= <uqs@FreeBSD.org>
To: Fabian Keil <freebsd-listen@fabiankeil.de>
Subject: Re: Zpool surgery
Message-ID: <20130130094326.GT35868@acme.spoerlein.net>
Mail-Followup-To: Fabian Keil <freebsd-listen@fabiankeil.de>,
 Dan Nelson <dnelson@allantgroup.com>,
 Peter Jeremy <peter@rulingia.com>, current@freebsd.org,
 fs@freebsd.org
References: <20130127103612.GB38645@acme.spoerlein.net>
 <1F0546C4D94D4CCE9F6BB4C8FA19FFF2@multiplay.co.uk>
 <20130127201140.GD29105@server.rulingia.com>
 <20130128085820.GR35868@acme.spoerlein.net>
 <20130128205802.1ffab53e@fabiankeil.de>
 <20130128214111.GA14888@dan.emsphone.com>
 <20130129155250.29d8f764@fabiankeil.de>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <20130129155250.29d8f764@fabiankeil.de>
User-Agent: Mutt/1.5.21 (2010-09-15)
Cc: Dan Nelson <dnelson@allantgroup.com>, current@freebsd.org, fs@freebsd.org
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 30 Jan 2013 09:43:34 -0000

On Tue, 2013-01-29 at 15:52:50 +0100, Fabian Keil wrote:
> Dan Nelson <dnelson@allantgroup.com> wrote:
> 
> > In the last episode (Jan 28), Fabian Keil said:
> > > Ulrich Spörlein <uqs@FreeBSD.org> wrote:
> > > > On Mon, 2013-01-28 at 07:11:40 +1100, Peter Jeremy wrote:
> > > > > On 2013-Jan-27 14:31:56 -0000, Steven Hartland <killing@multiplay.co.uk> wrote:
> > > > > >----- Original Message ----- 
> > > > > >From: "Ulrich Spörlein" <uqs@FreeBSD.org>
> > > > > >> I want to transplant my old zpool tank from a 1TB drive to a new
> > > > > >> 2TB drive, but *not* use dd(1) or any other cloning mechanism, as
> > > > > >> the pool was very full very often and is surely severely
> > > > > >> fragmented.
> > > > > >
> > > > > >Cant you just drop the disk in the original machine, set it as a
> > > > > >mirror then once the mirror process has completed break the mirror
> > > > > >and remove the 1TB disk.
> > > > > 
> > > > > That will replicate any fragmentation as well.  "zfs send | zfs recv"
> > > > > is the only (current) way to defragment a ZFS pool.
> > > 
> > > It's not obvious to me why "zpool replace" (or doing it manually)
> > > would replicate the fragmentation.
> > 
> > "zpool replace" essentially adds your new disk as a mirror to the parent
> > vdev, then deletes the original disk when the resilver is done.  Since
> > mirrors are block-identical copies of each other, the new disk will contain
> > an exact copy of the original disk, followed by 1TB of freespace.
> 
> Thanks for the explanation.
> 
> I was under the impression that zfs mirrors worked at a higher
> level than traditional mirrors like gmirror but there seems to
> be indeed less magic than I expected.
> 
> Fabian

To wrap this up, while the zpool replace worked for the disk, I played
around with it some more, and using snapshots instead *did* work the
second time. I'm not sure what I did wrong the first time ...

So basically this:
# zfs send -R oldtank@2013-01-22 | zfs recv -F -d newtank
(takes ages, then do a final snapshot before unmounting and send the
incremental)
# zfs send -R -i 2013-01-22 oldtank@2013-01-29 | zfs recv -F -d newtank

Allows me to send snapshots up to 2013-01-29 to the "archive" pool from
either oldtank or newtank. Yay!

Cheers,
Uli

From owner-freebsd-fs@FreeBSD.ORG  Wed Jan 30 10:20:05 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115])
 by hub.freebsd.org (Postfix) with ESMTP id 9BC97CB6
 for <freebsd-fs@freebsd.org>; Wed, 30 Jan 2013 10:20:05 +0000 (UTC)
 (envelope-from ronald-freebsd8@klop.yi.org)
Received: from smarthost1.greenhost.nl (smarthost1.greenhost.nl
 [195.190.28.78]) by mx1.freebsd.org (Postfix) with ESMTP id 3739CD31
 for <freebsd-fs@freebsd.org>; Wed, 30 Jan 2013 10:20:04 +0000 (UTC)
Received: from smtp.greenhost.nl ([213.108.104.138])
 by smarthost1.greenhost.nl with esmtps (TLS1.0:RSA_AES_256_CBC_SHA1:32)
 (Exim 4.69) (envelope-from <ronald-freebsd8@klop.yi.org>)
 id 1U0Ulu-0001bp-Hx
 for freebsd-fs@freebsd.org; Wed, 30 Jan 2013 11:20:03 +0100
Received: from [81.21.138.17] (helo=ronaldradial.versatec.local)
 by smtp.greenhost.nl with esmtpsa (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32)
 (Exim 4.72) (envelope-from <ronald-freebsd8@klop.yi.org>)
 id 1U0Ulu-0007jg-Ew
 for freebsd-fs@freebsd.org; Wed, 30 Jan 2013 11:20:02 +0100
Content-Type: text/plain; charset=us-ascii; format=flowed; delsp=yes
To: freebsd-fs@freebsd.org
Subject: Re: Improving ZFS performance for large directories
References: <19DB8F4A-6788-44F6-9A2C-E01DEA01BED9@dragondata.com>
Date: Wed, 30 Jan 2013 11:20:02 +0100
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
From: "Ronald Klop" <ronald-freebsd8@klop.yi.org>
Message-ID: <op.wrpyzok18527sy@ronaldradial.versatec.local>
In-Reply-To: <19DB8F4A-6788-44F6-9A2C-E01DEA01BED9@dragondata.com>
User-Agent: Opera Mail/12.13 (Win32)
X-Virus-Scanned: by clamav at smarthost1.samage.net
X-Spam-Level: /
X-Spam-Score: 0.8
X-Spam-Status: No, score=0.8 required=5.0 tests=BAYES_50 autolearn=disabled
 version=3.3.1
X-Scan-Signature: a9e4b997d6a751f3e45cb47a3c2b1d2c
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 30 Jan 2013 10:20:05 -0000

On Wed, 30 Jan 2013 00:20:15 +0100, Kevin Day <toasty@dragondata.com>  
wrote:

>
> I'm trying to improve performance when using ZFS in large (>60000 files)  
> directories. A common activity is to use "getdirentries" to enumerate  
> all the files in the directory, then "lstat" on each one to get  
> information about it. Doing an "ls -l" in a large directory like this  
> can take 10-30 seconds to complete. Trying to figure out why, I did:
>
> ktrace ls -l /path/to/large/directory
> kdump -R |sort -rn |more

Does ls -lf /pat/to/large/directory make a difference. It makes ls not to  
sort the directory so it can use a more efficient way of traversing the  
directory.

Ronald.


>
> to see what sys calls were taking the most time, I ended up with:
>
>  69247 ls       0.190729 STRU  struct stat {dev=846475008, ino=46220085,  
> mode=-rw-r--r-- , nlink=1, uid=0, gid=0, rdev=4294967295,  
> atime=1333196714, stime=1201004393, ctime=1333196714.547566024,  
> birthtime=1333196714.547566024, size=30784, blksize=31232, blocks=62,  
> flags=0x0 }
>  69247 ls       0.180121 STRU  struct stat {dev=846475008, ino=46233417,  
> mode=-rw-r--r-- , nlink=1, uid=0, gid=0, rdev=4294967295,  
> atime=1333197088, stime=1209814737, ctime=1333197088.913571042,  
> birthtime=1333197088.913571042, size=3162220, blksize=131072,  
> blocks=6409, flags=0x0 }
>  69247 ls       0.152370 RET   getdirentries 4088/0xff8
>  69247 ls       0.139939 CALL  stat(0x800d8f598,0x7fffffffcca0)
>  69247 ls       0.130411 RET   __acl_get_link 0
>  69247 ls       0.121602 RET   __acl_get_link 0
>  69247 ls       0.105799 RET   getdirentries 4064/0xfe0
>  69247 ls       0.105069 RET   getdirentries 4068/0xfe4
>  69247 ls       0.096862 RET   getdirentries 4028/0xfbc
>  69247 ls       0.085012 RET   getdirentries 4088/0xff8
>  69247 ls       0.082722 STRU  struct stat {dev=846475008, ino=72941319,  
> mode=-rw-r--r-- , nlink=1, uid=0, gid=0, rdev=4294967295,  
> atime=1348686155, stime=1348347621, ctime=1348686155.768875422,  
> birthtime=1348686155.768875422, size=6686225, blksize=131072,  
> blocks=13325, flags=0x0 }
>  69247 ls       0.070318 STRU  struct stat {dev=846475008, ino=46211679,  
> mode=-rw-r--r-- , nlink=1, uid=0, gid=0, rdev=4294967295,  
> atime=1333196475, stime=1240230314, ctime=1333196475.038567672,  
> birthtime=1333196475.038567672, size=829895, blksize=131072,  
> blocks=1797, flags=0x0 }
>  69247 ls       0.068060 RET   getdirentries 4048/0xfd0
>  69247 ls       0.065118 RET   getdirentries 4088/0xff8
>  69247 ls       0.062536 RET   getdirentries 4096/0x1000
>  69247 ls       0.061118 RET   getdirentries 4020/0xfb4
>  69247 ls       0.055038 STRU  struct stat {dev=846475008, ino=46220358,  
> mode=-rw-r--r-- , nlink=1, uid=0, gid=0, rdev=4294967295,  
> atime=1333196720, stime=1274282669, ctime=1333196720.972567345,  
> birthtime=1333196720.972567345, size=382344, blksize=131072, blocks=773,  
> flags=0x0 }
>  69247 ls       0.054948 STRU  struct stat {dev=846475008, ino=75025952,  
> mode=-rw-r--r-- , nlink=1, uid=0, gid=0, rdev=4294967295,  
> atime=1351071350, stime=1349726805, ctime=1351071350.800873870,  
> birthtime=1351071350.800873870, size=2575559, blksize=131072,  
> blocks=5127, flags=0x0 }
>  69247 ls       0.054828 STRU  struct stat {dev=846475008, ino=65021883,  
> mode=-rw-r--r-- , nlink=1, uid=0, gid=0, rdev=4294967295,  
> atime=1335730367, stime=1332843230, ctime=1335730367.541567371,  
> birthtime=1335730367.541567371, size=226347, blksize=131072, blocks=517,  
> flags=0x0 }
>  69247 ls       0.053743 STRU  struct stat {dev=846475008, ino=46222016,  
> mode=-rw-r--r-- , nlink=1, uid=0, gid=0, rdev=4294967295,  
> atime=1333196765, stime=1257110706, ctime=1333196765.206574132,  
> birthtime=1333196765.206574132, size=62112, blksize=62464, blocks=123,  
> flags=0x0 }
>  69247 ls       0.052015 RET   getdirentries 4060/0xfdc
>  69247 ls       0.051388 RET   getdirentries 4068/0xfe4
>  69247 ls       0.049875 RET   getdirentries 4088/0xff8
>  69247 ls       0.049156 RET   getdirentries 4032/0xfc0
>  69247 ls       0.048609 RET   getdirentries 4040/0xfc8
>  69247 ls       0.048279 RET   getdirentries 4032/0xfc0
>  69247 ls       0.048062 RET   getdirentries 4064/0xfe0
>  69247 ls       0.047577 RET   getdirentries 4076/0xfec
> (snip)
>
> the STRU are returns from calling lstat().
>
> It looks like both getdirentries and lstat are taking quite a while to  
> return. The shortest return for any lstat() call is 0.000004 seconds,  
> the maximum is 0.190729 and the average is around 0.0004. Just from  
> lstat() alone, that makes "ls" take over 20 seconds.
>
> I'm prepared to try an L2arc cache device (with  
> secondarycache=metadata), but I'm having trouble determining how big of  
> a device I'd need. We've got >30M inodes now on this filesystem,  
> including some files with extremely long names. Is there some way to  
> determine the amount of metadata on a ZFS filesystem?
>
> _______________________________________________
> freebsd-fs@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-fs
> To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org"

From owner-freebsd-fs@FreeBSD.ORG  Wed Jan 30 10:36:42 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id 4900D351
 for <freebsd-fs@freebsd.org>; Wed, 30 Jan 2013 10:36:42 +0000 (UTC)
 (envelope-from ndenev@gmail.com)
Received: from mail-bk0-f50.google.com (mail-bk0-f50.google.com
 [209.85.214.50]) by mx1.freebsd.org (Postfix) with ESMTP id BEC36DEE
 for <freebsd-fs@freebsd.org>; Wed, 30 Jan 2013 10:36:41 +0000 (UTC)
Received: by mail-bk0-f50.google.com with SMTP id jg9so742690bkc.37
 for <freebsd-fs@freebsd.org>; Wed, 30 Jan 2013 02:36:34 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=x-received:subject:mime-version:content-type:from:in-reply-to:date
 :cc:content-transfer-encoding:message-id:references:to:x-mailer;
 bh=XwHHaBNRncPTtC58MSyZcTj2pLTLfUFOxLhKoPVoOyk=;
 b=fiJn30P9vkLKlpzRL2/tToD0BPwYJcAb0HbHvf9eQO4E3IgA/svj5KZcqzDixPUdit
 sozWw/Iz/IGkeLTBps/SR2CQQXVqZactSDQhte7thzxuu7S9+uIFE8m+5VylGm7xEQJf
 0jKamcJ6GQGm+dINt/LMFvqpzMZcNZxO8xgA6Qal2OyAZ9fp0xem6G/yHcIX4ueoeuPq
 9VZzcqCwbXj7q2gKafOIDdZanufwrbidtW+MwCp5KUGRZ/49AN5A7msKqjLkb/P+q6Q0
 kzQj0KIa0O8i3YNHEc9imX2CxIZ+pFGZOWsFwmda5hzQ+AQODLjQeipFHR3XN9ojlGSJ
 Cj6A==
X-Received: by 10.204.12.206 with SMTP id y14mr1081502bky.132.1359542194602;
 Wed, 30 Jan 2013 02:36:34 -0800 (PST)
Received: from ndenevsa.sf.moneybookers.net (g1.moneybookers.com.
 [217.18.249.148])
 by mx.google.com with ESMTPS id z5sm383371bkv.11.2013.01.30.02.36.33
 (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128);
 Wed, 30 Jan 2013 02:36:34 -0800 (PST)
Subject: Re: Improving ZFS performance for large directories
Mime-Version: 1.0 (Mac OS X Mail 6.2 \(1499\))
Content-Type: text/plain; charset=us-ascii
From: Nikolay Denev <ndenev@gmail.com>
In-Reply-To: <19DB8F4A-6788-44F6-9A2C-E01DEA01BED9@dragondata.com>
Date: Wed, 30 Jan 2013 12:36:35 +0200
Content-Transfer-Encoding: quoted-printable
Message-Id: <5267B97C-ED47-4AAB-8415-12D6987E9371@gmail.com>
References: <19DB8F4A-6788-44F6-9A2C-E01DEA01BED9@dragondata.com>
To: Kevin Day <toasty@dragondata.com>
X-Mailer: Apple Mail (2.1499)
Cc: FreeBSD Filesystems <freebsd-fs@freebsd.org>
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 30 Jan 2013 10:36:42 -0000

On Jan 30, 2013, at 1:20 AM, Kevin Day <toasty@dragondata.com> wrote:

>=20
> I'm trying to improve performance when using ZFS in large (>60000 =
files) directories. A common activity is to use "getdirentries" to =
enumerate all the files in the directory, then "lstat" on each one to =
get information about it. Doing an "ls -l" in a large directory like =
this can take 10-30 seconds to complete. Trying to figure out why, I =
did:
>=20
> ktrace ls -l /path/to/large/directory
> kdump -R |sort -rn |more
>=20
> to see what sys calls were taking the most time, I ended up with:
>=20
> 69247 ls       0.190729 STRU  struct stat {dev=3D846475008, =
ino=3D46220085, mode=3D-rw-r--r-- , nlink=3D1, uid=3D0, gid=3D0, =
rdev=3D4294967295, atime=3D1333196714, stime=3D1201004393, =
ctime=3D1333196714.547566024, birthtime=3D1333196714.547566024, =
size=3D30784, blksize=3D31232, blocks=3D62, flags=3D0x0 }
> 69247 ls       0.180121 STRU  struct stat {dev=3D846475008, =
ino=3D46233417, mode=3D-rw-r--r-- , nlink=3D1, uid=3D0, gid=3D0, =
rdev=3D4294967295, atime=3D1333197088, stime=3D1209814737, =
ctime=3D1333197088.913571042, birthtime=3D1333197088.913571042, =
size=3D3162220, blksize=3D131072, blocks=3D6409, flags=3D0x0 }
> 69247 ls       0.152370 RET   getdirentries 4088/0xff8
> 69247 ls       0.139939 CALL  stat(0x800d8f598,0x7fffffffcca0)
> 69247 ls       0.130411 RET   __acl_get_link 0
> 69247 ls       0.121602 RET   __acl_get_link 0
> 69247 ls       0.105799 RET   getdirentries 4064/0xfe0
> 69247 ls       0.105069 RET   getdirentries 4068/0xfe4
> 69247 ls       0.096862 RET   getdirentries 4028/0xfbc
> 69247 ls       0.085012 RET   getdirentries 4088/0xff8
> 69247 ls       0.082722 STRU  struct stat {dev=3D846475008, =
ino=3D72941319, mode=3D-rw-r--r-- , nlink=3D1, uid=3D0, gid=3D0, =
rdev=3D4294967295, atime=3D1348686155, stime=3D1348347621, =
ctime=3D1348686155.768875422, birthtime=3D1348686155.768875422, =
size=3D6686225, blksize=3D131072, blocks=3D13325, flags=3D0x0 }
> 69247 ls       0.070318 STRU  struct stat {dev=3D846475008, =
ino=3D46211679, mode=3D-rw-r--r-- , nlink=3D1, uid=3D0, gid=3D0, =
rdev=3D4294967295, atime=3D1333196475, stime=3D1240230314, =
ctime=3D1333196475.038567672, birthtime=3D1333196475.038567672, =
size=3D829895, blksize=3D131072, blocks=3D1797, flags=3D0x0 }
> 69247 ls       0.068060 RET   getdirentries 4048/0xfd0
> 69247 ls       0.065118 RET   getdirentries 4088/0xff8
> 69247 ls       0.062536 RET   getdirentries 4096/0x1000
> 69247 ls       0.061118 RET   getdirentries 4020/0xfb4
> 69247 ls       0.055038 STRU  struct stat {dev=3D846475008, =
ino=3D46220358, mode=3D-rw-r--r-- , nlink=3D1, uid=3D0, gid=3D0, =
rdev=3D4294967295, atime=3D1333196720, stime=3D1274282669, =
ctime=3D1333196720.972567345, birthtime=3D1333196720.972567345, =
size=3D382344, blksize=3D131072, blocks=3D773, flags=3D0x0 }
> 69247 ls       0.054948 STRU  struct stat {dev=3D846475008, =
ino=3D75025952, mode=3D-rw-r--r-- , nlink=3D1, uid=3D0, gid=3D0, =
rdev=3D4294967295, atime=3D1351071350, stime=3D1349726805, =
ctime=3D1351071350.800873870, birthtime=3D1351071350.800873870, =
size=3D2575559, blksize=3D131072, blocks=3D5127, flags=3D0x0 }
> 69247 ls       0.054828 STRU  struct stat {dev=3D846475008, =
ino=3D65021883, mode=3D-rw-r--r-- , nlink=3D1, uid=3D0, gid=3D0, =
rdev=3D4294967295, atime=3D1335730367, stime=3D1332843230, =
ctime=3D1335730367.541567371, birthtime=3D1335730367.541567371, =
size=3D226347, blksize=3D131072, blocks=3D517, flags=3D0x0 }
> 69247 ls       0.053743 STRU  struct stat {dev=3D846475008, =
ino=3D46222016, mode=3D-rw-r--r-- , nlink=3D1, uid=3D0, gid=3D0, =
rdev=3D4294967295, atime=3D1333196765, stime=3D1257110706, =
ctime=3D1333196765.206574132, birthtime=3D1333196765.206574132, =
size=3D62112, blksize=3D62464, blocks=3D123, flags=3D0x0 }
> 69247 ls       0.052015 RET   getdirentries 4060/0xfdc
> 69247 ls       0.051388 RET   getdirentries 4068/0xfe4
> 69247 ls       0.049875 RET   getdirentries 4088/0xff8
> 69247 ls       0.049156 RET   getdirentries 4032/0xfc0
> 69247 ls       0.048609 RET   getdirentries 4040/0xfc8
> 69247 ls       0.048279 RET   getdirentries 4032/0xfc0
> 69247 ls       0.048062 RET   getdirentries 4064/0xfe0
> 69247 ls       0.047577 RET   getdirentries 4076/0xfec
> (snip)
>=20
> the STRU are returns from calling lstat().
>=20
> It looks like both getdirentries and lstat are taking quite a while to =
return. The shortest return for any lstat() call is 0.000004 seconds, =
the maximum is 0.190729 and the average is around 0.0004. Just from =
lstat() alone, that makes "ls" take over 20 seconds.
>=20
> I'm prepared to try an L2arc cache device (with =
secondarycache=3Dmetadata), but I'm having trouble determining how big =
of a device I'd need. We've got >30M inodes now on this filesystem, =
including some files with extremely long names. Is there some way to =
determine the amount of metadata on a ZFS filesystem?
>=20
> _______________________________________________
> freebsd-fs@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-fs
> To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org"

What are your : vfs.zfs.arc_meta_limit and vfs.zfs.arc_meta_used =
sysctls?
Maybe increasing the limit can help?

Regards,


From owner-freebsd-fs@FreeBSD.ORG  Wed Jan 30 15:15:10 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115])
 by hub.freebsd.org (Postfix) with ESMTP id 799B41B8
 for <freebsd-fs@freebsd.org>; Wed, 30 Jan 2013 15:15:10 +0000 (UTC)
 (envelope-from toasty@dragondata.com)
Received: from mail-ie0-x22c.google.com (ie-in-x022c.1e100.net
 [IPv6:2607:f8b0:4001:c03::22c])
 by mx1.freebsd.org (Postfix) with ESMTP id 490C9F94
 for <freebsd-fs@freebsd.org>; Wed, 30 Jan 2013 15:15:10 +0000 (UTC)
Received: by mail-ie0-f172.google.com with SMTP id c10so1338113ieb.17
 for <freebsd-fs@freebsd.org>; Wed, 30 Jan 2013 07:15:10 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=dragondata.com; s=google;
 h=x-received:content-type:mime-version:subject:from:in-reply-to:date
 :cc:content-transfer-encoding:message-id:references:to:x-mailer;
 bh=4jStnLPiLJlFlXWlcVd2u39urNx9ecuHBzR1WKnGeAY=;
 b=duvti6SSYFdBAFGyj87Qf3PSjr9+9PgNgdCCDIVygZjzzEunXnEYmsQum1cwd6iHmS
 hxaifO0JuzuzVGzG3suqMY+pYVVrvhXeCBQEUP07lo6YxguAMFciAqerdqFf3rBGwXda
 1ugktxPbzKf/p3np0p0Hsakib/9Uf4SZzGjUo=
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=google.com; s=20120113;
 h=x-received:content-type:mime-version:subject:from:in-reply-to:date
 :cc:content-transfer-encoding:message-id:references:to:x-mailer
 :x-gm-message-state;
 bh=4jStnLPiLJlFlXWlcVd2u39urNx9ecuHBzR1WKnGeAY=;
 b=oiSUN8jI/7GqNIRyYkCUIoLAyygk/qwDOYYkAM2hsjW/smXQaqv9d1EowrzQYJ7sOK
 9oxQaGph4jiSpE9xDXliaiikZRI8ht2XjGnT+7dGny1Cd5imThT8Dxl/WnvZEiXANAzz
 lnxV+uM3S2DvfC+TtDgHX8B6UeT6fTfNUdkW8Iylylmd/ohX/XekYTQmk5Z4r9ZuAOMq
 /DiWoJXuj2A4DCuauu7kBOndTGXDmhkg0PNRRHc81c21jVjbceCLu8bCLDZW3H+KJLtX
 VKgeTfh8Sx8Jo42fhlVCCzar3ZAvablXw6b95w1hxWcVYSxzf0E/01hyD1A1FVXXJG3c
 thrg==
X-Received: by 10.50.47.200 with SMTP id f8mr3951203ign.98.1359558909991;
 Wed, 30 Jan 2013 07:15:09 -0800 (PST)
Received: from vpn132.rw1.your.org (vpn132.rw1.your.org. [204.9.51.132])
 by mx.google.com with ESMTPS id bg10sm4495947igc.6.2013.01.30.07.15.07
 (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128);
 Wed, 30 Jan 2013 07:15:08 -0800 (PST)
Content-Type: text/plain; charset=us-ascii
Mime-Version: 1.0 (Mac OS X Mail 6.2 \(1499\))
Subject: Re: Improving ZFS performance for large directories
From: Kevin Day <toasty@dragondata.com>
In-Reply-To: <op.wrpyzok18527sy@ronaldradial.versatec.local>
Date: Wed, 30 Jan 2013 09:15:04 -0600
Content-Transfer-Encoding: quoted-printable
Message-Id: <AB6FA392-D80D-4280-8B16-FB931D2AD35C@dragondata.com>
References: <19DB8F4A-6788-44F6-9A2C-E01DEA01BED9@dragondata.com>
 <op.wrpyzok18527sy@ronaldradial.versatec.local>
To: "Ronald Klop" <ronald-freebsd8@klop.yi.org>
X-Mailer: Apple Mail (2.1499)
X-Gm-Message-State: ALoCoQkruxYgS03kacjD54zkMx4V3qRKoiNaTXmQywWTjh4H6FRcoDDHNGsh5gUccTzxo4AX8ktz
Cc: freebsd-fs@freebsd.org
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 30 Jan 2013 15:15:10 -0000


On Jan 30, 2013, at 4:20 AM, "Ronald Klop" <ronald-freebsd8@klop.yi.org> =
wrote:

> On Wed, 30 Jan 2013 00:20:15 +0100, Kevin Day <toasty@dragondata.com> =
wrote:
>=20
>>=20
>> I'm trying to improve performance when using ZFS in large (>60000 =
files) directories. A common activity is to use "getdirentries" to =
enumerate all the files in the directory, then "lstat" on each one to =
get information about it. Doing an "ls -l" in a large directory like =
this can take 10-30 seconds to complete. Trying to figure out why, I =
did:
>>=20
>> ktrace ls -l /path/to/large/directory
>> kdump -R |sort -rn |more
>=20
> Does ls -lf /pat/to/large/directory make a difference. It makes ls not =
to sort the directory so it can use a more efficient way of traversing =
the directory.
>=20
> Ronald.

Nope, the sort seems to add a trivial amount of extra time to the entire =
operation. Nearly all the time is spent in lstat() or getdirentries(). =
Good idea though!

-- Kevin


From owner-freebsd-fs@FreeBSD.ORG  Wed Jan 30 15:19:59 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115])
 by hub.freebsd.org (Postfix) with ESMTP id 8559B308
 for <freebsd-fs@freebsd.org>; Wed, 30 Jan 2013 15:19:59 +0000 (UTC)
 (envelope-from toasty@dragondata.com)
Received: from mail-ie0-x22c.google.com (ie-in-x022c.1e100.net
 [IPv6:2607:f8b0:4001:c03::22c])
 by mx1.freebsd.org (Postfix) with ESMTP id 515C7FDA
 for <freebsd-fs@freebsd.org>; Wed, 30 Jan 2013 15:19:59 +0000 (UTC)
Received: by mail-ie0-f172.google.com with SMTP id c10so1369847ieb.3
 for <freebsd-fs@freebsd.org>; Wed, 30 Jan 2013 07:19:59 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=dragondata.com; s=google;
 h=x-received:content-type:mime-version:subject:from:in-reply-to:date
 :cc:content-transfer-encoding:message-id:references:to:x-mailer;
 bh=r8f3Yu1HhqeZ9wHXx0xyu9zShuD73vpJQwFBGe3oISw=;
 b=qUbIKur38wWMrpJww/gOOCIYLFt90vPmxsMEGUiEsZPXgDgEuytOkpc3a7oGj3bnhM
 1/hP4ihpYbVWlSAMv5Y5q5CiG+Xj+HbL5zVmEM+WmqAxVO22TuxlPkxz2boseY6fjr5I
 9QHeX464fAAbOcFCpiWySrA/03PMOICKl5Uek=
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=google.com; s=20120113;
 h=x-received:content-type:mime-version:subject:from:in-reply-to:date
 :cc:content-transfer-encoding:message-id:references:to:x-mailer
 :x-gm-message-state;
 bh=r8f3Yu1HhqeZ9wHXx0xyu9zShuD73vpJQwFBGe3oISw=;
 b=PavDKmiLf/Sn1qOmSrsbmGHPgeoQS5LROkHLQLEEjBaGTAxDahWGqnkF6JvnkAVgGI
 S7A/gdGsI00skwd2VmVjZapgNlTywx5JdduWCrDWqcBtLzZJ/G+gkXhWhoU3iwb+JHUH
 H5NbctKDY+9xSqer5YcyiL3jVuHfHK3Z46NalGaZ6+T/op10FLh5hiHesi0bSen2z+Jv
 HaMmepWyojXOeiqWLXbz11gfqwfutPI8JIrcCIlwrWJhygizB9RON3NgI//Trq65MUOy
 +DxoEr0Yh6ZsfNIqUnmU9hdjjVWg7GLBBZSR9AAcQbx9kg5IAy9DcJ8k8Z+kLBcNJJMp
 3e+w==
X-Received: by 10.50.13.208 with SMTP id j16mr3837750igc.73.1359559199008;
 Wed, 30 Jan 2013 07:19:59 -0800 (PST)
Received: from vpn132.rw1.your.org (vpn132.rw1.your.org. [204.9.51.132])
 by mx.google.com with ESMTPS id fb10sm2077564igb.1.2013.01.30.07.19.55
 (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128);
 Wed, 30 Jan 2013 07:19:57 -0800 (PST)
Content-Type: text/plain; charset=us-ascii
Mime-Version: 1.0 (Mac OS X Mail 6.2 \(1499\))
Subject: Re: Improving ZFS performance for large directories
From: Kevin Day <toasty@dragondata.com>
In-Reply-To: <5267B97C-ED47-4AAB-8415-12D6987E9371@gmail.com>
Date: Wed, 30 Jan 2013 09:19:52 -0600
Content-Transfer-Encoding: 7bit
Message-Id: <47975CEB-EA50-4F6C-8C47-6F32312F34C4@dragondata.com>
References: <19DB8F4A-6788-44F6-9A2C-E01DEA01BED9@dragondata.com>
 <5267B97C-ED47-4AAB-8415-12D6987E9371@gmail.com>
To: Nikolay Denev <ndenev@gmail.com>
X-Mailer: Apple Mail (2.1499)
X-Gm-Message-State: ALoCoQkYnbcKjzrRuMXTbOa1TnXlK7LV6CYRTKJE5zTGGBvs7cMBaMH2lTMQXKldb04O4ZSDB54d
Cc: FreeBSD Filesystems <freebsd-fs@freebsd.org>
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 30 Jan 2013 15:19:59 -0000


On Jan 30, 2013, at 4:36 AM, Nikolay Denev <ndenev@gmail.com> wrote:
> 
> 
> What are your : vfs.zfs.arc_meta_limit and vfs.zfs.arc_meta_used sysctls?
> Maybe increasing the limit can help?


vfs.zfs.arc_meta_limit: 8199079936
vfs.zfs.arc_meta_used: 13965744408

Full output of zfs-stats:


------------------------------------------------------------------------
ZFS Subsystem Report				Wed Jan 30 15:16:54 2013
------------------------------------------------------------------------

System Information:

	Kernel Version:				901000 (osreldate)
	Hardware Platform:			amd64
	Processor Architecture:			amd64

	ZFS Storage pool Version:		28
	ZFS Filesystem Version:			5

FreeBSD 9.1-RC2 #1: Tue Oct 30 20:37:38 UTC 2012 root
 3:16PM  up 19 days, 19:44, 2 users, load averages: 0.91, 0.80, 0.68

------------------------------------------------------------------------

System Memory:

	12.44%	7.72	GiB Active,	6.04%	3.75	GiB Inact
	77.33%	48.01	GiB Wired,	2.25%	1.40	GiB Cache
	1.94%	1.21	GiB Free,	0.00%	1.21	MiB Gap

	Real Installed:				64.00	GiB
	Real Available:			99.97%	63.98	GiB
	Real Managed:			97.04%	62.08	GiB

	Logical Total:				64.00	GiB
	Logical Used:			90.07%	57.65	GiB
	Logical Free:			9.93%	6.35	GiB

Kernel Memory:					22.62	GiB
	Data:				99.91%	22.60	GiB
	Text:				0.09%	21.27	MiB

Kernel Memory Map:				54.28	GiB
	Size:				34.75%	18.86	GiB
	Free:				65.25%	35.42	GiB

------------------------------------------------------------------------

ARC Summary: (HEALTHY)
	Memory Throttle Count:			0

ARC Misc:
	Deleted:				430.91m
	Recycle Misses:				111.27m
	Mutex Misses:				2.49m
	Evict Skips:				647.25m

ARC Size:				87.63%	26.77	GiB
	Target Size: (Adaptive)		87.64%	26.77	GiB
	Min Size (Hard Limit):		12.50%	3.82	GiB
	Max Size (High Water):		8:1	30.54	GiB

ARC Size Breakdown:
	Recently Used Cache Size:	58.64%	15.70	GiB
	Frequently Used Cache Size:	41.36%	11.07	GiB

ARC Hash Breakdown:
	Elements Max:				2.19m
	Elements Current:		86.15%	1.89m
	Collisions:				344.47m
	Chain Max:				17
	Chains:					552.47k

------------------------------------------------------------------------

ARC Efficiency:					21.94b
	Cache Hit Ratio:		97.00%	21.28b
	Cache Miss Ratio:		3.00%	657.23m
	Actual Hit Ratio:		73.15%	16.05b

	Data Demand Efficiency:		98.94%	1.32b
	Data Prefetch Efficiency:	14.83%	299.44m

	CACHE HITS BY CACHE LIST:
	  Anonymously Used:		23.03%	4.90b
	  Most Recently Used:		6.12%	1.30b
	  Most Frequently Used:		69.29%	14.75b
	  Most Recently Used Ghost:	0.50%	105.94m
	  Most Frequently Used Ghost:	1.07%	226.92m

	CACHE HITS BY DATA TYPE:
	  Demand Data:			6.11%	1.30b
	  Prefetch Data:		0.21%	44.42m
	  Demand Metadata:		69.29%	14.75b
	  Prefetch Metadata:		24.38%	5.19b

	CACHE MISSES BY DATA TYPE:
	  Demand Data:			2.12%	13.90m
	  Prefetch Data:		38.80%	255.02m
	  Demand Metadata:		30.97%	203.56m
	  Prefetch Metadata:		28.11%	184.75m

------------------------------------------------------------------------

L2ARC is disabled

------------------------------------------------------------------------

File-Level Prefetch: (HEALTHY)

DMU Efficiency:					24.08b
	Hit Ratio:			66.02%	15.90b
	Miss Ratio:			33.98%	8.18b

	Colinear:				8.18b
	  Hit Ratio:			0.01%	560.82k
	  Miss Ratio:			99.99%	8.18b

	Stride:					15.23b
	  Hit Ratio:			99.98%	15.23b
	  Miss Ratio:			0.02%	2.62m

DMU Misc:
	Reclaim:				8.18b
	  Successes:			0.08%	6.31m
	  Failures:			99.92%	8.17b

	Streams:				663.44m
	  +Resets:			0.06%	397.18k
	  -Resets:			99.94%	663.04m
	  Bogus:				0

------------------------------------------------------------------------

VDEV cache is disabled

------------------------------------------------------------------------

ZFS Tunables (sysctl):
	kern.maxusers                           384
	vm.kmem_size                            66662760448
	vm.kmem_size_scale                      1
	vm.kmem_size_min                        0
	vm.kmem_size_max                        329853485875
	vfs.zfs.l2c_only_size                   0
	vfs.zfs.mfu_ghost_data_lsize            2121007104
	vfs.zfs.mfu_ghost_metadata_lsize        7876605440
	vfs.zfs.mfu_ghost_size                  9997612544
	vfs.zfs.mfu_data_lsize                  10160539648
	vfs.zfs.mfu_metadata_lsize              17161216
	vfs.zfs.mfu_size                        11163991040
	vfs.zfs.mru_ghost_data_lsize            7235079680
	vfs.zfs.mru_ghost_metadata_lsize        11107812352
	vfs.zfs.mru_ghost_size                  18342892032
	vfs.zfs.mru_data_lsize                  4406255616
	vfs.zfs.mru_metadata_lsize              3924364288
	vfs.zfs.mru_size                        8893582336
	vfs.zfs.anon_data_lsize                 0
	vfs.zfs.anon_metadata_lsize             0
	vfs.zfs.anon_size                       999424
	vfs.zfs.l2arc_norw                      1
	vfs.zfs.l2arc_feed_again                1
	vfs.zfs.l2arc_noprefetch                1
	vfs.zfs.l2arc_feed_min_ms               200
	vfs.zfs.l2arc_feed_secs                 1
	vfs.zfs.l2arc_headroom                  2
	vfs.zfs.l2arc_write_boost               8388608
	vfs.zfs.l2arc_write_max                 8388608
	vfs.zfs.arc_meta_limit                  8199079936
	vfs.zfs.arc_meta_used                   14161977912
	vfs.zfs.arc_min                         4099539968
	vfs.zfs.arc_max                         32796319744
	vfs.zfs.dedup.prefetch                  1
	vfs.zfs.mdcomp_disable                  0
	vfs.zfs.write_limit_override            0
	vfs.zfs.write_limit_inflated            206088929280
	vfs.zfs.write_limit_max                 8587038720
	vfs.zfs.write_limit_min                 33554432
	vfs.zfs.write_limit_shift               3
	vfs.zfs.no_write_throttle               0
	vfs.zfs.zfetch.array_rd_sz              1048576
	vfs.zfs.zfetch.block_cap                256
	vfs.zfs.zfetch.min_sec_reap             2
	vfs.zfs.zfetch.max_streams              8
	vfs.zfs.prefetch_disable                0
	vfs.zfs.mg_alloc_failures               12
	vfs.zfs.check_hostid                    1
	vfs.zfs.recover                         0
	vfs.zfs.txg.synctime_ms                 1000
	vfs.zfs.txg.timeout                     5
	vfs.zfs.vdev.cache.bshift               16
	vfs.zfs.vdev.cache.size                 0
	vfs.zfs.vdev.cache.max                  16384
	vfs.zfs.vdev.write_gap_limit            4096
	vfs.zfs.vdev.read_gap_limit             32768
	vfs.zfs.vdev.aggregation_limit          131072
	vfs.zfs.vdev.ramp_rate                  2
	vfs.zfs.vdev.time_shift                 6
	vfs.zfs.vdev.min_pending                4
	vfs.zfs.vdev.max_pending                10
	vfs.zfs.vdev.bio_flush_disable          0
	vfs.zfs.cache_flush_disable             0
	vfs.zfs.zil_replay_disable              0
	vfs.zfs.zio.use_uma                     0
	vfs.zfs.snapshot_list_prefetch          0
	vfs.zfs.version.zpl                     5
	vfs.zfs.version.spa                     28
	vfs.zfs.version.acl                     1
	vfs.zfs.debug                           0
	vfs.zfs.super_owner                     0

------------------------------------------------------------------------


From owner-freebsd-fs@FreeBSD.ORG  Wed Jan 30 16:34:48 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id 8D63B27E
 for <freebsd-fs@freebsd.org>; Wed, 30 Jan 2013 16:34:48 +0000 (UTC)
 (envelope-from ndenev@gmail.com)
Received: from mail-bk0-f51.google.com (mail-bk0-f51.google.com
 [209.85.214.51]) by mx1.freebsd.org (Postfix) with ESMTP id E7728774
 for <freebsd-fs@freebsd.org>; Wed, 30 Jan 2013 16:34:47 +0000 (UTC)
Received: by mail-bk0-f51.google.com with SMTP id ik5so922431bkc.38
 for <freebsd-fs@freebsd.org>; Wed, 30 Jan 2013 08:34:41 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=x-received:subject:mime-version:content-type:from:in-reply-to:date
 :cc:content-transfer-encoding:message-id:references:to:x-mailer;
 bh=8RTCp/ASRpz9O9eFd1VM9MVuj+AlGo/ex9Alocsrpy0=;
 b=P0iibY5ikgCzzIMWXojRZpi357R608ZbSCwecaIvvR6Z6TO6BzbzMfJOkGq6HBKKM0
 SwaMEO/Q0T1U1nEkWz5ujhv+5GzuO2dG0vWcoPcq+JptER9vrgTBttVEjakDLp8vI/BH
 YuThDDe7qcw8MKSeXNrmPqXDM6bblgDlL728ZKAnqqUhfGRmU920keaNDAmIyHtstEfO
 7lLulsLYq/Sl4foceTHANPoEG7Zc0o0jvx8NuU32ewdA/lIfzWBlgha5vozRTo16vCH7
 0D3dO5jt4UVrfkAZSBjIsxhDzhtwQwnV7hn7wnzACBdggU9zkmg/aWbJo/r2IlCSWo2p
 NeaA==
X-Received: by 10.204.11.78 with SMTP id s14mr1431340bks.118.1359563681583;
 Wed, 30 Jan 2013 08:34:41 -0800 (PST)
Received: from imba-brutale-3.totalterror.net ([93.152.184.10])
 by mx.google.com with ESMTPS id z5sm819545bkv.11.2013.01.30.08.34.39
 (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128);
 Wed, 30 Jan 2013 08:34:40 -0800 (PST)
Subject: Re: Improving ZFS performance for large directories
Mime-Version: 1.0 (Mac OS X Mail 6.2 \(1499\))
Content-Type: text/plain; charset=windows-1252
From: Nikolay Denev <ndenev@gmail.com>
In-Reply-To: <47975CEB-EA50-4F6C-8C47-6F32312F34C4@dragondata.com>
Date: Wed, 30 Jan 2013 18:34:41 +0200
Content-Transfer-Encoding: quoted-printable
Message-Id: <23E6691A-F30C-4731-9F78-FD8ADDDA09AE@gmail.com>
References: <19DB8F4A-6788-44F6-9A2C-E01DEA01BED9@dragondata.com>
 <5267B97C-ED47-4AAB-8415-12D6987E9371@gmail.com>
 <47975CEB-EA50-4F6C-8C47-6F32312F34C4@dragondata.com>
To: Kevin Day <toasty@dragondata.com>
X-Mailer: Apple Mail (2.1499)
Cc: FreeBSD Filesystems <freebsd-fs@freebsd.org>
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 30 Jan 2013 16:34:48 -0000

On Jan 30, 2013, at 5:19 PM, Kevin Day <toasty@dragondata.com> wrote:
>=20
> On Jan 30, 2013, at 4:36 AM, Nikolay Denev <ndenev@gmail.com> wrote:
>>=20
>>=20
>> What are your : vfs.zfs.arc_meta_limit and vfs.zfs.arc_meta_used =
sysctls?
>> Maybe increasing the limit can help?
>=20
>=20
> vfs.zfs.arc_meta_limit: 8199079936
> vfs.zfs.arc_meta_used: 13965744408
>=20
> Full output of zfs-stats: [=85snipped=85]

Looks like you can try to increase arc_meta_limit to be let's say : half =
of arc_max. (16398159872 in your case).


From owner-freebsd-fs@FreeBSD.ORG  Wed Jan 30 20:59:19 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id 56498EAD
 for <freebsd-fs@freebsd.org>; Wed, 30 Jan 2013 20:59:19 +0000 (UTC)
 (envelope-from toasty@dragondata.com)
Received: from mail-ie0-x22f.google.com (mail-ie0-x22f.google.com
 [IPv6:2607:f8b0:4001:c03::22f])
 by mx1.freebsd.org (Postfix) with ESMTP id F221E6AC
 for <freebsd-fs@freebsd.org>; Wed, 30 Jan 2013 20:59:18 +0000 (UTC)
Received: by mail-ie0-f175.google.com with SMTP id c12so1687626ieb.6
 for <freebsd-fs@freebsd.org>; Wed, 30 Jan 2013 12:59:18 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=dragondata.com; s=google;
 h=x-received:content-type:mime-version:subject:from:in-reply-to:date
 :cc:content-transfer-encoding:message-id:references:to:x-mailer;
 bh=cnUN69pWlUjCGxQg9jE+TzJzR5O+4orLIVM/h6z/A7U=;
 b=Cc3tS0JuX+tkIqPeXky9R/5+AQB69kVOHLPA8szy6+4ThBEScrvCd3F3g1ker8WB3q
 GWAnc5o2yIEnJo6aQGK6Ssks6g7lDUDLVLV1GeJ0qm2LUXCrNtLvG2w1PQ8B5rcmrSXL
 e76RZyZbm3+jLpmPTsJ4VG5iSWlUiwXSciDpA=
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=google.com; s=20120113;
 h=x-received:content-type:mime-version:subject:from:in-reply-to:date
 :cc:content-transfer-encoding:message-id:references:to:x-mailer
 :x-gm-message-state;
 bh=cnUN69pWlUjCGxQg9jE+TzJzR5O+4orLIVM/h6z/A7U=;
 b=GTt2W43YcjtaORYRsleZg45/xnFXvNselPfeI6Nr2QbZTKnFDXmrwM5z8PFRkvLENx
 kE1WoPku4buWmvrJxdaYWqivYAlm6Sq6exPVedwhpPkFcmQ/rOT9W3NisAzcPOES3Tzj
 /egkFpIqtCWaFPIX4qzsHZe1mUJiywX8sXUgAMzbCIjkhwDBM7BX6LAa/lo6NM612NPf
 XVMwBzb2ezCrqCD0tHeP8uDPlWVRw1U1mONq4I+YL7Uyt6OOMfv+kzmDqvr0Uyci7lGA
 vzuOTL3dUCmVN4TfQfgtLH6QViXN6czCqQMtvphUeI9ZKqBgBRHp/CoQxDVS1VsY1Dbd
 GufA==
X-Received: by 10.50.192.197 with SMTP id hi5mr4638121igc.45.1359579558523;
 Wed, 30 Jan 2013 12:59:18 -0800 (PST)
Received: from vpn132.rw1.your.org (vpn132.rw1.your.org. [204.9.51.132])
 by mx.google.com with ESMTPS id fa6sm3343316igb.2.2013.01.30.12.59.16
 (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128);
 Wed, 30 Jan 2013 12:59:17 -0800 (PST)
Content-Type: text/plain; charset=windows-1252
Mime-Version: 1.0 (Mac OS X Mail 6.2 \(1499\))
Subject: Re: Improving ZFS performance for large directories
From: Kevin Day <toasty@dragondata.com>
In-Reply-To: <23E6691A-F30C-4731-9F78-FD8ADDDA09AE@gmail.com>
Date: Wed, 30 Jan 2013 14:59:15 -0600
Content-Transfer-Encoding: quoted-printable
Message-Id: <D606CCAE-511F-46DC-838B-E1994B973309@dragondata.com>
References: <19DB8F4A-6788-44F6-9A2C-E01DEA01BED9@dragondata.com>
 <5267B97C-ED47-4AAB-8415-12D6987E9371@gmail.com>
 <47975CEB-EA50-4F6C-8C47-6F32312F34C4@dragondata.com>
 <23E6691A-F30C-4731-9F78-FD8ADDDA09AE@gmail.com>
To: Nikolay Denev <ndenev@gmail.com>
X-Mailer: Apple Mail (2.1499)
X-Gm-Message-State: ALoCoQnkS+6kCreYGM448c9En3dGEp/g3iUGqVrmWf1fActc0Mdnx6jek0faAnU6k7bN8BfFSNBO
Cc: FreeBSD Filesystems <freebsd-fs@freebsd.org>
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 30 Jan 2013 20:59:19 -0000


On Jan 30, 2013, at 10:34 AM, Nikolay Denev <ndenev@gmail.com> wrote:

>>=20
>> vfs.zfs.arc_meta_limit: 8199079936
>> vfs.zfs.arc_meta_used: 13965744408
>>=20
>> Full output of zfs-stats: [=85snipped=85]
>=20
> Looks like you can try to increase arc_meta_limit to be let's say : =
half of arc_max. (16398159872 in your case).
>=20
>=20

Okay, will give this a shot on the next reboot too.

Does anyone here understand the significance of "used" being higher than =
"limit"? Is the limit only a suggestion, or are there cases where =
there'a certain metadata that must be in arc, and it's particularly =
large here?

-- Kevin


From owner-freebsd-fs@FreeBSD.ORG  Wed Jan 30 21:56:10 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id 6C643145
 for <freebsd-fs@freebsd.org>; Wed, 30 Jan 2013 21:56:10 +0000 (UTC)
 (envelope-from artemb@gmail.com)
Received: from mail-vb0-f51.google.com (mail-vb0-f51.google.com
 [209.85.212.51]) by mx1.freebsd.org (Postfix) with ESMTP id 30961940
 for <freebsd-fs@freebsd.org>; Wed, 30 Jan 2013 21:56:10 +0000 (UTC)
Received: by mail-vb0-f51.google.com with SMTP id fq11so1298286vbb.38
 for <freebsd-fs@freebsd.org>; Wed, 30 Jan 2013 13:56:09 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=mime-version:x-received:sender:in-reply-to:references:date
 :x-google-sender-auth:message-id:subject:from:to:cc:content-type;
 bh=wpSwv3N4jMIGGcP8n72/lz2cALhn/QdLAKhy7UP+hEs=;
 b=mtwbJOBxtUdF6vKUzQssPpCv1g6Z66uP9VaNfEUCyGfzsv4+X6xwWeU+g+NwXi9GCH
 nhiPiH3r56Iz9kjTx+ZM7dCGMTkHPbEcrXYl1RmoqhW+qI0TbUCMNoOIhYaY4HCagrWW
 nJonqyNbla7eJkX+qNE7k2BAbzxFEJV6vs2bAIafXk6L5QzWT4xWd8LDhM/73dlRZX8g
 P7mVU3G4ysL6UzyyB62ewF37cHpVWR9w2ld4hy3seAJmCFIPyCTJ+hdQiT/FOg6zYjIt
 aXH5ZPqZJUhaxsGHOCphYhdZfsLouNMglO39h/XufIL+eRoXj2rvJqWPRS7/ORSpnx1V
 +Q2w==
MIME-Version: 1.0
X-Received: by 10.220.108.2 with SMTP id d2mr6090053vcp.60.1359582969557; Wed,
 30 Jan 2013 13:56:09 -0800 (PST)
Sender: artemb@gmail.com
Received: by 10.220.123.2 with HTTP; Wed, 30 Jan 2013 13:56:09 -0800 (PST)
In-Reply-To: <D606CCAE-511F-46DC-838B-E1994B973309@dragondata.com>
References: <19DB8F4A-6788-44F6-9A2C-E01DEA01BED9@dragondata.com>
 <5267B97C-ED47-4AAB-8415-12D6987E9371@gmail.com>
 <47975CEB-EA50-4F6C-8C47-6F32312F34C4@dragondata.com>
 <23E6691A-F30C-4731-9F78-FD8ADDDA09AE@gmail.com>
 <D606CCAE-511F-46DC-838B-E1994B973309@dragondata.com>
Date: Wed, 30 Jan 2013 13:56:09 -0800
X-Google-Sender-Auth: UFa8nGEIxkYKzwrgcUrOaIq3oic
Message-ID: <CAFqOu6gmyQLD5VPOPpyXjWd6quUYh6mvavVbLDP3rLc4kBHbgg@mail.gmail.com>
Subject: Re: Improving ZFS performance for large directories
From: Artem Belevich <art@freebsd.org>
To: Kevin Day <toasty@dragondata.com>
Content-Type: text/plain; charset=ISO-8859-1
Cc: FreeBSD Filesystems <freebsd-fs@freebsd.org>
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 30 Jan 2013 21:56:10 -0000

On Wed, Jan 30, 2013 at 12:59 PM, Kevin Day <toasty@dragondata.com> wrote:
>
> Does anyone here understand the significance of "used" being higher than "limit"? Is the limit only a suggestion, or are there cases where there'a certain metadata that must be in arc, and it's particularly large here?

arc_meta_limit is a soft limit which basically tells ARC to attempt
evicting metadata entries and reuse their buffers as opposed to
allocating new memory and growing ARC. According to the comment next
to arc_evict() function, it's a best-effort attempt and eviction is
not guaranteed. That could potentially allow meta_size to remain above
meta_limit.

--Artem

From owner-freebsd-fs@FreeBSD.ORG  Wed Jan 30 22:16:19 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id 93A218A7;
 Wed, 30 Jan 2013 22:16:19 +0000 (UTC)
 (envelope-from universite@ukr.net)
Received: from ffe17.ukr.net (ffe17.ukr.net [195.214.192.83])
 by mx1.freebsd.org (Postfix) with ESMTP id 225C0A11;
 Wed, 30 Jan 2013 22:16:18 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=ukr.net;
 s=ffe; 
 h=Date:Message-Id:From:To:References:In-Reply-To:Subject:Cc:Content-Type:Content-Transfer-Encoding:MIME-Version;
 bh=fV+Bh0N8JNjGFvYk/bb45iFUqDQptup1FImZtFB/6Y4=; 
 b=Q0ixxUY3cZf4anPbmhMYsDqjv1snjiiijqH2XruWglCDasyj6SNa9gw5XRB1FxfAadL4Izu8PVH+dwDODFd8CEUmO0JLUu5G3eZAtb7lYI3BmRCzYdzYDqul4dlpcTyCDc7S1ny+Vy0Cp5+/vP8OXYFb6/rk7zs2Vg8V1m4tamc=;
Received: from mail by ffe17.ukr.net with local ID 1U0fZJ-000N1z-3y
 ; Wed, 30 Jan 2013 23:51:45 +0200
MIME-Version: 1.0
Content-Disposition: inline
Content-Transfer-Encoding: binary
Content-Type: text/plain; charset="windows-1251"
Subject: Re[2]: Re[2]: AHCI timeout when using ZFS + AIO + NCQ
In-Reply-To: <1359317924363-5781425.post@n5.nabble.com>
References: <70362.1359299605.3196836531757973504@ffe11.ukr.net>
 <16B555759C2041ED8185DF478193A59D@multiplay.co.uk>
 <917933DB5C9A490D93A739058C2507A1@multiplay.co.uk>
 <93308.1359297551.14145052969567453184@ffe15.ukr.net>
 <13391.1359029978.3957795939058384896@ffe16.ukr.net>
 <70578.1359313319.18126575192049975296@ffe16.ukr.net>
 <221B307551154F489452F89E304CA5F7@multiplay.co.uk>
 <1359317924363-5781425.post@n5.nabble.com>
To: "Beeblebrox" <zaphod@berentweb.com>
From: "Vladislav Prodan" <universite@ukr.net>
X-Mailer: freemail.ukr.net 4.0
Message-Id: <87448.1359582705.624376220320202752@ffe17.ukr.net>
X-Browser: Mozilla/5.0 (Windows NT 6.1; WOW64;
 rv:18.0) Gecko/20100101 Firefox/18.0
Date: Wed, 30 Jan 2013 23:51:45 +0200
Cc: freebsd-fs@freebsd.org, freebsd-current@freebsd.org
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 30 Jan 2013 22:16:19 -0000


> I once ran into a very severe AHCI timeout problem. After months of trying to
> figure it out and insane "Hardware_ECC_Recovered" error values, I found that
> the error was with the power connector plug / sata HDD interface. All errors
> disappeared after replacing that cable. Since you have error on more than 1
> HDD, I suggest:
> 1. Check smartctl output for each AND all HDD
> 2. Check whether your power supply unit is still healthy or if it is
> supplying inconsistent power.
> 3. Check the main power supply line and whether it shows any voltage
> fluctuations or if there is a new heavy consumer of amps on the same power
> line as the server is plugged to.
> 
> 

I've deliberately chose a different server that has a different chipset, and that there were no problems with the HDD.

Added kernel support:
device ahci # AHCI-compatible SATA controllers

And now, after 2.5 days fell off one HDD.

[3:14]beastie:root->/root# zpool status
  pool: tank
 state: DEGRADED
status: One or more devices has been removed by the administrator.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Online the device using 'zpool online' or replace the device with
        'zpool replace'.
  scan: none requested
config:

        NAME                     STATE     READ WRITE CKSUM
        tank                     DEGRADED     0     0     0
          mirror-0               ONLINE       0     0     0
            gpt/disk0            ONLINE       0     0     0
            gpt/disk2            ONLINE       0     0     0
          mirror-1               DEGRADED     0     0     0
            gpt/disk1            ONLINE       0     0     0
            4931885954389536913  REMOVED      0     0     0  was /dev/gpt/disk3

errors: No known data errors


Jan 30 09:49:28 beastie kernel: ahcich3: Timeout on slot 29 port 0
Jan 30 09:49:28 beastie kernel: ahcich3: is 00000000 cs 20000000 ss 00000000 rs 20000000 tfd c0 serr 00000000 cmd 0004dd17
Jan 30 09:49:28 beastie kernel: (ada3:ahcich3:0:0:0): FLUSHCACHE48. ACB: ea 00 00 00 00 40 00 00 00 00 00 00
Jan 30 09:49:28 beastie kernel: (ada3:ahcich3:0:0:0): CAM status: Command timeout
Jan 30 09:49:28 beastie kernel: (ada3:ahcich3:0:0:0): Retrying command
Jan 30 09:51:31 beastie kernel: ahcich3: AHCI reset: device not ready after 31000ms (tfd = 00000080)
Jan 30 09:51:31 beastie kernel: ahcich3: Timeout on slot 29 port 0
Jan 30 09:51:31 beastie kernel: ahcich3: is 00000000 cs 20000000 ss 00000000 rs 20000000 tfd 80 serr 00000000 cmd 0004dd17
Jan 30 09:51:31 beastie kernel: (aprobe0:ahcich3:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
Jan 30 09:51:31 beastie kernel: (aprobe0:ahcich3:0:0:0): CAM status: Command timeout
Jan 30 09:51:31 beastie kernel: (aprobe0:ahcich3:0:0:0): Error 5, Retry was blocked
Jan 30 09:51:31 beastie kernel: ahcich3: AHCI reset: device not ready after 31000ms (tfd = 00000080)
Jan 30 09:51:31 beastie kernel: ahcich3: Timeout on slot 29 port 0
Jan 30 09:51:31 beastie kernel: ahcich3: is 00000000 cs 00000000 ss 00000000 rs 20000000 tfd 58 serr 00000000 cmd 0004dd17
Jan 30 09:51:31 beastie kernel: (aprobe0:ahcich3:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
Jan 30 09:51:31 beastie kernel: (aprobe0:ahcich3:0:0:0): CAM status: Command timeout
Jan 30 09:51:31 beastie kernel: (aprobe0:ahcich3:0:0:0): Error 5, Retry was blocked
Jan 30 09:51:31 beastie kernel: (ada3:ahcich3:0:0:0): lost device
Jan 30 09:51:31 beastie kernel: (pass3:ahcich3:0:0:0): passdevgonecb: devfs entry is gone


-- 
Vladislav V. Prodan            
System & Network Administrator 
http://support.od.ua           
+380 67 4584408, +380 99 4060508
VVP88-RIPE


From owner-freebsd-fs@FreeBSD.ORG  Thu Jan 31 03:51:26 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id 9D8C6183;
 Thu, 31 Jan 2013 03:51:26 +0000 (UTC)
 (envelope-from wollman@hergotha.csail.mit.edu)
Received: from hergotha.csail.mit.edu
 (wollman-1-pt.tunnel.tserv4.nyc4.ipv6.he.net [IPv6:2001:470:1f06:ccb::2])
 by mx1.freebsd.org (Postfix) with ESMTP id 3678A804;
 Thu, 31 Jan 2013 03:51:25 +0000 (UTC)
Received: from hergotha.csail.mit.edu (localhost [127.0.0.1])
 by hergotha.csail.mit.edu (8.14.5/8.14.5) with ESMTP id r0V3pOUj092930;
 Wed, 30 Jan 2013 22:51:24 -0500 (EST)
 (envelope-from wollman@hergotha.csail.mit.edu)
Received: (from wollman@localhost)
 by hergotha.csail.mit.edu (8.14.5/8.14.4/Submit) id r0V3pOAr092927;
 Wed, 30 Jan 2013 22:51:24 -0500 (EST) (envelope-from wollman)
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Message-ID: <20745.59964.60447.379943@hergotha.csail.mit.edu>
Date: Wed, 30 Jan 2013 22:51:24 -0500
From: Garrett Wollman <wollman@bimajority.org>
To: freebsd-stable@freebsd.org, freebsd-fs@freebsd.org
Subject: More on odd ZFS not-quite-deadlock
X-Mailer: VM 7.17 under 21.4 (patch 22) "Instant Classic" XEmacs Lucid
X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.2.7
 (hergotha.csail.mit.edu [127.0.0.1]); Wed, 30 Jan 2013 22:51:24 -0500 (EST)
X-Spam-Status: No, score=-1.0 required=5.0 tests=ALL_TRUSTED
 autolearn=disabled version=3.3.2
X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on
 hergotha.csail.mit.edu
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 31 Jan 2013 03:51:26 -0000

I posted a few days ago about what I thought was a ZFS-related
almost-deadlock.  I have a bit more information now, but I'm still
puzzled.  Hopefully someone else has seen this before.

While things are in the hung state, a "zfs recv" is running.  It's
receiving an empty snapshot to one of the many datasets on this file
server.  "zfs recv" reports that receiving this particular empty
snapshot takes just about half an hour.  When it finally completes,
everything starts working normally again.  (This particular
replication job will no longer be operational in a few hours, so this
may be the last time I can collect information about the issue for a
while.)  The same "zfs recv" takes only a few seconds 23 hours out of 24.

The kstacks of the processes that appear to possibly be involved look
like this:

  PID    TID COMM             TDNAME           KSTACK                       
    0 100061 kernel           thread taskq     mi_switch+0x196 sleepq_wait+0x42 _sx_slock_hard+0x3bb _sx_slock+0x3d zfs_reclaim_complete+0x38 taskqueue_run_locked+0x85 taskqueue_thread_loop+0x46 fork_exit+0x11f fork_trampoline+0xe 
    7 100215 zfskern          arc_reclaim_thre mi_switch+0x196 sleepq_timedwait+0x42 _cv_timedwait+0x13c arc_reclaim_thread+0x29d fork_exit+0x11f fork_trampoline+0xe 
    7 100216 zfskern          l2arc_feed_threa mi_switch+0x196 sleepq_timedwait+0x42 _cv_timedwait+0x13c l2arc_feed_thread+0x1a8 fork_exit+0x11f fork_trampoline+0xe 
    7 100592 zfskern          txg_thread_enter mi_switch+0x196 sleepq_wait+0x42 _cv_wait+0x121 txg_thread_wait+0x79 txg_quiesce_thread+0xb5 fork_exit+0x11f fork_trampoline+0xe 
    7 100593 zfskern          txg_thread_enter mi_switch+0x196 sleepq_timedwait+0x42 _cv_timedwait+0x13c txg_thread_wait+0x3c txg_sync_thread+0x269 fork_exit+0x11f fork_trampoline+0xe 
    7 100989 zfskern          txg_thread_enter mi_switch+0x196 sleepq_wait+0x42 _cv_wait+0x121 txg_thread_wait+0x79 txg_quiesce_thread+0xb5 fork_exit+0x11f fork_trampoline+0xe 
    7 100990 zfskern          txg_thread_enter mi_switch+0x196 sleepq_timedwait+0x42 _cv_timedwait+0x13c txg_thread_wait+0x3c txg_sync_thread+0x269 fork_exit+0x11f fork_trampoline+0xe 
    7 101355 zfskern          txg_thread_enter mi_switch+0x196 sleepq_wait+0x42 _cv_wait+0x121 txg_thread_wait+0x79 txg_quiesce_thread+0xb5 fork_exit+0x11f fork_trampoline+0xe 
    7 101356 zfskern          txg_thread_enter mi_switch+0x196 sleepq_timedwait+0x42 _cv_timedwait+0x13c txg_thread_wait+0x3c txg_sync_thread+0x269 fork_exit+0x11f fork_trampoline+0xe 
   13 100053 geom             g_event          mi_switch+0x196 sleepq_wait+0x42 _sleep+0x3a8 g_run_events+0x430 fork_exit+0x11f fork_trampoline+0xe 
   13 100054 geom             g_up             mi_switch+0x196 sleepq_wait+0x42 _sleep+0x3a8 g_io_schedule_up+0xd8 g_up_procbody+0x5c fork_exit+0x11f fork_trampoline+0xe 
   13 100055 geom             g_down           mi_switch+0x196 sleepq_wait+0x42 _sleep+0x3a8 g_io_schedule_down+0x20e g_down_procbody+0x5c fork_exit+0x11f fork_trampoline+0xe 
   22 100225 syncer           -                mi_switch+0x196 sleepq_wait+0x42 _cv_wait+0x121 rrw_enter+0xdb zfs_sync+0x63 sync_fsync+0x19d VOP_FSYNC_APV+0x4a sync_vnode+0x15e sched_sync+0x1c5 fork_exit+0x11f fork_trampoline+0xe 

93224 102554 zfs              -                mi_switch+0x196 sleepq_wait+0x42 _cv_wait+0x121 zio_wait+0x61 dbuf_read+0x5e5 dnode_next_offset_level+0x28d dnode_next_offset+0xb9 dmu_object_next+0x3e dsl_dataset_destroy+0x164 dmu_recv_end+0x184 zfs_ioc_recv+0x9f4 zfsdev_ioctl+0xe6 devfs_ioctl_f+0x7b kern_ioctl+0x115 sys_ioctl+0xf0 amd64_syscall+0x5ea Xfast_syscall+0xf7 

[This is the zfs recv process that is applying the replication package
with an empty snapshot.]

93320 102479 df               -                mi_switch+0x196 sleepq_wait+0x42 _cv_wait+0x121 rrw_enter+0xdb zfs_root+0x40 lookup+0xaa6 namei+0x535 kern_statfs+0xa4 sys_statfs+0x37 amd64_syscall+0x5ea Xfast_syscall+0xf7 
[7 more like this]

(I've deleted all of the threads that are clearly waiting for some
unrelated event, such as nanosleep() and select().)

-GAWollman


From owner-freebsd-fs@FreeBSD.ORG  Thu Jan 31 08:44:37 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id 37701254;
 Thu, 31 Jan 2013 08:44:37 +0000 (UTC) (envelope-from lev@FreeBSD.org)
Received: from onlyone.friendlyhosting.spb.ru (onlyone.friendlyhosting.spb.ru
 [46.4.40.135]) by mx1.freebsd.org (Postfix) with ESMTP id DD28A2EE;
 Thu, 31 Jan 2013 08:44:36 +0000 (UTC)
Received: from lion.home.serebryakov.spb.ru (unknown
 [IPv6:2001:470:923f:1:2577:cf36:d0d4:4986])
 (Authenticated sender: lev@serebryakov.spb.ru)
 by onlyone.friendlyhosting.spb.ru (Postfix) with ESMTPA id B6CD34ACC7;
 Thu, 31 Jan 2013 12:44:28 +0400 (MSK)
Date: Thu, 31 Jan 2013 12:44:19 +0400
From: Lev Serebryakov <lev@FreeBSD.org>
Organization: FreeBSD
X-Priority: 3 (Normal)
Message-ID: <1291867.20130131124419@serebryakov.spb.ru>
To: freebsd-fs@FreeBSD.org, freebsd-stable@freebsd.org
Subject: 9.1-STABLE, live lock up,
 seems that it is ZFS lockup in "zfskern{txg_thread_enter}" state
 "tx->tx"
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
Reply-To: lev@FreeBSD.org
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 31 Jan 2013 08:44:37 -0000

Hello, freebsd-fs.

  I have 9.1-STABLE (r244958) system, amd64, 8GiB memory.

  Two SATA disks, 750Gb each.

  Disks are partitoned into 7 (BSD) partitons (exactly the same), 5 of
 these pairs are joined into gmirrors for "system" FSes (UFS2), one
 pair is used for swaps and 7th pair is used as zmirror for /usr/home.

   Tonight system becomes unusable, as every process which try to read
 directories in /usr/home (like "ls ~" or "find /usr/home -type f")
 hangs forever. I could login to system, login shell starts, but if I
 run "ls" right after -- it hangs. Every periodic process, which try
 to read home FS (directories, not files!) hangs. It looks, like
 stat() calls on this FS hangs, but not open()/read()/write()/close().

  One thing I fins suspicious in different system diagnostics, is
  kernel thread "zfskern{txg_thread_enter}" which is shown in state
  "tx->tx" forever.

  Disks looks completely OK according to smartd/smartctl, no hardware
 errors in dmesg, etc.

===============================================
# zpool status
  pool: pool
 state: ONLINE
status: The pool is formatted using a legacy on-disk format.  The pool can
        still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'.  Once this is done, the
        pool will no longer be accessible on software that does not support feature
        flags.
  scan: resilvered 32.1G in 0h34m with 0 errors on Sat Jun  2 16:22:59 2012
config:

        NAME         STATE     READ WRITE CKSUM
        pool         ONLINE       0     0     0
          mirror-0   ONLINE       0     0     0
            ada0s1h  ONLINE       0     0     0
            ada1s1h  ONLINE       0     0     0

errors: No known data errors
================================================


-- 
// Black Lion AKA Lev Serebryakov <lev@FreeBSD.org>


From owner-freebsd-fs@FreeBSD.ORG  Thu Jan 31 16:03:30 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115])
 by hub.freebsd.org (Postfix) with ESMTP id 2EDBAA64
 for <freebsd-fs@freebsd.org>; Thu, 31 Jan 2013 16:03:30 +0000 (UTC)
 (envelope-from pluknet@gmail.com)
Received: from mail-qe0-f52.google.com (mail-qe0-f52.google.com
 [209.85.128.52]) by mx1.freebsd.org (Postfix) with ESMTP id D7DCFFD6
 for <freebsd-fs@freebsd.org>; Thu, 31 Jan 2013 16:03:29 +0000 (UTC)
Received: by mail-qe0-f52.google.com with SMTP id 6so1349003qeb.11
 for <freebsd-fs@freebsd.org>; Thu, 31 Jan 2013 08:03:22 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=mime-version:x-received:in-reply-to:references:date:message-id
 :subject:from:to:cc:content-type;
 bh=x/z3mYX3Z1sloFJ+we21eSXbSGsC6DXMvBFRlRI8018=;
 b=j4w3TgznRpGt8LSD7eFli+GGKzVLlbAntX6h81gYpbG5E7sqrJyXnYvLj+kqJqzqEA
 9yYMdmOFfayRDaE44GL0CRHfPHjvK+K9vAHv+BpjrUJWL+Aih29vbLxcxOZVBQEIb2eA
 cT36kzn0qRKFhS/fxhV+VZ/5HsWbg7ekwCjB1L6jesTUIKFuzHDmdAWeKj2PvAWLr4vo
 K8wm/OnP51Hsl8Hrikp9x75XqQqlIKpz2VzQPN2ylq+9u5V3UhJY00nxXRVp1OZY4gHQ
 KzKNaRGrWNzO2Aw3iWpLlV2aNxQhwtiXV0nWm6TlBHrZV2DdZBAeJT68SrWI+P7Z+Wtm
 MpAg==
MIME-Version: 1.0
X-Received: by 10.229.78.97 with SMTP id j33mr2252518qck.107.1359648202844;
 Thu, 31 Jan 2013 08:03:22 -0800 (PST)
Received: by 10.229.78.96 with HTTP; Thu, 31 Jan 2013 08:03:22 -0800 (PST)
In-Reply-To: <1171241649.2066788.1358383377496.JavaMail.root@erie.cs.uoguelph.ca>
References: <CAE-mSOJk1HbxvF=ZpoSP21b9j65qMov=AE-OM6wcUkbadQeZbw@mail.gmail.com>
 <1171241649.2066788.1358383377496.JavaMail.root@erie.cs.uoguelph.ca>
Date: Thu, 31 Jan 2013 19:03:22 +0300
Message-ID: <CAE-mSOJK-_wex62e3GcHhBcMBJZETuUAU0ZRd4eH8GcN2NrRcg@mail.gmail.com>
Subject: Re: getcwd lies on/under nfs4-mounted zfs dataset
From: Sergey Kandaurov <pluknet@gmail.com>
To: Rick Macklem <rmacklem@uoguelph.ca>
Content-Type: text/plain; charset=ISO-8859-1
Cc: freebsd-fs@freebsd.org
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 31 Jan 2013 16:03:30 -0000

On 17 January 2013 04:42, Rick Macklem <rmacklem@uoguelph.ca> wrote:
> pluknet@gmail.com wrote:
>> Hi.
>>
>> We stuck with the problem getting wrong current directory path
>> when sitting on/under zfs dataset filesystem mounted over NFSv4.
>> Both nfs server and client are 10.0-CURRENT from December or so.
>>
>> The component path "user3" unexpectedly appears to be "." (dot).
>> nfs-client:/home/user3 # pwd
>> /home/.
>> nfs-client:/home/user3/var/run # pwd
>> /home/./var/run
>>
> Ok, I've figured out what is going on. The algorithm in libc
> works, but vn_fullpath1() doesn't. The latter assumes that
> "mount points" are marked with VV_ROOT etc. For the
> "pseudo mount points" (which are mount points within the
> directory tree on the NFSv4 server), this isn't the case.
>
> If you:
> sysctl debug.disablecwd=1
> sysctl debug.disablefullpath=1
>
> it works. (At least for the UFS case I tested.)

Thank you very much, Rick!
As an interim solution, we've decided to go that way.

>
> I can't see how this can be made to work correctly
> for vn_fullpath1() unless it was re-written to use the
> same algorithm that lib/libc/gen/getcwd.c implements.
>
> I was pretty sure this used to work. Maybe the syscalls
> used to be disabled by default or weren't used by the
> libc functions?
>
> Anyhow, sorry about the cofusing posts while I figured
> out what was going on, rick
> ps: Don't use the patch I posted. It isn't needed and
>     will break stuff.
>
>> nfs-client:~ # procstat -f 3225
>> PID COMM FD T V FLAGS REF OFFSET PRO NAME
>> 3225 a.out text v r r-------- - - - /home/./var/a.out
>> 3225 a.out ctty v c rw------- - - - /dev/pts/2
>> 3225 a.out cwd v d r-------- - - - /home/./var
>> 3225 a.out root v d r-------- - - - /
>>
>> The used setup follows.
>>
>> 1. NFS Server with local ZFS:
>> # cat /etc/exports
>> V4: / -sec=sys
>>
>> # zfs list
>> pool1 10.4M 122G 580K /pool1
>> pool1/user3 on /pool1/user3 (zfs, NFS exported, local, nfsv4acls)
>>
>> Exports list on localhost:
>> /pool1/user3 109.70.28.0
>> /pool1 109.70.28.0
>>
>> # zfs get sharenfs pool1/user3
>> NAME PROPERTY VALUE SOURCE
>> pool1/user3 sharenfs -alldirs -maproot=root -network=109.70.28.0/24
>> local
>>
>> 2. pool1 is mounted on NFSv4 client:
>> nfs-server:/pool1 on /home (nfs, noatime, nfsv4acls)
>>
>> So that on NFS client the "pool1/user3" dataset comes at /home/user3.
>> / - ufs
>> /home - zpool-over-nfsv4
>> /home/user3 - zfs dataset "pool1/user3"
>>
>> At the same time it works as expected when we're not on zfs dataset,
>> but directly on its parent zfs pool (also over NFSv4), e.g.
>> nfs-client:/home/non_dataset_dir # pwd
>> /home/non_dataset_dir
>>
>> The ls command works as expected:
>> nfs-client:/# ls -dl /home/user3/var/
>> drwxrwxrwt+ 6 root wheel 6 Jan 10 16:19 /home/user3/var/
>>


-- 
wbr,
pluknet

From owner-freebsd-fs@FreeBSD.ORG  Fri Feb  1 07:44:05 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@FreeBSD.org
Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115])
 by hub.freebsd.org (Postfix) with ESMTP id 543E88EC
 for <freebsd-fs@FreeBSD.org>; Fri,  1 Feb 2013 07:44:05 +0000 (UTC)
 (envelope-from avg@FreeBSD.org)
Received: from citadel.icyb.net.ua (citadel.icyb.net.ua [212.40.38.140])
 by mx1.freebsd.org (Postfix) with ESMTP id 8AD26E8
 for <freebsd-fs@FreeBSD.org>; Fri,  1 Feb 2013 07:44:04 +0000 (UTC)
Received: from porto.starpoint.kiev.ua (porto-e.starpoint.kiev.ua
 [212.40.38.100])
 by citadel.icyb.net.ua (8.8.8p3/ICyb-2.3exp) with ESMTP id JAA16602
 for <freebsd-fs@FreeBSD.org>; Fri, 01 Feb 2013 09:43:56 +0200 (EET)
 (envelope-from avg@FreeBSD.org)
Received: from localhost ([127.0.0.1])
 by porto.starpoint.kiev.ua with esmtp (Exim 4.34 (FreeBSD))
 id 1U1BHw-0000i0-9B
 for freebsd-fs@FreeBSD.org; Fri, 01 Feb 2013 09:43:56 +0200
Message-ID: <510B723A.2090404@FreeBSD.org>
Date: Fri, 01 Feb 2013 09:43:54 +0200
From: Andriy Gapon <avg@FreeBSD.org>
User-Agent: Mozilla/5.0 (X11; FreeBSD amd64;
 rv:17.0) Gecko/20130121 Thunderbird/17.0.2
MIME-Version: 1.0
To: freebsd-fs@FreeBSD.org
Subject: zfs hang/deadlock - what to do - how to report
X-Enigmail-Version: 1.4.6
Content-Type: text/plain; charset=x-viet-vps
Content-Transfer-Encoding: 7bit
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 01 Feb 2013 07:44:05 -0000


Please first read the following https://wiki.freebsd.org/AvgZfsDeadlockDebug
Please follow the advices and do suggested preliminary analysis.
Please report accordingly.
Thank you.
-- 
Andriy Gapon

From owner-freebsd-fs@FreeBSD.ORG  Fri Feb  1 09:09:08 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id 68218B5A
 for <fs@freebsd.org>; Fri,  1 Feb 2013 09:09:08 +0000 (UTC)
 (envelope-from brde@optusnet.com.au)
Received: from mail06.syd.optusnet.com.au (mail06.syd.optusnet.com.au
 [211.29.132.187]) by mx1.freebsd.org (Postfix) with ESMTP id 25654698
 for <fs@freebsd.org>; Fri,  1 Feb 2013 09:09:06 +0000 (UTC)
Received: from c211-30-173-106.carlnfd1.nsw.optusnet.com.au
 (c211-30-173-106.carlnfd1.nsw.optusnet.com.au [211.30.173.106])
 by mail06.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id r1198suE025477
 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO)
 for <fs@freebsd.org>; Fri, 1 Feb 2013 20:08:57 +1100
Date: Fri, 1 Feb 2013 20:08:54 +1100 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: fs@freebsd.org
Subject: some fixes for msdosfs
Message-ID: <20130201182606.A1492@besplex.bde.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
X-Optus-CM-Score: 0
X-Optus-CM-Analysis: v=2.0 cv=MscKcBme c=1 sm=1 a=kj9zAlcOel0A:10
 a=PO7r1zJSAAAA:8 a=JzwRw_2MAAAA:8 a=IPjz-GnoPKUA:10
 a=Nf6N-zW9uRpywVz5buAA:9 a=CjuIK1q_8ugA:10 a=ADzlCoBTbut52P-S:21
 a=qp1OgrSAsFp3MxqR:21 a=TEtd8y5WR3g2ypngnwZWYw==:117
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 01 Feb 2013 09:09:08 -0000

Please commit some of these fixes.

1.
The directory entry for dotdot was corrupted in the FAT32 case when moving
a directory to a subdir of the root directory from somewhere else.

For all directory moves that change the parent directory, the dotdot
entry must be fixed up.  For msdosfs, the root directory is magic for
non-FAT32.  It is less magic for FAT32, but needs the same magic for
the dotdot fixup.  It didn't have it.

chkdsk and fsck_msdosfs fix the corrupt directory entries with no problems.

The fix is simple -- use the same magic for dotdot in msdosfs_rename()
as in msdosfs_mkdir().  But the patch is large due to related cleanups
and unrelated changes that -current already has.

@ Index: msdosfs_vnops.c
@ ===================================================================
@ RCS file: /home/ncvs/src/sys/fs/msdosfs/msdosfs_vnops.c,v
@ retrieving revision 1.147
@ diff -u -2 -r1.147 msdosfs_vnops.c
@ --- msdosfs_vnops.c	4 Feb 2004 21:52:53 -0000	1.147
@ +++ msdosfs_vnops.c	31 Jan 2013 17:36:21 -0000
@ @@ -926,5 +978,5 @@
@  	int doingdirectory = 0, newparent = 0;
@  	int error;
@ -	u_long cn;
@ +	u_long cn, pcl;
@  	daddr_t bn;
@  	struct denode *fddep;	/* from file's parent directory	 */

This is in msdosfs_rename().

Use the same variable name as in mkdir(), instead of reusing cn.

@ @@ -1199,9 +1251,13 @@
@  		}
@  		dotdotp = (struct direntry *)bp->b_data + 1;
@ -		putushort(dotdotp->deStartCluster, dp->de_StartCluster);
@ +		pcl = dp->de_StartCluster;
@ +		if (FAT32(pmp) && pcl == pmp->pm_rootdirblk)
@ +			pcl = MSDOSFSROOT;
@ +		putushort(dotdotp->deStartCluster, pcl);
@  		if (FAT32(pmp))
@ -			putushort(dotdotp->deHighClust, dp->de_StartCluster >> 16);
@ -		error = bwrite(bp);
@ -		if (error) {
@ +			putushort(dotdotp->deHighClust, pcl >> 16);

Use the same code as in mkdir().  Don't comment on it again.

@ +		if (fvp->v_mount->mnt_flag & MNT_ASYNC)
@ +			bdwrite(bp);
@ +		else if ((error = bwrite(bp)) != 0) {
@  			/* XXX should really panic here, fs is corrupt */
@  			VOP_UNLOCK(fvp, 0, td);

Unrelated changes that -current already has.

@ @@ -1313,6 +1369,11 @@
@  	putushort(denp[0].deMTime, ndirent.de_MTime);
@  	pcl = pdep->de_StartCluster;
@ +	/*
@ +	 * Although the root directory has a non-magic starting cluster
@ +	 * number for FAT32, chkdsk and fsck_msdosfs still require
@ +	 * references to it in dotdot entries to be magic.
@ +	 */
@  	if (FAT32(pmp) && pcl == pmp->pm_rootdirblk)
@ -		pcl = 0;
@ +		pcl = MSDOSFSROOT;
@  	putushort(denp[1].deStartCluster, pcl);
@  	putushort(denp[1].deCDate, ndirent.de_CDate);

This is in msdosfs_mkdir().

Document the magic there.  Don't hard-code 0.

@ @@ -1324,9 +1385,10 @@
@  	if (FAT32(pmp)) {
@  		putushort(denp[0].deHighClust, newcluster >> 16);
@ -		putushort(denp[1].deHighClust, pdep->de_StartCluster >> 16);
@ +		putushort(denp[1].deHighClust, pcl >> 16);
@  	}

Don't depend on magic soft-coding of 0.  For the FAT32 root directory,
pdep->de_StartCluster is usually < 65536.  Perhaps it is always small.
The old code depended on this to get a result of 0 when the value is
shifted.

@ 
@ -	error = bwrite(bp);
@ -	if (error)
@ +	if (ap->a_dvp->v_mount->mnt_flag & MNT_ASYNC)
@ +		bdwrite(bp);
@ +	else if ((error = bwrite(bp)) != 0)
@  		goto bad;
@

Unrelated changes that -current already has.

Further cleanups: the condition for being the root directory should
probably be written as (DETOV(pdep)->v_vflag & VV_ROOT).  msdosfs uses
this in some places, but it prefers to test if a denode's first cluster
number is MSDOSFSROOT.  The latter is simpler and was equivalent before
FAT32 existed.  msdosfs still uses it a lot, but it now means that the
denode is for a non-FAT32 root directory.  This is quite confusing.
The old test often gives the correct classification because for the
FAT32 case where it differs, the root directory is not magic, and the
code under the test is really handling the magic case.  Comments add
to the confusion because they are mostly unchanged and still say that
the root directory is always magic.  Cases where the root directory
is magic for FAT32 are mostly classified using the (FAT32(pmp) && cn
== pmp->pm_rootdirblk) condition.  Apparently there is some magic that
requires the FAT32(pmp) condition before pmp->pm_rootdirblk can be
trusted.  The VV_ROOT condition seems better for these cases.

============

2.
mountmsdosfs() had an insane sanity test.

While testing the above, I tried FAT32 on a small partition.  This
failed to mount because pmp->pm_Sectors was nonzero.  Normally,
FAT32 file systems are so large that the 16-bit pm_Sectors can't
hold the size.  This is indicated by setting it to 0 and using
only pm_HugeSectors.  But at least old versions of newfs_msdos
use the 16-bit field if possible, and msdosfs supports this except
for breaking its own support in the sanity check.  This is quite
different from the handling of pm_FATsecs -- now the 16-bit value
is always ignored for FAT32 except for checking that it is 0,
and newfs_msdos doesn't use the 16-bit value for FAT32.

I just removed the sanity test.

@ Index: msdosfs_vfsops.c
@ ===================================================================
@ RCS file: /home/ncvs/src/sys/fs/msdosfs/msdosfs_vfsops.c,v
@ retrieving revision 1.120
@ diff -u -2 -r1.120 msdosfs_vfsops.c
@ --- msdosfs_vfsops.c	16 Jun 2004 09:47:03 -0000	1.120
@ +++ msdosfs_vfsops.c	31 Jan 2013 17:53:25 -0000
@ @@ -431,5 +459,4 @@
@  		if (bsp->bs710.bsBootSectSig2 != BOOTSIG2
@  		    || bsp->bs710.bsBootSectSig3 != BOOTSIG3
@ -		    || pmp->pm_Sectors
@  		    || pmp->pm_FATsecs
@  		    || getushort(b710->bpbFSVers)) {

The sanity tests of the signatures have already been removed in -current.

============

3.
Backup FATs were sometimes marked dirty by copying their first block
from the primary FAT, and then they were not marked clean on unmount.

This bug has been known for a long time, and always happened while
testing (1), so I fixed it.  My tests were mostly to create a new file
system, 1 mkdir and move the new directory forth and back from the root
partition to corrupt it.  Since all the FAT entries are in the first
block of the FAT, backing this up always marks the backups as unclean.

chkdsk and fsck_msdosfs fix this, but it gives them extra work and
uninspires confidence in the backups.

@ Index: msdosfs_fat.c
@ ===================================================================
@ RCS file: /home/ncvs/src/sys/fs/msdosfs/msdosfs_fat.c,v
@ retrieving revision 1.50
@ diff -u -2 -r1.50 msdosfs_fat.c
@ --- msdosfs_fat.c	1 Sep 2008 13:18:16 -0000	1.50
@ +++ msdosfs_fat.c	31 Jan 2013 15:07:41 -0000
@ @@ -337,6 +338,6 @@
@  	u_long fatbn;
@  {
@ -	int i;
@  	struct buf *bpn;
@ +	int cleanfat, i;
@ 
@  #ifdef MSDOSFS_DEBUG
@ @@ -378,4 +356,10 @@
@  		 * bwrite()'s and really slow things down.
@  		 */
@ +		if (fatbn != pmp->pm_fatblk || FAT12(pmp))
@ +			cleanfat = 0;
@ +		else if (FAT16(pmp))
@ +			cleanfat = 16;
@ +		else
@ +			cleanfat = 32;
@  		for (i = 1; i < pmp->pm_FATs; i++) {
@  			fatbn += pmp->pm_FATsecs;
@ @@ -384,5 +368,10 @@
@  			    0, 0, 0);
@  			bcopy(bp->b_data, bpn->b_data, bp->b_bcount);
@ -			if (pmp->pm_flags & MSDOSFSMNT_WAITONFAT)
@ +			/* Force the clean bit on in the other copies. */
@ +			if (cleanfat == 16)
@ +				((u_int8_t *)bpn->b_data)[3] |= 0x80;
@ +			else if (cleanfat == 32)
@ +				((u_int8_t *)bpn->b_data)[7] |= 0x08;
@ +			if (pmp->pm_mountp->mnt_flag & MNT_SYNCHRONOUS)
@  				bwrite(bpn);
@  			else

Unrelated change for the bwrite() condition.  The MSDOSFSMNT_WAITONFAT
flag is bogus and broken.  It does less than track the MNT_SYNCHRONOUS
flag.  It is set to the latter at mount time but not updated by
MNT_UPDATE.  You could exploit this to set it to a different value
than the current MNT_SYNCHRONOUS setting, but this is undocumented and
fragile.  (FAT updates should be sync by default, but this is too slow,
so the default is async (delayed) FAT, async (delayed or async) file
data and sync metadata for denodes, which is probably a worse
combination than async everything.  But you could change the FAT write
policy to sync by mounting with sync and then MNT_UPDATEing with nosync
to get nosync (default) for file data.)

@ @@ -394,11 +383,10 @@
@  	 * Write out the first (or current) fat last.
@  	 */
@ -	if (pmp->pm_flags & MSDOSFSMNT_WAITONFAT)
@ +	if (pmp->pm_mountp->mnt_flag & MNT_SYNCHRONOUS)
@  		bwrite(bp);
@  	else
@  		bdwrite(bp);

Fixing the condition is more important for the primary FAT.

The backups should probably always be written with async or even delayed
writes.  (async would be not much different from sync, since we don't
check for success.  It would still be very slow.)

@ -	/*
@ -	 * Maybe update fsinfo sector here?
@ -	 */
@ +
@ +	pmp->pm_fmod |= 1;

Unrelated changes.  (I moved all fsinfo updates from here, and use pm_fmod
as a set of flags, with the 1 flag indicating the old (unused) condition
of a modified FAT.)

@  }
@

@ @@ -1097,5 +1085,5 @@
@   * manipulating the upper bit of the FAT entry for cluster 1.  Note that
@   * this bit is not defined for FAT12 volumes, which are always assumed to
@ - * be dirty.
@ + * be clean.
@   *
@   * The fatentry() routine only works on cluster numbers that a file could

Vaguely related -- fix a backwards comment in markvoldirty().

markvoldirty() is too specialized.  The bug wouldn't have existed if
updatefats() had been used instead of the direct bread()/bwrite() of
a single block in markvoldirty().  There are also some locking and
blocksize problems with this special i/o.  However, I prefer not to
write the all backups for marking the volume dirty, and it was easy
to keep them marked clean in updatefats().  Writing the primary FAT
when only it is dirty is already 1 too many writes.  So it is a feature
that markvoldirty() only writes 1 block.

Bruce

From owner-freebsd-fs@FreeBSD.ORG  Fri Feb  1 19:24:38 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id 5C6C27E6
 for <freebsd-fs@freebsd.org>; Fri,  1 Feb 2013 19:24:38 +0000 (UTC)
 (envelope-from peter@rulingia.com)
Received: from vps.rulingia.com (host-122-100-2-194.octopus.com.au
 [122.100.2.194]) by mx1.freebsd.org (Postfix) with ESMTP id E7B77A73
 for <freebsd-fs@freebsd.org>; Fri,  1 Feb 2013 19:24:37 +0000 (UTC)
Received: from server.rulingia.com
 (c220-239-236-213.belrs5.nsw.optusnet.com.au [220.239.236.213])
 by vps.rulingia.com (8.14.5/8.14.5) with ESMTP id r11JONrl039865
 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK);
 Sat, 2 Feb 2013 06:24:24 +1100 (EST)
 (envelope-from peter@rulingia.com)
X-Bogosity: Ham, spamicity=0.000000
Received: from server.rulingia.com (localhost.rulingia.com [127.0.0.1])
 by server.rulingia.com (8.14.5/8.14.5) with ESMTP id r11JOIN8027080
 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
 Sat, 2 Feb 2013 06:24:18 +1100 (EST)
 (envelope-from peter@server.rulingia.com)
Received: (from peter@localhost)
 by server.rulingia.com (8.14.5/8.14.5/Submit) id r11JOGLh027079;
 Sat, 2 Feb 2013 06:24:16 +1100 (EST) (envelope-from peter)
Date: Sat, 2 Feb 2013 06:24:16 +1100
From: Peter Jeremy <peter@rulingia.com>
To: Kevin Day <toasty@dragondata.com>
Subject: Re: Improving ZFS performance for large directories
Message-ID: <20130201192416.GA76461@server.rulingia.com>
References: <19DB8F4A-6788-44F6-9A2C-E01DEA01BED9@dragondata.com>
 <CAJjvXiE+8OMu_yvdRAsWugH7W=fhFW7bicOLLyjEn8YrgvCwiw@mail.gmail.com>
 <F4420A8C-FB92-4771-B261-6C47A736CF7F@dragondata.com>
MIME-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha1;
 protocol="application/pgp-signature"; boundary="envbJBWh7q8WU6mo"
Content-Disposition: inline
In-Reply-To: <F4420A8C-FB92-4771-B261-6C47A736CF7F@dragondata.com>
X-PGP-Key: http://www.rulingia.com/keys/peter.pgp
User-Agent: Mutt/1.5.21 (2010-09-15)
Cc: FreeBSD Filesystems <freebsd-fs@freebsd.org>
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 01 Feb 2013 19:24:38 -0000


--envbJBWh7q8WU6mo
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On 2013-Jan-29 18:06:01 -0600, Kevin Day <toasty@dragondata.com> wrote:
>On Jan 29, 2013, at 5:42 PM, Matthew Ahrens <mahrens@delphix.com> wrote:
>> On Tue, Jan 29, 2013 at 3:20 PM, Kevin Day <toasty@dragondata.com> wrote:
>> I'm prepared to try an L2arc cache device (with secondarycache=3Dmetadat=
a),
>>=20
>> You might first see how long it takes when everything is cached.  E.g. b=
y doing this in the same directory several times.  This will give you a low=
er bound on the time it will take (or put another way, an upper bound on th=
e improvement available from a cache device).
>> =20
>
>Doing it twice back-to-back makes a bit of difference but it's still slow =
either way.

ZFS can very conservative about caching data and twice might not be enough.
I suggest you try 8-10 times, or until the time stops reducing.

>I think some of the issue is that nothing is being allowed to stay cached =
long.

Well ZFS doesn't do any time-based eviction so if things aren't
staying in the cache, it's because they are being evicted by things
that ZFS considers more deserving.

Looking at the zfs-stats you posted, it looks like your workload has
very low locality of reference (the data hitrate is very) low.  If
this is not what you expect then you need more RAM.  OTOH, your
vfs.zfs.arc_meta_used being above vfs.zfs.arc_meta_limit suggests that
ZFS really wants to cache more metadata (by default ZFS has a 25%
metadata, 75% data split in ARC to prevent metadata caching starving
data caching).  I would go even further than the 50:50 split suggested
later and try 75:25 (ie, triple the current vfs.zfs.arc_meta_limit).

Note that if there is basically no locality of reference in your
workload (as I suspect), you can even turn off data caching for
specific filesystems with zfs set primarycache=3Dmetadata tank/foo
(note that you still need to increase vfs.zfs.arc_meta_limit to
allow ZFS to use the the ARC to cache metadata).

--=20
Peter Jeremy

--envbJBWh7q8WU6mo
Content-Type: application/pgp-signature

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (FreeBSD)

iEYEARECAAYFAlEMFmAACgkQ/opHv/APuIecWACgn5H+MWNyBmOSD6dCkZOrkIF7
mUgAn0tVC7elSQq2Z22FqQ5/wNi+0Fvn
=u4yZ
-----END PGP SIGNATURE-----

--envbJBWh7q8WU6mo--