From owner-freebsd-fs@freebsd.org  Mon Feb 15 15:05:50 2016
Return-Path: <owner-freebsd-fs@freebsd.org>
Delivered-To: freebsd-fs@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id D7117AA9590
 for <freebsd-fs@mailman.ysv.freebsd.org>; Mon, 15 Feb 2016 15:05:50 +0000 (UTC)
 (envelope-from paul@kraus-haus.org)
Received: from mail-yw0-x22a.google.com (mail-yw0-x22a.google.com
 [IPv6:2607:f8b0:4002:c05::22a])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id 9B918125B
 for <freebsd-fs@freebsd.org>; Mon, 15 Feb 2016 15:05:50 +0000 (UTC)
 (envelope-from paul@kraus-haus.org)
Received: by mail-yw0-x22a.google.com with SMTP id g127so116262245ywf.2
 for <freebsd-fs@freebsd.org>; Mon, 15 Feb 2016 07:05:50 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=kraus-haus-org.20150623.gappssmtp.com; s=20150623;
 h=subject:mime-version:content-type:from:in-reply-to:date:cc
 :content-transfer-encoding:message-id:references:to;
 bh=bhSgh0e37Uq4AkncxmR0jZyfz10rIsm9dwZCULi4XtQ=;
 b=lQBSPgoGVTkcsuGbM3Te3bmkypKcCxi/CS/lmnYX4X+jyPj4kLThYyy11j6S3uz3Gi
 CN5CcV3m4/PGyr6dzbIZ1XyAhD0n1fV2jW/gmyef8WY/yuYUCybtySHThayKSKrmTIYg
 DC5DFKJhKz13p9leGMoaGhLAiYhrqMs+r1qBbR4Th7KA52JypKdrGuT8WQMdWUcO4HGq
 maugPUEFaB9z2lL9dfGcDOz41EshpHj2I9YLMypzYCHMcpcmgOVIQJNtpO+ams01jF0O
 nYRDvyA3C+WnEihiKUDBA1yKycf/VPA4HodXGDnMuHEbbR+03/kYRmxegze0y5T2heEe
 Q6tw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20130820;
 h=x-gm-message-state:subject:mime-version:content-type:from
 :in-reply-to:date:cc:content-transfer-encoding:message-id:references
 :to; bh=bhSgh0e37Uq4AkncxmR0jZyfz10rIsm9dwZCULi4XtQ=;
 b=N65semmTyqP/uHqXljWhxyzeS/gv5UTj0U0g6317rQcVL3hL789MngKx7YXt2zKcm7
 VV26jkTc1sZDsAuXMADjpoWMsOBPryvkXrTTOQ0nNVV5diATlJOUcNTgjF12fxYsa7MH
 M8ZbZX9n/fyI2Hwe+3JD8W0hzQhOVYItEPQqx6cru+KlZO5apSZmiNH0Bnl4qASA779g
 UEq8Wy3B+yhRcNl+XG1XY+Z0IKK/tpkWPf0DplJhtL/JQ13P8wJ03YjbRpKr5YJqjJc+
 b21suEGL+JWjc4HXIgHlTOPZ1X7NtOXLxuUmEeK8Z8O/0dxi/6gDHQ4Vf+sRkVdpofIN
 Lqlg==
X-Gm-Message-State: AG10YORYtcp014aVtqARhrkQWZIJ1X7m4iTu9k5+yEQFES2nCizSCCHHjPUB/gVCEh68EA==
X-Received: by 10.129.56.87 with SMTP id f84mr9046014ywa.14.1455548749549;
 Mon, 15 Feb 2016 07:05:49 -0800 (PST)
Received: from [192.168.2.137] (pool-100-4-209-221.albyny.fios.verizon.net.
 [100.4.209.221])
 by smtp.gmail.com with ESMTPSA id p189sm20883753ywc.9.2016.02.15.07.05.48
 (version=TLS1 cipher=ECDHE-RSA-AES128-SHA bits=128/128);
 Mon, 15 Feb 2016 07:05:48 -0800 (PST)
Subject: Re: Hours of tiny transfers at the end of a ZFS resilver?
Mime-Version: 1.0 (Mac OS X Mail 7.3 \(1878.6\))
Content-Type: text/plain; charset=windows-1252
From: Paul Kraus <paul@kraus-haus.org>
In-Reply-To: <120226C8-3003-4334-9F5F-882CCB0D28C5@bigpond.net.au>
Date: Mon, 15 Feb 2016 10:05:45 -0500
Cc: freebsd-fs@freebsd.org
Content-Transfer-Encoding: quoted-printable
Message-Id: <44B57B63-C9C5-4166-8737-D4866E6A9D08@kraus-haus.org>
References: <120226C8-3003-4334-9F5F-882CCB0D28C5@bigpond.net.au>
To: Andrew Reilly <areilly@bigpond.net.au>
X-Mailer: Apple Mail (2.1878.6)
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs/>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 15 Feb 2016 15:05:51 -0000

On Feb 15, 2016, at 5:18, Andrew Reilly <areilly@bigpond.net.au> wrote:

> Hi Filesystem experts,
>=20
> I have a question about the nature of ZFS and the resilvering
> that occurs after a driver replacement from a raidz array.

How many snapshots do you have ? I have seen this behavior on pools with =
many snapshots and ongoing creation of snapshots during the resilver. =
The resilver gets to somewhere above 95% (usually 99.xxx % for me) and =
then slows to a crawl, often for days.

Most of the ZFS pools I manage have automated jobs to create hourly =
snapshots, so I am always creating snapshots.

More below...

>=20
> I have a fairly simple home file server that (by way of

<snip>

> have had the system off-line for many hours (I guess).
>=20
> Now, one thing that I didn't realise at the start of this
> process was that the zpool has the original 512B sector size
> baked in at a fairly low level, so it is using some sort of
> work-around for the fact that the new drives actually have 4096B
> sectors (although they lie about that in smartctl -i queries):

Running 4K native drives in a 512B pool will cause a performance hit. =
When I ran into this I rebuilt the pool from scratch as a 4K native =
pool. If there is at least one 4K native drive in a given vdev the vdev =
will be created native 4K (at least under FBSD 10.x). My home server has =
a pool of mixed 512B and 4K drives. I made sure each vdev was built 4K.

The code in the drive that emulates 512B behavior has not been very fast =
and that is the crux of the performance issues. I just had to rebuild a =
pool because 2TB WD Red Pro are 4K while 2TB WD RE are 512B.=20

<snip>

> While clearly sub-optimal, I expect that the performance will
> still be good enough for my purposes: I can build a new,
> properly aligned file system when I do the next re-build.
>=20
> The odd thing is that after charging through the resilver using
> large blocks (around 64k according to systat), when they get to
> the end, as this one is now, the process drags on for hours with
> millions of tiny, sub-2K transfers:

Yup.

The resilver process walks through the transaction groups (TXG) =
replaying them onto the new (replacement) drive. This is different from =
other traditional resync methods. It also means that the early TXG will =
be large (as you loaded data) and then he size of the TXG will vary with =
the size of the data written.

<snip>

> So there's a problem wth the zpool status output: it's
> predicting half an hour to go based on the averaged 67M/s over
> the whole drive, not the <2MB/s that it's actually doing, and
> will probably continue to do so for several hours, if tonight
> goes the same way as last night.  Last night zpool status said
> "0h05m to go" for more than three hours, before I gave up
> waiting to start the next drive.

Yup, the code that estimates time to go is based on the overall average =
transfer not the current. In my experience the transfer rate peaks =
somewhere in the middle of the resilver.

> Is this expected behaviour, or something bad and peculiar about
> my system?

Expected ? I=92m not sure if the designers of ZFS expected this behavior =
:-)

But it is the typical behavior and is correct.

> I'm confused about how ZFS really works, given this state.  I
> had thought that the zpool layer did parity calculation in big
> 256k-ish stripes across the drives, and the zfs filesystem layer
> coped with that large block size because it had lots of caching
> and wrote everything in log-structure.  Clearly that mental
> model must be incorrect, because then it would only ever be
> doing large transfers.  Anywhere I could go to find a nice
> write-up of how ZFS is working?

You really can=92t think about ZFS the same way as older systems, with a =
volume manager and a filesystem, they are fully integrated. For example, =
stripe size (across all the top level vdevs) is dynamic, changing with =
each write operation. I believe that it tries to include every top level =
vdev in each write operation. In your case that does not apply as you =
only have one top level vdev, but note that performance really scales =
with the number of top level vdevs more than the number of drives per =
vdev.

Also note that striping within a RAIDz<n> vdev is separate from the top =
level vdev striping.

Take a look here: =
http://blog.delphix.com/matt/2014/06/06/zfs-stripe-width/ for a good =
discussion of ZFS striping for RAIDz<n> vdevs. And don=92t forget to =
follow the links at the bottom of the page for more details.

P.S. For performance it is generally recommended to use mirrors while =
for capacity use RAIDz<n>, all tempered by the mean time to data loss =
(MTTDL) you need. Hint, a 3-way mirror has about the same MTTDL as a =
RAIDz2.

--
Paul Kraus
paul@kraus-haus.org