From owner-freebsd-fs@freebsd.org Mon Feb 15 15:05:50 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id D7117AA9590 for ; Mon, 15 Feb 2016 15:05:50 +0000 (UTC) (envelope-from paul@kraus-haus.org) Received: from mail-yw0-x22a.google.com (mail-yw0-x22a.google.com [IPv6:2607:f8b0:4002:c05::22a]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 9B918125B for ; Mon, 15 Feb 2016 15:05:50 +0000 (UTC) (envelope-from paul@kraus-haus.org) Received: by mail-yw0-x22a.google.com with SMTP id g127so116262245ywf.2 for ; Mon, 15 Feb 2016 07:05:50 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kraus-haus-org.20150623.gappssmtp.com; s=20150623; h=subject:mime-version:content-type:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=bhSgh0e37Uq4AkncxmR0jZyfz10rIsm9dwZCULi4XtQ=; b=lQBSPgoGVTkcsuGbM3Te3bmkypKcCxi/CS/lmnYX4X+jyPj4kLThYyy11j6S3uz3Gi CN5CcV3m4/PGyr6dzbIZ1XyAhD0n1fV2jW/gmyef8WY/yuYUCybtySHThayKSKrmTIYg DC5DFKJhKz13p9leGMoaGhLAiYhrqMs+r1qBbR4Th7KA52JypKdrGuT8WQMdWUcO4HGq maugPUEFaB9z2lL9dfGcDOz41EshpHj2I9YLMypzYCHMcpcmgOVIQJNtpO+ams01jF0O nYRDvyA3C+WnEihiKUDBA1yKycf/VPA4HodXGDnMuHEbbR+03/kYRmxegze0y5T2heEe Q6tw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:subject:mime-version:content-type:from :in-reply-to:date:cc:content-transfer-encoding:message-id:references :to; bh=bhSgh0e37Uq4AkncxmR0jZyfz10rIsm9dwZCULi4XtQ=; b=N65semmTyqP/uHqXljWhxyzeS/gv5UTj0U0g6317rQcVL3hL789MngKx7YXt2zKcm7 VV26jkTc1sZDsAuXMADjpoWMsOBPryvkXrTTOQ0nNVV5diATlJOUcNTgjF12fxYsa7MH M8ZbZX9n/fyI2Hwe+3JD8W0hzQhOVYItEPQqx6cru+KlZO5apSZmiNH0Bnl4qASA779g UEq8Wy3B+yhRcNl+XG1XY+Z0IKK/tpkWPf0DplJhtL/JQ13P8wJ03YjbRpKr5YJqjJc+ b21suEGL+JWjc4HXIgHlTOPZ1X7NtOXLxuUmEeK8Z8O/0dxi/6gDHQ4Vf+sRkVdpofIN Lqlg== X-Gm-Message-State: AG10YORYtcp014aVtqARhrkQWZIJ1X7m4iTu9k5+yEQFES2nCizSCCHHjPUB/gVCEh68EA== X-Received: by 10.129.56.87 with SMTP id f84mr9046014ywa.14.1455548749549; Mon, 15 Feb 2016 07:05:49 -0800 (PST) Received: from [192.168.2.137] (pool-100-4-209-221.albyny.fios.verizon.net. [100.4.209.221]) by smtp.gmail.com with ESMTPSA id p189sm20883753ywc.9.2016.02.15.07.05.48 (version=TLS1 cipher=ECDHE-RSA-AES128-SHA bits=128/128); Mon, 15 Feb 2016 07:05:48 -0800 (PST) Subject: Re: Hours of tiny transfers at the end of a ZFS resilver? Mime-Version: 1.0 (Mac OS X Mail 7.3 \(1878.6\)) Content-Type: text/plain; charset=windows-1252 From: Paul Kraus In-Reply-To: <120226C8-3003-4334-9F5F-882CCB0D28C5@bigpond.net.au> Date: Mon, 15 Feb 2016 10:05:45 -0500 Cc: freebsd-fs@freebsd.org Content-Transfer-Encoding: quoted-printable Message-Id: <44B57B63-C9C5-4166-8737-D4866E6A9D08@kraus-haus.org> References: <120226C8-3003-4334-9F5F-882CCB0D28C5@bigpond.net.au> To: Andrew Reilly X-Mailer: Apple Mail (2.1878.6) X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 15 Feb 2016 15:05:51 -0000 On Feb 15, 2016, at 5:18, Andrew Reilly wrote: > Hi Filesystem experts, >=20 > I have a question about the nature of ZFS and the resilvering > that occurs after a driver replacement from a raidz array. How many snapshots do you have ? I have seen this behavior on pools with = many snapshots and ongoing creation of snapshots during the resilver. = The resilver gets to somewhere above 95% (usually 99.xxx % for me) and = then slows to a crawl, often for days. Most of the ZFS pools I manage have automated jobs to create hourly = snapshots, so I am always creating snapshots. More below... >=20 > I have a fairly simple home file server that (by way of > have had the system off-line for many hours (I guess). >=20 > Now, one thing that I didn't realise at the start of this > process was that the zpool has the original 512B sector size > baked in at a fairly low level, so it is using some sort of > work-around for the fact that the new drives actually have 4096B > sectors (although they lie about that in smartctl -i queries): Running 4K native drives in a 512B pool will cause a performance hit. = When I ran into this I rebuilt the pool from scratch as a 4K native = pool. If there is at least one 4K native drive in a given vdev the vdev = will be created native 4K (at least under FBSD 10.x). My home server has = a pool of mixed 512B and 4K drives. I made sure each vdev was built 4K. The code in the drive that emulates 512B behavior has not been very fast = and that is the crux of the performance issues. I just had to rebuild a = pool because 2TB WD Red Pro are 4K while 2TB WD RE are 512B.=20 > While clearly sub-optimal, I expect that the performance will > still be good enough for my purposes: I can build a new, > properly aligned file system when I do the next re-build. >=20 > The odd thing is that after charging through the resilver using > large blocks (around 64k according to systat), when they get to > the end, as this one is now, the process drags on for hours with > millions of tiny, sub-2K transfers: Yup. The resilver process walks through the transaction groups (TXG) = replaying them onto the new (replacement) drive. This is different from = other traditional resync methods. It also means that the early TXG will = be large (as you loaded data) and then he size of the TXG will vary with = the size of the data written. > So there's a problem wth the zpool status output: it's > predicting half an hour to go based on the averaged 67M/s over > the whole drive, not the <2MB/s that it's actually doing, and > will probably continue to do so for several hours, if tonight > goes the same way as last night. Last night zpool status said > "0h05m to go" for more than three hours, before I gave up > waiting to start the next drive. Yup, the code that estimates time to go is based on the overall average = transfer not the current. In my experience the transfer rate peaks = somewhere in the middle of the resilver. > Is this expected behaviour, or something bad and peculiar about > my system? Expected ? I=92m not sure if the designers of ZFS expected this behavior = :-) But it is the typical behavior and is correct. > I'm confused about how ZFS really works, given this state. I > had thought that the zpool layer did parity calculation in big > 256k-ish stripes across the drives, and the zfs filesystem layer > coped with that large block size because it had lots of caching > and wrote everything in log-structure. Clearly that mental > model must be incorrect, because then it would only ever be > doing large transfers. Anywhere I could go to find a nice > write-up of how ZFS is working? You really can=92t think about ZFS the same way as older systems, with a = volume manager and a filesystem, they are fully integrated. For example, = stripe size (across all the top level vdevs) is dynamic, changing with = each write operation. I believe that it tries to include every top level = vdev in each write operation. In your case that does not apply as you = only have one top level vdev, but note that performance really scales = with the number of top level vdevs more than the number of drives per = vdev. Also note that striping within a RAIDz vdev is separate from the top = level vdev striping. Take a look here: = http://blog.delphix.com/matt/2014/06/06/zfs-stripe-width/ for a good = discussion of ZFS striping for RAIDz vdevs. And don=92t forget to = follow the links at the bottom of the page for more details. P.S. For performance it is generally recommended to use mirrors while = for capacity use RAIDz, all tempered by the mean time to data loss = (MTTDL) you need. Hint, a 3-way mirror has about the same MTTDL as a = RAIDz2. -- Paul Kraus paul@kraus-haus.org