Date: Mon, 15 Feb 2016 10:05:45 -0500 From: Paul Kraus <paul@kraus-haus.org> To: Andrew Reilly <areilly@bigpond.net.au> Cc: freebsd-fs@freebsd.org Subject: Re: Hours of tiny transfers at the end of a ZFS resilver? Message-ID: <44B57B63-C9C5-4166-8737-D4866E6A9D08@kraus-haus.org> In-Reply-To: <120226C8-3003-4334-9F5F-882CCB0D28C5@bigpond.net.au> References: <120226C8-3003-4334-9F5F-882CCB0D28C5@bigpond.net.au>
next in thread | previous in thread | raw e-mail | index | archive | help
On Feb 15, 2016, at 5:18, Andrew Reilly <areilly@bigpond.net.au> wrote: > Hi Filesystem experts, >=20 > I have a question about the nature of ZFS and the resilvering > that occurs after a driver replacement from a raidz array. How many snapshots do you have ? I have seen this behavior on pools with = many snapshots and ongoing creation of snapshots during the resilver. = The resilver gets to somewhere above 95% (usually 99.xxx % for me) and = then slows to a crawl, often for days. Most of the ZFS pools I manage have automated jobs to create hourly = snapshots, so I am always creating snapshots. More below... >=20 > I have a fairly simple home file server that (by way of <snip> > have had the system off-line for many hours (I guess). >=20 > Now, one thing that I didn't realise at the start of this > process was that the zpool has the original 512B sector size > baked in at a fairly low level, so it is using some sort of > work-around for the fact that the new drives actually have 4096B > sectors (although they lie about that in smartctl -i queries): Running 4K native drives in a 512B pool will cause a performance hit. = When I ran into this I rebuilt the pool from scratch as a 4K native = pool. If there is at least one 4K native drive in a given vdev the vdev = will be created native 4K (at least under FBSD 10.x). My home server has = a pool of mixed 512B and 4K drives. I made sure each vdev was built 4K. The code in the drive that emulates 512B behavior has not been very fast = and that is the crux of the performance issues. I just had to rebuild a = pool because 2TB WD Red Pro are 4K while 2TB WD RE are 512B.=20 <snip> > While clearly sub-optimal, I expect that the performance will > still be good enough for my purposes: I can build a new, > properly aligned file system when I do the next re-build. >=20 > The odd thing is that after charging through the resilver using > large blocks (around 64k according to systat), when they get to > the end, as this one is now, the process drags on for hours with > millions of tiny, sub-2K transfers: Yup. The resilver process walks through the transaction groups (TXG) = replaying them onto the new (replacement) drive. This is different from = other traditional resync methods. It also means that the early TXG will = be large (as you loaded data) and then he size of the TXG will vary with = the size of the data written. <snip> > So there's a problem wth the zpool status output: it's > predicting half an hour to go based on the averaged 67M/s over > the whole drive, not the <2MB/s that it's actually doing, and > will probably continue to do so for several hours, if tonight > goes the same way as last night. Last night zpool status said > "0h05m to go" for more than three hours, before I gave up > waiting to start the next drive. Yup, the code that estimates time to go is based on the overall average = transfer not the current. In my experience the transfer rate peaks = somewhere in the middle of the resilver. > Is this expected behaviour, or something bad and peculiar about > my system? Expected ? I=92m not sure if the designers of ZFS expected this behavior = :-) But it is the typical behavior and is correct. > I'm confused about how ZFS really works, given this state. I > had thought that the zpool layer did parity calculation in big > 256k-ish stripes across the drives, and the zfs filesystem layer > coped with that large block size because it had lots of caching > and wrote everything in log-structure. Clearly that mental > model must be incorrect, because then it would only ever be > doing large transfers. Anywhere I could go to find a nice > write-up of how ZFS is working? You really can=92t think about ZFS the same way as older systems, with a = volume manager and a filesystem, they are fully integrated. For example, = stripe size (across all the top level vdevs) is dynamic, changing with = each write operation. I believe that it tries to include every top level = vdev in each write operation. In your case that does not apply as you = only have one top level vdev, but note that performance really scales = with the number of top level vdevs more than the number of drives per = vdev. Also note that striping within a RAIDz<n> vdev is separate from the top = level vdev striping. Take a look here: = http://blog.delphix.com/matt/2014/06/06/zfs-stripe-width/ for a good = discussion of ZFS striping for RAIDz<n> vdevs. And don=92t forget to = follow the links at the bottom of the page for more details. P.S. For performance it is generally recommended to use mirrors while = for capacity use RAIDz<n>, all tempered by the mean time to data loss = (MTTDL) you need. Hint, a 3-way mirror has about the same MTTDL as a = RAIDz2. -- Paul Kraus paul@kraus-haus.org
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?44B57B63-C9C5-4166-8737-D4866E6A9D08>