Date: Fri, 8 Aug 2014 12:03:54 -0400 From: Paul Kraus <paul@kraus-haus.org> To: Scott Bennett <bennett@sdf.org>, FreeBSD Questions !!!! <freebsd-questions@freebsd.org> Cc: Andrew Berg <aberg010@my.hennepintech.edu> Subject: Re: some ZFS questions Message-ID: <40AF5B49-80AF-4FE2-BA14-BFF86164EAA8@kraus-haus.org> In-Reply-To: <201408070816.s778G9ug015988@sdf.org> References: <201408070816.s778G9ug015988@sdf.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Aug 7, 2014, at 4:16, Scott Bennett <bennett@sdf.org> wrote: > If two pools use different partitions on a drive and both pools = are > rebuilding those partitions at the same time, then how could ZFS *not* > be hammering the drive? The access arm would be doing almost nothing = but > endless series of long seeks back and forth between the two partitions > involved. How is this different from real production use with, for example, a = large database? Even with a single vdev per physical drive you generate = LOTS of RANDOM I/O during a resilver. Remember that a ZFS resilver is = NOT a like other RAID resync operations. It is NOT a sequential copy of = existing data. It is a functionally a reply of all the data written to = the zpool as it walks the UberBlock. The major difference between a = resilver and a scrub is that the resilver is expecting to be writing = data to one (or more) vdevs, while the scrub is mainly a read operation = (still generating LOTS of random I/O) looking for errors in the read = data (and correcting such when found). > When you're talking about hundreds of gigabytes to be written > to each partition, it could take months or even years to complete, = during > which time something else is almost certain to fail and halt the = rebuilds. In my experience it is not the amount of data to be re-written, but the = amount of writes that created the data. For example, a zpool that is = mostly write once (a mead library, for example, where each CD is written = once, never changed, and read lots) will resilver much faster than a = zpool with lots of small random writes and lots of deletions (like a = busy database). See my blog post here: = http://pk1048.com/zfs-resilver-observations/ for the most recent = resilver I had to do on my home server. I needed to scan 2.84TB of data = to rewrite 580GB, it took just under 17 hours. If I had two (or more) vdevs on each device (and I *have* done that when = I needed to), I would have issued the first zpool replace command, = waited for it to complete and then issued the other. If I had more than = one drive fail, I would have handled the replacement of BOTH drives on = one zpool first and then moved on to the second. This is NOT because I = want to be nice and easy on my drives :-), it is simply because I expect = that running the two operations in parallel will be slower than running = them in series. For the major reason that large seeks are slower than = short seeks. Also note from the data in my blog entry that the only drive being = pushed close to it=92s limits is the new replaced drive that is handling = the writes. The read drives are not being pushed that hard. YMMV as this = is a 5 drive RAIDz2 and for the case of a 2-way mirror the read drive = and write drive will be more closely loaded. > That looks good. What happens if a "zpool replace failingdrive = newdrive" > is running when the failingdrive actually fails completely? A zpool replace is not a simple copy from the failing device to the new = one, it is a rebuild of the data on the new device, so if the device = fails completely it just keeps rebuilding. The example in my blog was of = a drive that just went offline with no warning. I put the new drive in = the same physical slot (I did not have any open slots) and issued the = resilver command. Note that having the FreeBSD device drive echo the Vendor info, = including drive P/N and S/N to the system log is a HUGE help to = replacing bad drives. >> memory pressure more gracefully, but it's not committed yet. I highly = recommend >> moving to 64-bit as soon as possible. >=20 > I intend to do so, but "as soon as possible" will be after all = this > disk trouble and disk reconfiguration have been resolved. It will be = done > via an in-place upgrade from source, so I need to have a place to run > buildworld and build kernel. So the real world intrudes on perfection yet again :-) We do what we = have to in order to get the job done, but make sure to understand the = limitations and compromises you are making along the way. > Before doing an installkernel and installworld, > I need also to have a place to run full backups. I have not had a = place to > store new backups for the last three months, which is making me more = unhappy > by the day. I really have to get the disk work *done* before I can = move > forward on anything else, which is why I'm trying to find out whether = I can > actually use ZFS raidzN in that cause while still on i386. Yes, you can. I have used ZFS on 32-bit systems (OK, they were really = 32-bit VMs, but I was still running ZFS there, still am today and it has = saved my butt at least once already). > Performance > will not be an issue that I can see until later if ever. I have run ZFS on systems with as little as 1GB total RAM, just do NOT = expect stellar (or even good) performance. Keep a close watch on the ARC = size (FreeBSD 10 makes this easy with the additional status line in top = for the ZFS ARC and L2ARC). You can also use arcstat.pl (get the FreeBSD = version here = https://code.google.com/p/jhell/downloads/detail?name=3Darcstat.pl )to = track ARC usage over time. On my most critical production server I leave = it running with a 60 second sample so if something goes south I can see = what happened just before. Tune vfs.zfs.arc_max in /boot/loader.conf If I had less than 4GB of RAM I would limit the ARC to 1/2 RAM, unless = this were solely a fileserver, then I would watch how much memory I = needed outside ZFS and set the ARC to slightly less than that. Take a = look at the recommendations here https://wiki.freebsd.org/ZFSTuningGuide = for low RAM situations. > I just need to > know whether I can use it at all with my presently installed OS or = will > instead have to use gvinum(8) raid5 and hope for minimal data = corruption. > (At least only one .eli device would be needed in that case, not the = M+N > .eli devices that would be required for a raidzN pool.) Unfortunately, > ideal conditions for ZFS are not an available option for now. I am a big believer in ZFS, so I think the short term disadvantages are = outweighed by the ease of migration and the long term advantages. So I = would go the ZFS route. -- Paul Kraus paul@kraus-haus.org
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?40AF5B49-80AF-4FE2-BA14-BFF86164EAA8>