From owner-freebsd-questions@FreeBSD.ORG  Fri Aug  8 16:03:59 2014
Return-Path: <owner-freebsd-questions@FreeBSD.ORG>
Delivered-To: freebsd-questions@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id A17DB8F2
 for <freebsd-questions@freebsd.org>; Fri,  8 Aug 2014 16:03:59 +0000 (UTC)
Received: from mail-qa0-f41.google.com (mail-qa0-f41.google.com
 [209.85.216.41])
 (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id 5DD002684
 for <freebsd-questions@freebsd.org>; Fri,  8 Aug 2014 16:03:58 +0000 (UTC)
Received: by mail-qa0-f41.google.com with SMTP id j7so5707536qaq.0
 for <freebsd-questions@freebsd.org>; Fri, 08 Aug 2014 09:03:57 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20130820;
 h=x-gm-message-state:subject:mime-version:content-type:from
 :in-reply-to:date:cc:content-transfer-encoding:message-id:references
 :to; bh=9kJKHLTn0w4p1KdD97h4SBfElHopBDOmeC6j8l8HKLQ=;
 b=ADiJDLrW/yByhbMX9vYE6guF+Sv865Q3BBOMJz40Kq7EYUYdM8wxtcKdzaLw/QLg9P
 WHB/ct6kmkhkkXCvhR73JYQ9MuR1JMnDVnrTZHNb2mJOEP9q12HT/+hEGY9XmP/vCGYM
 QRX4p2vwSytPHfH7CvQCS3pnjyect70pQGCvPuS5aCfQYFPpp80ffNZ942ssz4YlwSmf
 p6PYxGJJjotHqpZsH2YRLySG4NdLdMrogbo5Z4vbytNv2slo0bZVsA7FMmkn79co6pTc
 7qflKjSdV3x1jLjz+ELeVAo+SGKVFvWW5FzEG+c30+1nneWgdQKn2IAr71u5G1WacOhI
 0Niw==
X-Gm-Message-State: ALoCoQmHG7COXchcxBa39vb54713zc4p5iYQMlNWrBfsTFLWlxTaGzWqe7NVDsfqTJhdMlQpfaPH
X-Received: by 10.224.75.130 with SMTP id y2mr34296404qaj.72.1407513837629;
 Fri, 08 Aug 2014 09:03:57 -0700 (PDT)
Received: from [192.168.1.127] (c-71-234-255-65.hsd1.vt.comcast.net.
 [71.234.255.65])
 by mx.google.com with ESMTPSA id 80sm3952133qgr.38.2014.08.08.09.03.55
 for <multiple recipients>
 (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128);
 Fri, 08 Aug 2014 09:03:56 -0700 (PDT)
Subject: Re: some ZFS questions
Mime-Version: 1.0 (Mac OS X Mail 7.3 \(1878.6\))
Content-Type: text/plain; charset=windows-1252
From: Paul Kraus <paul@kraus-haus.org>
In-Reply-To: <201408070816.s778G9ug015988@sdf.org>
Date: Fri, 8 Aug 2014 12:03:54 -0400
Content-Transfer-Encoding: quoted-printable
Message-Id: <40AF5B49-80AF-4FE2-BA14-BFF86164EAA8@kraus-haus.org>
References: <201408070816.s778G9ug015988@sdf.org>
To: Scott Bennett <bennett@sdf.org>,
 FreeBSD Questions !!!! <freebsd-questions@freebsd.org>
X-Mailer: Apple Mail (2.1878.6)
Cc: Andrew Berg <aberg010@my.hennepintech.edu>
X-BeenThere: freebsd-questions@freebsd.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: User questions <freebsd-questions.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-questions>, 
 <mailto:freebsd-questions-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-questions/>
List-Post: <mailto:freebsd-questions@freebsd.org>
List-Help: <mailto:freebsd-questions-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-questions>, 
 <mailto:freebsd-questions-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 08 Aug 2014 16:03:59 -0000

On Aug 7, 2014, at 4:16, Scott Bennett <bennett@sdf.org> wrote:

>     If two pools use different partitions on a drive and both pools =
are
> rebuilding those partitions at the same time, then how could ZFS *not*
> be hammering the drive?  The access arm would be doing almost nothing =
but
> endless series of long seeks back and forth between the two partitions
> involved.

How is this different from real production use with, for example, a =
large database? Even with a single vdev per physical drive you generate =
LOTS of RANDOM I/O during a resilver. Remember that a ZFS resilver is =
NOT a like other RAID resync operations. It is NOT a sequential copy of =
existing data. It is a functionally a reply of all the data written to =
the zpool as it walks the UberBlock. The major difference between a =
resilver and a scrub is that the resilver is expecting to be writing =
data to one (or more) vdevs, while the scrub is mainly a read operation =
(still generating LOTS of random I/O) looking for errors in the read =
data (and correcting such when found).

>  When you're talking about hundreds of gigabytes to be written
> to each partition, it could take months or even years to complete, =
during
> which time something else is almost certain to fail and halt the =
rebuilds.

In my experience it is not the amount of data to be re-written, but the =
amount of writes that created the data. For example, a zpool that is =
mostly write once (a mead library, for example, where each CD is written =
once, never changed, and read lots) will resilver much faster than a =
zpool with lots of small random writes and lots of deletions (like a =
busy database). See my blog post here: =
http://pk1048.com/zfs-resilver-observations/ for the most recent =
resilver I had to do on my home server. I needed to scan 2.84TB of data =
to rewrite 580GB, it took just under 17 hours.

If I had two (or more) vdevs on each device (and I *have* done that when =
I needed to), I would have issued the first zpool replace command, =
waited for it to complete and then issued the other. If I had more than =
one drive fail, I would have handled the replacement of BOTH drives on =
one zpool first and then moved on to the second. This is NOT because I =
want to be nice and easy on my drives :-), it is simply because I expect =
that running the two operations in parallel will be slower than running =
them in series. For the major reason that large seeks are slower than =
short seeks.

Also note from the data in my blog entry that the only drive being =
pushed close to it=92s limits is the new replaced drive that is handling =
the writes. The read drives are not being pushed that hard. YMMV as this =
is a 5 drive RAIDz2 and for the case of a 2-way mirror the read drive =
and write drive will be more closely loaded.

>     That looks good.  What happens if a "zpool replace failingdrive =
newdrive"
> is running when the failingdrive actually fails completely?

A zpool replace is not a simple copy from the failing device to the new =
one, it is a rebuild of the data on the new device, so if the device =
fails completely it just keeps rebuilding. The example in my blog was of =
a drive that just went offline with no warning. I put the new drive in =
the same physical slot (I did not have any open slots) and issued the =
resilver command.

Note that having the FreeBSD device drive echo the Vendor info, =
including drive P/N and S/N to the system log is a HUGE help to =
replacing bad drives.

>> memory pressure more gracefully, but it's not committed yet. I highly =
recommend
>> moving to 64-bit as soon as possible.
>=20
>     I intend to do so, but "as soon as possible" will be after all =
this
> disk trouble and disk reconfiguration have been resolved.  It will be =
done
> via an in-place upgrade from source, so I need to have a place to run
> buildworld and build kernel.

So the real world intrudes on perfection yet again :-) We do what we =
have to in order to get the job done, but make sure to understand the =
limitations and compromises you are making along the way.

>  Before doing an installkernel and installworld,
> I need also to have a place to run full backups.  I have not had a =
place to
> store new backups for the last three months, which is making me more =
unhappy
> by the day.  I really have to get the disk work *done* before I can =
move
> forward on anything else, which is why I'm trying to find out whether =
I can
> actually use ZFS raidzN in that cause while still on i386.

Yes, you can. I have used ZFS on 32-bit systems (OK, they were really =
32-bit VMs, but I was still running ZFS there, still am today and it has =
saved my butt at least once already).

>  Performance
> will not be an issue that I can see until later if ever.

I have run ZFS on systems with as little as 1GB total RAM, just do NOT =
expect stellar (or even good) performance. Keep a close watch on the ARC =
size (FreeBSD 10 makes this easy with the additional status line in top =
for the ZFS ARC and L2ARC). You can also use arcstat.pl (get the FreeBSD =
version here =
https://code.google.com/p/jhell/downloads/detail?name=3Darcstat.pl )to =
track ARC usage over time. On my most critical production server I leave =
it running with a 60 second sample so if something goes south I can see =
what happened just before.

Tune vfs.zfs.arc_max in /boot/loader.conf

If I had less than 4GB of RAM I would limit the ARC to 1/2 RAM, unless =
this were solely a fileserver, then I would watch how much memory I =
needed outside ZFS and set the ARC to slightly less than that. Take a =
look at the recommendations here https://wiki.freebsd.org/ZFSTuningGuide =
for low RAM situations.

>  I just need to
> know whether I can use it at all with my presently installed OS or =
will
> instead have to use gvinum(8) raid5 and hope for minimal data =
corruption.
> (At least only one .eli device would be needed in that case, not the =
M+N
> .eli devices that would be required for a raidzN pool.) Unfortunately,
> ideal conditions for ZFS are not an available option for now.

I am a big believer in ZFS, so I think the short term disadvantages are =
outweighed by the ease of migration and the long term advantages. So I =
would go the ZFS route.

--
Paul Kraus
paul@kraus-haus.org