From owner-freebsd-fs@FreeBSD.ORG  Tue Jan 22 11:13:11 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115])
 by hub.freebsd.org (Postfix) with ESMTP id 86F6692A
 for <freebsd-fs@freebsd.org>; Tue, 22 Jan 2013 11:13:11 +0000 (UTC)
 (envelope-from borjam@sarenet.es)
Received: from proxypop03b.sare.net (proxypop03b.sare.net [194.30.0.251])
 by mx1.freebsd.org (Postfix) with ESMTP id BCF5A9C6
 for <freebsd-fs@freebsd.org>; Tue, 22 Jan 2013 11:13:10 +0000 (UTC)
Received: from [172.16.2.2] (izaro.sarenet.es [192.148.167.11])
 by proxypop03.sare.net (Postfix) with ESMTPSA id 861F19DD4B5;
 Tue, 22 Jan 2013 12:03:39 +0100 (CET)
From: Borja Marcos <borjam@sarenet.es>
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: quoted-printable
Subject: RFC: Suggesting ZFS "best practices" in FreeBSD
Date: Tue, 22 Jan 2013 12:03:59 +0100
Message-Id: <314B600D-E8E6-4300-B60F-33D5FA5A39CF@sarenet.es>
To: FreeBSD Filesystems <freebsd-fs@freebsd.org>
Mime-Version: 1.0 (Apple Message framework v1085)
X-Mailer: Apple Mail (2.1085)
Cc: Scott Long <scottl@samsco.org>
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 22 Jan 2013 11:13:11 -0000

(Scott, I hope you don't mind to be CC'd, I'm not sure you read the -FS =
mailing list, and this is a SCSI//FS issue)


Hi :)

Hope nobody will hate me too much, but ZFS usage under FreeBSD is still =
chaotic. We badly need a well proven "doctrine" in order to avoid =
problems. Especially, we need to avoid the braindead Linux HOWTO-esque =
crap of endless commands for which no rationale is offered at all, and =
which mix personal preferences and even misconceptions as "advice" (I =
saw one of those howtos which suggested disabling checksums "because =
they are useless").

ZFS is a very different beast from other filesystems, and the setup can =
involve some non-obvious decisions. Worse, Windows oriented server =
vendors insist on bundling servers with crappy raid controllers which =
tend to make things worse.

Since I've been using ZFS on FreeBSD (from the first versions) I have =
noticed several serious problems. I try to explain some of them, and my =
suggestions for a solution. We should collect more use cases and issues =
and try to reach a consensus.=20


1- Dynamic disk naming -> We should use static naming (GPT labels, for =
instance)

ZFS was born in a system with static device naming (Solaris). When you =
plug a disk it gets a fixed name. As far as I know, at least from my =
experience with Sun boxes, c1t3d12 is always c1t3d12. FreeBSD's dynamic =
naming can be very problematic.

For example, imagine that I have 16 disks, da0 to da15. One of them, =
say, da5, dies. When I reboot the machine, all the devices from da6 to =
da15 will be renamed to the device number -1. Potential for trouble as a =
minimum.

After several different installations, I am preferring to rely on static =
naming. Doing it with some care can really help to make pools portable =
from one system to another. I create a GPT partition in each drive, and =
Iabel it with a readable name. Thus, imagine I label each big partition =
(which takes the whole available space) as pool-vdev-disk, for example, =
pool-raidz1-disk1.

When creating a pool, I use these names. Instead of dealing with device =
numbers. For example:=20

% zpool status
  pool: rpool
 state: ONLINE
  scan: scrub repaired 0 in 0h52m with 0 errors on Mon Jan  7 16:25:47 =
2013
config:

	NAME             STATE     READ WRITE CKSUM
	rpool            ONLINE       0     0     0
	  mirror-0       ONLINE       0     0     0
	    gpt/rpool-disk1       ONLINE       0     0     0
	    gpt/rpool-disk2       ONLINE       0     0     0
	logs
	  gpt/zfs-log    ONLINE       0     0     0
	cache
	  gpt/zfs-cache  ONLINE       0     0     0

Using a unique name for each disk within your organization is important. =
That way, you can safely move the disks to a different server, which =
might be using ZFS, and still be able to import the pool without name =
collisions. Of course  you could use gptids, which, as far as I know, =
are unique, but they are difficult to use and in case  of a disk failure =
it's not easy to determine which disk to replace.


2- RAID cards.

Simply: Avoid them like the pest. ZFS is designed to operate on bare =
disks. And it does an amazingly good job. Any additional software layer =
you add on top will compromise it. I have had bad experiences with "mfi" =
and "aac" cards.=20

There are two solutions adopted by RAID card users. None of them is =
good. The first an obvious one is to create a RAID5 taking advantage of =
the battery based cache (if present). It works, but it loses some of the =
advantages of ZFS. Moreover, trying different cards, I have been forced =
to reboot whole servers in order to do something trivial like replacing =
a failed disk. Yes, there are software tools to control some of the =
cards, but they are at the very least cumbersome and confusing.

The second "solution" is to create a RAID0 volume for each disk (some =
RAID card manufacturers even dare to call it JBOD). I haven't seen a =
single instance of this working flawlessly. Again, a replaced disk can =
be a headache. At the very least, you have to deal with a cumbersome and =
complicated management program to replace a disk, and you often have to =
reboot the server.

The biggest reason to avoid these stupid cards, anyway, is plain simple: =
Those cards, at least the ones I have tried bundled by Dell as =
PERC(insert a random number here) or Sun, isolate the ASC/ASCQ sense =
codes from the filesystem. Pure crap.

Years ago, fighting this issue, and when ZFS was still rather =
experimental, I asked for help and Scott Long sent me a "don't try this =
at home" simple patch, so that the disks become available to the CAM =
layer, bypassing the RAID card. He warned me of potential issues and =
lost sense codes, but, so far so good. And indeed the sense codes are =
lost when a RAID card creates a volume, even if in the misnamed "JBOD" =
configuration.=20


=
http://www.mavetju.org/mail/view_message.php?list=3Dfreebsd-scsi&id=3D2634=
817&raw=3Dyes
http://comments.gmane.org/gmane.os.freebsd.devel.scsi/5679

Anyway, even if there might be some issues due to command handling, the =
end to end verification performed by ZFS should ensure that, as a =
minimum, the data on the disks won't  be corrupted and, in case it =
happens, it will be detected. I rather prefer to have ZFS deal with it, =
instead of working on a sort of "virtual" disk implemented on the RAID =
card.

Another *strong* reason to avoid those cards, even "JBOD" =
configurations, is disk portability. The RAID labels the disks. Moving =
one disk from one machine to another one will result on a funny =
situation of confusing "import foreign config/ignore" messages when =
rebooting the destination server (mandatory in order to be able to =
access the transferred disk). Once again, additional complexity, useless =
layering and more reboots. That may be acceptable for Metoosoft crap, =
not for Unix systems.

Summarizing: I would *strongly* recommend to avoid the RAID cards and =
get proper host adapters without any fancy functionalities instead. The =
one sold by Dell as H200 seems to work very well. No need to create any =
JBOD or fancy thing at all. It will just expose the drivers as normal =
SAS/SATA ones. A host adapter without fancy firmware is the best =
guarantee about failures caused by fancy firmware.

But, in case that=B4s not possible, I am still leaning to the kludge of =
bypassing the RAID functionality, and even avoiding the JBOD/RAID0 thing =
by patching the driver. There is one issue, though. In case of reboot, =
the RAID cards freeze, I am not sure why. Maybe that could be fixed,  it =
happens on machines on which I am not using the RAID functionality at =
all. They should become "transparent" but they don't.=20

Also, I think that  the so-called JBOD thing would impair the correct =
performance of a zfs health daemon doing things such as automatic failed =
disk replacement by hot-spares, etc. And there won't be a real ASC/ASCQ =
log message for diagnosis.

(See at the bottom to read about a problem I have just had with a "JBOD" =
configuration)


3- Installation, boot, etc.

Here I am not sure. Before zfsboot became available, I used to create a =
zfs-on-root system by doing, more or less, this:

- Install base system on a pendrive. After the installation, just /boot =
will be used  from the pendrive, and /boot/loader.conf will=20

- Create the ZFS pool.

- Create and populate the root hierarchy. I used to create something =
like:

pool/root
pool/root/var
pool/root/usr
pool/root/tmp

Why pool/root instead of simply "pool"? Because it's easier to =
understand, snapshot, send/receive, etc. Why in a hierarchy? Because, if =
needed, it's possible to snapshot the whole "system" tree atomically.=20

I also set the mountpoint of the "system" tree as legacy, and rely on =
/etc/fstab. Why? In order to avoid an accidental "auto mount"  of =
critical filesystems in case, for example, I boot off a pendrive in =
order to tinker.=20

For the last system I installed, I tried with zfsboot instead of booting =
off the /boot directory of a FFS partition.


(*) An example of RAID/JBOD induced crap and the problem of not using =
static naming follows,=20

I am using a Sun server running FreeBSD. It has 16 160 GB SAS disks, and =
one of those cards I worship: this particular example is controlled by =
the aac driver.=20

As I was going to tinker a lot, I decided to create a raid-based mirror =
for the system, so that I can boot off it and have swap even with a =
failed disk, and use the other 14 disks as a pool with two raidz vdevs =
of 6 disks, leaving two disks as hot-spares. Later  I removed one of the =
hot-spares and I installed a SSD disk with two partitions to try and =
make it work as L2ARC  and log. As I had gone for the jbod pain, of =
course replacing that disk meant rebooting the server in order to do =
something as illogical as creating a "logical" volume on top of it. =
These cards just love to be rebooted.

  pool: pool
 state: ONLINE
  scan: resilvered 7.79G in 0h33m with 0 errors on Tue Jan 22 10:25:10 =
2013
config:

	NAME             STATE     READ WRITE CKSUM
	pool             ONLINE       0     0     0
	  raidz1-0       ONLINE       0     0     0
	    aacd1        ONLINE       0     0     0
	    aacd2        ONLINE       0     0     0
	    aacd3        ONLINE       0     0     0
	    aacd4        ONLINE       0     0     0
	    aacd5        ONLINE       0     0     0
	    aacd6        ONLINE       0     0     0
	  raidz1-1       ONLINE       0     0     0
	    aacd7        ONLINE       0     0     0
	    aacd8        ONLINE       0     0     0
	    aacd9        ONLINE       0     0     0
	    aacd10       ONLINE       0     0     0
	    aacd11       ONLINE       0     0     0
	    aacd12       ONLINE       0     0     0
	logs
	  gpt/zfs-log    ONLINE       0     0     0
	cache
	  gpt/zfs-cache  ONLINE       0     0     0
	spares
	  aacd14         AVAIL  =20

errors: No known data errors


The fun begun when a disk failed. When it happened, I offlined it, and =
replaced it by the remaining hot-spare. But something had changed, and =
the pool remained in this state:

% zpool status
  pool: pool
 state: DEGRADED
status: One or more devices has been taken offline by the administrator.
	Sufficient replicas exist for the pool to continue functioning =
in a
	degraded state.
action: Online the device using 'zpool online' or replace the device =
with
	'zpool replace'.
  scan: resilvered 192K in 0h0m with 0 errors on Wed Dec  5 08:31:57 =
2012
config:

	NAME                        STATE     READ WRITE CKSUM
	pool                        DEGRADED     0     0     0
	  raidz1-0                  DEGRADED     0     0     0
	    spare-0                 DEGRADED     0     0     0
	      13277671892912019085  OFFLINE      0     0     0  was =
/dev/aacd1
	      aacd14                ONLINE       0     0     0
	    aacd2                   ONLINE       0     0     0
	    aacd3                   ONLINE       0     0     0
	    aacd4                   ONLINE       0     0     0
	    aacd5                   ONLINE       0     0     0
	    aacd6                   ONLINE       0     0     0
	  raidz1-1                  ONLINE       0     0     0
	    aacd7                   ONLINE       0     0     0
	    aacd8                   ONLINE       0     0     0
	    aacd9                   ONLINE       0     0     0
	    aacd10                  ONLINE       0     0     0
	    aacd11                  ONLINE       0     0     0
	    aacd12                  ONLINE       0     0     0
	logs
	  gpt/zfs-log               ONLINE       0     0     0
	cache
	  gpt/zfs-cache             ONLINE       0     0     0
	spares
	  2388350688826453610       INUSE     was /dev/aacd14

errors: No known data errors
%=20


ZFS was somewhat confused by the JBOD volumes, and it was impossible to =
end this situation. A reboot revealed that the card,  apparently, had =
changed volume numbers. Thanks to the resiliency of ZFS, I didn't lose a =
single bit of data, but the situation seemed to be risky. Finally I =
could fix it by replacing the failed disk, rebooting the whole server, =
of course, and doing a zpool replace. But the card added some confusion, =
and I still don't know what was the disk failure. No traces of a =
meaningful error message.=20


Best regards,


Borja.