From owner-freebsd-fs@FreeBSD.ORG Tue Jan 22 11:13:11 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 86F6692A for ; Tue, 22 Jan 2013 11:13:11 +0000 (UTC) (envelope-from borjam@sarenet.es) Received: from proxypop03b.sare.net (proxypop03b.sare.net [194.30.0.251]) by mx1.freebsd.org (Postfix) with ESMTP id BCF5A9C6 for ; Tue, 22 Jan 2013 11:13:10 +0000 (UTC) Received: from [172.16.2.2] (izaro.sarenet.es [192.148.167.11]) by proxypop03.sare.net (Postfix) with ESMTPSA id 861F19DD4B5; Tue, 22 Jan 2013 12:03:39 +0100 (CET) From: Borja Marcos Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: quoted-printable Subject: RFC: Suggesting ZFS "best practices" in FreeBSD Date: Tue, 22 Jan 2013 12:03:59 +0100 Message-Id: <314B600D-E8E6-4300-B60F-33D5FA5A39CF@sarenet.es> To: FreeBSD Filesystems Mime-Version: 1.0 (Apple Message framework v1085) X-Mailer: Apple Mail (2.1085) Cc: Scott Long X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 22 Jan 2013 11:13:11 -0000 (Scott, I hope you don't mind to be CC'd, I'm not sure you read the -FS = mailing list, and this is a SCSI//FS issue) Hi :) Hope nobody will hate me too much, but ZFS usage under FreeBSD is still = chaotic. We badly need a well proven "doctrine" in order to avoid = problems. Especially, we need to avoid the braindead Linux HOWTO-esque = crap of endless commands for which no rationale is offered at all, and = which mix personal preferences and even misconceptions as "advice" (I = saw one of those howtos which suggested disabling checksums "because = they are useless"). ZFS is a very different beast from other filesystems, and the setup can = involve some non-obvious decisions. Worse, Windows oriented server = vendors insist on bundling servers with crappy raid controllers which = tend to make things worse. Since I've been using ZFS on FreeBSD (from the first versions) I have = noticed several serious problems. I try to explain some of them, and my = suggestions for a solution. We should collect more use cases and issues = and try to reach a consensus.=20 1- Dynamic disk naming -> We should use static naming (GPT labels, for = instance) ZFS was born in a system with static device naming (Solaris). When you = plug a disk it gets a fixed name. As far as I know, at least from my = experience with Sun boxes, c1t3d12 is always c1t3d12. FreeBSD's dynamic = naming can be very problematic. For example, imagine that I have 16 disks, da0 to da15. One of them, = say, da5, dies. When I reboot the machine, all the devices from da6 to = da15 will be renamed to the device number -1. Potential for trouble as a = minimum. After several different installations, I am preferring to rely on static = naming. Doing it with some care can really help to make pools portable = from one system to another. I create a GPT partition in each drive, and = Iabel it with a readable name. Thus, imagine I label each big partition = (which takes the whole available space) as pool-vdev-disk, for example, = pool-raidz1-disk1. When creating a pool, I use these names. Instead of dealing with device = numbers. For example:=20 % zpool status pool: rpool state: ONLINE scan: scrub repaired 0 in 0h52m with 0 errors on Mon Jan 7 16:25:47 = 2013 config: NAME STATE READ WRITE CKSUM rpool ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 gpt/rpool-disk1 ONLINE 0 0 0 gpt/rpool-disk2 ONLINE 0 0 0 logs gpt/zfs-log ONLINE 0 0 0 cache gpt/zfs-cache ONLINE 0 0 0 Using a unique name for each disk within your organization is important. = That way, you can safely move the disks to a different server, which = might be using ZFS, and still be able to import the pool without name = collisions. Of course you could use gptids, which, as far as I know, = are unique, but they are difficult to use and in case of a disk failure = it's not easy to determine which disk to replace. 2- RAID cards. Simply: Avoid them like the pest. ZFS is designed to operate on bare = disks. And it does an amazingly good job. Any additional software layer = you add on top will compromise it. I have had bad experiences with "mfi" = and "aac" cards.=20 There are two solutions adopted by RAID card users. None of them is = good. The first an obvious one is to create a RAID5 taking advantage of = the battery based cache (if present). It works, but it loses some of the = advantages of ZFS. Moreover, trying different cards, I have been forced = to reboot whole servers in order to do something trivial like replacing = a failed disk. Yes, there are software tools to control some of the = cards, but they are at the very least cumbersome and confusing. The second "solution" is to create a RAID0 volume for each disk (some = RAID card manufacturers even dare to call it JBOD). I haven't seen a = single instance of this working flawlessly. Again, a replaced disk can = be a headache. At the very least, you have to deal with a cumbersome and = complicated management program to replace a disk, and you often have to = reboot the server. The biggest reason to avoid these stupid cards, anyway, is plain simple: = Those cards, at least the ones I have tried bundled by Dell as = PERC(insert a random number here) or Sun, isolate the ASC/ASCQ sense = codes from the filesystem. Pure crap. Years ago, fighting this issue, and when ZFS was still rather = experimental, I asked for help and Scott Long sent me a "don't try this = at home" simple patch, so that the disks become available to the CAM = layer, bypassing the RAID card. He warned me of potential issues and = lost sense codes, but, so far so good. And indeed the sense codes are = lost when a RAID card creates a volume, even if in the misnamed "JBOD" = configuration.=20 = http://www.mavetju.org/mail/view_message.php?list=3Dfreebsd-scsi&id=3D2634= 817&raw=3Dyes http://comments.gmane.org/gmane.os.freebsd.devel.scsi/5679 Anyway, even if there might be some issues due to command handling, the = end to end verification performed by ZFS should ensure that, as a = minimum, the data on the disks won't be corrupted and, in case it = happens, it will be detected. I rather prefer to have ZFS deal with it, = instead of working on a sort of "virtual" disk implemented on the RAID = card. Another *strong* reason to avoid those cards, even "JBOD" = configurations, is disk portability. The RAID labels the disks. Moving = one disk from one machine to another one will result on a funny = situation of confusing "import foreign config/ignore" messages when = rebooting the destination server (mandatory in order to be able to = access the transferred disk). Once again, additional complexity, useless = layering and more reboots. That may be acceptable for Metoosoft crap, = not for Unix systems. Summarizing: I would *strongly* recommend to avoid the RAID cards and = get proper host adapters without any fancy functionalities instead. The = one sold by Dell as H200 seems to work very well. No need to create any = JBOD or fancy thing at all. It will just expose the drivers as normal = SAS/SATA ones. A host adapter without fancy firmware is the best = guarantee about failures caused by fancy firmware. But, in case that=B4s not possible, I am still leaning to the kludge of = bypassing the RAID functionality, and even avoiding the JBOD/RAID0 thing = by patching the driver. There is one issue, though. In case of reboot, = the RAID cards freeze, I am not sure why. Maybe that could be fixed, it = happens on machines on which I am not using the RAID functionality at = all. They should become "transparent" but they don't.=20 Also, I think that the so-called JBOD thing would impair the correct = performance of a zfs health daemon doing things such as automatic failed = disk replacement by hot-spares, etc. And there won't be a real ASC/ASCQ = log message for diagnosis. (See at the bottom to read about a problem I have just had with a "JBOD" = configuration) 3- Installation, boot, etc. Here I am not sure. Before zfsboot became available, I used to create a = zfs-on-root system by doing, more or less, this: - Install base system on a pendrive. After the installation, just /boot = will be used from the pendrive, and /boot/loader.conf will=20 - Create the ZFS pool. - Create and populate the root hierarchy. I used to create something = like: pool/root pool/root/var pool/root/usr pool/root/tmp Why pool/root instead of simply "pool"? Because it's easier to = understand, snapshot, send/receive, etc. Why in a hierarchy? Because, if = needed, it's possible to snapshot the whole "system" tree atomically.=20 I also set the mountpoint of the "system" tree as legacy, and rely on = /etc/fstab. Why? In order to avoid an accidental "auto mount" of = critical filesystems in case, for example, I boot off a pendrive in = order to tinker.=20 For the last system I installed, I tried with zfsboot instead of booting = off the /boot directory of a FFS partition. (*) An example of RAID/JBOD induced crap and the problem of not using = static naming follows,=20 I am using a Sun server running FreeBSD. It has 16 160 GB SAS disks, and = one of those cards I worship: this particular example is controlled by = the aac driver.=20 As I was going to tinker a lot, I decided to create a raid-based mirror = for the system, so that I can boot off it and have swap even with a = failed disk, and use the other 14 disks as a pool with two raidz vdevs = of 6 disks, leaving two disks as hot-spares. Later I removed one of the = hot-spares and I installed a SSD disk with two partitions to try and = make it work as L2ARC and log. As I had gone for the jbod pain, of = course replacing that disk meant rebooting the server in order to do = something as illogical as creating a "logical" volume on top of it. = These cards just love to be rebooted. pool: pool state: ONLINE scan: resilvered 7.79G in 0h33m with 0 errors on Tue Jan 22 10:25:10 = 2013 config: NAME STATE READ WRITE CKSUM pool ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 aacd1 ONLINE 0 0 0 aacd2 ONLINE 0 0 0 aacd3 ONLINE 0 0 0 aacd4 ONLINE 0 0 0 aacd5 ONLINE 0 0 0 aacd6 ONLINE 0 0 0 raidz1-1 ONLINE 0 0 0 aacd7 ONLINE 0 0 0 aacd8 ONLINE 0 0 0 aacd9 ONLINE 0 0 0 aacd10 ONLINE 0 0 0 aacd11 ONLINE 0 0 0 aacd12 ONLINE 0 0 0 logs gpt/zfs-log ONLINE 0 0 0 cache gpt/zfs-cache ONLINE 0 0 0 spares aacd14 AVAIL =20 errors: No known data errors The fun begun when a disk failed. When it happened, I offlined it, and = replaced it by the remaining hot-spare. But something had changed, and = the pool remained in this state: % zpool status pool: pool state: DEGRADED status: One or more devices has been taken offline by the administrator. Sufficient replicas exist for the pool to continue functioning = in a degraded state. action: Online the device using 'zpool online' or replace the device = with 'zpool replace'. scan: resilvered 192K in 0h0m with 0 errors on Wed Dec 5 08:31:57 = 2012 config: NAME STATE READ WRITE CKSUM pool DEGRADED 0 0 0 raidz1-0 DEGRADED 0 0 0 spare-0 DEGRADED 0 0 0 13277671892912019085 OFFLINE 0 0 0 was = /dev/aacd1 aacd14 ONLINE 0 0 0 aacd2 ONLINE 0 0 0 aacd3 ONLINE 0 0 0 aacd4 ONLINE 0 0 0 aacd5 ONLINE 0 0 0 aacd6 ONLINE 0 0 0 raidz1-1 ONLINE 0 0 0 aacd7 ONLINE 0 0 0 aacd8 ONLINE 0 0 0 aacd9 ONLINE 0 0 0 aacd10 ONLINE 0 0 0 aacd11 ONLINE 0 0 0 aacd12 ONLINE 0 0 0 logs gpt/zfs-log ONLINE 0 0 0 cache gpt/zfs-cache ONLINE 0 0 0 spares 2388350688826453610 INUSE was /dev/aacd14 errors: No known data errors %=20 ZFS was somewhat confused by the JBOD volumes, and it was impossible to = end this situation. A reboot revealed that the card, apparently, had = changed volume numbers. Thanks to the resiliency of ZFS, I didn't lose a = single bit of data, but the situation seemed to be risky. Finally I = could fix it by replacing the failed disk, rebooting the whole server, = of course, and doing a zpool replace. But the card added some confusion, = and I still don't know what was the disk failure. No traces of a = meaningful error message.=20 Best regards, Borja.