Date: Tue, 22 Jan 2013 17:27:13 -0800 From: Michael DeMan <freebsd@deman.com> To: FreeBSD Filesystems <freebsd-fs@freebsd.org> Subject: Re: RFC: Suggesting ZFS "best practices" in FreeBSD Message-ID: <AAE9CC17-B5C4-43DC-B86B-2F498FCA5AD4@deman.com> In-Reply-To: <314B600D-E8E6-4300-B60F-33D5FA5A39CF@sarenet.es> References: <314B600D-E8E6-4300-B60F-33D5FA5A39CF@sarenet.es>
next in thread | previous in thread | raw e-mail | index | archive | help
I think this would be awesome. Googling around it is extremely = difficult to know what to do and which practices are current or = obsolete, etc. I would suggest maybe some separate sections so the information is = organized well and can be easily maintained? MAIN:=20 - recommended anybody using ZFS have a a 64-bit processor and 8GB RAM. - I don't know, but it seems to me that much of what would go in here is = fairly well known now and probably not changing much? ROOT ON ZFS: - section just for this 32-bit AND/OR TINY MEMORY: - all the tuning needed for the people that aren't following recommended = 64-bit+8GB RAM setup. - probably there are enough people even though it seems pretty obvious = in a couple more years nobody will have 32-bit or less than 8GB RAM? A couple more things for subsections in topic MAIN - lots of stuff to go = in there... PARTITIONING: I could be disinformed here, but my understanding) is best practice is = to use gpart + gnop to: #1. Ensure proper alignment for 4K sector drives - the latest western = digitals still report as 512. #2. Ensure a little extra space is left on the drive since if the whole = drive is used, a replacement may be a tiny bit smaller and will not = work. #3. Label the disks so you know what is what. MAPPING PHYSICAL DRIVES: Particularly and issue with SATA drives - basically force the mapping so = if the system reboots with a drive missing (or you add drives) you know = what is what. - http://lists.freebsd.org/pipermail/freebsd-fs/2011-March/011039.html - so you can put a label on the disk caddies and when the system says = 'diskXYZ' died - you can just look at the label on the front of the box = and change 'diskXYZ'. - also without this - if you reboot after adding disks or with a disk = missing - all the adaXYZ numbering shifts :( SPECIFIC TUNABLES - there are still a myriad of specific tunables that can be very helpful = even with a 8GB+=20 ZFS GENERAL BEST PRACTICES - address the regular ZFS stuff here=20 - why the ZIL is a good thing even you think it kills your NFS = performance - no vdevs > 8 disks, raidz1 best with 5 disks, raidz2 best with 6 = disks, etc. - striping over raidz1/raidz2 pools - striping over mirrors - etc... On Jan 22, 2013, at 3:03 AM, Borja Marcos <borjam@sarenet.es> wrote: > (Scott, I hope you don't mind to be CC'd, I'm not sure you read the = -FS mailing list, and this is a SCSI//FS issue) >=20 >=20 >=20 > Hi :) >=20 > Hope nobody will hate me too much, but ZFS usage under FreeBSD is = still chaotic. We badly need a well proven "doctrine" in order to avoid = problems. Especially, we need to avoid the braindead Linux HOWTO-esque = crap of endless commands for which no rationale is offered at all, and = which mix personal preferences and even misconceptions as "advice" (I = saw one of those howtos which suggested disabling checksums "because = they are useless"). >=20 > ZFS is a very different beast from other filesystems, and the setup = can involve some non-obvious decisions. Worse, Windows oriented server = vendors insist on bundling servers with crappy raid controllers which = tend to make things worse. >=20 > Since I've been using ZFS on FreeBSD (from the first versions) I have = noticed several serious problems. I try to explain some of them, and my = suggestions for a solution. We should collect more use cases and issues = and try to reach a consensus.=20 >=20 >=20 >=20 > 1- Dynamic disk naming -> We should use static naming (GPT labels, for = instance) >=20 > ZFS was born in a system with static device naming (Solaris). When you = plug a disk it gets a fixed name. As far as I know, at least from my = experience with Sun boxes, c1t3d12 is always c1t3d12. FreeBSD's dynamic = naming can be very problematic. >=20 > For example, imagine that I have 16 disks, da0 to da15. One of them, = say, da5, dies. When I reboot the machine, all the devices from da6 to = da15 will be renamed to the device number -1. Potential for trouble as a = minimum. >=20 > After several different installations, I am preferring to rely on = static naming. Doing it with some care can really help to make pools = portable from one system to another. I create a GPT partition in each = drive, and Iabel it with a readable name. Thus, imagine I label each big = partition (which takes the whole available space) as pool-vdev-disk, for = example, pool-raidz1-disk1. >=20 > When creating a pool, I use these names. Instead of dealing with = device numbers. For example:=20 >=20 > % zpool status > pool: rpool > state: ONLINE > scan: scrub repaired 0 in 0h52m with 0 errors on Mon Jan 7 16:25:47 = 2013 > config: >=20 > NAME STATE READ WRITE CKSUM > rpool ONLINE 0 0 0 > mirror-0 ONLINE 0 0 0 > gpt/rpool-disk1 ONLINE 0 0 0 > gpt/rpool-disk2 ONLINE 0 0 0 > logs > gpt/zfs-log ONLINE 0 0 0 > cache > gpt/zfs-cache ONLINE 0 0 0 >=20 > Using a unique name for each disk within your organization is = important. That way, you can safely move the disks to a different = server, which might be using ZFS, and still be able to import the pool = without name collisions. Of course you could use gptids, which, as far = as I know, are unique, but they are difficult to use and in case of a = disk failure it's not easy to determine which disk to replace. >=20 >=20 >=20 >=20 > 2- RAID cards. >=20 > Simply: Avoid them like the pest. ZFS is designed to operate on bare = disks. And it does an amazingly good job. Any additional software layer = you add on top will compromise it. I have had bad experiences with "mfi" = and "aac" cards.=20 >=20 > There are two solutions adopted by RAID card users. None of them is = good. The first an obvious one is to create a RAID5 taking advantage of = the battery based cache (if present). It works, but it loses some of the = advantages of ZFS. Moreover, trying different cards, I have been forced = to reboot whole servers in order to do something trivial like replacing = a failed disk. Yes, there are software tools to control some of the = cards, but they are at the very least cumbersome and confusing. >=20 > The second "solution" is to create a RAID0 volume for each disk (some = RAID card manufacturers even dare to call it JBOD). I haven't seen a = single instance of this working flawlessly. Again, a replaced disk can = be a headache. At the very least, you have to deal with a cumbersome and = complicated management program to replace a disk, and you often have to = reboot the server. >=20 > The biggest reason to avoid these stupid cards, anyway, is plain = simple: Those cards, at least the ones I have tried bundled by Dell as = PERC(insert a random number here) or Sun, isolate the ASC/ASCQ sense = codes from the filesystem. Pure crap. >=20 > Years ago, fighting this issue, and when ZFS was still rather = experimental, I asked for help and Scott Long sent me a "don't try this = at home" simple patch, so that the disks become available to the CAM = layer, bypassing the RAID card. He warned me of potential issues and = lost sense codes, but, so far so good. And indeed the sense codes are = lost when a RAID card creates a volume, even if in the misnamed "JBOD" = configuration.=20 >=20 >=20 > = http://www.mavetju.org/mail/view_message.php?list=3Dfreebsd-scsi&id=3D2634= 817&raw=3Dyes > http://comments.gmane.org/gmane.os.freebsd.devel.scsi/5679 >=20 > Anyway, even if there might be some issues due to command handling, = the end to end verification performed by ZFS should ensure that, as a = minimum, the data on the disks won't be corrupted and, in case it = happens, it will be detected. I rather prefer to have ZFS deal with it, = instead of working on a sort of "virtual" disk implemented on the RAID = card. >=20 > Another *strong* reason to avoid those cards, even "JBOD" = configurations, is disk portability. The RAID labels the disks. Moving = one disk from one machine to another one will result on a funny = situation of confusing "import foreign config/ignore" messages when = rebooting the destination server (mandatory in order to be able to = access the transferred disk). Once again, additional complexity, useless = layering and more reboots. That may be acceptable for Metoosoft crap, = not for Unix systems. >=20 > Summarizing: I would *strongly* recommend to avoid the RAID cards and = get proper host adapters without any fancy functionalities instead. The = one sold by Dell as H200 seems to work very well. No need to create any = JBOD or fancy thing at all. It will just expose the drivers as normal = SAS/SATA ones. A host adapter without fancy firmware is the best = guarantee about failures caused by fancy firmware. >=20 > But, in case that=B4s not possible, I am still leaning to the kludge = of bypassing the RAID functionality, and even avoiding the JBOD/RAID0 = thing by patching the driver. There is one issue, though. In case of = reboot, the RAID cards freeze, I am not sure why. Maybe that could be = fixed, it happens on machines on which I am not using the RAID = functionality at all. They should become "transparent" but they don't.=20= >=20 > Also, I think that the so-called JBOD thing would impair the correct = performance of a zfs health daemon doing things such as automatic failed = disk replacement by hot-spares, etc. And there won't be a real ASC/ASCQ = log message for diagnosis. >=20 > (See at the bottom to read about a problem I have just had with a = "JBOD" configuration) >=20 >=20 >=20 >=20 > 3- Installation, boot, etc. >=20 > Here I am not sure. Before zfsboot became available, I used to create = a zfs-on-root system by doing, more or less, this: >=20 > - Install base system on a pendrive. After the installation, just = /boot will be used from the pendrive, and /boot/loader.conf will=20 >=20 > - Create the ZFS pool. >=20 > - Create and populate the root hierarchy. I used to create something = like: >=20 > pool/root > pool/root/var > pool/root/usr > pool/root/tmp >=20 > Why pool/root instead of simply "pool"? Because it's easier to = understand, snapshot, send/receive, etc. Why in a hierarchy? Because, if = needed, it's possible to snapshot the whole "system" tree atomically.=20 >=20 > I also set the mountpoint of the "system" tree as legacy, and rely on = /etc/fstab. Why? In order to avoid an accidental "auto mount" of = critical filesystems in case, for example, I boot off a pendrive in = order to tinker.=20 >=20 > For the last system I installed, I tried with zfsboot instead of = booting off the /boot directory of a FFS partition. >=20 >=20 >=20 >=20 > (*) An example of RAID/JBOD induced crap and the problem of not using = static naming follows,=20 >=20 > I am using a Sun server running FreeBSD. It has 16 160 GB SAS disks, = and one of those cards I worship: this particular example is controlled = by the aac driver.=20 >=20 > As I was going to tinker a lot, I decided to create a raid-based = mirror for the system, so that I can boot off it and have swap even with = a failed disk, and use the other 14 disks as a pool with two raidz vdevs = of 6 disks, leaving two disks as hot-spares. Later I removed one of the = hot-spares and I installed a SSD disk with two partitions to try and = make it work as L2ARC and log. As I had gone for the jbod pain, of = course replacing that disk meant rebooting the server in order to do = something as illogical as creating a "logical" volume on top of it. = These cards just love to be rebooted. >=20 > pool: pool > state: ONLINE > scan: resilvered 7.79G in 0h33m with 0 errors on Tue Jan 22 10:25:10 = 2013 > config: >=20 > NAME STATE READ WRITE CKSUM > pool ONLINE 0 0 0 > raidz1-0 ONLINE 0 0 0 > aacd1 ONLINE 0 0 0 > aacd2 ONLINE 0 0 0 > aacd3 ONLINE 0 0 0 > aacd4 ONLINE 0 0 0 > aacd5 ONLINE 0 0 0 > aacd6 ONLINE 0 0 0 > raidz1-1 ONLINE 0 0 0 > aacd7 ONLINE 0 0 0 > aacd8 ONLINE 0 0 0 > aacd9 ONLINE 0 0 0 > aacd10 ONLINE 0 0 0 > aacd11 ONLINE 0 0 0 > aacd12 ONLINE 0 0 0 > logs > gpt/zfs-log ONLINE 0 0 0 > cache > gpt/zfs-cache ONLINE 0 0 0 > spares > aacd14 AVAIL =20 >=20 > errors: No known data errors >=20 >=20 >=20 > The fun begun when a disk failed. When it happened, I offlined it, and = replaced it by the remaining hot-spare. But something had changed, and = the pool remained in this state: >=20 > % zpool status > pool: pool > state: DEGRADED > status: One or more devices has been taken offline by the = administrator. > Sufficient replicas exist for the pool to continue functioning = in a > degraded state. > action: Online the device using 'zpool online' or replace the device = with > 'zpool replace'. > scan: resilvered 192K in 0h0m with 0 errors on Wed Dec 5 08:31:57 = 2012 > config: >=20 > NAME STATE READ WRITE CKSUM > pool DEGRADED 0 0 0 > raidz1-0 DEGRADED 0 0 0 > spare-0 DEGRADED 0 0 0 > 13277671892912019085 OFFLINE 0 0 0 was = /dev/aacd1 > aacd14 ONLINE 0 0 0 > aacd2 ONLINE 0 0 0 > aacd3 ONLINE 0 0 0 > aacd4 ONLINE 0 0 0 > aacd5 ONLINE 0 0 0 > aacd6 ONLINE 0 0 0 > raidz1-1 ONLINE 0 0 0 > aacd7 ONLINE 0 0 0 > aacd8 ONLINE 0 0 0 > aacd9 ONLINE 0 0 0 > aacd10 ONLINE 0 0 0 > aacd11 ONLINE 0 0 0 > aacd12 ONLINE 0 0 0 > logs > gpt/zfs-log ONLINE 0 0 0 > cache > gpt/zfs-cache ONLINE 0 0 0 > spares > 2388350688826453610 INUSE was /dev/aacd14 >=20 > errors: No known data errors > %=20 >=20 >=20 > ZFS was somewhat confused by the JBOD volumes, and it was impossible = to end this situation. A reboot revealed that the card, apparently, had = changed volume numbers. Thanks to the resiliency of ZFS, I didn't lose a = single bit of data, but the situation seemed to be risky. Finally I = could fix it by replacing the failed disk, rebooting the whole server, = of course, and doing a zpool replace. But the card added some confusion, = and I still don't know what was the disk failure. No traces of a = meaningful error message.=20 >=20 >=20 >=20 >=20 > Best regards, >=20 >=20 >=20 >=20 >=20 >=20 > Borja. >=20 >=20 > _______________________________________________ > freebsd-fs@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-fs > To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org"
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?AAE9CC17-B5C4-43DC-B86B-2F498FCA5AD4>