FreeBSD Mail Archives

Date:      Tue, 22 Jan 2013 17:27:13 -0800
From:      Michael DeMan <freebsd@deman.com>
To:        FreeBSD Filesystems <freebsd-fs@freebsd.org>
Subject:   Re: RFC: Suggesting ZFS "best practices" in FreeBSD
Message-ID:  <AAE9CC17-B5C4-43DC-B86B-2F498FCA5AD4@deman.com>
In-Reply-To: <314B600D-E8E6-4300-B60F-33D5FA5A39CF@sarenet.es>
References:  <314B600D-E8E6-4300-B60F-33D5FA5A39CF@sarenet.es>


I think this would be awesome.  Googling around it is extremely difficult to know what to do and which practices are current or obsolete, etc.

I would suggest maybe some separate sections so the information is organized well and can be easily maintained?


MAIN: 
- recommended anybody using ZFS have a a 64-bit processor and 8GB RAM.
- I don't know, but it seems to me that much of what would go in here is fairly well known now and probably not changing much?

ROOT ON ZFS:
- section just for this

32-bit AND/OR TINY MEMORY:
- all the tuning needed for the people that aren't following recommended 64-bit+8GB RAM setup.
- probably there are enough people even though it seems pretty obvious in a couple more years nobody will have 32-bit or less than 8GB RAM?



A couple more things for subsections in topic MAIN - lots of stuff to go in there...


PARTITIONING:
I could be disinformed here, but my understanding) is best practice is to use gpart + gnop to:
#1.  Ensure proper alignment for 4K sector drives - the latest western digitals still report as 512.
#2.  Ensure a little extra space is left on the drive since if the whole drive is used, a replacement may be a tiny bit smaller and will not work.
#3.  Label the disks so you know what is what.

MAPPING PHYSICAL DRIVES:
Particularly and issue with SATA drives - basically force the mapping so if the system reboots with a drive missing (or you add drives) you know what is what.
- http://lists.freebsd.org/pipermail/freebsd-fs/2011-March/011039.html
- so you can put a label on the disk caddies and when the system says 'diskXYZ' died - you can just look at the label on the front of the box and change 'diskXYZ'.
- also without this - if you reboot after adding disks or with a disk missing - all the adaXYZ numbering shifts :(


SPECIFIC TUNABLES
- there are still a myriad of specific tunables that can be very helpful even with a 8GB+ 

ZFS GENERAL BEST PRACTICES - address the regular ZFS stuff here 
- why the ZIL is a good thing even you think it kills your NFS performance
- no vdevs > 8 disks, raidz1 best with 5 disks, raidz2 best with 6 disks, etc.
- striping over raidz1/raidz2 pools
- striping over mirrors
- etc...











On Jan 22, 2013, at 3:03 AM, Borja Marcos <borjam@sarenet.es> wrote:

> (Scott, I hope you don't mind to be CC'd, I'm not sure you read the -FS mailing list, and this is a SCSI//FS issue)
> 
> 
> 
> Hi :)
> 
> Hope nobody will hate me too much, but ZFS usage under FreeBSD is still chaotic. We badly need a well proven "doctrine" in order to avoid problems. Especially, we need to avoid the braindead Linux HOWTO-esque crap of endless commands for which no rationale is offered at all, and which mix personal preferences and even misconceptions as "advice" (I saw one of those howtos which suggested disabling checksums "because they are useless").
> 
> ZFS is a very different beast from other filesystems, and the setup can involve some non-obvious decisions. Worse, Windows oriented server vendors insist on bundling servers with crappy raid controllers which tend to make things worse.
> 
> Since I've been using ZFS on FreeBSD (from the first versions) I have noticed several serious problems. I try to explain some of them, and my suggestions for a solution. We should collect more use cases and issues and try to reach a consensus. 
> 
> 
> 
> 1- Dynamic disk naming -> We should use static naming (GPT labels, for instance)
> 
> ZFS was born in a system with static device naming (Solaris). When you plug a disk it gets a fixed name. As far as I know, at least from my experience with Sun boxes, c1t3d12 is always c1t3d12. FreeBSD's dynamic naming can be very problematic.
> 
> For example, imagine that I have 16 disks, da0 to da15. One of them, say, da5, dies. When I reboot the machine, all the devices from da6 to da15 will be renamed to the device number -1. Potential for trouble as a minimum.
> 
> After several different installations, I am preferring to rely on static naming. Doing it with some care can really help to make pools portable from one system to another. I create a GPT partition in each drive, and Iabel it with a readable name. Thus, imagine I label each big partition (which takes the whole available space) as pool-vdev-disk, for example, pool-raidz1-disk1.
> 
> When creating a pool, I use these names. Instead of dealing with device numbers. For example: 
> 
> % zpool status
>  pool: rpool
> state: ONLINE
>  scan: scrub repaired 0 in 0h52m with 0 errors on Mon Jan  7 16:25:47 2013
> config:
> 
> 	NAME             STATE     READ WRITE CKSUM
> 	rpool            ONLINE       0     0     0
> 	  mirror-0       ONLINE       0     0     0
> 	    gpt/rpool-disk1       ONLINE       0     0     0
> 	    gpt/rpool-disk2       ONLINE       0     0     0
> 	logs
> 	  gpt/zfs-log    ONLINE       0     0     0
> 	cache
> 	  gpt/zfs-cache  ONLINE       0     0     0
> 
> Using a unique name for each disk within your organization is important. That way, you can safely move the disks to a different server, which might be using ZFS, and still be able to import the pool without name collisions. Of course  you could use gptids, which, as far as I know, are unique, but they are difficult to use and in case  of a disk failure it's not easy to determine which disk to replace.
> 
> 
> 
> 
> 2- RAID cards.
> 
> Simply: Avoid them like the pest. ZFS is designed to operate on bare disks. And it does an amazingly good job. Any additional software layer you add on top will compromise it. I have had bad experiences with "mfi" and "aac" cards. 
> 
> There are two solutions adopted by RAID card users. None of them is good. The first an obvious one is to create a RAID5 taking advantage of the battery based cache (if present). It works, but it loses some of the advantages of ZFS. Moreover, trying different cards, I have been forced to reboot whole servers in order to do something trivial like replacing a failed disk. Yes, there are software tools to control some of the cards, but they are at the very least cumbersome and confusing.
> 
> The second "solution" is to create a RAID0 volume for each disk (some RAID card manufacturers even dare to call it JBOD). I haven't seen a single instance of this working flawlessly. Again, a replaced disk can be a headache. At the very least, you have to deal with a cumbersome and complicated management program to replace a disk, and you often have to reboot the server.
> 
> The biggest reason to avoid these stupid cards, anyway, is plain simple: Those cards, at least the ones I have tried bundled by Dell as PERC(insert a random number here) or Sun, isolate the ASC/ASCQ sense codes from the filesystem. Pure crap.
> 
> Years ago, fighting this issue, and when ZFS was still rather experimental, I asked for help and Scott Long sent me a "don't try this at home" simple patch, so that the disks become available to the CAM layer, bypassing the RAID card. He warned me of potential issues and lost sense codes, but, so far so good. And indeed the sense codes are lost when a RAID card creates a volume, even if in the misnamed "JBOD" configuration. 
> 
> 
> http://www.mavetju.org/mail/view_message.php?list=freebsd-scsi&id=2634817&raw=yes
> http://comments.gmane.org/gmane.os.freebsd.devel.scsi/5679
> 
> Anyway, even if there might be some issues due to command handling, the end to end verification performed by ZFS should ensure that, as a minimum, the data on the disks won't  be corrupted and, in case it happens, it will be detected. I rather prefer to have ZFS deal with it, instead of working on a sort of "virtual" disk implemented on the RAID card.
> 
> Another *strong* reason to avoid those cards, even "JBOD" configurations, is disk portability. The RAID labels the disks. Moving one disk from one machine to another one will result on a funny situation of confusing "import foreign config/ignore" messages when rebooting the destination server (mandatory in order to be able to access the transferred disk). Once again, additional complexity, useless layering and more reboots. That may be acceptable for Metoosoft crap, not for Unix systems.
> 
> Summarizing: I would *strongly* recommend to avoid the RAID cards and get proper host adapters without any fancy functionalities instead. The one sold by Dell as H200 seems to work very well. No need to create any JBOD or fancy thing at all. It will just expose the drivers as normal SAS/SATA ones. A host adapter without fancy firmware is the best guarantee about failures caused by fancy firmware.
> 
> But, in case that�s not possible, I am still leaning to the kludge of bypassing the RAID functionality, and even avoiding the JBOD/RAID0 thing by patching the driver. There is one issue, though. In case of reboot, the RAID cards freeze, I am not sure why. Maybe that could be fixed,  it happens on machines on which I am not using the RAID functionality at all. They should become "transparent" but they don't. 
> 
> Also, I think that  the so-called JBOD thing would impair the correct performance of a zfs health daemon doing things such as automatic failed disk replacement by hot-spares, etc. And there won't be a real ASC/ASCQ log message for diagnosis.
> 
> (See at the bottom to read about a problem I have just had with a "JBOD" configuration)
> 
> 
> 
> 
> 3- Installation, boot, etc.
> 
> Here I am not sure. Before zfsboot became available, I used to create a zfs-on-root system by doing, more or less, this:
> 
> - Install base system on a pendrive. After the installation, just /boot will be used  from the pendrive, and /boot/loader.conf will 
> 
> - Create the ZFS pool.
> 
> - Create and populate the root hierarchy. I used to create something like:
> 
> pool/root
> pool/root/var
> pool/root/usr
> pool/root/tmp
> 
> Why pool/root instead of simply "pool"? Because it's easier to understand, snapshot, send/receive, etc. Why in a hierarchy? Because, if needed, it's possible to snapshot the whole "system" tree atomically. 
> 
> I also set the mountpoint of the "system" tree as legacy, and rely on /etc/fstab. Why? In order to avoid an accidental "auto mount"  of critical filesystems in case, for example, I boot off a pendrive in order to tinker. 
> 
> For the last system I installed, I tried with zfsboot instead of booting off the /boot directory of a FFS partition.
> 
> 
> 
> 
> (*) An example of RAID/JBOD induced crap and the problem of not using static naming follows, 
> 
> I am using a Sun server running FreeBSD. It has 16 160 GB SAS disks, and one of those cards I worship: this particular example is controlled by the aac driver. 
> 
> As I was going to tinker a lot, I decided to create a raid-based mirror for the system, so that I can boot off it and have swap even with a failed disk, and use the other 14 disks as a pool with two raidz vdevs of 6 disks, leaving two disks as hot-spares. Later  I removed one of the hot-spares and I installed a SSD disk with two partitions to try and make it work as L2ARC  and log. As I had gone for the jbod pain, of course replacing that disk meant rebooting the server in order to do something as illogical as creating a "logical" volume on top of it. These cards just love to be rebooted.
> 
>  pool: pool
> state: ONLINE
>  scan: resilvered 7.79G in 0h33m with 0 errors on Tue Jan 22 10:25:10 2013
> config:
> 
> 	NAME             STATE     READ WRITE CKSUM
> 	pool             ONLINE       0     0     0
> 	  raidz1-0       ONLINE       0     0     0
> 	    aacd1        ONLINE       0     0     0
> 	    aacd2        ONLINE       0     0     0
> 	    aacd3        ONLINE       0     0     0
> 	    aacd4        ONLINE       0     0     0
> 	    aacd5        ONLINE       0     0     0
> 	    aacd6        ONLINE       0     0     0
> 	  raidz1-1       ONLINE       0     0     0
> 	    aacd7        ONLINE       0     0     0
> 	    aacd8        ONLINE       0     0     0
> 	    aacd9        ONLINE       0     0     0
> 	    aacd10       ONLINE       0     0     0
> 	    aacd11       ONLINE       0     0     0
> 	    aacd12       ONLINE       0     0     0
> 	logs
> 	  gpt/zfs-log    ONLINE       0     0     0
> 	cache
> 	  gpt/zfs-cache  ONLINE       0     0     0
> 	spares
> 	  aacd14         AVAIL   
> 
> errors: No known data errors
> 
> 
> 
> The fun begun when a disk failed. When it happened, I offlined it, and replaced it by the remaining hot-spare. But something had changed, and the pool remained in this state:
> 
> % zpool status
>  pool: pool
> state: DEGRADED
> status: One or more devices has been taken offline by the administrator.
> 	Sufficient replicas exist for the pool to continue functioning in a
> 	degraded state.
> action: Online the device using 'zpool online' or replace the device with
> 	'zpool replace'.
>  scan: resilvered 192K in 0h0m with 0 errors on Wed Dec  5 08:31:57 2012
> config:
> 
> 	NAME                        STATE     READ WRITE CKSUM
> 	pool                        DEGRADED     0     0     0
> 	  raidz1-0                  DEGRADED     0     0     0
> 	    spare-0                 DEGRADED     0     0     0
> 	      13277671892912019085  OFFLINE      0     0     0  was /dev/aacd1
> 	      aacd14                ONLINE       0     0     0
> 	    aacd2                   ONLINE       0     0     0
> 	    aacd3                   ONLINE       0     0     0
> 	    aacd4                   ONLINE       0     0     0
> 	    aacd5                   ONLINE       0     0     0
> 	    aacd6                   ONLINE       0     0     0
> 	  raidz1-1                  ONLINE       0     0     0
> 	    aacd7                   ONLINE       0     0     0
> 	    aacd8                   ONLINE       0     0     0
> 	    aacd9                   ONLINE       0     0     0
> 	    aacd10                  ONLINE       0     0     0
> 	    aacd11                  ONLINE       0     0     0
> 	    aacd12                  ONLINE       0     0     0
> 	logs
> 	  gpt/zfs-log               ONLINE       0     0     0
> 	cache
> 	  gpt/zfs-cache             ONLINE       0     0     0
> 	spares
> 	  2388350688826453610       INUSE     was /dev/aacd14
> 
> errors: No known data errors
> % 
> 
> 
> ZFS was somewhat confused by the JBOD volumes, and it was impossible to end this situation. A reboot revealed that the card,  apparently, had changed volume numbers. Thanks to the resiliency of ZFS, I didn't lose a single bit of data, but the situation seemed to be risky. Finally I could fix it by replacing the failed disk, rebooting the whole server, of course, and doing a zpool replace. But the card added some confusion, and I still don't know what was the disk failure. No traces of a meaningful error message. 
> 
> 
> 
> 
> Best regards,
> 
> 
> 
> 
> 
> 
> Borja.
> 
> 
> _______________________________________________
> freebsd-fs@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-fs
> To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org"

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?AAE9CC17-B5C4-43DC-B86B-2F498FCA5AD4>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation