From owner-freebsd-fs@FreeBSD.ORG Fri Nov 16 04:58:39 2012 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 0B800273; Fri, 16 Nov 2012 04:58:39 +0000 (UTC) (envelope-from smckay@internode.on.net) Received: from ipmail05.adl6.internode.on.net (ipmail05.adl6.internode.on.net [IPv6:2001:44b8:8060:ff02:300:1:6:5]) by mx1.freebsd.org (Postfix) with ESMTP id 18BA28FC08; Fri, 16 Nov 2012 04:58:37 +0000 (UTC) Message-Id: <57ac1f$gg70bn@ipmail05.adl6.internode.on.net> X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AohgABzHpVB4mMGlPGdsb2JhbAAqGrQQjjoYAQEBATg0gh8BBQ5XFBAIEAE4QxQGiB8MLckohAeBBgOXGIRxA4UliBE Received: from unknown (HELO localhost) ([120.152.193.165]) by ipmail05.adl6.internode.on.net with ESMTP; 16 Nov 2012 15:28:36 +1030 From: Stephen McKay To: Eitan Adler Subject: ZFS FAQ (Was: SSD recommendations for ZFS cache/log) References: <57ac1f$gf3rkl@ipmail05.adl6.internode.on.net> <50A31D48.3000700@shatow.net> In-Reply-To: from Eitan Adler at "Thu, 15 Nov 2012 22:54:46 -0500" Date: Fri, 16 Nov 2012 15:58:27 +1100 Cc: FreeBSD FS , Stephen McKay X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 16 Nov 2012 04:58:39 -0000 On Thursday, 15th November 2012, Eitan Adler wrote: >Can people here please tell me what is wrong in the following content? A few things. I'll intersperse them. >Is there additional data or questions to add? The whole ZFS world desperately needs good documentation. There are misconceptions everywhere. There are good tuning hints and bad (or out of date) ones. Further, it depends on your target application whether the defaults are fairly good or plain suck. http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide is one of the places to go, but it's quite Solaris specific. It's also starting to get a bit dated. The existing http://wiki.freebsd.org/ZFS and ZFSTuningGuide are a mixture of old and recent information and are not a useful intro to the subject for a system administrator. As a project, we should have some short pithy "How to do ZFS right" document that targets workstations, file servers, database servers and web servers (as a start). There are so many things to balance it's not obvious what to do. Simply whacking in a cheap SSD and a bunch of slow disks is rarely the right answer. Included in this hypothetical guide would also be the basics of partitioning using gpart (GPT) with labels and boot partitions to allow complex setups. For example, I've set up small servers where a small slice of each disk becomes a massively mirrored root pool while the majority of each disk becomes a raidz2 pool. This makes sense where you need maximum flexibility and can afford only a few spindles. >Please note that I've never used ZFS so being whacked by a cluebat >would be helpful. My cluebat is small. :-) I've used ZFS for real for a few different target applications but by no means have I covered all uses. I think we need input from many people to make a useful ZFS FAQ. >+ >+ What is the ZIL and when does it get used? >+ >+ >+ >+ The ZIL (ZFS >+ intent log) is a write cache for ZFS. All writes get >+ recorded in the ZIL. Eventually ZFS will perform a >+ Transaction Group Commit in which it >+ flushes out the data in the ZIL to disk. >+ The ZIL is not a cache. It is only used for synchronous writes, not for all writes. It is only read during crash recovery. Its purpose is data integrity. Async writes (most writes) are kept in RAM and bundled into transactions. Transactions are written to disk in an atomic fashion. The ZIL is needed for writes that have been acknowledged as written but which are not yet on disk as part of a transaction. Sync writes will result from fsync() calls, being a NFS server, most netatalk stuff I've seen, and probably a lot of other stuff. But crucially, you get none from editing, compiling, playing nethack, web browsing and many other things. So you may not need a separate fast ZIL, which was basically the question that started all this off. :-) So, I guess you need a "Do I need an SSD for ZIL?" question here somewhere, and a similar "Do I need an SSD for L2ARC?" to go with it. >+ >+ What is the L2ARC? >+ >+ >+ >+ The L2ARC is a read cache stored >+ on a fast device such as an SSD. It is >+ used to speed up operations such as deduplication or >+ encryption. This cache is not persisent across >+ reboots. Note that RAM is used as the first layer >+ of cache and the L2ARC is only needed if there is >+ insufficient RAM. >+ >+ The L2ARC is a general read cache which happens to speed up dedup because the dedup table typically gets very large and will not fit into ARC. It's not primarily there to help dedup. I don't think it's anything to do with encryption. Or compression, if that's what you meant to write. I wish I could expand on the "you don't need L2ARC if you have enough RAM" idea, but that's basically true. L2ARC isn't a free lunch either as it needs space in the ARC to index it. So, perversely, a working set that fits perfectly in the ARC will not fit perfectly any more if an L2ARC is used because part of the ARC is holding the index, pushing part of the working set into the L2ARC which is presumably slower than RAM. I think people could still write research papers on this aspect of ZFS. It's also any area where the defaults seem poorly tuned, at least if we believe our own ZFSTuningGuide wiki page. >+ >+ Is enabling deduplication advisable? >+ >+ >+ >+ The answer very much depends on the expected workload. >+ Deduplication takes up a signifigent amount of RAM and CPU >+ time and may slow down read and write disk access times. >+ Unless one is storing data that is very heavily >+ duplicated (such as virtual machine images, or user >+ backups) it is likely that deduplication will do more harm >+ than good. Another consideration is the inability to >+ revert deduplication status. If deduplication is enabled, >+ data written, and then dedup is disabled, those blocks >+ which were deduplicated will not be duplicated until >+ they are next modified. >+ s/signifigent/significant/ I've got a really short answer to whether or not you should enable dedup which I give people who ask: No. I have a longer answer too, but I think the short answer is better than typing all day. :-) I like your version, but would be tempted to make it more scary so people don't discover too late the long term pain dedup causes. People rarely expect, for example, that deleting stuff can be slow, but with dedup it can be glacial, especially if your dedup table doesn't fit in RAM. Perhaps you should start with the words "Generally speaking, no." Cheers, Stephen.