Date: Thu, 12 May 2011 01:34:29 -0700 From: Jeremy Chadwick <freebsd@jdc.parodius.com> To: Daniel Kalchev <daniel@digsys.bg> Cc: freebsd-fs@freebsd.org Subject: Re: ZFS: How to enable cache and logs. Message-ID: <20110512083429.GA58841@icarus.home.lan> In-Reply-To: <4DCB7F22.4060008@digsys.bg> References: <4DCA5620.1030203@dannysplace.net> <4DCB455C.4020805@dannysplace.net> <alpine.GSO.2.01.1105112146500.20825@freddy.simplesystems.org> <20110512033626.GA52047@icarus.home.lan> <4DCB7F22.4060008@digsys.bg>
next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, May 12, 2011 at 09:33:06AM +0300, Daniel Kalchev wrote: > On 12.05.11 06:36, Jeremy Chadwick wrote: > >On Wed, May 11, 2011 at 09:51:58PM -0500, Bob Friesenhahn wrote: > >>On Thu, 12 May 2011, Danny Carroll wrote: > >>>Replying to myself in order to summarise the recommendations (when using > >>>v28): > >>>- Don't use SSD for the Log device. Write speed tends to be a problem. > >>DO use SSD for the log device. The log device is only used for > >>synchronous writes. Except for certain usages (E.g. database and > >>NFS server) most writes will be asynchronous and never be written to > >>the log. Huge synchronous writes will also bypass the SSD log > >>device. The log device is for reducing latency on small synchronous > >>writes. > >Bob, please correct me if I'm wrong, but as I understand it a log device > >(ZIL) effectively limits the overall write speed of the pool itself. > > > Perhaps I misstated it in my first post, but there is nothing wrong > with using SSD for the SLOG. > > You can of course create usage/benchmark scenario, where an (cheap) > SSD based SLOG will be worse than an (fast) HDD based SLOG, > especially if you are not concerned about latency. The SLOG resolves > two issues, it increases the pool throughput (primary storage) by > removing small synchronous writes from it, that will unnecessarily > introduce head movement and more IOPS and it provided low latency > for small synchronous writes. I've been reading about this in detail here: http://constantin.glez.de/blog/2010/07/solaris-zfs-synchronous-writes-and-zil-explained I had no idea the primary point of a SLOG was to deal with applications that make use of O_SYNC. I thought it was supposed to improve write performance for both asynchronous and synchronous writes. Obviously I'm wrong here. The author's description (at that URL) of an example scenario makes little sense to me; there's a story he tells referring to a bank and a financial transaction of US$699 performed which got cached in RAM and then the system lost power -- and how the intent log on a filesystem would be replayed during reboot. What guarantee is there that the intent log -- which is written to the disk -- actually got written to the disk in the middle of a power failure? There's a lot of focus there on the idea that "the intent log will fix everything, but may lose writes", but what guarantee do I have that the intent log isn't corrupt or botched during a power failure? I guess this is why others have mentioned the importance of BBUs and supercaps, but I don't know what guarantee there is that during a power failure there won't be some degree of filesystem corruption or lost data. There's a lot about ensuring/guaranteeing filesystem integrity I've to learn. > The later is only valid if the SSD is sufficiently write-optimized. > Most consumer SSDs end up saturated by writes. Sequential write IOPS > is what matters here. Oh, I absolutely agree on this point. So basically consumer-level SSDs that don't provide extreme write speed benefits (compared to a classic MHDD) -- not discussing seek times here, we all know SSDs win there -- probably aren't good candidates for SLOGs. What's interesting about the focus on IOPS is that Intel SSDs, in the consumer class, still trump their competitors. But given that your above statement focuses on sequential writes, and the site I provided is quite clear about what happens to sequential writes on Intel SSD that doesn't have TRIM..... Yeah, you get where I'm going with this. :-) > About TRIM. As it was already mentioned, you will use only small > portion of an (for example) 32GB SSD for the SLOG. If you do not > allocate the entire SSD, then wear leveling will be able to play > well and it is very likely you will not suffer any performance > degradation. That sounds ideal, though I'm not sure about the "won't suffer ANY performance degradation" part. I think degradation is just less likely to be witnessed. I should clarify on what "allocate" in the above paragraph means (for readers, not for you Daniel :-) ): it means disk space actually used (LBAs actually written to). Wear levelling works better when there's more available (unused) flash. The more full the disk (filesystem(s)) is, the worse the wear levelling algorithm performs. > By the way, I do not believe Windows benchmark has any significance > in our ZFS usage for the SSDs. How is TRIM implemented in Windows? > How does it relate to SSD usage as SLOG and L2ARC? Yeah, I knew someone would go down this road. Sigh. I strongly believe it does have relevance. The relevance is in the fact that the non-TRIM benchmarks (read: an OS that has TRIM support but the SSD itself does not, therefore TRIM cannot be used) are strong indicators that the performance of the SSD -- sequential reads and writes both -- greatly degrade without TRIM over time. This is also why you'll find people (who cannot use TRIM) regularly advocating an entire format (writing zeros to all LBAs on the disk) after prolonged use without TRIM. I don't know how TRIM is implemented with NTFS in Windows. > How can ever TRIM support influence reading from the drive?! I guess you want more proof, so here you go. Again, the authors wrote a bunch of data to the filesystem, took a sequential read benchmark, then induced TRIM and took another sequential read benchmark. The difference is obvious. This is an X25-V, however, which is the "low-end" of the consumer series, so the numbers are much worse -- but this is a drive that runs for around US$100, making it appealing to people: http://www.anandtech.com/show/3756/2010-value-ssd-100-roundup-kingston-and-ocz-take-on-intel/5 I imagine the reason this happens is similar to why memory performance degrades under fragmentation or when there's a lot of "middle-man stuff" going on. "Middle-man stuff" in this case means the FTL inside of the SSD which is used to correlate LBAs with physical NAND flash pages (and the physically separate chips; it's not just one big flash chip you know). NAND flash pages tend to be something like 256KByte or 512KByte in size, so erasing one means no part of it should be in use by the OS or underlying filesystem. How does the SSD know what's used by the OS? It has to literally keep track of all the LBAs written to. I imagine that list is extremely large and takes time to iterate over. TRIM allows the OS to tell the underlying SSD "LBAs x-y aren't in use any more", which probably removes an entry from the FTL flash<->LBA map, and even does things like move data around between flash pages so that it can erase a NAND flash page. It can do the latter given the role of the FTL acting as a "middle-man" as noted above. > TRIM is an slow operation. How often are these issued? Good questions, for which I have no answer. The same could be asked of any OS however, not just Windows. And I've asked the same question about SSDs internal "garbage collection" too. I have no answers, so you and I are both wondering the same question. And yes, I am aware TRIM is a costly operation. There's a description I found of the process that makes a lot of sense, so rather than re-word it I'll just include it here: http://www.enterprisestorageforum.com/technology/article.php/11182_3910451_2/Fixing-SSD-Performance-Degradation-Part-1.htm See the paragraph starting with "Another long-awaited technique". > What is the impact of issuing TRIM to an otherwise loaded SSD? I'm not sure if "loaded" means "heavy I/O load" or "heavily used" (space-wise). If you meant "heavy I/O load": as I understand it -- following forums, user experiences, etc. -- a heavily-used drive which hasn't had TRIM issued tends to perform worse as time goes on. Most people with OSes that don't have TRIM (OS X, Windows XP, etc.) tend to resort to a full format of the SSD (every LBA written zero, e.g. the -E flag to newfs) every so often. The interval TRIM should be performed is almost certainly up for discussion, but I can't provide any advice because no OS I run or use seems to implement it (aside from FreeBSD UFS, and that seems to issue TRIM on BIO_DELETE via GEOM). (Inline EDIT: Holy crap, I just realised TRIM support has to be enabled via tunefs on UFS filesystems. I started digging through the code and I found the FS_TRIM bit; gee, maybe I should use tunefs -t. I wish I had known this; I thought it just did this automatically if the underlying storage device provided TRIM support. Sigh) Here's some data which probably won't mean much to you since it's from a Windows machine, but the important part is that it's from a Windows XP SP3 machine -- XP has no TRIM support. Disk: Intel 320-series SSD; model SSDSA2CW080G3; 80GB, MLC SB: Intel ICH9, in "Enhanced" mode (non-AHCI, non-RAID) OS: Windows XP SP3 FS: NTFS, 4KB cluster size, NTFS atime turned off, NTFS partition properly 4KB-aligned Space: Approximately 6GB of 80GB used. This disk is very new (only 436 power-on hours). Here are details of the disk: http://jdc.parodius.com/freebsd/i320ssd/ssdsa2cw080g3_01.png And a screen shot of a sequential read benchmark which should speak for itself. Block read size is 64KBytes. This is a raw device read and not a filesystem-level read, meaning NTFS isn't in the picture here. What's interesting is the degradation in performance around the 16GB region: http://jdc.parodius.com/freebsd/i320ssd/ssdsa2cw080g3_02.png Next, a screen shot of a filesystem-based benchmark. This is writing and reading a 256MByte file (to the NTFS filesystem) using different block sizes. Horizontal access is block size, vertical axis is speed. Reads are the blue bar, writes are the orange: http://jdc.parodius.com/freebsd/i320ssd/ssdsa2cw080g3_03.png And finally, the same device-level sequential read benchmark performed again to show what effect the write benchmarks may have had on the disk: http://jdc.parodius.com/freebsd/i320ssd/ssdsa2cw080g3_04.png Sadly I can't test sequential writes because it's an OS disk. So, my findings more or less mimic that of what other people are seeing as well. Given that the read benchmarks are device-level and not filesystem-level, one shouldn't be pondering Windows -- one should be pondering the implications of lack of TRIM and what's going on within the drive itself. I also have an Intel 320-series SSD in my home FreeBSD box as an OS disk (UFS2 / UFS2+SU filesystems). The amount of space used there is lower (~4GB). Do you know of some benchmarking utilities which do device-level reads and can plot or provide metrics for LBA offsets or equivalent? I could compare that to the Windows benchmarks, but still, I think we're barking up the wrong tree. I'm really not comparing ZFS to NTFS here; I'm saying that TRIM addresses performance problems (to some degree) regardless of filesystem type. Anyway, I think that's enough from me for now. I've written this over the course of almost 2 hours. -- | Jeremy Chadwick jdc@parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP 4BD6C0CB |
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20110512083429.GA58841>