Date: Thu, 12 May 2011 03:40:59 -0500 From: "Matthew D. Fuller" <fullermd@over-yonder.net> To: Jeremy Chadwick <freebsd@jdc.parodius.com> Cc: freebsd-fs@freebsd.org, Jason Hellenthal <jhell@DataIX.net> Subject: Re: ZFS: How to enable cache and logs. Message-ID: <20110512084058.GP90856@over-yonder.net> In-Reply-To: <20110512010433.GA48863@icarus.home.lan> References: <4DCA5620.1030203@dannysplace.net> <20110511100655.GA35129@icarus.home.lan> <4DCA66CF.7070608@digsys.bg> <20110511105117.GA36571@icarus.home.lan> <4DCA7056.20200@digsys.bg> <20110511120830.GA37515@icarus.home.lan> <20110511223849.GA65193@DataIX.net> <20110512010433.GA48863@icarus.home.lan>
next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, May 11, 2011 at 06:04:33PM -0700 I heard the voice of Jeremy Chadwick, and lo! it spake thus: > > (What confuses me about the "idle GC" method is how it determines > what it can erase -- if the OS didn't tell it what it's using, how > does it know it can erase the page?) I'm no expert either, but the following is my understanding... Remember that SSD's (like ZFS, a layer higher up) don't overwrite blocks, they write new data to a new block and update the pointers the level above them (the disk LBA in this case) to point at the new location. So when you overwrite LBA 12345 on the disk with new data, what actually happens is that the SDD writes that data to currently empy flash $SOMEWHERE, and updates its internal table so that LBA 12345 request go there. The bit of flash that was previously considered LBA 12345 still contains the old data, but is now "free" as far as the drive is concerned (though not immediately writable, as it needs to be erased first). Sorta like rm'ing a file doesn't actually delete its contents, just the name pointing to it. Where GC comes in is that the size you can write/address is smaller than the size flash has to be erased in. To pick numbers that are in the right ballpark (it will vary per drive), you have 512 byte blocks that you can read/write (like any other drive), but you can only erase a page of 8k at a time. So let's suppose you write 16 kB of data to a fresh drive. You've written 32 512-byte blocks, which completely fill up 2 8k pages. All nice and compact. Now let's suppose you overwrite from 4k-8k and 12k-16k. Now we have 8k of remaining useful data, but it's spread out over 2 8k pages (4k in each). We can't write new stuff those two now "empty" 4k sections, because we have to erase before we can write, and we can only erase the whole 8k page. This is where the GC kicks in; it knows (because those two LBA ranges have been overwritten) that they're no longer needed, and can notice that all the remaining important data in those two pages can actually fit in a single page. So, it can read 0k-4k and 8k-12k, and write them into a new empty page. Update its LBA map to point those logical addresses over to the new in-flash location, and now the entirety of those two original 8k pages is unused. So now it can go ahead and erase them both, and put them on the "ready for reuse" list. Now, as for TRIM. There are two ways that a block (or set of blocks) can become "no longer needed". One is that they're overwritten with new data; the drive knows that and can mark them as unused like above. The other is that they contain data for a file that's deleted. But the drive has no idea what files being deleted means. All that happens from the drive's perspective is an overwrite of some LBA's that, to the OS, contain directory info. It has no way of knowing that impacts these other LBA's that held a file. TRIM allows the OS to say "OK, these LBA's? Yeah, you can trash 'em now." And so they end up on the dead list, ready for the GC to collapse them away like above. So neither TRIM nor GC is a replacement for the other. GC is about collapsing away reapable space (and also serves a purpose in wear-levelling, but that's unimportant in this discussion). The drive automatically knows about space that's reapable because it was rewritten. TRIM lets it know about space that's reapable because of deletion. Without that, you could delete a file (so LBA 54321 no longer contains useful info, and doesn't need to be preserved), but since the drive doesn't know that, not only can the GC not compact away that space, it has to go ahead and re-copy that block as if it held good data when it shuffles stuff around, so you're creating extra wear. GC can't make TRIM "unnecessary", any more than a book can make a flashlight unnecessary. TRIM is one of the ways you provide info for the GC to use. One thing that CAN make TRIM less important is writing in a "compact" manner (e.g., always write new data to the lowest available LBA). Assuming you oscillate around a steady disk usage (or slowly increase), that means that you'll tend to overwrite space for deleted files relatively soon, so the drive gets to know about the reapable space that way. With more random or other LBA allocation, or if you shrink the used space significantly, a deleted block may hang around unwritten to for much longer, and so have more chance for the GC to unnecessarily recopy and recopy it. This leaves entirely to one side annoying implementational issues. I'm given to understand that due to some combination of "dumb firmware implementation" and "dumb standardized requirements", TRIM can be an unbelievable expensive command, so doing it as part of e.g. 'rm' may damage performance outrageously. That may point to a better implementation being "rack up a list of LBA's and flush periodically", or "scan filesystem weekly and send TRIM's for all empty LBA's" or the like. But again, that's implementation. -- Matthew Fuller (MF4839) | fullermd@over-yonder.net Systems/Network Administrator | http://www.over-yonder.net/~fullermd/ On the Internet, nobody can hear you scream.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20110512084058.GP90856>