From owner-freebsd-arm@freebsd.org Sun Jan 22 05:16:44 2017 Return-Path: Delivered-To: freebsd-arm@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 69D44CBC180 for ; Sun, 22 Jan 2017 05:16:44 +0000 (UTC) (envelope-from wlosh@bsdimp.com) Received: from mail-io0-x22d.google.com (mail-io0-x22d.google.com [IPv6:2607:f8b0:4001:c06::22d]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 3341EB8B for ; Sun, 22 Jan 2017 05:16:44 +0000 (UTC) (envelope-from wlosh@bsdimp.com) Received: by mail-io0-x22d.google.com with SMTP id l66so89505151ioi.1 for ; Sat, 21 Jan 2017 21:16:44 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bsdimp-com.20150623.gappssmtp.com; s=20150623; h=mime-version:sender:in-reply-to:references:from:date:message-id :subject:to:cc; bh=hNQRuJSGuoavci/DUnlX8i4R7tQaeQ4Pi5+kGEv0xG4=; b=wjNLOMGR0Fcl7hdS0xbGPWnWcBVtasqbMlEWY9BTIN2hvoA33YjCeWcQ6PS/wJ4wBt bBIbLjKVyRxdXzpLheBioJgD4j+KR0GjavrxoMPYfxzuuiTPLwg9xHaczF8neMspWAJX hxav5Gu4bhUGaIQlA1il7JR03QX3a0ilhphAextFJtZoQnBIC4aR8wupDyL8Izwhnrqm mdgdOELJb8bzttTJtpANMH4bK8F7iwcEuoK6Y1L3n3aB0IM5HvwW523sDBrL4uDOMgVl vIC9sQrrBpGzA5aLL3b2gZsolHljlx8u65rZpKnOC7nWTUTZHMLI7QTnFiS5ulhofq4Y En1w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:sender:in-reply-to:references:from :date:message-id:subject:to:cc; bh=hNQRuJSGuoavci/DUnlX8i4R7tQaeQ4Pi5+kGEv0xG4=; b=VLvqtj4D2BCIMTKts1xC5uP6Yw7qkBEoTLw8t29kUoVypTWWTY4iMVnNuM9z1cQm/k D9VCRDITzLzkOIBSDrlrHO6o1RN4sxA2DJqh8o4GeFAvBuSk0T7vywKKyN4dGeXrF+bA YRjDDXFttchOZVTGaLLaFUd0jwARC0wtH0NlYAdKVLeJdvE/sNyOzLBF4Uf6xBLmMqjz 9uU5J7HLaAE8fCsDCzONzOgGIDdkw/AMZ8lu9jo5qsUU8cn7LK06KTBSizxMQ8bjKZxA KcIik1almpVAvTU24qUiDteyFZSPoD1vyYUSXnMelUVg5NMQxBDxk7p2nN1BW6E4kuTl kkgw== X-Gm-Message-State: AIkVDXKodNgU1eLXbdD4yY2wWEigTOPdW0gjHMYfIkwVGVNlDWJuw9xSwIoKVQ8xAhAWlgGmQRVdb6wcSiHC9A== X-Received: by 10.107.198.195 with SMTP id w186mr19808184iof.19.1485062203407; Sat, 21 Jan 2017 21:16:43 -0800 (PST) MIME-Version: 1.0 Sender: wlosh@bsdimp.com Received: by 10.79.145.217 with HTTP; Sat, 21 Jan 2017 21:16:42 -0800 (PST) X-Originating-IP: [50.253.99.174] In-Reply-To: References: <20170122002432.B16E8406061@ip-64-139-1-69.sjc.megapath.net> From: Warner Losh Date: Sat, 21 Jan 2017 22:16:42 -0700 X-Google-Sender-Auth: jbbdXtyMLBPERhX1HZ9l25n8cBw Message-ID: Subject: Re: how to measure microsd wear To: Karl Denninger Cc: "freebsd-arm@freebsd.org" Content-Type: text/plain; charset=UTF-8 X-BeenThere: freebsd-arm@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "Porting FreeBSD to ARM processors." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 22 Jan 2017 05:16:44 -0000 On Sat, Jan 21, 2017 at 9:29 PM, Karl Denninger wrote: > On 1/21/2017 18:24, Hal Murray wrote: >> karl@denninger.net said: >>> and this one is not a low-hour failure either, nor is it an off-brand -- >>> it's a Sandisk Ultra 32Gb and the machine has roughly a year of 24x7x365 >>> uptime on it. >> Any idea how many writes it did? > Offhand, no. I did not expect this particular device to have a problem > given its workload, but it did. It could have been a completely random > even (e.g. cosmic ray hits the "wrong" place in the controller's mapping > tables, damages the data in it a critical way, and the controller throws > up its hands and says "screw you, it's over.") There's no real way to > know - the card is effectively junk as the controller has write-locked > it, so all I can do (and did) is get the config files and application it > runs under the OS off it and put them on the new one. If you are lucky, the SD card will fail 'read only' rather than 'read never'. :) You're correct that once you go into that mode, however, you can't get out of it with standard interfaces. There are rumors of vendor specific ones that are used to diagnose failure modes, but I've never been able to find out more about them as they are firmware specific. The NAND chips, however, remain generally readable and you can do a hunt to see what's what. I say generally, though, because the list of failure modes for NAND chips is scary.... > The other failures were less-surprising; in particular the box on my > desk, given that I compile on it frequently and that produces a lot of > small write I/O activity, didn't shock me all that much when it failed. > > One of the big problems with NAND flash (in any form) is that it can > only be written to "zeros." That is, a blank page is all "1s" at a bit > level, and a write actually just writes the zeros. Yes and now. Erasing a page will set it to all 1's. Programming a page will move the charge nodes from the 'erased' state to the 'programmed' state. What this means varies a lot based on the type of NAND. SLC, sure, it goes from 1 to 0 (although the node for 1 still moves a bit). For MLC or TLC, it's a lot more complicated because you're encoding 2 or 3 bits into discrete voltage level. You have to do the proper dance and program the pages in the correct order with the correct 'randomizations' in the data (either inside the chip, or external to it) to make sure that the 'white balance' of bits is very close to 50/50. Also added in this pipeline is the ECC or LDPC error coding to ensure that the crappy NAND can recover from the inevitable bit errors that you know will happen. > This leads to what > is called "write amplification" because changing one byte in a page > requires reading the page in and writing an entire new page out, then > (usually later) erasing the former page; you cannot update in-place. The erase and program causes this, yes. But write amplification happens when the drive has to garbage collect blocks off its log to find blocks it can write new blocks. It is different than the effect you are talking about which seems to ignore the LBA to phyiscal translation layer that's in the drive that's hidden from the user. > If > a page is 4k in size then writing a single byte results in an actual > write of 4k bytes, or ~4,000 times as much as you think you wrote. No, that's not how it works. First off, there's no interface for writing one byte (just programming a page). Second, the OS will translate writing one byte to writing one block which will cause a new page to be written to the end of the log. Flash memory is erased BLOCKS at a time (usually a few hundred pages) and programmed a page at a time. When you write a new block, it gets appended to the end of the "log" with a note about the new LBA to physical mapping. Also, pages are usually larger than logical blocks, so you often wind up with multiple blocks living inside a single page. The cause of write amps are when LBAs are re-written creating "holes" in the map. The erase blocks that are most empty are usually selected to be garbage collected (the valid blocks written to the end of the log and the erase block erased to use for new writes). So write amp tends to trend as the inverse of the number of spare blocks in the system... > This > is also one of the reasons that random small-block write performance is > much slower than big writes; if you write an even multiple of an on-card > block the controller can simply lay down the new data onto pre-erased > space, where if you write small pieces of data it cannot do that and > winds up doing a lot of read/write cycling. Actually, that's crap too. The reason that small writes are lower performance is down to internal structures on SSDs that block the physical writes (say 512 or 4k) into a larger page (say 32k or 64k) to keep the metadata on the LBA to physical mapping down. So they do a read, modify write, which adds a tREAD and some buffering time and increases write amp. > It gets worse (by a lot) if > there's file metadata to update with each write as well because that > metadata almost-certainly winds up carrying a (large) amount of write > amplification irrespective of the file data itself. All of this is a > big part of why write I/O performance to these cards for actual > filesystem use is stinky in the general case compared against > pretty-much anything else. Blocks are blocks. Metadata doesn't matter. Blocks get appended to the device log, so when you write them, it doesn't matter where on the disk. > The controller's internal logic has much voodoo in it from a user's > perspective; the manufacturers consider exactly how they do what they do > to be proprietary and simply present to you an opaque block-level > interface. There are rumors that some controllers "know" about certain > filesystems (specifically exFAT) and are optimized for it, which implies > they may behave less-well if you're using something else. How true this > actually might be is unknown but a couple of years ago I had a card that > appeared dead -- until it was reformatted with exFAT, at which point it > started working again. I didn't trust it, needless to say. Different drives have different strategies. But the exFAT special case, when it is still used at all, is confined to SD cards, SSDs don't use it anymore. > SSDs typically have a published endurance rating and a reasonable > interface to get a handle on how much "wear" they have experienced. > I've never seen either in any meaningful form for SD cards of any sort. > In addition SSDs can (and do) "cheat" in that they all have RAM in them > and thus can collate writes together before physically committing them > in some instances, plus they typically will report that a write is > "complete" when it is in RAM (and not actually in NAND!) Needless to > say if there's no proper power protection sufficient to flush that RAM > if the power fails unexpectedly very bad things will happen to your > data, and very few SSDs have said proper power protection (Intel 7xx and > 3xxx series are two that are known to do this correctly; I have a bunch > of the 7xx series drives in service and have never had a problem with > any of them even under intentional cord-yank scenarios intended to test > their power-loss protection.) I'm unaware of SD cards that do any of > this and I suspect their small size precludes it, never mind that they > were not designed for a workload where this would be terribly useful. > The use envisioned for most SD cards, and their intent when designed, is > the sequential writing of anywhere from large to huge files (video or > still pictures) and the later sequential reading back of same, all under > some form of a FAT filesystem (exFAT for the larger cards now available.) That's kinda true. SD cards have no room for super caps. However, the black art that goes into the SD cards have allowed for this and they have reliability guarantees of their own, but they go slower for it. They generally don't have large DRAM buffers, and generally write throttle to NAND rate pretty quickly. If you aren't lying to the OS by saying the write is complete to win on IOPs benchmarks, the reliability issues go away. Where SD cards fall down is their firmware is usually rubbish at recovering from certain kinds of errors. The usual one of a camera that loses power while writes are going on generally work (basically a sequential workload) because the meta-data they need to keep book is usually written in a reliable way. Usually, but not always, which is why I usually rate them as rubbish. > IMHO the best you can do with these cards in this application is to > minimize writes to the extent you can, especially small and frequent > writes of little actual value (e.g. mount with noatime!) and make sure > you can reasonably recover from failures in a rational fashion. That's true, but for none of the reasons that you suggest. The reasons have more to do with the log structure the devices are forced to used coupled with newer NAND nodes that have lower and lower endurance. The good 3d NAND that gives back to endurance again are reserved for the NVMe and SSD drives, but it is starting to show up in high-performance SD cards as well, though they are still pricy. BTW, I worked at Fusion I/O for a few years writing their 'on load' driver that did all these things and learning about all the tricks that are used in the industry. My main area of focus was the NAND reliability models that were used to get better performance later in life out of crappy NAND and to ensure the drives could go 5x the manufacturer's stated endurance numbers with lots of clever tricks. Secondarily, I worked on garbage collection and radix tree design to help improve our drive's performance under a variety of work loads. Warner