Date: Thu, 28 Jul 2011 07:59:17 -0700 From: Jeremy Chadwick <freebsd@jdc.parodius.com> To: Steven Hartland <killing@multiplay.co.uk> Cc: freebsd-fs@FreeBSD.ORG Subject: Re: Questions about erasing an ssd to restore performance under FreeBSD Message-ID: <20110728145917.GA37805@icarus.home.lan> In-Reply-To: <A6828B6CE6764E13A44B1ABF61CF3FED@multiplay.co.uk> References: <13BEC27B17D24D0CBF2E6A98FD3227F3@multiplay.co.uk> <20110728012437.GA23430@icarus.home.lan> <FD3A11BEFD064193AA24C1DF09EDD719@multiplay.co.uk> <20110728103234.GA33275@icarus.home.lan> <A6828B6CE6764E13A44B1ABF61CF3FED@multiplay.co.uk>
next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, Jul 28, 2011 at 02:22:21PM +0100, Steven Hartland wrote: > ----- Original Message ----- From: "Jeremy Chadwick" > <freebsd@jdc.parodius.com> > >Well, on FreeBSD /dev/urandom is a symlink to /dev/random. I've > >discussed in the past why I use /dev/urandom instead of /dev/random (I > >happen to work in a heterogeneous OS environment at work, where urandom > >and random are different things). > > > >I was mainly curious why you were using if=/some/actual/file rather than > >if=/dev/urandom directly. 'tis okay, not of much importance. > > /dev/urandom seems to bottle neck at ~60MB/s a cached file generated from > it doesn't e.g. > dd if=/dev/random of=/dev/null bs=1m count=1000 > 1000+0 records in > 1000+0 records out > 1048576000 bytes transferred in 16.152686 secs (64916509 bytes/sec) > > dd if=/dev/random of=/data/test bs=1m count=1000 > 1000+0 records in > 1000+0 records out > 1048576000 bytes transferred in 16.178811 secs (64811685 bytes/sec) > > dd if=/data/test of=/dev/null bs=1m > 1000+0 records in > 1000+0 records out > 1048576000 bytes transferred in 0.240348 secs (4362738865 bytes/sec) /dev/urandom is highly CPU-bound. For example, on my home box it tops out at about 79MBytes/sec. I tend to use /dev/zero for I/O testing, since I really don't need the CPU tied up generating random data from entropy sources. The difference in speed is dramatic. So yes, I guess if you wanted to test high write speeds with purely randomised data as your source, creating a temporary file with content from /dev/urandom first would be your best bet. (Assuming, of course, that the source you plan to read from can transfer as fast as the writes to the destination, but that goes without saying). > >Okay, so it sounds like what happened -- if I understand correctly -- is > >that your ZFS-based Corsair SSD volume (/ssd) recently had a bunch of > >data copied to it. It still had 60% free space available. After, the > >SSD performance for writes really plummeted (~20MByte/sec), but reads > >were still decent. Performing an actual ATA-level secure erase brought > >the drive back to normal write performance (~190MByte/sec). > > Yes this is correct. > > >If all of that is correct, then I would say the issue is that the > >internal GC on the Corsair SSD in question sucks. With 60% of the drive > >still available, performance should not have dropped to such an abysmal > >rate; the FTL and wear levelling should have, ideally, dealt with this > >just fine. But it didn't. > > Agreed > > >Why I'm focusing on the GC aspect: because ZFS (or GEOM; whatever, > >that's an engineering discussion for elsewhere) lacks TRIM. The > >underlying filesystem is therefore unable to tell the drive "hey, these > >LBAs aren't used any more, you can consider them free and perform a NAND > >page erase when an entire NAND page is unused". The FTL has to track > >all LBAs you've written to, otherwise if erasing a NAND page which still > >had used data in it (for the filesystem) it would result in loss of > >data. > > > >So in summary I'm not too surprised by this situation happening, but I > >*AM* surprised at just how horrible writes became for you. The white > >paper I linked you goes over this to some degree -- it talks about how > >everyone thinks SSDs are "so amazingly fast" yet nobody does benchmarks > >or talks about how horrible they perform when very little free space is > >available, or if the GC is badly implemented. Maybe Corsair's GC is > >badly implemented -- I don't know. > > Agreed again, we've seen a few disks now drop to this level of performance > at first we thought the disk was failing, as the newfs -E didn't fix it when > the man page indicates it should. But seems thats explained now, only > works if its ada not da, and also not quite as good as a secure erase. I guess the newfs(8) man page should be rephrased then. When I read the description for the -E option, I see this paragraph: Erasing may take a long time as it writes to every sector on the disk. And immediately think "Oh, all it does is write zeros to every LBA, probably in blocks of some size that's unknown to me (vs. 512 bytes)". I can submit a PR + patch for this, but I'd propose the man page description for -E be changed to this: -E Erase the content of the disk before making the filesystem. The reserved area in front of the superblock (for bootcode) will not be erased. This option writes zeros to every sector (LBA) on the disk, in transfer sizes of, at most, 65536 * sectorsize bytes. Basically remove the mention of wear-leveling and "intended for use with flash devices". Any device can use this option as well; it's a UFS-esque equivalent of dd if=/dev/zero of=/dev/device bs=..., sans the exclusions mentioned. The tricky part is the "transfer sizes of, at most..." line. I'm certain someone will ask me where I got that from, so I'll explain it. Sorry for the long-winded stuff, but this is more or less how I learn, and I hope it benefits someone in the process. And man I sure hope I'm reading this code right... <ignore-if-you-dont-care> Down the rabbit hole we go: newfs(8) calls berase(3), which is part of libufs: 501 if (Eflag && !Nflag) { ... 505 berase(&disk, sblock.fs_sblockloc / disk.d_bsize, 506 sblock.fs_size * sblock.fs_fsize - sblock.fs_sblockloc); The man page for berase(3) doesn't tell you the size of I/O transfer (the "block size") when it asks the kernel to effectively write zeros to the device. Looking at src/lib/libufs/block.c, we find this: 143 berase(struct uufsd *disk, ufs2_daddr_t blockno, ufs2_daddr_t size) ... 154 ioarg[0] = blockno * disk->d_bsize; 155 ioarg[1] = size; 156 rv = ioctl(disk->d_fd, DIOCGDELETE, ioarg); This ioctl(2) (DIOCGDELETE) is not documented anywhere in the entire source code tree (grep -r DIOCGDELETE /usr/src returns absolutely no documentation references). Furthermore, at this point we still have no idea whow the arguments being passed to ioctl are used; is "size" the total size, or is it the transfer size of the write we're going to issue? DIOCGDELETE is handled in src/sys/geom/geom_dev.c, where we finally get some answers: 293 case DIOCGDELETE: 294 offset = ((off_t *)data)[0]; 295 length = ((off_t *)data)[1]; ... 303 while (length > 0) { 304 chunk = length; 305 if (chunk > 65536 * cp->provider->sectorsize) 306 chunk = 65536 * cp->provider->sectorsize; 307 error = g_delete_data(cp, offset, chunk); 308 length -= chunk; 309 offset += chunk; So ioctl[0] is the offset, and ioctl[1] represents the actual TOTAL SIZE of what we want erased, NOT the transfer block size itself. The block size itself is calculated on line 306, so 65536 * the actual GEOM provider's "advertised sector size". On SSDs, this would be 512 bytes (no I am not kidding). But we're still not finished. What is g_delete_data? It's an internal GEOM function which does what it's told (heh :-) ). src/sys/geom/geom_io.c sheds light on that: 739 g_delete_data(struct g_consumer *cp, off_t offset, off_t length) 740 { 741 struct bio *bp; 742 int error; 743 744 KASSERT(length > 0 && length >= cp->provider->sectorsize, 745 ("g_delete_data(): invalid length %jd", (intmax_t)length)); 746 747 bp = g_alloc_bio(); 748 bp->bio_cmd = BIO_DELETE; 749 bp->bio_done = NULL; 750 bp->bio_offset = offset; 751 bp->bio_length = length; 752 bp->bio_data = NULL; 753 g_io_request(bp, cp); 754 error = biowait(bp, "gdelete"); ... Okay, so without going into g_io_request() (did I not say something about rabbit holes earlier?), we can safely assume that's even more abstraction around a BIO_DELETE call. bp->bio_length is the size of the data to tinker with, in bytes. So in summary, with a 512-byte "advertised sector" disk, the erase would happen in 32MByte "transfer size blocks". Let's test that theory with an mdconfig(8) "disk" and a slightly modified version of newfs(8) that tells us what the value of the 3rd argument is that it's passing to berase(3): # mdconfig -a -t malloc -s 256m -o reserve -u 0 md0 # sysctl -b kern.geom.conftxt | strings | grep md0 0 MD md0 268435456 512 u 0 s 512 f 0 fs 0 l 268435456 t malloc Sector size of the md0 pseudo-disk is 512 bytes (5th parameter). Now the modified newfs: # ~jdc/tmp/newfs/newfs -E /dev/md0 /dev/md0: 256.0MB (524288 sectors) block size 16384, fragment size 2048 using 4 cylinder groups of 64.02MB, 4097 blks, 8256 inodes. Erasing sectors [128...524287] berase() 3rd arg: 268369920 super-block backups (for fsck -b #) at: 160, 131264, 262368, 393472 There's the printf() I added ("berase()..."). So the argument passed to berase() is 268369920 (the size of the pseudo-disk, sans the area before the superblock, in this case 4 CGs at 16384 block size, so 65536 bytes; 268435456 - 268369920 == 65536). Now back to the geom_dev.c code with the data we know: - Line 395 would assign length to 268369920 - Line 304 would assign chunk to 268369920 - Line 305 conditional would prove true; 268369920 > 33554432 (65536*512), so chunk becomes 33554432 - Line 307 "and within" does the actual zeroing </ignore-if-you-dont-care> The reason the man page can't say 32MBytes explicitly is because it's dynamic (based on sector size). I imagine, somewhere down the road, we WILL have disks that start advertising non-512-byte sector sizes. As of this writing none I have seen do (SSDs nor WD -EARS drives). > >I would see if there are any F/W updates for that model of drive. The > >firmware controls the GC model/method. Otherwise, if this issue is > >reproducible, I'll add this model of Corsair SSD to my list of drives to > >avoid. > > Its the latest firmware version, already checked that. The performance > has been good till now and I suspect it could be a generic sandforce > thing if its a firmware issue. SandForce-based SSDs have a history of being extremely good with their GC, but I've never used one. However, if I remember right (something I read not more than a week ago, I just can't remember where!), it's very rare that any SF-based SSD vendor uses the stock SF firmware. They modify the hell out of it. Meaning: two SSDs using the exact same model of SF controller doesn't mean they'll behave the exact same. Hmm, I probably read this on some SSD review site, maybe Anandtech. I imagine the same applies to Marvell-based SSD controllers too. > >Is it possible to accomplish Secure Erase via "camcontrol cmd" with > >ada(4)? Yes, but the procedure will be extremely painful, drawn out, > >and very error-prone. > > > >Given that you've followed the procedure on the Linux hdparm/ATA Secure > >Erase web page, you're aware of the security and "locked" status one has > >to deal with using password-protection to accomplish the erase. hdparm > >makes this easy because it's just a bunch of command-line flags; the > >""heavy lifting"" on the ATA layer is done elsewhere. With "camcontrol > >cmd", you get to submit the raw ATA CDB yourself, multiple times, at > >different phases. Just how familiar with the ATA protocol are you? :-) > > > >Why I sound paranoid: a typo could potentially "brick" your drive. If > >you issue a set-password on the drive, ***ALL*** LBA accesses (read and > >write) return I/O errors from that point forward. Make a typo in the > >password, formulate the CDB wrong, whatever -- suddenly you have a drive > >that you can't access or use any more because the password was wrong, > >etc... If the user doesn't truly understand what they're doing > >(including the formulation of the CDB), then they're going to panic. > > > >camcontrol and atacontrol could both be modified to do the heavy > >lifting, making similar options/arguments that would mimic hdparm in > >operation. This would greatly diminish the risks, but the *EXACT > >PROCEDURE* would need to be explained in the man page. But keep reading > >for why that may not be enough. > > > >I've been in the situation where I've gone through the procedure you > >followed on said web page, only to run into a quirk with the ATA/IDE > >subsystem on Windows XP, requiring a power-cycle of the system. The > >secure erase finished, but I was panicking when I saw the drive spitting > >out I/O errors on every LBA. I realised that I needed to unlock the > >drive using --security-unlock then disable security by using > >--security-disable. Once I did that it was fine. The web page omits > >that part, in the case of emergency or anomalies are witnessed. This > >ordeal happened to me today, no joke, while tinkering with my new Intel > >510 SSD. So here's a better page: > > > >http://tinyapps.org/docs/wipe_drives_hdparm.html > > > >Why am I pointing this out? Because, in effect, an entire "HOW TO DO > >THIS AND WHAT TO DO IF IT GOES HORRIBLY WRONG" section would need to be > >added to camcontrol/atacontrol to ensure people don't end up with > >"bricked" drives and blame FreeBSD. Trust me, it will happen. Give > >users tools to shoot themselves in the foot and they will do so. > > > >Furthermore, SCSI drives (which is what camcontrol has historically been > >for up until recently) have a completely different secure erase CDB > >command for them. ATA has SECURITY ERASE UNIT, SCSI has SECURITY > >INITIALIZE -- and in the SCSI realm, this feature is optional! So > >there's that error-prone issue as well. Do you know how many times I've > >issued "camcontrol inquiry" instead of "camcontrol identify" on my > >ada(4)-based systems? Too many. Food for thought. :-) > > > >Anyway, this is probably the only time you will ever find me saying > >this, but: if improving camcontrol/atacontrol to accomplish the above is > >what you want, patches are welcome. I could try to spend some time on > >this if there is great interest in the community for such (I'm more > >familiar with atacontrol's code given my SMART work in the past), and I > >do have an unused Intel 320-series SSD which I can test with. > > This is of definite of interest here and I suspect to the rest of the > community as well. I'm not at all familiar with ATA codes etc so I > expect it would take me ages to come up with this. > > In our case SSD's are a must as HD's don't have the IOPs to deal with > our application, we'll just need to manage the write speed drop offs. > > Performing offline maintenance to have them run at good speed is > not ideal but much easier and more acceptable than booting another OS, > which would a total PITA as some machines don't have IPMI with virtual > media so means remote hands etc. > > Using a Backup -> Erase -> Restore direct from BSD would hence be my > preferred workaround until TRIM support is added, but I guess that could > well be some time for ZFS. Understood. I'm off work this week so I'll see if I can dedicate some time to it. Too many non-work projects I'm juggling right now, argh. I'll have to start with camcontrol since the test system I have uses ada(4) and not classic ata(4). I'm not even sure what I'm really in for given that I've never looked at camcontrol's code before. If I "brick" my SSD I'll send you a bill, Steven. Kidding. :-) -- | Jeremy Chadwick jdc at parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, US | | Making life hard for others since 1977. PGP 4BD6C0CB |
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20110728145917.GA37805>