FreeBSD Mail Archives

Date:      Thu, 28 Jul 2011 07:59:17 -0700
From:      Jeremy Chadwick <freebsd@jdc.parodius.com>
To:        Steven Hartland <killing@multiplay.co.uk>
Cc:        freebsd-fs@FreeBSD.ORG
Subject:   Re: Questions about erasing an ssd to restore performance under FreeBSD
Message-ID:  <20110728145917.GA37805@icarus.home.lan>
In-Reply-To: <A6828B6CE6764E13A44B1ABF61CF3FED@multiplay.co.uk>
References:  <13BEC27B17D24D0CBF2E6A98FD3227F3@multiplay.co.uk> <20110728012437.GA23430@icarus.home.lan> <FD3A11BEFD064193AA24C1DF09EDD719@multiplay.co.uk> <20110728103234.GA33275@icarus.home.lan> <A6828B6CE6764E13A44B1ABF61CF3FED@multiplay.co.uk>

On Thu, Jul 28, 2011 at 02:22:21PM +0100, Steven Hartland wrote:
> ----- Original Message ----- From: "Jeremy Chadwick"
> <freebsd@jdc.parodius.com>
> >Well, on FreeBSD /dev/urandom is a symlink to /dev/random.  I've
> >discussed in the past why I use /dev/urandom instead of /dev/random (I
> >happen to work in a heterogeneous OS environment at work, where urandom
> >and random are different things).
> >
> >I was mainly curious why you were using if=/some/actual/file rather than
> >if=/dev/urandom directly.  'tis okay, not of much importance.
> 
> /dev/urandom seems to bottle neck at ~60MB/s a cached file generated from
> it doesn't e.g.
> dd if=/dev/random of=/dev/null bs=1m count=1000
> 1000+0 records in
> 1000+0 records out
> 1048576000 bytes transferred in 16.152686 secs (64916509 bytes/sec)
> 
> dd if=/dev/random of=/data/test bs=1m count=1000
> 1000+0 records in
> 1000+0 records out
> 1048576000 bytes transferred in 16.178811 secs (64811685 bytes/sec)
> 
> dd if=/data/test of=/dev/null bs=1m
> 1000+0 records in
> 1000+0 records out
> 1048576000 bytes transferred in 0.240348 secs (4362738865 bytes/sec)

/dev/urandom is highly CPU-bound.  For example, on my home box it tops
out at about 79MBytes/sec.  I tend to use /dev/zero for I/O testing,
since I really don't need the CPU tied up generating random data from
entropy sources.  The difference in speed is dramatic.

So yes, I guess if you wanted to test high write speeds with purely
randomised data as your source, creating a temporary file with content
from /dev/urandom first would be your best bet.  (Assuming, of course,
that the source you plan to read from can transfer as fast as the writes
to the destination, but that goes without saying).

> >Okay, so it sounds like what happened -- if I understand correctly -- is
> >that your ZFS-based Corsair SSD volume (/ssd) recently had a bunch of
> >data copied to it.  It still had 60% free space available.  After, the
> >SSD performance for writes really plummeted (~20MByte/sec), but reads
> >were still decent.  Performing an actual ATA-level secure erase brought
> >the drive back to normal write performance (~190MByte/sec).
> 
> Yes this is correct.
> 
> >If all of that is correct, then I would say the issue is that the
> >internal GC on the Corsair SSD in question sucks.  With 60% of the drive
> >still available, performance should not have dropped to such an abysmal
> >rate; the FTL and wear levelling should have, ideally, dealt with this
> >just fine.  But it didn't.
> 
> Agreed
> 
> >Why I'm focusing on the GC aspect: because ZFS (or GEOM; whatever,
> >that's an engineering discussion for elsewhere) lacks TRIM.  The
> >underlying filesystem is therefore unable to tell the drive "hey, these
> >LBAs aren't used any more, you can consider them free and perform a NAND
> >page erase when an entire NAND page is unused".  The FTL has to track
> >all LBAs you've written to, otherwise if erasing a NAND page which still
> >had used data in it (for the filesystem) it would result in loss of
> >data.
> >
> >So in summary I'm not too surprised by this situation happening, but I
> >*AM* surprised at just how horrible writes became for you.  The white
> >paper I linked you goes over this to some degree -- it talks about how
> >everyone thinks SSDs are "so amazingly fast" yet nobody does benchmarks
> >or talks about how horrible they perform when very little free space is
> >available, or if the GC is badly implemented.  Maybe Corsair's GC is
> >badly implemented -- I don't know.
> 
> Agreed again, we've seen a few disks now drop to this level of performance
> at first we thought the disk was failing, as the newfs -E didn't fix it when
> the man page indicates it should. But seems thats explained now, only
> works if its ada not da, and also not quite as good as a secure erase.

I guess the newfs(8) man page should be rephrased then.  When I read the
description for the -E option, I see this paragraph:

            Erasing may take a long time as it writes to every sector
            on the disk.

And immediately think "Oh, all it does is write zeros to every LBA,
probably in blocks of some size that's unknown to me (vs. 512 bytes)".

I can submit a PR + patch for this, but I'd propose the man page
description for -E be changed to this:

   -E      Erase the content of the disk before making the filesystem.
           The reserved area in front of the superblock (for bootcode)
           will not be erased.

           This option writes zeros to every sector (LBA) on the disk,
           in transfer sizes of, at most, 65536 * sectorsize bytes.

Basically remove the mention of wear-leveling and "intended for use
with flash devices".  Any device can use this option as well; it's a
UFS-esque equivalent of dd if=/dev/zero of=/dev/device bs=..., sans the
exclusions mentioned.

The tricky part is the "transfer sizes of, at most..." line.  I'm
certain someone will ask me where I got that from, so I'll explain it.
Sorry for the long-winded stuff, but this is more or less how I learn,
and I hope it benefits someone in the process.  And man I sure hope I'm
reading this code right...

<ignore-if-you-dont-care>

Down the rabbit hole we go:

newfs(8) calls berase(3), which is part of libufs:

 501         if (Eflag && !Nflag) {
 ...
 505                 berase(&disk, sblock.fs_sblockloc / disk.d_bsize,
 506                     sblock.fs_size * sblock.fs_fsize - sblock.fs_sblockloc);

The man page for berase(3) doesn't tell you the size of I/O transfer
(the "block size") when it asks the kernel to effectively write zeros to
the device.

Looking at src/lib/libufs/block.c, we find this:

143 berase(struct uufsd *disk, ufs2_daddr_t blockno, ufs2_daddr_t size)
...
154         ioarg[0] = blockno * disk->d_bsize;
155         ioarg[1] = size;
156         rv = ioctl(disk->d_fd, DIOCGDELETE, ioarg);

This ioctl(2) (DIOCGDELETE) is not documented anywhere in the entire
source code tree (grep -r DIOCGDELETE /usr/src returns absolutely no
documentation references).  Furthermore, at this point we still have no
idea whow the arguments being passed to ioctl are used; is "size" the
total size, or is it the transfer size of the write we're going to
issue?

DIOCGDELETE is handled in src/sys/geom/geom_dev.c, where we finally get
some answers:

293         case DIOCGDELETE:
294                 offset = ((off_t *)data)[0];
295                 length = ((off_t *)data)[1];
...
303                 while (length > 0) {
304                         chunk = length;
305                         if (chunk > 65536 * cp->provider->sectorsize)
306                                 chunk = 65536 * cp->provider->sectorsize;
307                         error = g_delete_data(cp, offset, chunk);
308                         length -= chunk;
309                         offset += chunk;

So ioctl[0] is the offset, and ioctl[1] represents the actual TOTAL SIZE
of what we want erased, NOT the transfer block size itself.

The block size itself is calculated on line 306, so 65536 * the actual
GEOM provider's "advertised sector size".  On SSDs, this would be 512
bytes (no I am not kidding).

But we're still not finished.  What is g_delete_data?  It's an internal
GEOM function which does what it's told (heh :-) ).

src/sys/geom/geom_io.c sheds light on that:

739 g_delete_data(struct g_consumer *cp, off_t offset, off_t length)
740 {
741         struct bio *bp;
742         int error;
743
744         KASSERT(length > 0 && length >= cp->provider->sectorsize,
745             ("g_delete_data(): invalid length %jd", (intmax_t)length));
746
747         bp = g_alloc_bio();
748         bp->bio_cmd = BIO_DELETE;
749         bp->bio_done = NULL;
750         bp->bio_offset = offset;
751         bp->bio_length = length;
752         bp->bio_data = NULL;
753         g_io_request(bp, cp);
754         error = biowait(bp, "gdelete");
...

Okay, so without going into g_io_request() (did I not say something
about rabbit holes earlier?), we can safely assume that's even more
abstraction around a BIO_DELETE call.  bp->bio_length is the size of the
data to tinker with, in bytes.

So in summary, with a 512-byte "advertised sector" disk, the erase would
happen in 32MByte "transfer size blocks".  Let's test that theory with
an mdconfig(8) "disk" and a slightly modified version of newfs(8) that
tells us what the value of the 3rd argument is that it's passing to
berase(3):

# mdconfig -a -t malloc -s 256m -o reserve -u 0
md0
# sysctl -b kern.geom.conftxt | strings | grep md0
0 MD md0 268435456 512 u 0 s 512 f 0 fs 0 l 268435456 t malloc

Sector size of the md0 pseudo-disk is 512 bytes (5th parameter).  Now
the modified newfs:

# ~jdc/tmp/newfs/newfs -E /dev/md0
/dev/md0: 256.0MB (524288 sectors) block size 16384, fragment size 2048
        using 4 cylinder groups of 64.02MB, 4097 blks, 8256 inodes.
Erasing sectors [128...524287]
berase() 3rd arg: 268369920
super-block backups (for fsck -b #) at:
 160, 131264, 262368, 393472

There's the printf() I added ("berase()...").  So the argument passed to
berase() is 268369920 (the size of the pseudo-disk, sans the area before
the superblock, in this case 4 CGs at 16384 block size, so 65536 bytes;
268435456 - 268369920 == 65536).  Now back to the geom_dev.c code with
the data we know:

- Line 395 would assign length to 268369920
- Line 304 would assign chunk to 268369920
- Line 305 conditional would prove true; 268369920 > 33554432
  (65536*512), so chunk becomes 33554432
- Line 307 "and within" does the actual zeroing

</ignore-if-you-dont-care>

The reason the man page can't say 32MBytes explicitly is because it's
dynamic (based on sector size).  I imagine, somewhere down the road, we
WILL have disks that start advertising non-512-byte sector sizes.  As of
this writing none I have seen do (SSDs nor WD -EARS drives).
           
> >I would see if there are any F/W updates for that model of drive.  The
> >firmware controls the GC model/method.  Otherwise, if this issue is
> >reproducible, I'll add this model of Corsair SSD to my list of drives to
> >avoid.
> 
> Its the latest firmware version, already checked that. The performance
> has been good till now and I suspect it could be a generic sandforce
> thing if its a firmware issue.

SandForce-based SSDs have a history of being extremely good with their
GC, but I've never used one.  However, if I remember right (something I
read not more than a week ago, I just can't remember where!), it's very
rare that any SF-based SSD vendor uses the stock SF firmware.  They
modify the hell out of it.  Meaning: two SSDs using the exact same model
of SF controller doesn't mean they'll behave the exact same.  Hmm, I
probably read this on some SSD review site, maybe Anandtech.  I imagine
the same applies to Marvell-based SSD controllers too.

> >Is it possible to accomplish Secure Erase via "camcontrol cmd" with
> >ada(4)?  Yes, but the procedure will be extremely painful, drawn out,
> >and very error-prone.
> >
> >Given that you've followed the procedure on the Linux hdparm/ATA Secure
> >Erase web page, you're aware of the security and "locked" status one has
> >to deal with using password-protection to accomplish the erase.  hdparm
> >makes this easy because it's just a bunch of command-line flags; the
> >""heavy lifting"" on the ATA layer is done elsewhere.  With "camcontrol
> >cmd", you get to submit the raw ATA CDB yourself, multiple times, at
> >different phases.  Just how familiar with the ATA protocol are you?  :-)
> >
> >Why I sound paranoid: a typo could potentially "brick" your drive.  If
> >you issue a set-password on the drive, ***ALL*** LBA accesses (read and
> >write) return I/O errors from that point forward.  Make a typo in the
> >password, formulate the CDB wrong, whatever -- suddenly you have a drive
> >that you can't access or use any more because the password was wrong,
> >etc...  If the user doesn't truly understand what they're doing
> >(including the formulation of the CDB), then they're going to panic.
> >
> >camcontrol and atacontrol could both be modified to do the heavy
> >lifting, making similar options/arguments that would mimic hdparm in
> >operation.  This would greatly diminish the risks, but the *EXACT
> >PROCEDURE* would need to be explained in the man page.  But keep reading
> >for why that may not be enough.
> >
> >I've been in the situation where I've gone through the procedure you
> >followed on said web page, only to run into a quirk with the ATA/IDE
> >subsystem on Windows XP, requiring a power-cycle of the system.  The
> >secure erase finished, but I was panicking when I saw the drive spitting
> >out I/O errors on every LBA.  I realised that I needed to unlock the
> >drive using --security-unlock then disable security by using
> >--security-disable.  Once I did that it was fine.  The web page omits
> >that part, in the case of emergency or anomalies are witnessed.  This
> >ordeal happened to me today, no joke, while tinkering with my new Intel
> >510 SSD.  So here's a better page:
> >
> >http://tinyapps.org/docs/wipe_drives_hdparm.html
> >
> >Why am I pointing this out?  Because, in effect, an entire "HOW TO DO
> >THIS AND WHAT TO DO IF IT GOES HORRIBLY WRONG" section would need to be
> >added to camcontrol/atacontrol to ensure people don't end up with
> >"bricked" drives and blame FreeBSD.  Trust me, it will happen.  Give
> >users tools to shoot themselves in the foot and they will do so.
> >
> >Furthermore, SCSI drives (which is what camcontrol has historically been
> >for up until recently) have a completely different secure erase CDB
> >command for them.  ATA has SECURITY ERASE UNIT, SCSI has SECURITY
> >INITIALIZE -- and in the SCSI realm, this feature is optional!  So
> >there's that error-prone issue as well.  Do you know how many times I've
> >issued "camcontrol inquiry" instead of "camcontrol identify" on my
> >ada(4)-based systems?  Too many.  Food for thought.  :-)
> >
> >Anyway, this is probably the only time you will ever find me saying
> >this, but: if improving camcontrol/atacontrol to accomplish the above is
> >what you want, patches are welcome.  I could try to spend some time on
> >this if there is great interest in the community for such (I'm more
> >familiar with atacontrol's code given my SMART work in the past), and I
> >do have an unused Intel 320-series SSD which I can test with.
> 
> This is of definite of interest here and I suspect to the rest of the
> community as well. I'm not at all familiar with ATA codes etc so I
> expect it would take me ages to come up with this.
> 
> In our case SSD's are a must as HD's don't have the IOPs to deal with
> our application, we'll just need to manage the write speed drop offs.
> 
> Performing offline maintenance to have them run at good speed is
> not ideal but much easier and more acceptable than booting another OS,
> which would a total PITA as some machines don't have IPMI with virtual
> media so means remote hands etc.
> 
> Using a Backup -> Erase -> Restore direct from BSD would hence be my
> preferred workaround until TRIM support is added, but I guess that could
> well be some time for ZFS.

Understood.  I'm off work this week so I'll see if I can dedicate some
time to it.  Too many non-work projects I'm juggling right now, argh.

I'll have to start with camcontrol since the test system I have uses
ada(4) and not classic ata(4).  I'm not even sure what I'm really in for
given that I've never looked at camcontrol's code before.

If I "brick" my SSD I'll send you a bill, Steven.  Kidding.  :-)

-- 
| Jeremy Chadwick                                jdc at parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                   Mountain View, CA, US |
| Making life hard for others since 1977.               PGP 4BD6C0CB |

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20110728145917.GA37805>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation