Date: Mon, 18 Jul 2011 21:05:44 -0700 From: Jeremy Chadwick <freebsd@jdc.parodius.com> To: Glen Barber <glen.j.barber@gmail.com> Cc: freebsd-stable@freebsd.org Subject: Re: Status of support for 4KB disk sectors Message-ID: <20110719040544.GA9607@icarus.home.lan> In-Reply-To: <4E24FC18.3010605@gmail.com> References: <CAN6yY1uaUqk2ifiNViJyMFJWf60a4DmCiVs3Z=--_TjtzseABQ@mail.gmail.com> <20110718234124.GA5626@icarus.home.lan> <4E24FC18.3010605@gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, Jul 18, 2011 at 11:38:00PM -0400, Glen Barber wrote: > On 7/18/11 7:41 PM, Jeremy Chadwick wrote: > > On Mon, Jul 18, 2011 at 03:50:15PM -0700, Kevin Oberman wrote: > >> I just want to check on the status of 4K sector support in FreeBSD. I read > >> a long thread on the topic from a while back and it looks like I might hit some > >> issues if I'm not REALLY careful. Since I will be keeping the existing Windows > >> installation, I need to be sure that I can set up the disk correctly without > >> screwing up Windows 7. > >> > >> I was planning on just DDing the W7 slice over, but I am not sure how well this > >> would play with GPT. Or should I not try to use GPT at all? I'd like > >> to as this laptop > >> spreads Windows 7 over two slices and adds a third for the recovery > >> system, leaving > >> only one for FreeBSD and I'd like to put my files in a separate slice. > >> GPT would offer > >> that fifth slice. > >> > >> I have read the handbook and don't see any reference to 4K sectors and only a > >> one-liner about gpart(8) and GPT. Oncew I get this all figured out, > >> I'll see about writing > >> an update about this as GPT looks like the way to go in e future. > > > > When you say "4KB sector support", what do you mean by this? All > > drives on the market as of this writing, that I've seen, all claim a > > physical/logical sector size of 512 bytes -- yes, even SSDs, and EARS > > drives which we know use 4KB sectors. They do this to guarantee full > > compatibility with existing software. > > > > Since you're talking about gpart and "4KB sector support", did you mean > > to ask "what's the state of FreeBSD and aligned partition support to > > ensure decent performance with 4KB-sector drives?" > > > > If so: there have been some commits in recent days to RELENG_8 to help > > try to address the shortcomings of the existing utilities and GEOM > > infrastructure. Read the most recent commit text carefully: > > > > http://www.freebsd.org/cgi/cvsweb.cgi/src/sbin/geom/class/part/geom_part.c > > > > But the currently "known method" is to use gnop(8). Here's an example: > > > > http://www.leidinger.net/blog/2011/05/03/another-root-on-zfs-howto-optimized-for-4k-sector-drives/ > > > > Notice: I'm reading this as "how badly do 'green drives' suck?" It's important to note that not all WD Caviar Green drives use 4KB sectors. WD, as of this writing, uses the 4-letter "EARS" string in the drive model that denotes use of 4KB sectors. The Green series do have other problems that people have experienced, such as bugs/quirks in the firmware causing the drive to repetitively park its heads in the landing zone (witnessed as either really bad drive performance, or the drive falling off the bus + reattaching). You can detect this situation by looking at SMART attribute 193 (Load_Cycle_Count). A very high number (in the tens or hundreds of thousands for a drive that has only been in use for a week or so) is an indicator of the problem. WD apparently has given people firmware updates to fix the issue. However the drive firmware version number does not change after updating the microcode, but it does fix the problem. (For what it's worth, Samsung pulled this same manoeuvre when it came to firmware updates for a catastrophic bug on their SpinPoint F4 drives.) What I'm saying is there's no way to detect whether or not your drive is running the fixed firmware, other than looking at said SMART attribute. I do have references for this issue, but it will take me some time to dig up the URLs and so on. > FWIW, I've recently done the gnop(8) trick to two "green" drives in one > of my machines because I was seeing horrifying performance problems with > what I consider to be basic stuff, like 'portsnap extract', but more > severely with copying large data (file-backed bacula files to be exact) > into said datasets. I have yet to retry my read/write tests with drives > I have not converted with gnop(8). I imagine this would have a tremendous effect on performance. With SSDs, the estimated performance impact is between 30-50% depending on what the workload is. Meaning with SSDs, drives with aligned partitions perform 30-50% better. When you read about how NAND cell and NAND flash pages work (look it up on Wikipedia, look for FTL (flash transition layer)) it makes sense. With mechanical HDDs, I'm not sure what the performance hit is, but I imagine it's large. Furthermore, talking about SSDs again: I want to make folks aware of the fact that Intel SSDs use an 8KB NAND flash page (not 4KB!). NAND pages are erased 256 pages at a time (8*256=2MByte). When it comes to alignment, flash page size is what's of concern. So for Intel SSDs (X25 series, 320 series, and 510 series), 8KByte-aligned is the way to go. > I have not conclusively tested all possible combinations of > configurations, nor reverted the changes to the drives to retest, but if > it is of any interest, here's what I'm seeing. > > I have comparisons between WD "green" and "black" drives. > Unfortunately, the machines are not completely similar - one is a > Core2Quad, the other Core2Duo; one has 6GB RAM, the other 8GB RAM; also, > 'orion' is running a month-old 8-STABLE; 'kaos' is running a 2-week-old > -CURRENT. Both machines are using ZFSv28: > > orion % sysctl -n hw.ncpu; sysctl -n hw.physmem > 4 > 6353416192 > > kaos % sysctl -n hw.ncpu; sysctl -n hw.physmem > 2 > 8534401024 > > The drives in 'orion' are 1TB WD green drives in a ZFS mirror; the > drives in 'kaos' are 1TB WD black drives in a raidz1 (3 drives). > > First the read test: > > kaos % sh -c 'time find /usr/src -type f -name \*.\[1-9\] >/dev/null' > 12.94 real 0.60 user 11.95 sys > > orion % sh -c 'time find /usr/src -type f -name \*.\[1-9\] >/dev/null' > 118.02 real 0.46 user 8.74 sys > > I guess no real surprise here. 'kaos' has more spindles to read from, > on top of faster seek times. > > Next the write test: > > The 'compressed' and 'dedup' datasets referenced below are 'lzjb' and > 'sha256,verify', respectively. I'd wait for the 'compressed+dedup' > tests to finish, but I have to wake up tomorrow morning. > > orion# sh -c 'time portsnap extract -p /zstore/perftest >/dev/null' > 306.71 real 44.37 user 110.28 sys > > orion# sh -c 'time portsnap extract -p /zstore/perftest_compress >/dev/null' > 166.62 real 43.87 user 109.49 sys > > orion# sh -c 'time portsnap extract -p /zstore/perftest_dedup >/dev/null' > 3576.43 real 44.98 user 109.12 sys > > kaos# sh -c 'time portsnap extract -p /perftest >/dev/null' > 311.31 real 51.23 user 193.37 sys > > kaos# sh -c 'time portsnap extract -p /perftest_compress >/dev/null' > 269.85 real 49.55 user 191.56 sys > > kaos# sh -c 'time portsnap extract -p /perftest_dedup >/dev/null' > 4655.73 real 51.86 user 196.22 sys > > Like I said, I have not yet had the time to retest this on drives > without the gnop(8) fix (another similar zpool with 2 drives), so maybe > the data I'm providing isn't relevant, but since the gnop(8) fix for 4K > sector drives was mentioned, I thought it might be relevant to a point. The problem with what you're testing here is that it's not really "testing the drive" -- it's testing multiple drives with ZFS in the middle. Using dd would address that. For testing "non-aligned" offsets (for the EARS drive), use the seek= parameter. I would also recommend in picking an awkwardly-sized bs= value, such as 61340. > > Now, that's for ZFS, but I'm under the impression the exact same is > > needed for FFS/UFS. > > > > <rant> Do I bother doing this with my SSDs? No. Am I suffering in > > performance? Probably. Why do I not care? Because the level of > > annoyance is extremely high -- remember, all of this has to be done from > > within the installer environment (referring to "Emergency Shell"), which > > on FreeBSD lacks an incredible amount of usability, and is even worse to > > deal with when doing a remote install via PXE/serial. Fixit is the only > > decent environment. Given that floppies are more or less gone, I don't > > understand why the Fixit environment doesn't replace the "Emergency > > Shell". </rant> > > > > Not that it necessarily helps in a PXE environment, but a memstick of > 9-CURRENT has helped me recover minor "oops" situations a few times over > the past few months. What are these "floppies" you speak of, again? :) Sure, USB flash drives work great. But it's a little hard to install a USB flash drive when you're 3000 miles away. :-) mm's mfsBSD is also useful for recovery situations: http://mfsbsd.vx.sk/ My point, though, was this: Fixit was separate from Emergency Shell because of space concerns on floppy disks (Fixit wouldn't fit). Since floppies really aren't used much any more, this concern should be revised. IMHO Fixit should be removed and Emergency Shell should provide the same environment/utilities/etc. as Fixit. -- | Jeremy Chadwick jdc at parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, US | | Making life hard for others since 1977. PGP 4BD6C0CB |
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20110719040544.GA9607>