Date: Mon, 29 Oct 2012 03:59:05 -0700 From: Jeremy Chadwick <jdc@koitsu.org> To: freebsd-questions@freebsd.org Subject: Re: 9.1 and gmirror with GPT? Message-ID: <20121029105905.GA358@icarus.home.lan>
next in thread | raw e-mail | index | archive | help
(I won't be responding to any public or private mails relating to this topic after this point, just as an FYI) Just a reminder for readers: If you're truly using 4096-byte sectors disks -- specifically MECHANICAL hard disks (MHDDs) -- use of 4KByte alignment is fine. But if you ever plan on using an SSD the future, you need to align things to 1MBytes or 2MBytes. I have read on the mailing lists where some users "don't know why / what the justification is" behind this, so I'll explain it: The reason is that FTLs within SSDs do not issue erases (resetting bits to zero) on a per-flash-page basis (a flash page is commonly 4KBytes), but on a "block" basis (a group of pages). This is usually referred to as the "NAND erase block size". Let me make this clear: this is not the same thing as filesystem block size or similar "block size" you might see mentioned throughout the zillions of layers of I/O abstraction in a *IX system and its kernel. Do not mix up the terms (yes I know it's confusing). Anyway... Most SSD vendors do not disclose what the NAND erase block size is in their products, and that's disappointing. However poking and prodding (usually performance testing) has shown that most vendors use either 1MByte or 2MByte NAND erase block sizes (as of this writing). I haven't seen larger in the field yet, for consumer products anyway (i.e. don't ask me about FusionIO). This is why Windows Vista and Windows 7 aligns its partitions to 1MByte boundaries. ...and quite honestly FreeBSD should too. I am aware 9.1-RELEASE supposedly addresses this -- however I have not determined if the alignment size chosen by the committer was 4096 or 1MB/2MB. I have a gut feeling it's the former, and that's bad. With 1MByte or 2MByte alignment, performance on 512-byte MHDDs would be fine, performance on 4096-byte MHDDs would be fine, and performance on SSDs would be fine. If folks want to be on the "extra super duper safe side", align to 2MB. Otherwise align to 1MB and don't worry about it. Lack of proper alignment to NAND erase block size can result in excess wear/tear on the NAND flash, which means diminishing the effectiveness of wear levelling and the performance of your drive. Do not ask me for numbers; I do not have them. Read Wikipedia's article on wear levelling for details. Next: in case it's not made clear to readers from Warren's statements: the magical "8" divisor he's using comes from 4096/512 ("how many 512 bytes are there in a 4096-byte sector"). Thus, for 1MByte alignment the value would be 1048576/512 or 2048. For 2MByte alignment the value would be 2097152/512 or 4096. The general rule-of-thumb I tend to use is to use GPT and start my FreeBSD partitions at LBA 4096, and make sure all the partition sizes are divisible by 2MBytes. If there is a GPT+GEOM conflict, I tend to recommend to people, with the introduction of graid(8), that they make use of BIOS-level RAID and then use GPT. There is one known caveat to this (as of this writing) where a ZFS root filesystem used on top of graid(8) results in a problem, but mav@ is looking into that. And don't ask me why you'd want to do that anyway -- some people apparently like complicating their lives and shunning KISS principle entirely. P.S. -- Linux md solved their equivalent of the "GEOM vs. GPT" issue with the introduction of md superblock version 1.2 (superblock=metadata in this context). They stuck the superblock 4096 bytes after the start of the device. This does limit the number of GPT partitions supported (from 128 down to 8), but I question the reasoning/sanity of anyone who's got more than 8 GPT partitions on a single disk anyway (use a volume manager already). -- | Jeremy Chadwick jdc@koitsu.org | | UNIX Systems Administrator http://jdc.koitsu.org/ | | Mountain View, CA, US | | Making life hard for others since 1977. PGP 4BD6C0CB |
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20121029105905.GA358>