Date: Wed, 10 Sep 2014 09:48:40 +0200 From: Stefan Esser <se@freebsd.org> To: Aristedes Maniatis <ari@ish.com.au>, freebsd-stable <freebsd-stable@freebsd.org> Subject: Re: getting to 4K disk blocks in ZFS Message-ID: <54100258.2000505@freebsd.org> In-Reply-To: <540FF3C4.6010305@ish.com.au> References: <540FF3C4.6010305@ish.com.au>
next in thread | previous in thread | raw e-mail | index | archive | help
Am 10.09.2014 um 08:46 schrieb Aristedes Maniatis: > As we all know, it is important to ensure that modern disks are set > up properly with the correct block size. Everything is good if all > the disks and the pool are "ashift=9" (512 byte blocks). But as soon > as one new drive requires 4k blocks, performance drops through the > floor of the enture pool. > > > In order to upgrade there appear to be two separate things that must > be done for a ZFS pool. > > 1. Create partitions on 4K boundaries. This is simple with the > "-a 4k" option in gpart, and it isn't hard to remove disks one at a > time from a pool, reformat them on the right boundaries and put them > back. Hopefully you've left a few spare bytes on the disk to ensure > that your partition doesn't get smaller when you reinsert it to the > pool. > > 2. Create a brand new pool which has ashift=12 and zfs send|receive > all the data over. > > > I guess I don't understand enough about zpool to know why the pool > itself has a block size, since I understood ZFS to have variable > stripe widths. I'm not a ZFS internals expert, just a long time user, but I'll try to answer your questions. ZFS is based on a copy-on-write paradigm, which ensures, that no data is ever overwritten in place. All writes go to new blank blocks, and only after the last reference to an "old" block is lost (when no TXG or snapshot has references to it), is the old block freed and put back on the free block map. ZFS uses variable block sizes by breaking down large blocks to smaller fragments as suitable for the data to be stored. The largest block to be used is configurable (128 KByte by default) and the smallest fragment is the sector size (i.e. 512 or 4096 bytes), as configured by "ashift". The problem with 4K sector disks that report 512 byte sectors is, that ZFS still assumes, that no data is overwritten in place, while the disk drive does it behind the curtains. ZFS thinks it can atomically write 512 bytes, but the drive reads 4K, places the 512 bytes of data within that 4K physical sector in the drive's cache, and then writes back the 4K of data in one go. The cost is not only the latency of this read-modify-write sequence, but also that an elementary ZFS assumption is violated: Data that is in other (logical) 512 byte sectors of the physical 4 KByte sector can be lost, if that write operation fails, resulting in loss of data in those files that happen to share the physical sector with the one that received the write operation. This may never hit you, but ZFS is built on the assumption, that it cannot happen at all, which is no longer true with 4KB drives that are used with ashift=9. > The problem with step 2 is that you need to have enough hard disks > spare to create a whole new pool and throw away the old disks. Plus > a disk controller with lots of spare ports. Plus the ability to take > the system offline for hours or days while the migration happens. > > One way to reduce this slightly is to create a new pool with reduced > redundancy. For example, create a RAIDZ2 with two fake disks, then > off-line those disks. Both methods are dangerous! Studies have found, that the risk of another disk failure during resilvering is substantial. That was the reason for higher RAIDZ redundancy groups (raidz2, raidz3). With 1) you have to copy the data multiple times and the load could lead to loss of one of the source drives (and since you are in the process of overwriting the drive that provided redundancy, you loose your pool that way). The copying to a degraded pool that you describe in 2) is a possibility (and I've done it, once). You should make sure, that all source data is still available until a successful resilvering of the "new" pool with the fake disks replaced. You could do this by moving the redundant disks from the old pool the new pool (i.e. degrading the old pool, after all data has been copied, to use the redundant drives to complete the new pool). But this assumes, that the technologies of the drives match - I'll soon go from 4*2TB to 3*4TB (raidz1 in both cases), since I had 2 of the 2TB drives fail over the course of last year (replaced under warranty). > So, given how much this problem sucks (it is extremely easy to add > a 4K disk by mistake as a replacement for a failed disk), and how > painful the workaround is... will ZFS ever gain the ability to change > block size for the pool? Or is this so deep in the internals of ZFS > it is as likely as being able to dynamically add disks to an existing > zvol in the "never going to happen" basket? You can add a 4 KB physical drive that emulates 512 byte sectors (nearly all drives do) to an ashift=9 ZFS pool, but performance will suffer and you'll be violating a ZFS assumption as explained above. > And secondly, is it also bad to have ashift 9 disks inside a ashift > 12 pool? That is, do we need to replace all our disks in one go and > forever keep big sticky labels on each disk so we never mix them? The ashift parameter is per pool, not per disk. You can have a drive with emulated 512 byte sectors in an ashift=9 pool, but you cannot change the ashift value of a pool after creation. Regards, STefan
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?54100258.2000505>