From owner-freebsd-stable@FreeBSD.ORG  Wed Sep 10 06:47:15 2014
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
Delivered-To: freebsd-stable@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id A79E5A6C
 for <freebsd-stable@freebsd.org>; Wed, 10 Sep 2014 06:47:15 +0000 (UTC)
Received: from ipmail06.adl6.internode.on.net (ipmail06.adl6.internode.on.net
 [150.101.137.145])
 by mx1.freebsd.org (Postfix) with ESMTP id 303E1FD1
 for <freebsd-stable@freebsd.org>; Wed, 10 Sep 2014 06:47:14 +0000 (UTC)
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: AloKALDzD1Q7p/kP/2dsb2JhbABZg2BXgnyuOQaZCoh0eIQtBAsBOQoCDycCBRYLAgsDAgECAQlCDQgBAYg9mGGOIIEUlTcEgSiEUIxRgVMFhhqRSEuQaYkLgWcegW5agk8BAQE
Received: from eth4368.nsw.adsl.internode.on.net (HELO fish.ish.com.au)
 ([59.167.249.15])
 by ipmail06.adl6.internode.on.net with ESMTP; 10 Sep 2014 16:16:35 +0930
Received: from ip-136.ish.com.au ([203.29.62.136]:53339)
 by fish.ish.com.au with esmtpsa (UNKNOWN:AES128-SHA:128) (Exim 4.76)
 (envelope-from <ari@ish.com.au>) id 1XRbfh-0005qp-07
 for freebsd-stable@freebsd.org; Wed, 10 Sep 2014 16:46:29 +1000
Message-ID: <540FF3C4.6010305@ish.com.au>
Date: Wed, 10 Sep 2014 16:46:28 +1000
From: Aristedes Maniatis <ari@ish.com.au>
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9;
 rv:32.0) Gecko/20100101 Thunderbird/32.0
MIME-Version: 1.0
To: freebsd-stable <freebsd-stable@freebsd.org>
Subject: getting to 4K disk blocks in ZFS
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.18-1
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-stable>,
 <mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable/>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
 <mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 10 Sep 2014 06:47:15 -0000

As we all know, it is important to ensure that modern disks are set up properly with the correct block size. Everything is good if all the disks and the pool are "ashift=9" (512 byte blocks). But as soon as one new drive requires 4k blocks, performance drops through the floor of the enture pool.


In order to upgrade there appear to be two separate things that must be done for a ZFS pool.

1. Create partitions on 4K boundaries. This is simple with the "-a 4k" option in gpart, and it isn't hard to remove disks one at a time from a pool, reformat them on the right boundaries and put them back. Hopefully you've left a few spare bytes on the disk to ensure that your partition doesn't get smaller when you reinsert it to the pool.

2. Create a brand new pool which has ashift=12 and zfs send|receive all the data over.


I guess I don't understand enough about zpool to know why the pool itself has a block size, since I understood ZFS to have variable stripe widths.

The problem with step 2 is that you need to have enough hard disks spare to create a whole new pool and throw away the old disks. Plus a disk controller with lots of spare ports. Plus the ability to take the system offline for hours or days while the migration happens.

One way to reduce this slightly is to create a new pool with reduced redundancy. For example, create a RAIDZ2 with two fake disks, then offline those disks.


So, given how much this problem sucks (it is extremely easy to add a 4K disk by mistake as a replacement for a failed disk), and how painful the workaround is... will ZFS ever gain the ability to change block size for the pool? Or is this so deep in the internals of ZFS it is as likely as being able to dynamically add disks to an existing zvol in the "never going to happen" basket?

And secondly, is it also bad to have ashift 9 disks inside a ashift 12 pool? That is, do we need to replace all our disks in one go and forever keep big sticky labels on each disk so we never mix them?


Thanks for any advice
Ari Maniatis


-- 
-------------------------->
Aristedes Maniatis
ish
http://www.ish.com.au
Level 1, 30 Wilson Street Newtown 2042 Australia
phone +61 2 9550 5001   fax +61 2 9550 4001
GPG fingerprint CBFB 84B4 738D 4E87 5E5C  5EFA EF6A 7D2E 3E49 102A