From owner-freebsd-fs@FreeBSD.ORG  Wed Oct 12 17:29:15 2011
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 5808B106564A
	for <freebsd-fs@freebsd.org>; Wed, 12 Oct 2011 17:29:15 +0000 (UTC)
	(envelope-from jdc@koitsu.dyndns.org)
Received: from qmta15.westchester.pa.mail.comcast.net
	(qmta15.westchester.pa.mail.comcast.net [76.96.59.228])
	by mx1.freebsd.org (Postfix) with ESMTP id 052A78FC0A
	for <freebsd-fs@freebsd.org>; Wed, 12 Oct 2011 17:29:14 +0000 (UTC)
Received: from omta23.westchester.pa.mail.comcast.net ([76.96.62.74])
	by qmta15.westchester.pa.mail.comcast.net with comcast
	id jnA21h0061c6gX85FtVFTe; Wed, 12 Oct 2011 17:29:15 +0000
Received: from koitsu.dyndns.org ([67.180.84.87])
	by omta23.westchester.pa.mail.comcast.net with comcast
	id jtVD1h00S1t3BNj3jtVEE0; Wed, 12 Oct 2011 17:29:14 +0000
Received: by icarus.home.lan (Postfix, from userid 1000)
	id 21C5E102C1C; Wed, 12 Oct 2011 10:29:12 -0700 (PDT)
Date: Wed, 12 Oct 2011 10:29:12 -0700
From: Jeremy Chadwick <freebsd@jdc.parodius.com>
To: Daniel Kalchev <daniel@digsys.bg>
Message-ID: <20111012172912.GA27013@icarus.home.lan>
References: <4E95AE08.7030105@lerctr.org>
	<20111012155938.GA24649@icarus.home.lan> <4E95C546.70904@digsys.bg>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <4E95C546.70904@digsys.bg>
User-Agent: Mutt/1.5.21 (2010-09-15)
Cc: freebsd-fs@freebsd.org
Subject: Re: AF (4096 byte sector) drives: Can you mix/match in a ZFS pool?
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 12 Oct 2011 17:29:15 -0000

On Wed, Oct 12, 2011 at 07:50:14PM +0300, Daniel Kalchev wrote:
> 
> 
> On 12.10.11 18:59, Jeremy Chadwick wrote:
> >On Wed, Oct 12, 2011 at 10:11:04AM -0500, Larry Rosenman wrote:
> >>I have a root on ZFS box with 6 drives, all 400G (except one 500G)
> >>in a pool.
> >>
> >>I want to upgrade to 2T or 3T drives, but was wondering if you can
> >>mix/match while doing the drive by drive
> >>replacement.
> >>
> >>This is on 9.0-BETA3 if that matters.
> >This is a very good question, and opens a large can of worms.  My gut
> >feeling tells me this discussion is going to be very long.
> >
> >I'm going to say that no, mixing 512-byte and 4096-byte sector drives in
> >a single vdev is a bad idea.  Here's why:
> 
> This was not the original question. The original question is whether
> replacing 512-byte sector drives in a 512-byte sector aligned zpool
> with 4096-byte sector drives is possible.
> 
> It is possible, of course, as most 4096-byte drives today emulate
> 512-byte drives and some even pretend to be 512-byte sector drives.
> 
> Performance might degrade, this depends on the workload. In some
> cases the performance might be way bad.
> 
> >
> >The procedure I've read for doing this is as follows:
> >
> >ada0 =  512-byte sector disk
> >ada1 = 4096-byte sector disk
> >ada2 =  512-byte sector disk
> >
> >gnop create -S 4096 ada1
> >zpool create mypool raidz ada0 ada1.nop ada2
> >zdb | grep ashift
> >    <should show "ashift: 12" for 4096-byte alignment or "ashift: 9" for
> >     512-byte alignment>
> >zpool export mypool
> >gnop destroy ada1.nop
> >zpool import mypool
> 
> It is not important which of the underlying drives will be gnop-ed.
> You may well gnop all of these. The point is, that ZFS uses the
> largest sector size of any of the underlying devices to determine
> the ashift value. That is the "minimum write" value, or the smallest
> unit of data ZFS will write in an I/O.
> 
> >Circling back to the procedure I stated above: this would result in an
> >ashift=12 alignment for all I/O to all underlying disks.  How do you
> >think your 512-byte sector drives are going to perform when doing reads
> >and writes?  (Answer: badly)
> 
> The gnop trick is used not because you will ask a 512-byte sector
> drive to write 8 sectors with one I/O, but because you may ask an
> 4096-byte sector drive to write only 512 bytes -- which for the
> drive means it has to read 4096 bytes, modify 512 of these bytes and
> write back 4096 bytes.

If I'm reading this correctly, you're effectively stating ashift
actually just defines (or helps in calculating) an LBA offset for the
start of the pool-related data on that device?  "ashift" seems like a
badly-named term/variable for what this does, but oh well.

I was always under the impression the term "ashift" stood for "align
shift" and was applied to the block size of data read from a disk in a
single request -- and keep reading (specifically last part of my mail).

> >So my advice is do not mix-match 512-byte and 4096-byte sector disks in a
> >vdev that consists of multiple disks.
> 
> The proper way to handle this is to create your zpool with 4096-byte
> alignment, that is, for the time being by using the above gnop
> 'hack'.

...which brings into question why this is needed at all, meaning, why
the ZFS code cannot be changed to default to an ashift value that's
calculated as 12 (or equivalent) regardless of 512-byte or 4096-byte
sector drives.

I guess changing this would get into a discussion about whether or not
it could (not would) badly impact other forms of media (CF drives,
etc.), but if it's literally just a starting LBA offset adjustment value
then it shouldn't matter.

How was this addressed on Solaris/OpenSolaris?

I really need to know this, mainly because we use both SSDs on Solaris
10 at my workplace, in addition to the fact that our Solaris 10 boxes
are using 1TB disks and will soon (in many months to come) be upgraded
to 2TBs, which almost certainly means we'll end up with 4096-byte sector
drives.  The last thing I need to deal with is our entire division
talking about crummy I/O throughput due to our disk imaging process not
forcing ashift to be 12.  If I have to deal with Oracle then so be it,
but I imagine someone lingering knows...  :-)

> This way, you are sure to not have performance implications no
> matter what (512 or 4096 byte) drives you use in the vdev.
> 
> There should be no implications to having one vdev with 512 byte
> alignment and another with 4096 byte alignment. ZFS is smart enough
> to issue minimum of 512 byte writes to the former and 4096 bytes to
> the latter thus not creating any bottleneck.

How does ZFS determine this?  I was under the impression that this
behaviour was determined by (or "assisted by") ashift.

Surely ZFS cannot ask the underlying storage provider (e.g. GEOM on
FreeBSD) what logical vs. physical sector size to use (e.g. for SATA
what's returned in the ATA IDENTIFY payload), because on SSDs such as
Intel SSDs *both* of those sizes are reported as 512 bytes (camcontrol
identify confirms).

-- 
| Jeremy Chadwick                                jdc at parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                   Mountain View, CA, US |
| Making life hard for others since 1977.               PGP 4BD6C0CB |