From owner-svn-src-all@FreeBSD.ORG Mon Dec 6 21:10:44 2010 Return-Path: Delivered-To: svn-src-all@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 207FA1065675; Mon, 6 Dec 2010 21:10:44 +0000 (UTC) (envelope-from mavbsd@gmail.com) Received: from mail-fx0-f54.google.com (mail-fx0-f54.google.com [209.85.161.54]) by mx1.freebsd.org (Postfix) with ESMTP id CD8338FC0A; Mon, 6 Dec 2010 21:10:42 +0000 (UTC) Received: by fxm16 with SMTP id 16so9837641fxm.13 for ; Mon, 06 Dec 2010 13:10:41 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:sender:message-id:date:from :user-agent:mime-version:to:cc:subject:references:in-reply-to :content-type:content-transfer-encoding; bh=nTJ0Li8vN3KuCpb8yWFI33G3N5xdVD8lMIjuzXES3YI=; b=WZCEVHZpQNswQlTrc7veCBDNM8JKhzyTzlkkAke2uJQXrid3wJput0K0GoO6mHKHA4 YFtXCmTR5rrXagXK1sv6sMjU9XBr/58515I/ASeT/vyRnacY6qpetqtQPNwfSLr3xToc OBFSVy/BzbBoPdrXSecf5Hab+7KwHxgmAmNuo= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=sender:message-id:date:from:user-agent:mime-version:to:cc:subject :references:in-reply-to:content-type:content-transfer-encoding; b=BJ6+moGrhH6dDJuCqGtULwjF9RuAiuSTFh725s/BcucCoF7SLzhrkpOUiUpP878+h6 XVWzFdChcDth5FxLUUUGmrdM+hjLz5L9zk+Q0yYaqqoIEiztCpPJZRsucLYGEgzYUQv0 23JpvciJykVvsJAkGAY4vR91CtGgBx670EJto= Received: by 10.223.74.193 with SMTP id v1mr6153795faj.105.1291669841645; Mon, 06 Dec 2010 13:10:41 -0800 (PST) Received: from mavbook.mavhome.dp.ua (pc.mavhome.dp.ua [212.86.226.226]) by mx.google.com with ESMTPS id a2sm1706654faw.22.2010.12.06.13.10.39 (version=SSLv3 cipher=RC4-MD5); Mon, 06 Dec 2010 13:10:40 -0800 (PST) Sender: Alexander Motin Message-ID: <4CFD514E.8010103@FreeBSD.org> Date: Mon, 06 Dec 2010 23:10:38 +0200 From: Alexander Motin User-Agent: Mozilla/5.0 (X11; U; FreeBSD amd64; en-US; rv:1.9.2.12) Gecko/20101104 Thunderbird/3.1.6 MIME-Version: 1.0 To: John Baldwin References: <201012061218.oB6CI3oW032770@svn.freebsd.org> <20101206195327.GD1936@garage.freebsd.pl> <201012061518.49835.jhb@freebsd.org> In-Reply-To: <201012061518.49835.jhb@freebsd.org> Content-Type: text/plain; charset=ISO-8859-15; format=flowed Content-Transfer-Encoding: 7bit Cc: svn-src-head@freebsd.org, svn-src-all@freebsd.org, src-committers@freebsd.org, Pawel Jakub Dawidek , Ivan Voras Subject: Re: svn commit: r216230 - head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs X-BeenThere: svn-src-all@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "SVN commit messages for the entire src tree \(except for " user" and " projects" \)" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 06 Dec 2010 21:10:44 -0000 On 06.12.2010 22:18, John Baldwin wrote: > On Monday, December 06, 2010 2:53:27 pm Pawel Jakub Dawidek wrote: >> On Mon, Dec 06, 2010 at 08:35:36PM +0100, Ivan Voras wrote: >>> Please persuade me on technical grounds why ashift, a property >>> intended for address alignment, should not be set in this way. If your >>> answer is "I don't know but you are still wrong because I say so" I >>> will respect it and back it out but only until I/we discuss the >>> question with upstream ZFS developers. >> >> No. You persuade me why changing ashift in ZFS, which, as the comment >> clearly states is "device's minimum transfer size" is better and not >> hackish than presenting the disk with properly configured sector size. >> This can not only affect disks that still use 512 bytes sectors, but >> doesn't fix the problem at all. It just works around the problem in ZFS >> when configured on top of raw disks. Both ATA and SCSI standards implemented support for different logical and physical sector sizes. It is not a hack - it seems to be the way manufacturers decided to go. At least on their words. IMHO hack in this situation would be to report to GEOM some fake sector size, different from one reported by device. In any way it is the main visible disk characteristic, independently of what it's firmware does inside. >> What about other file systems? What about other GEOM classes? GELI is >> great example here, as people use ZFS on top of GELI alot. GELI >> integrity verification works in a way that not reporting disk sector >> size properly will have huge negative performance impact. ZFS' ashift >> won't change that. > > I am mostly on your side here, but I wonder if GELI shouldn't prefer the > stripesize anyway? For example, if you ran GELI on top of RAID-5 I imagine it > would be far more performant for it to use stripe-size logical blocks instead > of individual sectors for the underlying media. > > The RAID-5 argument also suggests that other filesystems should probably > prefer stripe sizes to physical sector sizes when picking block sizes, etc. Looking further I can see use even for several "stripesize" values on that way, unrelated to logical sector size. Let's take an example: 5 disks with 4K physical sectors in RAID5 with 64K strip. We'll have three sizes to align at: 4K, 64K and 256K. Aligning to 4K allow to avoid read-modify-write on disk level; to 64K - avoid request splitting and so increase (up to double) parallel random read performance; to 256K - significantly increase write speed by avoiding read-modify-write on RAID5. How can it be used? We can easily align partition to the biggest of them - 256K, to give maximum chances to any file system to align properly. UFS allocates space and writes data in granularity of blocks - depending on specific situation we may wish to increase block size to 64K, but it's quite a big value, so depends. We can safely increase fragment size to 4K. Also we could make UFS read-ahead and write-back code to align I/Os in run-time to the reported blocks. Depending on situation both 64K and 256K could be reasonable candidates for it. Sure solution is somewhat engineering (not absolute) in each case, but IMHO reasonable. Specific usage for these values (512, 4K, 64K and 256K) depends on abilities of specific partitioning scheme and file system. Neither disk driver nor GEOM may know what will be more usable at each next level. 512 bytes is the only one critically important value in this situation; everything else is only optimization. -- Alexander Motin