From owner-freebsd-fs@FreeBSD.ORG Fri Jul 5 14:53:53 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id AE145F35 for ; Fri, 5 Jul 2013 14:53:53 +0000 (UTC) (envelope-from jdc@koitsu.org) Received: from relay5-d.mail.gandi.net (relay5-d.mail.gandi.net [217.70.183.197]) by mx1.freebsd.org (Postfix) with ESMTP id 5515F10DA for ; Fri, 5 Jul 2013 14:53:53 +0000 (UTC) Received: from mfilter10-d.gandi.net (mfilter10-d.gandi.net [217.70.178.139]) by relay5-d.mail.gandi.net (Postfix) with ESMTP id 9E8A241C074; Fri, 5 Jul 2013 16:53:36 +0200 (CEST) X-Virus-Scanned: Debian amavisd-new at mfilter10-d.gandi.net Received: from relay5-d.mail.gandi.net ([217.70.183.197]) by mfilter10-d.gandi.net (mfilter10-d.gandi.net [10.0.15.180]) (amavisd-new, port 10024) with ESMTP id 3NYKojwuXQkh; Fri, 5 Jul 2013 16:53:35 +0200 (CEST) X-Originating-IP: 76.102.14.35 Received: from jdc.koitsu.org (c-76-102-14-35.hsd1.ca.comcast.net [76.102.14.35]) (Authenticated sender: jdc@koitsu.org) by relay5-d.mail.gandi.net (Postfix) with ESMTPSA id 594CE41C05C; Fri, 5 Jul 2013 16:53:34 +0200 (CEST) Received: by icarus.home.lan (Postfix, from userid 1000) id 5C22173A31; Fri, 5 Jul 2013 07:53:32 -0700 (PDT) Date: Fri, 5 Jul 2013 07:53:32 -0700 From: Jeremy Chadwick To: Daniel Kalchev Subject: Re: Slow resilvering with mirrored ZIL Message-ID: <20130705145332.GA5449@icarus.home.lan> References: <20130704000405.GA75529@icarus.home.lan> <20130704171637.GA94539@icarus.home.lan> <2A261BEA-4452-4F6A-8EFB-90A54D79CBB9@alumni.chalmers.se> <20130704191203.GA95642@icarus.home.lan> <43015E9015084CA6BAC6978F39D22E8B@multiplay.co.uk> <3CFB4564D8EB4A6A9BCE2AFCC5B6E400@multiplay.co.uk> <51D6A206.2020303@digsys.bg> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <51D6A206.2020303@digsys.bg> User-Agent: Mutt/1.5.21 (2010-09-15) Cc: freebsd-fs@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 05 Jul 2013 14:53:53 -0000 On Fri, Jul 05, 2013 at 01:37:58PM +0300, Daniel Kalchev wrote: > > On 05.07.13 02:28, Steven Hartland wrote: > > > > > >If anyone wants my current patches which add switch to 4k ashift > >by default > >as a sysctl + works with QUIRKS too, just let me know. > > > >They are well tested, just we want more options before putting in > >the tree. > > Is it not easier to add this as an option to zpool create, instead > of an sysctl? > > That is, I believe we have two scenarios here: > > 1. Having an sysctl that instructs ZFS to look at the FreeBSD quirks > to decide what the ashift should be, instead of only querying the > 'sectorsize' property of the storage. I believe we might not even > need an sysctl here, just make it default to obey the quirks --- but > sysctl for the interim period will not hurt (with the proper > default). I can expand on this one (specifically "relying on sectorsize of the media"): no, this will not work reliably, for two reasons. Hear me out: 1. You're operating under the assumption that every disk/device advertises both logical and physical sector sizes separately. That is far from the case. I know you're aware that all devices advertise a logical size of 512 (even if they are 4K physical) to remain fully compatible with legacy OSes, but what a lot of people don't know is that many disks (I'm speaking about ATA here because I don't do much with SAS/SCSI) don't implement the necessary bits of ATA IDENTIFY CDB result that defines separate logical and physical sector sizes (per T13 ATA-8 Working Draft specification). The problem is that these vendors see the Working Draft as "beta/alpha" and therefore don't bother honouring some of the more useful (I would say critical in this case) features of it -- like this one. :-( You will find many disks on the market today -- including SSDs -- that are like this. Some are even big-name brands you would expect better from. If I had to guess, I would say probably 30% of them this way; it's a substantial number. If you want some examples/proof, or want to see the spec yourself, just let me know and I can give you some/point you to the relevant docs. 2. A common rebuttal is "well that's what quirks are for". Absolutely! But how do you think those quirks are added? When someone tells a committer "hey, there's a new disk out which doesn't implement physical sector size in ATA IDENTIFY, here's a patch". Otherwise it never gets added. And sometimes that addition takes months given people's FreeBSD time vs. real life time. So in effect quirks are *always* outdated. I'm not "damning FreeBSD", I'm just stating that this is the reality of the situation. For example -- only recently in stable/9 (maybe stable/8, didn't look) were quirks added for Intel SSDs which came to market over 2 years ago (BTW thanks for adding those Steve). Even if there was a way to rectify that scenario in an efficient manner (the quirks right now are hard-coded in kernel space), it wouldn't change this scenario: - User buys a disk which advertises logical only/lacks 4K quirks; say the disk was RTM 2 years ago - Installs FreeBSD on it / uses it, lots of data on it now - Notices performance problems or "other anomalies" (this thread is an acceptable example, although there are literally 8 or 9 problems going on with this situation that are all compounded) - User posts on FreeBSD forum or mailing list asking for help, not sure what the problem is (very common) - Response from community/devs is: "you get to repartition/reinstall your entire OS". I know *I* sure wouldn't want to be told that... I've pondered a some solutions to this dilemma, but really none of them are plausible/realistic/have too many potential risks in exchange. It sort of reminds me of the gmirror/GPT conflict problem**. > 2. Have an option to zpool create and zpool add, that specifies the > ashift value. Here my thinking is that it should let you specify an > ashift equal or larger than the computed one, which is based on the > largest sector size of all devices in a vdev. I'm very much a supporter of the option being added to one of the ZFS commands. I'm not against Steve's sysctl, but the problem with that is more of a social one: features like this (if committed) never end up being announced to the world in a useful manner, so nobody knows they exist until it's too late. It would also just make me wonder "why bother with the sysctl at all, just use 4096 universally going forward, and have whatever code/bits still support cases where existing setups use 512" (last part sounds easier than probably done, not sure). As for the "basing things on sector size" -- see my above explanation for why/how this isn't entirely reliable. Manufacturers, argh! :-) But something like "zpool create -a 12 ..." would be a blessing, because I'd just use that all the time. If changing the default from 9 to 12 isn't plausible, then at least offering what I just described would be a good/worthwhile stepping stone. Though, I guess really it's not much different from the gnop approach, just that you now don't have to use gnop. But you still have to be aware of the flag (ex. -a 12), just like you have to be aware of gnop and that ordeal. I'm trying to think about it from the viewpoint of a user not having to know about/do *any* of that. > Don't know, but always wondered.. how hard is it to change the > ashift value on the fly? Does it impact reads of data already on the > vdev, or does it impact only writes? If only writes, it should be > trivial, really.... I've wondered this too, but I don't have any familiarity with the ZFS innards or filesystems at a low level to be able to talk on it. ** -- Linux md had this same problem (though at the beginning of the device, not the end), and they solved it cleanly with md 1.2 (the version number is stored in the superblock/metadata) where they skip the first 4096 bytes on the disk and store the superblock there: https://raid.wiki.kernel.org/index.php/RAID_superblock_formats Sections 1.3 and 1.5 are most relevant/educational. You can see clearly, though, that they've had to change their approach given the same stuff FreeBSD is dealing with. I looked at the gmirror code late last week to see if this was possible to do, and at first glance it appears to be, but I don't think there would be a clean "upgrade path" -- I'm fairly certain it would require a full gmirror recreation (as in fully start over), because otherwise changes would conflict with existing partition sizes and other whatnots. See Section 1.6 in the above document for this situation on Linux -- just remember that Linux md is a bit of a different beast than gmirror (GEOM is more versatile, md is more rigid/static). I'm sure Pawel has thought about all of this many times over though and that it's more of an issue of time than anything else. -- | Jeremy Chadwick jdc@koitsu.org | | UNIX Systems Administrator http://jdc.koitsu.org/ | | Making life hard for others since 1977. PGP 4BD6C0CB |