Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 5 Jul 2013 07:53:32 -0700
From:      Jeremy Chadwick <jdc@koitsu.org>
To:        Daniel Kalchev <daniel@digsys.bg>
Cc:        freebsd-fs@freebsd.org
Subject:   Re: Slow resilvering with mirrored ZIL
Message-ID:  <20130705145332.GA5449@icarus.home.lan>
In-Reply-To: <51D6A206.2020303@digsys.bg>
References:  <CBCA1716-A3EC-4E3B-AE0A-3C8028F6AACF@alumni.chalmers.se> <20130704000405.GA75529@icarus.home.lan> <C8C696C0-2963-4868-8BB8-6987B47C3460@alumni.chalmers.se> <20130704171637.GA94539@icarus.home.lan> <2A261BEA-4452-4F6A-8EFB-90A54D79CBB9@alumni.chalmers.se> <20130704191203.GA95642@icarus.home.lan> <43015E9015084CA6BAC6978F39D22E8B@multiplay.co.uk> <CAOjFWZ4obK1cSmvTpW%2Bt4xKdMf%2BkJV5w-sujDT1AZoepj%2B5YrA@mail.gmail.com> <3CFB4564D8EB4A6A9BCE2AFCC5B6E400@multiplay.co.uk> <51D6A206.2020303@digsys.bg>

next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, Jul 05, 2013 at 01:37:58PM +0300, Daniel Kalchev wrote:
> 
> On 05.07.13 02:28, Steven Hartland wrote:
> >
> >
> >If anyone wants my current patches which add switch to 4k ashift
> >by default
> >as a sysctl + works with QUIRKS too, just let me know.
> >
> >They are well tested, just we want more options before putting in
> >the tree.
> 
> Is it not easier to add this as an option to zpool create, instead
> of an sysctl?
> 
> That is, I believe we have two scenarios here:
> 
> 1. Having an sysctl that instructs ZFS to look at the FreeBSD quirks
> to decide what the ashift should be, instead of only querying the
> 'sectorsize' property of the storage. I believe we might not even
> need an sysctl here, just make it default to obey the quirks --- but
> sysctl for the interim period will not hurt (with the proper
> default).

I can expand on this one (specifically "relying on sectorsize of the
media"): no, this will not work reliably, for two reasons.  Hear me out:

1. You're operating under the assumption that every disk/device
advertises both logical and physical sector sizes separately.  That is
far from the case.

I know you're aware that all devices advertise a logical size of 512
(even if they are 4K physical) to remain fully compatible with legacy
OSes, but what a lot of people don't know is that many disks (I'm
speaking about ATA here because I don't do much with SAS/SCSI) don't
implement the necessary bits of ATA IDENTIFY CDB result that defines
separate logical and physical sector sizes (per T13 ATA-8 Working Draft
specification).  The problem is that these vendors see the Working Draft
as "beta/alpha" and therefore don't bother honouring some of the more
useful (I would say critical in this case) features of it -- like this
one.  :-(

You will find many disks on the market today -- including SSDs -- that
are like this.  Some are even big-name brands you would expect better
from.  If I had to guess, I would say probably 30% of them this way;
it's a substantial number.

If you want some examples/proof, or want to see the spec yourself, just
let me know and I can give you some/point you to the relevant docs.

2. A common rebuttal is "well that's what quirks are for".  Absolutely!
But how do you think those quirks are added?  When someone tells a
committer "hey, there's a new disk out which doesn't implement physical
sector size in ATA IDENTIFY, here's a patch".  Otherwise it never gets
added.  And sometimes that addition takes months given people's FreeBSD
time vs. real life time.  So in effect quirks are *always* outdated.

I'm not "damning FreeBSD", I'm just stating that this is the reality of
the situation.  For example -- only recently in stable/9 (maybe
stable/8, didn't look) were quirks added for Intel SSDs which came to
market over 2 years ago (BTW thanks for adding those Steve).

Even if there was a way to rectify that scenario in an efficient manner
(the quirks right now are hard-coded in kernel space), it wouldn't
change this scenario:

- User buys a disk which advertises logical only/lacks 4K quirks; say
  the disk was RTM 2 years ago
- Installs FreeBSD on it / uses it, lots of data on it now
- Notices performance problems or "other anomalies" (this thread is an
  acceptable example, although there are literally 8 or 9 problems going
  on with this situation that are all compounded)
- User posts on FreeBSD forum or mailing list asking for help, not sure
  what the problem is (very common)
- Response from community/devs is: "you get to repartition/reinstall
  your entire OS".

I know *I* sure wouldn't want to be told that...

I've pondered a some solutions to this dilemma, but really none of them
are plausible/realistic/have too many potential risks in exchange.

It sort of reminds me of the gmirror/GPT conflict problem**.

> 2. Have an option to zpool create and zpool add, that specifies the
> ashift value. Here my thinking is that it should let you specify an
> ashift equal or larger than the computed one, which is based on the
> largest sector size of all devices in a vdev.

I'm very much a supporter of the option being added to one of the ZFS
commands.  I'm not against Steve's sysctl, but the problem with that is
more of a social one: features like this (if committed) never end up
being announced to the world in a useful manner, so nobody knows they
exist until it's too late.  It would also just make me wonder "why
bother with the sysctl at all, just use 4096 universally going forward,
and have whatever code/bits still support cases where existing setups
use 512" (last part sounds easier than probably done, not sure).

As for the "basing things on sector size" -- see my above explanation
for why/how this isn't entirely reliable.  Manufacturers, argh!  :-)

But something like "zpool create -a 12 ..." would be a blessing, because
I'd just use that all the time.  If changing the default from 9 to 12
isn't plausible, then at least offering what I just described would be a
good/worthwhile stepping stone.

Though, I guess really it's not much different from the gnop approach,
just that you now don't have to use gnop.  But you still have to be
aware of the flag (ex. -a 12), just like you have to be aware of gnop
and that ordeal.  I'm trying to think about it from the viewpoint of a
user not having to know about/do *any* of that.

> Don't know, but always wondered.. how hard is it to change the
> ashift value on the fly? Does it impact reads of data already on the
> vdev, or does it impact only writes? If only writes, it should be
> trivial, really....

I've wondered this too, but I don't have any familiarity with the ZFS
innards or filesystems at a low level to be able to talk on it.



** -- Linux md had this same problem (though at the beginning of the
device, not the end), and they solved it cleanly with md 1.2 (the
version number is stored in the superblock/metadata) where they skip the
first 4096 bytes on the disk and store the superblock there:

https://raid.wiki.kernel.org/index.php/RAID_superblock_formats

Sections 1.3 and 1.5 are most relevant/educational.  You can see
clearly, though, that they've had to change their approach given the
same stuff FreeBSD is dealing with.

I looked at the gmirror code late last week to see if this was possible
to do, and at first glance it appears to be, but I don't think there
would be a clean "upgrade path" -- I'm fairly certain it would require a
full gmirror recreation (as in fully start over), because otherwise
changes would conflict with existing partition sizes and other whatnots.
See Section 1.6 in the above document for this situation on Linux --
just remember that Linux md is a bit of a different beast than gmirror
(GEOM is more versatile, md is more rigid/static).  I'm sure Pawel has
thought about all of this many times over though and that it's more of
an issue of time than anything else.

-- 
| Jeremy Chadwick                                   jdc@koitsu.org |
| UNIX Systems Administrator                http://jdc.koitsu.org/ |
| Making life hard for others since 1977.             PGP 4BD6C0CB |




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20130705145332.GA5449>