From owner-freebsd-stable@FreeBSD.ORG Fri Feb 17 02:10:21 2012 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id B303B106564A for ; Fri, 17 Feb 2012 02:10:21 +0000 (UTC) (envelope-from jdc@koitsu.dyndns.org) Received: from qmta14.emeryville.ca.mail.comcast.net (qmta14.emeryville.ca.mail.comcast.net [76.96.27.212]) by mx1.freebsd.org (Postfix) with ESMTP id 9234B8FC0C for ; Fri, 17 Feb 2012 02:10:21 +0000 (UTC) Received: from omta20.emeryville.ca.mail.comcast.net ([76.96.30.87]) by qmta14.emeryville.ca.mail.comcast.net with comcast id apTH1i00A1smiN4AEqAMy8; Fri, 17 Feb 2012 02:10:21 +0000 Received: from koitsu.dyndns.org ([67.180.84.87]) by omta20.emeryville.ca.mail.comcast.net with comcast id aqAK1i00G1t3BNj8gqALsP; Fri, 17 Feb 2012 02:10:20 +0000 Received: by icarus.home.lan (Postfix, from userid 1000) id 75EFD102C1E; Thu, 16 Feb 2012 18:10:19 -0800 (PST) Date: Thu, 16 Feb 2012 18:10:19 -0800 From: Jeremy Chadwick To: Warren Block Message-ID: <20120217021019.GA61420@icarus.home.lan> References: <4F35743B.4020302@os2.kiev.ua> <4F37DBA3.7030304@cran.org.uk> <20120213195554.O46120@sola.nimnet.asn.au> <092c01cceb40$2dc8f240$895ad6c0$@fisglobal.com> <095a01cceb54$04a38fb0$0deaaf10$@fisglobal.com> <4F3ACDE7.8060003@bit0.com> <4F3D9A7C.7080900@quip.cz> <20120217001829.GA59869@icarus.home.lan> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Cc: Mike Andrews , freebsd-stable@freebsd.org, Miroslav Lachman <000.fbsd@quip.cz> Subject: Re: New BSD Installer X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 17 Feb 2012 02:10:21 -0000 On Thu, Feb 16, 2012 at 06:34:53PM -0700, Warren Block wrote: > On Thu, 16 Feb 2012, Jeremy Chadwick wrote: > >On Fri, Feb 17, 2012 at 01:08:28AM +0100, Miroslav Lachman wrote: > >> > >>Please don't mix two things together. gpart can replace fdisk and > >>bsdlabel, but GPT vs. MBR is a different thing. GPT doesn't play > >>nice with GEOM classes which store their metadata on last sector. > >>For example, you can't use gmirror of a whole drives and use GPT on > >>top of this mirror. (and gmirror is not the only one) > > > >This is quite possibly the most concise, clearest definition of a major > >(borderline catastrophic) situation pertaining to GPT + GEOM > >combinations. > > > >I'm going to be more bold than usual: who is fixing this, and when is it > >going to be MFC'd to 9, 8, and probably 7 would be a good idea? If > >nobody is fixing this, someone had better light a fire under someone's > >ass to fix it. I'm absolutely amazed this is still a problem. > > How can it be fixed? GPT only has two points of reference, the > start and end of the disk. To do more it would have to be aware of > a lot of possible disk formats. The GPT aspect of it cannot be fixed. The GEOM aspect of it should be fixed. The "let's store the metadata in the last sector" mentality is what needs to be addressed. There has to be a better way of doing this. I'm surprised that given the nature of these two bits (GPT vs. GEOM), that the GEOM layer cannot simply lie about the full capacity of the partition, or something to that effect. Consider this: Linux's md driver has the capability to do, in effect, the same thing GEOM classes (gmirror, etc.) do. They obviously must store metadata somewhere too. How did they do it? http://www.mjmwired.net/kernel/Documentation/md.txt http://linux.die.net/man/4/md http://linux.die.net/man/8/mdadm Quoting mdadm: >> The devicesize option will rarely be of use. It applies to version 1.1 >> and 1.2 metadata only (where the metadata is at the start of the device) >> and is only useful when the component device has changed size (typically >> become larger). The version 1 metadata records the amount of the device >> that can be used to store data, so if a device in a version 1.1 or 1.2 >> array becomes larger, the metadata will still be visible, but the extra >> space will not. In this case it might be useful to assemble the array >> with --update=devicesize. This will cause mdadm to determine the maximum >> usable amount of space on each device and update the relevant field in >> the metadata. Quoting md: >> The common format -- known as version 0.90 -- has a superblock that is >> 4K long and is written into a 64K aligned block that starts at least 64K >> and less than 128K from the end of the device (i.e. to get the address >> of the superblock round the size of the device down to a multiple of 64K >> and then subtract 64K). The available size of each device is the amount >> of space before the super block, so between 64K and 128K is lost when a >> device in incorporated into an MD array. This superblock stores >> multi-byte fields in a processor-dependent manner, so arrays cannot >> easily be moved between computers with different processors. >> >> The new format -- known as version 1 -- has a superblock that is >> normally 1K long, but can be longer. It is normally stored between 8K >> and 12K from the end of the device, on a 4K boundary, though variations >> can be stored at the start of the device (version 1.1) or 4K from the >> start of the device (version 1.2). This metadata format stores multibyte >> data in a processor-independent format and supports up to hundreds of >> component devices (version 0.90 only supports 28). So for version 0.90 of their metadata format, you lose drive capacity by about 64-128KBytes, given that the space is needed for metadata. For version 1.0, I'm not sure. For version 1.1 it looks like the metadata can be stored at the beginning. So overall, this sounds to me like the equivalent of if GEOM was to "lie" about the actual capacities of the devices when using classes that require use of metadata (gmirror, etc.). > On the other hand, GEOM stuff works inside GPT partitions. And if > that's not acceptable, MBR partitions will be around for a long > time. MBR partitions don't scale past 2TB. Arguing that use of MBR is an acceptable workaround is the equivalent to burying one's head in the sand. Let's try to accept the future, not feign ignorance. -- | Jeremy Chadwick jdc at parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, US | | Making life hard for others since 1977. PGP 4BD6C0CB |