From owner-freebsd-geom@FreeBSD.ORG Sun Jun 28 01:20:51 2009 Return-Path: Delivered-To: freebsd-geom@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id A5DA81065673; Sun, 28 Jun 2009 01:20:51 +0000 (UTC) (envelope-from xcllnt@mac.com) Received: from asmtpout023.mac.com (asmtpout023.mac.com [17.148.16.98]) by mx1.freebsd.org (Postfix) with ESMTP id 7F2CF8FC22; Sun, 28 Jun 2009 01:20:51 +0000 (UTC) (envelope-from xcllnt@mac.com) MIME-version: 1.0 Content-transfer-encoding: 7BIT Content-type: text/plain; charset=us-ascii; format=flowed; delsp=yes Received: from macbook-pro.lan.xcllnt.net (mail.xcllnt.net [75.101.29.67]) by asmtp023.mac.com (Sun Java(tm) System Messaging Server 6.3-8.01 (built Dec 16 2008; 32bit)) with ESMTPSA id <0KLX007KJD2QER50@asmtp023.mac.com>; Sat, 27 Jun 2009 18:20:51 -0700 (PDT) From: Marcel Moolenaar In-reply-to: Date: Sat, 27 Jun 2009 18:20:49 -0700 Message-id: <2FFFB36F-EFA3-4D92-98A3-692BA2D6F63E@mac.com> References: <20090625110253.GA31443@mech-cluster238.men.bris.ac.uk> <10FCC74D-6D46-4112-AD89-BBB4C5933957@mac.com> To: Ivan Voras X-Mailer: Apple Mail (2.1067.4) Cc: freebsd-current@freebsd.org, freebsd-questions@freebsd.org, freebsd-geom@freebsd.org Subject: Re: gmirror gm0 destroyed on shutdown; GPT corrupt X-BeenThere: freebsd-geom@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: GEOM-specific discussions and implementations List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 28 Jun 2009 01:20:52 -0000 On Jun 27, 2009, at 4:15 AM, Ivan Voras wrote: > Marcel Moolenaar wrote: >> On Jun 25, 2009, at 4:02 AM, Anton Shterenlikht wrote: >>> dev_taste(DEV,mirror/gm0) >>> g_part_taste(PART,mirror/gm0) >>> >>> GEOM: mirror/gm0: the secondary GPT table is corrupt or invalid. >>> GEOM: mirror/gm0: using the primary only -- recovery suggested. >>> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ >> You created the mirror after the GPT, which means you destroyed >> the GPT backup header. gmirror uses the last sector on the disk >> for metadata and that by itself is a cause for various problems. >> It's better to use gmirror per partition. > > Or create the GPT partition inside the gmirror device - then the GPT > backup table will be at last_sector-1, but... > >> You could run into a race condition between GPT and gmirror and >> GPT winning (again the result of gmirror using the last sector >> on a disk for metadata). > > unfortunately this could still happen, and will lead to the same > error if GPT is tasted first, since it is embedded in the first > sector and will assume the whole drive is available to GPT, and will > then proceed to not find its backup data in the last sector. > > It looks to me like GEOM classes should have a "priority" field for > tasting. Any objections to that idea? Using the last sector is not only flawed because it creates a race condition, it's flawed in the assumption that you can always make a geom part of a mirror by storing meta-data on the geom without causing corruption. This whole idea of using the last sector was so that a fully partitioned disk with data could be turned into a mirrored disk. A neat idea, but hardly the basis for a generic mirroring implementation when it silently corrupts a disk. I think it's better to change gmirror to use the first sector on the provider. This never creates a race condition and as such, you don't need to invent a priority scheme, that has it's own set of flaws on top of it. The only downside is that it's not easy to make a fully partitioned and populated disk part of a mirror: one would need to move the data forward one sector to free the first sector. This we can actually do by inserting a GEOM that does it while I/O is still ongoing. The good thing is: we need a class that does exactly this for implementing the "move" verb in gpart. In other words: Solving the problem that putting the metadata in the first sector creates, can and will be re-used in implementing the gpart "move partition" feature. I doubt anyone will complain that the creation of a mirror brings with it a few hours of disk activity that does not inhibit normal operation... -- Marcel Moolenaar xcllnt@mac.com From owner-freebsd-geom@FreeBSD.ORG Sun Jun 28 08:35:20 2009 Return-Path: Delivered-To: freebsd-geom@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id EBD1E1065675 for ; Sun, 28 Jun 2009 08:35:20 +0000 (UTC) (envelope-from andrew@modulus.org) Received: from email.octopus.com.au (email.octopus.com.au [122.100.2.232]) by mx1.freebsd.org (Postfix) with ESMTP id AC03B8FC0A for ; Sun, 28 Jun 2009 08:35:20 +0000 (UTC) (envelope-from andrew@modulus.org) Received: by email.octopus.com.au (Postfix, from userid 1002) id 6D34E172D8; Sun, 28 Jun 2009 18:16:49 +1000 (EST) X-Spam-Checker-Version: SpamAssassin 3.2.3 (2007-08-08) on email.octopus.com.au X-Spam-Level: X-Spam-Status: No, score=-0.3 required=10.0 tests=ALL_TRUSTED,DNS_FROM_DOB, RCVD_IN_DOB autolearn=no version=3.2.3 Received: from [10.20.30.102] (60.218.233.220.static.exetel.com.au [220.233.218.60]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) (Authenticated sender: admin@email.octopus.com.au) by email.octopus.com.au (Postfix) with ESMTP id 0DC3917258; Sun, 28 Jun 2009 18:16:41 +1000 (EST) Message-ID: <4A4725FA.80505@modulus.org> Date: Sun, 28 Jun 2009 18:12:42 +1000 From: Andrew Snow User-Agent: Thunderbird 2.0.0.6 (X11/20070926) MIME-Version: 1.0 To: Dan Naumov References: In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: freebsd-fs@freebsd.org, freebsd-geom@freebsd.org Subject: Re: read/write benchmarking: UFS2 vs ZFS vs EXT3 vs ZFS RAIDZ vs Linux MDRAID X-BeenThere: freebsd-geom@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: GEOM-specific discussions and implementations List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 28 Jun 2009 08:35:21 -0000 > Contiguous Write Performance: > http://virtual.tehinterweb.net/livejournal/2009-06-22_zfs_diskperf/zfs-diskperf-contig-write.png What confuses me about these results is that the '5 disk' performance was barely higher than the 'single disk' performance. All figures are also lower than I get from a single modern SATA disk. My own testing with dd from /dev/zero with FreeBSD ZFS an Intel ICH10 chipset motherboard with Core2duo 2.66ghz showed RAIDZ performance scaling linearly with number of disks: What Write Read -------------------------------- 7 disk RAIDZ2 220 305 6 disk RAIDZ2 173 260 5 disk RAIDZ2 120 213 Only the on-board controllers were used, with Seagate disks of around 250GB capacity. System had 8GB RAM. These results are so different in absolute terms to your results that I don't know how to interpret your set. - Andrew From owner-freebsd-geom@FreeBSD.ORG Sun Jun 28 08:50:00 2009 Return-Path: Delivered-To: freebsd-geom@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id B12E51065670; Sun, 28 Jun 2009 08:50:00 +0000 (UTC) (envelope-from pjd@garage.freebsd.pl) Received: from mail.garage.freebsd.pl (chello087206192061.chello.pl [87.206.192.61]) by mx1.freebsd.org (Postfix) with ESMTP id EE5238FC14; Sun, 28 Jun 2009 08:49:59 +0000 (UTC) (envelope-from pjd@garage.freebsd.pl) Received: by mail.garage.freebsd.pl (Postfix, from userid 65534) id DAD9245B36; Sun, 28 Jun 2009 10:49:55 +0200 (CEST) Received: from localhost (chello087206192061.chello.pl [87.206.192.61]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.garage.freebsd.pl (Postfix) with ESMTP id DA48545683; Sun, 28 Jun 2009 10:49:49 +0200 (CEST) Date: Sun, 28 Jun 2009 10:49:57 +0200 From: Pawel Jakub Dawidek To: Marcel Moolenaar Message-ID: <20090628084957.GB4159@garage.freebsd.pl> References: <20090625110253.GA31443@mech-cluster238.men.bris.ac.uk> <10FCC74D-6D46-4112-AD89-BBB4C5933957@mac.com> <2FFFB36F-EFA3-4D92-98A3-692BA2D6F63E@mac.com> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="d6Gm4EdcadzBjdND" Content-Disposition: inline In-Reply-To: <2FFFB36F-EFA3-4D92-98A3-692BA2D6F63E@mac.com> User-Agent: Mutt/1.4.2.3i X-PGP-Key-URL: http://people.freebsd.org/~pjd/pjd.asc X-OS: FreeBSD 8.0-CURRENT i386 X-Spam-Checker-Version: SpamAssassin 3.0.4 (2005-06-05) on mail.garage.freebsd.pl X-Spam-Level: X-Spam-Status: No, score=-0.6 required=4.5 tests=BAYES_00,RCVD_IN_SORBS_DUL autolearn=no version=3.0.4 Cc: freebsd-current@freebsd.org, freebsd-questions@freebsd.org, freebsd-geom@freebsd.org Subject: Re: gmirror gm0 destroyed on shutdown; GPT corrupt X-BeenThere: freebsd-geom@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: GEOM-specific discussions and implementations List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 28 Jun 2009 08:50:01 -0000 --d6Gm4EdcadzBjdND Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Sat, Jun 27, 2009 at 06:20:49PM -0700, Marcel Moolenaar wrote: > Using the last sector is not only flawed because it creates a race > condition, it's flawed in the assumption that you can always make > a geom part of a mirror by storing meta-data on the geom without > causing corruption. This whole idea of using the last sector was > so that a fully partitioned disk with data could be turned into a > mirrored disk. A neat idea, but hardly the basis for a generic > mirroring implementation when it silently corrupts a disk. This wasn't the idea:) People started putting gmirror on top of partitioned disk, because it was easier/simpler/faster than creating mirror, partitioning and copying the data. I for one never put mirror on already partitioned disk. Although it is sometimes safe to use the last sector. Gjournal already looks for UFS and if UFS is in place, it figures out if the last sector is in use - it isn't if partition size is not multiple of UFS block size. > I think it's better to change gmirror to use the first sector on the > provider. This never creates a race condition and as such, you don't > need to invent a priority scheme, that has it's own set of flaws on > top of it. The only downside is that it's not easy to make a fully > partitioned and populated disk part of a mirror: one would need to > move the data forward one sector to free the first sector. This we > can actually do by inserting a GEOM that does it while I/O is still > ongoing. The good thing is: we need a class that does exactly this > for implementing the "move" verb in gpart. There were two reasons to use the last sector instead of first: 1. You want to be able to boot from gmirror. If all your data will be moved forward your boot sectors and kernel will be harder to find. 2. For recovery reasons you may want to turn off gmirror and still be able to access your data. Note that gmirror can handle the case where disk, slice and partition share the same last sector - it simply stores provider size in its metadata, so once it gets disk for tasting it detects its too big and ignores it, then slice will be given for tasting, but it also has larger size than expected and will be ignored as well. Finally partition will be tasted and gmirror configured. --=20 Pawel Jakub Dawidek http://www.wheel.pl pjd@FreeBSD.org http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! --d6Gm4EdcadzBjdND Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.4 (FreeBSD) iD8DBQFKRy61ForvXbEpPzQRAmOAAJ44Mp928wYkoBPD3p64vr3tA0aW9gCcDqWO Dr4QaHHEB5I33pAqDmt6CWQ= =6fRJ -----END PGP SIGNATURE----- --d6Gm4EdcadzBjdND-- From owner-freebsd-geom@FreeBSD.ORG Sun Jun 28 10:30:27 2009 Return-Path: Delivered-To: freebsd-geom@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 85E76106568D; Sun, 28 Jun 2009 10:30:27 +0000 (UTC) (envelope-from dan.naumov@gmail.com) Received: from mail-yx0-f181.google.com (mail-yx0-f181.google.com [209.85.210.181]) by mx1.freebsd.org (Postfix) with ESMTP id 2EC788FC24; Sun, 28 Jun 2009 10:30:27 +0000 (UTC) (envelope-from dan.naumov@gmail.com) Received: by yxe11 with SMTP id 11so2764255yxe.3 for ; Sun, 28 Jun 2009 03:30:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:cc:content-type :content-transfer-encoding; bh=j6IbRWSGP4VxhJUPN5R97FXETiJgSZKKWcuKyQrn6Wc=; b=LvT/zQIOA6DH64RwpVctPUYYS9fuwSm8AcR6hrxQZtid9Gk6gn6mFDLTiW7LkKnyyx VWOuRuI6fCwNcMGxVWqSyNaW/yn+fIeULxPSCvIEoHXvZ1jO9rXsNyY3epiNm1ajBLTk yAbgTGHmdKlsH9VuQy1xMjxorYwWPMUDBXaCg= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type:content-transfer-encoding; b=aQM+5VBu+4VYZ8BQAAW+q9iHwD/0RDqwJZdWwCJiSMuThkbNDbCyMtIbndXBdIfcoH PbJ0GfOSfumdRn6vKJjMPbXCsNfMoRg8Eih4ut30Q4gW4//K5dFRX0AkulE72GwCloV+ OCsvyr14n59mlGzBkbR4Vfv616jbfZWdbJQx0= MIME-Version: 1.0 Received: by 10.100.46.18 with SMTP id t18mr7516635ant.54.1246185026686; Sun, 28 Jun 2009 03:30:26 -0700 (PDT) In-Reply-To: <4A4725FA.80505@modulus.org> References: <4A4725FA.80505@modulus.org> Date: Sun, 28 Jun 2009 13:30:26 +0300 Message-ID: From: Dan Naumov To: Andrew Snow Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Cc: freebsd-fs@freebsd.org, freebsd-geom@freebsd.org Subject: Re: read/write benchmarking: UFS2 vs ZFS vs EXT3 vs ZFS RAIDZ vs Linux MDRAID X-BeenThere: freebsd-geom@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: GEOM-specific discussions and implementations List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 28 Jun 2009 10:30:28 -0000 > What confuses me about these results is that the '5 disk' performance was > barely higher than the 'single disk' performance. =A0All figures are also > lower than I get from a single modern SATA disk. > > My own testing with dd from /dev/zero with FreeBSD ZFS an Intel ICH10 > chipset motherboard with Core2duo 2.66ghz showed RAIDZ performance scalin= g > linearly with number of disks: > > > What =A0 =A0 =A0 =A0 =A0 =A0 =A0 Write =A0 Read > -------------------------------- > 7 disk RAIDZ2 =A0 =A0 =A0220 =A0 =A0 305 > 6 disk RAIDZ2 =A0 =A0 =A0173 =A0 =A0 260 > 5 disk RAIDZ2 =A0 =A0 =A0120 =A0 =A0 213 What's confusing is that your results are actually out of place with how ZFS numbers are supposed to look, not mine :) When using ZFS RAIDZ, due to the way parity checking works in ZFS, your pool is SUPPOSED to have throughput of the average single disk from that pool and not some numbers growing skyhigh in a linear fashion. The numbers that did surprise me the most were actually gmirror reads (results posted earlier to this list): a geom gmirror is consistently SLOWER for reading that a single disk (and it only gets progressively worse the more disks you have in your gmirror). Read performance of all other mirroring implementations pretty much scale up linearly with the amount of disks present in the mirror. - Sincerely, Dan Naumov From owner-freebsd-geom@FreeBSD.ORG Sun Jun 28 10:39:57 2009 Return-Path: Delivered-To: freebsd-geom@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id BCF19106564A for ; Sun, 28 Jun 2009 10:39:57 +0000 (UTC) (envelope-from andrew@modulus.org) Received: from email.octopus.com.au (email.octopus.com.au [122.100.2.232]) by mx1.freebsd.org (Postfix) with ESMTP id 7B9388FC19 for ; Sun, 28 Jun 2009 10:39:57 +0000 (UTC) (envelope-from andrew@modulus.org) Received: by email.octopus.com.au (Postfix, from userid 1002) id B0A8817348; Sun, 28 Jun 2009 20:40:21 +1000 (EST) X-Spam-Checker-Version: SpamAssassin 3.2.3 (2007-08-08) on email.octopus.com.au X-Spam-Level: X-Spam-Status: No, score=-0.3 required=10.0 tests=ALL_TRUSTED,DNS_FROM_DOB, RCVD_IN_DOB autolearn=no version=3.2.3 Received: from [10.20.30.102] (60.218.233.220.static.exetel.com.au [220.233.218.60]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) (Authenticated sender: admin@email.octopus.com.au) by email.octopus.com.au (Postfix) with ESMTP id 66C601721F; Sun, 28 Jun 2009 20:40:13 +1000 (EST) Message-ID: <4A4747A0.6040902@modulus.org> Date: Sun, 28 Jun 2009 20:36:16 +1000 From: Andrew Snow User-Agent: Thunderbird 2.0.0.6 (X11/20070926) MIME-Version: 1.0 To: Dan Naumov References: <4A4725FA.80505@modulus.org> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: freebsd-fs@freebsd.org, freebsd-geom@freebsd.org Subject: Re: read/write benchmarking: UFS2 vs ZFS vs EXT3 vs ZFS RAIDZ vs Linux MDRAID X-BeenThere: freebsd-geom@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: GEOM-specific discussions and implementations List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 28 Jun 2009 10:39:58 -0000 > What's confusing is that your results are actually out of place with > how ZFS numbers are supposed to look, not mine :) When using ZFS > RAIDZ, due to the way parity checking works in ZFS, your pool is > SUPPOSED to have throughput of the average single disk from that pool > and not some numbers growing skyhigh in a linear fashion. Could you please elaborate on this and explain it? - Andrew From owner-freebsd-geom@FreeBSD.ORG Sun Jun 28 11:02:04 2009 Return-Path: Delivered-To: freebsd-geom@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 54B5F106564A; Sun, 28 Jun 2009 11:02:04 +0000 (UTC) (envelope-from dan.naumov@gmail.com) Received: from an-out-0708.google.com (an-out-0708.google.com [209.85.132.245]) by mx1.freebsd.org (Postfix) with ESMTP id E79F38FC08; Sun, 28 Jun 2009 11:02:03 +0000 (UTC) (envelope-from dan.naumov@gmail.com) Received: by an-out-0708.google.com with SMTP id d14so933252and.13 for ; Sun, 28 Jun 2009 04:02:03 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:cc:content-type :content-transfer-encoding; bh=+YkaB5f87hEajaJGUm2wtGXumnLWaCI1qwMrNc3VXcA=; b=ejs+AZ276WHeCBZ1UOnLYMAvEoAaPK7x4MgVmARF3je7HteCDeQsiDfijg8Z2giShY GptBwHQAfQilzjS7m9B6m2h53dEZXa7wWWozcdVZdacJb5NO95iPWCk2LglpUIYMYAs9 DrCWs0PFOFuztf5PEAJvoucGobv6UPQXiqZuQ= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type:content-transfer-encoding; b=wwxu82stT2ouvAjW5jf6ElNZiSSc+r2xKBCqTFOhwPbfBjfZRlEMiLPHA0vcjL3JxT F9paT4kddeJjH7yK4u9/kvc1qC6xILB4yL1lNUQEHYGcy9+vDqJKK1rAWcwtr5aAGmdN j3ajhaN6e1gaPP10zEe9PBDhNaUPdMObsf6qM= MIME-Version: 1.0 Received: by 10.100.11.14 with SMTP id 14mr7540531ank.81.1246186923267; Sun, 28 Jun 2009 04:02:03 -0700 (PDT) In-Reply-To: <4A4747A0.6040902@modulus.org> References: <4A4725FA.80505@modulus.org> <4A4747A0.6040902@modulus.org> Date: Sun, 28 Jun 2009 14:02:03 +0300 Message-ID: From: Dan Naumov To: Andrew Snow Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Cc: freebsd-fs@freebsd.org, freebsd-geom@freebsd.org Subject: Re: read/write benchmarking: UFS2 vs ZFS vs EXT3 vs ZFS RAIDZ vs Linux MDRAID X-BeenThere: freebsd-geom@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: GEOM-specific discussions and implementations List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 28 Jun 2009 11:02:04 -0000 "Now we come to the crucial decision ZFS has made for raidz and raidz2: in raidz and raidz2, the data block is striped across all of the disks. Instead of a model where a parity stripe is a bunch of data blocks, each with an independent checksum, ZFS stripes a single data block (and its parity), with a single checksum, across all the disks (or as many of them as necessary). This is a rational implementation decision, but when combined with the need to verify checksums, it has an important consequence: in ZFS, reads always involve all disks, because ZFS always must verify the data block's checksum, which requires reading all of the data block, which is spread across all of the drives. This is unlike normal RAID-5 or RAID-6, in which a small enough read will only touch one drive, and means that adding more disks to a ZFS raidz pool does not increase how many random reads you can do per second. (A normal RAID-5 or RAID-6 array has a (theoretical) random read IO capacity equal to the sum of the random IO operations rate of each of the disks in the array, and so adding another disk adds its IOPs per second to your read capacity. A ZFS raidz or raidz2 pool instead has a capacity equal to the slowest disk's IOPs per second, and adding another disk does nothing to help. Effectively a raidz ZFS gives you a single disk's read IOPs per second rate.)" This was on a blog of a SUN engineer (although a post from a few years ago), unfortunately I don't have the link, I actually had to go through my posting history on the Ars Technica forum to even find this quote in the first place. If the situation has changed and the above quote no longer holds true, it would be nice if someone more knowledgeable on the performance implications could elaborate what kind of performance is to be expected on a raidz system :) - Sincerely, Dan Naumov On Sun, Jun 28, 2009 at 1:36 PM, Andrew Snow wrote: >> What's confusing is that your results are actually out of place with >> how ZFS numbers are supposed to look, not mine :) When using ZFS >> RAIDZ, due to the way parity checking works in ZFS, your pool is >> SUPPOSED to have throughput of the average single disk from that pool >> and not some numbers growing skyhigh in a linear fashion. > > Could you please elaborate on this and explain it? > > - Andrew > From owner-freebsd-geom@FreeBSD.ORG Sun Jun 28 11:37:17 2009 Return-Path: Delivered-To: freebsd-geom@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 954EB1065674 for ; Sun, 28 Jun 2009 11:37:17 +0000 (UTC) (envelope-from andrew@modulus.org) Received: from email.octopus.com.au (email.octopus.com.au [122.100.2.232]) by mx1.freebsd.org (Postfix) with ESMTP id 52BBD8FC0C for ; Sun, 28 Jun 2009 11:37:17 +0000 (UTC) (envelope-from andrew@modulus.org) Received: by email.octopus.com.au (Postfix, from userid 1002) id 9296F172FE; Sun, 28 Jun 2009 21:37:41 +1000 (EST) X-Spam-Checker-Version: SpamAssassin 3.2.3 (2007-08-08) on email.octopus.com.au X-Spam-Level: X-Spam-Status: No, score=-0.3 required=10.0 tests=ALL_TRUSTED,DNS_FROM_DOB, RCVD_IN_DOB autolearn=no version=3.2.3 Received: from [10.20.30.102] (60.218.233.220.static.exetel.com.au [220.233.218.60]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) (Authenticated sender: admin@email.octopus.com.au) by email.octopus.com.au (Postfix) with ESMTP id AD78817255; Sun, 28 Jun 2009 21:37:33 +1000 (EST) Message-ID: <4A475511.5000700@modulus.org> Date: Sun, 28 Jun 2009 21:33:37 +1000 From: Andrew Snow User-Agent: Thunderbird 2.0.0.6 (X11/20070926) MIME-Version: 1.0 To: Dan Naumov References: <4A4725FA.80505@modulus.org> <4A4747A0.6040902@modulus.org> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: freebsd-fs@freebsd.org, freebsd-geom@freebsd.org Subject: Re: read/write benchmarking: UFS2 vs ZFS vs EXT3 vs ZFS RAIDZ vs Linux MDRAID X-BeenThere: freebsd-geom@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: GEOM-specific discussions and implementations List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 28 Jun 2009 11:37:17 -0000 OK, I thought we were taling about a single-threaded sequential write which was what my benchmark is. It sounds like the graphs you published were of a multi-threaded writers - how many processes were running in parallel in the case of the "Contiguous Write Performance" here? http://virtual.tehinterweb.net/livejournal/2009-06-22_zfs_diskperf/zfs-diskperf-contig-write.png - Andrew From owner-freebsd-geom@FreeBSD.ORG Sun Jun 28 13:18:37 2009 Return-Path: Delivered-To: freebsd-geom@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 75ADA1065674; Sun, 28 Jun 2009 13:18:37 +0000 (UTC) (envelope-from ivoras@gmail.com) Received: from mail-ew0-f213.google.com (mail-ew0-f213.google.com [209.85.219.213]) by mx1.freebsd.org (Postfix) with ESMTP id A482B8FC19; Sun, 28 Jun 2009 13:18:36 +0000 (UTC) (envelope-from ivoras@gmail.com) Received: by ewy9 with SMTP id 9so2920753ewy.43 for ; Sun, 28 Jun 2009 06:18:35 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:sender:received:in-reply-to :references:from:date:x-google-sender-auth:message-id:subject:to:cc :content-type:content-transfer-encoding; bh=XnO120P/HEuzlrl0SjLrmgGQAT1/OAERpkN0C9fVH4k=; b=ClsfAbS8JfgxAZ2JRYUSnVrmLYLmr9fKoXSLItX09wA/Fo+D7Rg/cBvmi2WwQLgy8k 1TdxxEnO6BjY7HgrHQOZVAj/Y7QNM3I1tjDiQ6qOWm3js1IwufyVF+1co74mpEF7RZTt wY2H4lzK7mUQe/JQkzeHKWcQKpZdCWPRCIbSY= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:from:date :x-google-sender-auth:message-id:subject:to:cc:content-type :content-transfer-encoding; b=irAcrXBOHnWjJq2i0np42qbeZyIsFat0bje6pcbj9GZIYbHyejQxUptF3gnzH5UKhb 7hRyLyhQWta8f/27IcJmQe84eYieMh23bYzhYjvfsRQIecmaW7vDHhQfuW3IpX4NIKK7 rJDOrWAW/NVUC8d92drewyMX3iyRqsG+HThvY= MIME-Version: 1.0 Sender: ivoras@gmail.com Received: by 10.210.91.7 with SMTP id o7mr2068474ebb.69.1246193482137; Sun, 28 Jun 2009 05:51:22 -0700 (PDT) In-Reply-To: <2FFFB36F-EFA3-4D92-98A3-692BA2D6F63E@mac.com> References: <20090625110253.GA31443@mech-cluster238.men.bris.ac.uk> <10FCC74D-6D46-4112-AD89-BBB4C5933957@mac.com> <2FFFB36F-EFA3-4D92-98A3-692BA2D6F63E@mac.com> From: Ivan Voras Date: Sun, 28 Jun 2009 14:51:02 +0200 X-Google-Sender-Auth: 0a1d4c3dc2ba9046 Message-ID: <9bbcef730906280551r26e30b61oc84acdd02d94743e@mail.gmail.com> To: Marcel Moolenaar Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Cc: freebsd-current@freebsd.org, freebsd-questions@freebsd.org, freebsd-geom@freebsd.org Subject: Re: gmirror gm0 destroyed on shutdown; GPT corrupt X-BeenThere: freebsd-geom@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: GEOM-specific discussions and implementations List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 28 Jun 2009 13:18:38 -0000 2009/6/28 Marcel Moolenaar : > Using the last sector is not only flawed because it creates a race > condition, it's flawed in the assumption that you can always make > a geom part of a mirror by storing meta-data on the geom without > causing corruption. This whole idea of using the last sector was > so that a fully partitioned disk with data could be turned into a > mirrored disk. A neat idea, but hardly the basis for a generic > mirroring implementation when it silently corrupts a disk. > > I think it's better to change gmirror to use the first sector on the > provider. Yes, it would be cleaner to implement but it would also make the mirrored devices unbootable. But maybe the class of users needing the functionality is smaller now. > This never creates a race condition and as such, you don't > need to invent a priority scheme, that has it's own set of flaws on > top of it. The only downside is that it's not easy to make a fully > partitioned and populated disk part of a mirror: one would need to > move the data forward one sector to free the first sector. This we > can actually do by inserting a GEOM that does it while I/O is still > ongoing. The good thing is: we need a class that does exactly this > for implementing the "move" verb in gpart. Looks too complicated and fragile. Maybe there's a need for metadata-less automatic mirrors in some way, by storing the configuration somewhere else, possibly in /etc. From owner-freebsd-geom@FreeBSD.ORG Sun Jun 28 13:43:51 2009 Return-Path: Delivered-To: freebsd-geom@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 61D28106568B; Sun, 28 Jun 2009 13:43:51 +0000 (UTC) (envelope-from spambox@haruhiism.net) Received: from fujibayashi.jp (karas.fujibayashi.jp [77.221.159.4]) by mx1.freebsd.org (Postfix) with ESMTP id 1359F8FC1C; Sun, 28 Jun 2009 13:43:50 +0000 (UTC) (envelope-from spambox@haruhiism.net) Received: from [192.168.0.2] (ppp91-122-47-189.pppoe.avangarddsl.ru [91.122.47.189]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by fujibayashi.jp (Postfix) with ESMTPSA id 6464178F98; Sun, 28 Jun 2009 17:25:30 +0400 (MSD) Message-ID: <4A476F56.2030504@haruhiism.net> Date: Sun, 28 Jun 2009 17:25:42 +0400 From: Aisaka Taiga User-Agent: Thunderbird 2.0.0.22 (Windows/20090605) MIME-Version: 1.0 To: Ivan Voras References: <20090625110253.GA31443@mech-cluster238.men.bris.ac.uk> <10FCC74D-6D46-4112-AD89-BBB4C5933957@mac.com> <2FFFB36F-EFA3-4D92-98A3-692BA2D6F63E@mac.com> <9bbcef730906280551r26e30b61oc84acdd02d94743e@mail.gmail.com> In-Reply-To: <9bbcef730906280551r26e30b61oc84acdd02d94743e@mail.gmail.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Cc: freebsd-geom@freebsd.org, Marcel Moolenaar , freebsd-questions@freebsd.org, freebsd-current@freebsd.org Subject: Re: gmirror gm0 destroyed on shutdown; GPT corrupt X-BeenThere: freebsd-geom@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: GEOM-specific discussions and implementations List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 28 Jun 2009 13:43:52 -0000 Ivan Voras wrote: > Yes, it would be cleaner to implement but it would also make the > mirrored devices unbootable. > But maybe the class of users needing the functionality is smaller now. > Most dedicated server providers can't afford to use hardware RAID systems because that would drastically increase the price of a single system; yet many customers want mirroring. > Looks too complicated and fragile. Maybe there's a need for > metadata-less automatic mirrors in some way, by storing the > configuration somewhere else, possibly in /etc. This might be dangerous in some cases. Imagine booting with two drives swapped; such a configuration might lead to data corruption on a volume which was enumerated incorrectly or swapped. -- Kamigishi Rei KREI-RIPE From owner-freebsd-geom@FreeBSD.ORG Mon Jun 29 11:06:59 2009 Return-Path: Delivered-To: freebsd-geom@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 10EBA1065673 for ; Mon, 29 Jun 2009 11:06:59 +0000 (UTC) (envelope-from owner-bugmaster@FreeBSD.org) Received: from freefall.freebsd.org (freefall.freebsd.org [IPv6:2001:4f8:fff6::28]) by mx1.freebsd.org (Postfix) with ESMTP id E88488FC19 for ; Mon, 29 Jun 2009 11:06:58 +0000 (UTC) (envelope-from owner-bugmaster@FreeBSD.org) Received: from freefall.freebsd.org (localhost [127.0.0.1]) by freefall.freebsd.org (8.14.3/8.14.3) with ESMTP id n5TB6wNK046333 for ; Mon, 29 Jun 2009 11:06:58 GMT (envelope-from owner-bugmaster@FreeBSD.org) Received: (from gnats@localhost) by freefall.freebsd.org (8.14.3/8.14.3/Submit) id n5TB6wrZ046329 for freebsd-geom@FreeBSD.org; Mon, 29 Jun 2009 11:06:58 GMT (envelope-from owner-bugmaster@FreeBSD.org) Date: Mon, 29 Jun 2009 11:06:58 GMT Message-Id: <200906291106.n5TB6wrZ046329@freefall.freebsd.org> X-Authentication-Warning: freefall.freebsd.org: gnats set sender to owner-bugmaster@FreeBSD.org using -f From: FreeBSD bugmaster To: freebsd-geom@FreeBSD.org Cc: Subject: Current problem reports assigned to freebsd-geom@FreeBSD.org X-BeenThere: freebsd-geom@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: GEOM-specific discussions and implementations List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 29 Jun 2009 11:06:59 -0000 Note: to view an individual PR, use: http://www.freebsd.org/cgi/query-pr.cgi?pr=(number). The following is a listing of current problems submitted by FreeBSD users. These represent problem reports covering all versions including experimental development code and obsolete releases. S Tracker Resp. Description -------------------------------------------------------------------------------- o kern/135898 geom [geom] Severe filesystem corruption - large files or l o kern/135874 geom [geom] [patch] geom_linux_lvm misses newer fedora defa o kern/134922 geom [gmirror] [panic] kernel panic when use fdisk on disk o kern/134113 geom [geli] Problem setting secondary GELI key o kern/134044 geom [geom] gmirror(8) overwrites fs with stale data from r o kern/133931 geom [geli] [request] intentionally wrong password to destr o bin/132845 geom [geom] [patch] ggated(8) does not close files opened a o kern/132273 geom glabel(8): [patch] failing on journaled partition o kern/132242 geom [gmirror] gmirror.ko fails to fully initialize o kern/131353 geom [geom] gjournal(8) kernel lock o kern/131037 geom [geli] Unable to create disklabel on .eli-Device p docs/130548 geom [patch] gjournal(8) man page is missing sysctls o kern/130528 geom gjournal fsck during boot o kern/129674 geom [geom] gjournal root did not mount on boot o kern/129645 geom gjournal(8): GEOM_JOURNAL causes system to fail to boo o kern/129245 geom [geom] gcache is more suitable for suffix based provid o bin/128398 geom [patch] glabel(8): teach geom_label to recognise gpt l f kern/128276 geom [gmirror] machine lock up when gmirror module is used o kern/126902 geom [geom] geom_label: kernel panic during install boot o kern/124973 geom [gjournal] [patch] boot order affects geom_journal con o kern/124969 geom gvinum(8): gvinum raid5 plex does not detect missing s o kern/124294 geom [geom] gmirror(8) have inappropriate logic when workin o kern/124130 geom [gmirror] [usb] gmirror fails to start usb devices tha o kern/123962 geom [panic] [gjournal] gjournal (455Gb data, 8Gb journal), o kern/123630 geom [patch] [gmirror] gmirror doesnt allow the original dr o kern/123122 geom [geom] GEOM / gjournal kernel lock o kern/122738 geom [geom] gmirror list "losts consumers" after gmirror de f kern/122415 geom [geom] UFS labels are being constantly created and rem o kern/122067 geom [geom] [panic] Geom crashed during boot o kern/121559 geom [patch] [geom] geom label class allows to create inacc o kern/121481 geom [gmirror] data rot on disk with gmirror o kern/121364 geom [gmirror] Removing all providers create a "zombie" mir o kern/120231 geom [geom] GEOM_CONCAT error adding second drive o kern/120091 geom [geom] [geli] [gjournal] geli does not prompt for pass o kern/120044 geom [msdosfs] [geom] incorrect MSDOSFS label fries adminis o kern/120021 geom [geom] [panic] net-p2p/qbittorrent crashes system when o kern/119743 geom [geom] geom label for cds is keeped after dismount and p kern/116896 geom [geom] [patch] Typo in a kassert in GEOM o kern/115856 geom [geli] ZFS thought it was degraded when it should have o kern/115547 geom [geom] [patch] [request] let GEOM Eli get password fro o kern/114532 geom [geom] GEOM_MIRROR shows up in kldstat even if compile o kern/113957 geom [gmirror] gmirror is intermittently reporting a degrad o kern/113885 geom [gmirror] [patch] improved gmirror balance algorithm o kern/113837 geom [geom] unable to access 1024 sector size storage o kern/113419 geom [geom] geom fox multipathing not failing back p bin/110705 geom gmirror(8) control utility does not exit with correct o kern/107707 geom [geom] [patch] [request] add new class geom_xbox360 to o kern/104389 geom [geom] [patch] sys/geom/geom_dump.c doesn't encode XML o kern/98034 geom [geom] dereference of NULL pointer in acd_geom_detach o kern/94632 geom [geom] Kernel output resets input while GELI asks for o kern/90582 geom [geom] [panic] Restore cause panic string (ffs_blkfree o bin/90093 geom fdisk(8) incapable of altering in-core geometry a kern/89660 geom [vinum] [patch] [panic] due to g_malloc returning null o kern/89546 geom [geom] GEOM error s kern/89102 geom [geom] [panic] panic when forced unmount FS from unplu o kern/88601 geom [geli] geli cause kernel panic under heavy disk usage o kern/87544 geom [gbde] mmaping large files on a gbde filesystem deadlo o kern/84556 geom [geom] [panic] GBDE-encrypted swap causes panic at shu o bin/81779 geom misleading error messages in geom(8) utilities. o kern/79251 geom [2TB] newfs fails on 2.6TB gbde device o kern/79035 geom [vinum] gvinum unable to create a striped set of mirro o bin/78131 geom gbde(8) "destroy" not working. s kern/73177 geom kldload geom_* causes panic due to memory exhaustion 63 problems total. From owner-freebsd-geom@FreeBSD.ORG Mon Jun 29 21:26:46 2009 Return-Path: Delivered-To: freebsd-geom@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 2F3B11065672 for ; Mon, 29 Jun 2009 21:26:46 +0000 (UTC) (envelope-from rick@kiwi-computer.com) Received: from hamlet.setfilepointer.com (hamlet.SetFilePointer.com [63.224.10.2]) by mx1.freebsd.org (Postfix) with SMTP id AA8B98FC14 for ; Mon, 29 Jun 2009 21:26:45 +0000 (UTC) (envelope-from rick@kiwi-computer.com) Received: (qmail 13230 invoked from network); 29 Jun 2009 16:00:03 -0500 Received: from keira.kiwi-computer.com (HELO kiwi-computer.com) (63.224.10.3) by hamlet.setfilepointer.com with SMTP; 29 Jun 2009 16:00:03 -0500 Received: (qmail 24217 invoked by uid 2001); 29 Jun 2009 21:00:03 -0000 Date: Mon, 29 Jun 2009 16:00:03 -0500 From: "Rick C. Petty" To: Marcel Moolenaar Message-ID: <20090629210003.GA24038@keira.kiwi-computer.com> References: <20090625110253.GA31443@mech-cluster238.men.bris.ac.uk> <10FCC74D-6D46-4112-AD89-BBB4C5933957@mac.com> <2FFFB36F-EFA3-4D92-98A3-692BA2D6F63E@mac.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <2FFFB36F-EFA3-4D92-98A3-692BA2D6F63E@mac.com> User-Agent: Mutt/1.4.2.3i Cc: Ivan Voras , freebsd-geom@freebsd.org Subject: Re: gmirror gm0 destroyed on shutdown; GPT corrupt X-BeenThere: freebsd-geom@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: rick-freebsd2008@kiwi-computer.com List-Id: GEOM-specific discussions and implementations List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 29 Jun 2009 21:26:46 -0000 [[ Removing the double cross-post, since this is GEOM-specific ]] On Sat, Jun 27, 2009 at 06:20:49PM -0700, Marcel Moolenaar wrote: > > Using the last sector is not only flawed because it creates a race > condition, It shouldn't create a race condition. If you add a gpt to a mirror, the gpt backup will be the last sector in the mirror, which is the last sector of the disk minus 1. If you want to create the gpt first in anticipation of putting a mirror around it, you would need to do gpt add -s ... where "numsectors" is the output of: diskinfo | awk '{print $4}' In general GEOM requires you to build your topology inward: you're not supposed to create a mirror out of a previously-generated gpt because that's like inserting a new class in between the provider and consumer which doesn't make sense. > it's flawed in the assumption that you can always make > a geom part of a mirror by storing meta-data on the geom without > causing corruption. This whole idea of using the last sector was > so that a fully partitioned disk with data could be turned into a > mirrored disk. A neat idea, but hardly the basis for a generic > mirroring implementation when it silently corrupts a disk. I don't believe this was the assumption, nor the reason for this "feature". > I think it's better to change gmirror to use the first sector on the > provider. Ack! That would be quite disruptive as you would have to increment the sector offset number in the partition table (gpt or mbr) to make the partition bootable. I thought most geom schemes added their metadata at the end, since it fits the "container" philosophy. One exception is gvinum and it's quite painful to make a gvinum root partition bootable. It's even more painful if you ever wish to "undo" it. I've had terrible luck trying to remove gvinum from the picture, whereas gmirror and others are quite easy. > In other words: Solving the problem that putting the metadata in the > first sector creates, can and will be re-used in implementing the > gpart "move partition" feature. I doubt anyone will complain that > the creation of a mirror brings with it a few hours of disk activity > that does not inhibit normal operation... No, but people would roar rather loudly if the partition isn't bootable. To make it bootable, knowledge of each GEOM provider would have to be embedded into boot2, which is already quite full. Sure it's okay for some providers (like raid5 or striping) which you can't boot to already, but such things as simple as mirroring should work to some extent with or without GEOM. -- Rick C. Petty From owner-freebsd-geom@FreeBSD.ORG Tue Jun 30 21:38:10 2009 Return-Path: Delivered-To: freebsd-geom@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 2F6CE1065673; Tue, 30 Jun 2009 21:38:10 +0000 (UTC) (envelope-from xcllnt@mac.com) Received: from asmtpout029.mac.com (asmtpout029.mac.com [17.148.16.104]) by mx1.freebsd.org (Postfix) with ESMTP id 185338FC1C; Tue, 30 Jun 2009 21:38:09 +0000 (UTC) (envelope-from xcllnt@mac.com) MIME-version: 1.0 Content-transfer-encoding: 7BIT Content-type: text/plain; charset=us-ascii; format=flowed Received: from macbook-pro.lan.xcllnt.net ([75.101.29.67]) by asmtp029.mac.com (Sun Java(tm) System Messaging Server 6.3-8.01 (built Dec 16 2008; 32bit)) with ESMTPSA id <0KM2004Y0MR7RQ90@asmtp029.mac.com>; Tue, 30 Jun 2009 14:38:06 -0700 (PDT) From: Marcel Moolenaar In-reply-to: <20090629210003.GA24038@keira.kiwi-computer.com> Date: Tue, 30 Jun 2009 14:37:55 -0700 Message-id: <704EE47D-F0C4-4C63-AA3C-3ADF92CC8379@mac.com> References: <20090625110253.GA31443@mech-cluster238.men.bris.ac.uk> <10FCC74D-6D46-4112-AD89-BBB4C5933957@mac.com> <2FFFB36F-EFA3-4D92-98A3-692BA2D6F63E@mac.com> <20090629210003.GA24038@keira.kiwi-computer.com> To: rick-freebsd2008@kiwi-computer.com X-Mailer: Apple Mail (2.1068) Cc: Ivan Voras , freebsd-geom@freebsd.org Subject: Re: gmirror gm0 destroyed on shutdown; GPT corrupt X-BeenThere: freebsd-geom@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: GEOM-specific discussions and implementations List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 30 Jun 2009 21:38:10 -0000 On Jun 29, 2009, at 2:00 PM, Rick C. Petty wrote: > [[ Removing the double cross-post, since this is GEOM-specific ]] > > On Sat, Jun 27, 2009 at 06:20:49PM -0700, Marcel Moolenaar wrote: >> >> Using the last sector is not only flawed because it creates a race >> condition, > > It shouldn't create a race condition. It does. Answer the following: foo0 is a provider with 3 sectors. bar is a geom class that puts meta-data in the first sector. baz is a geom class that puts meta-data in the last sector. Both bar and baz get to taste foo0. Which one should go first? -- Marcel Moolenaar xcllnt@mac.com From owner-freebsd-geom@FreeBSD.ORG Tue Jun 30 21:53:47 2009 Return-Path: Delivered-To: freebsd-geom@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 0E6E6106564A for ; Tue, 30 Jun 2009 21:53:47 +0000 (UTC) (envelope-from rick@kiwi-computer.com) Received: from hamlet.setfilepointer.com (hamlet.SetFilePointer.com [63.224.10.2]) by mx1.freebsd.org (Postfix) with SMTP id A91198FC22 for ; Tue, 30 Jun 2009 21:53:46 +0000 (UTC) (envelope-from rick@kiwi-computer.com) Received: (qmail 40424 invoked from network); 30 Jun 2009 16:53:46 -0500 Received: from keira.kiwi-computer.com (HELO kiwi-computer.com) (63.224.10.3) by hamlet.setfilepointer.com with SMTP; 30 Jun 2009 16:53:46 -0500 Received: (qmail 34420 invoked by uid 2001); 30 Jun 2009 21:53:45 -0000 Date: Tue, 30 Jun 2009 16:53:45 -0500 From: "Rick C. Petty" To: Marcel Moolenaar Message-ID: <20090630215345.GC33849@keira.kiwi-computer.com> References: <20090625110253.GA31443@mech-cluster238.men.bris.ac.uk> <10FCC74D-6D46-4112-AD89-BBB4C5933957@mac.com> <2FFFB36F-EFA3-4D92-98A3-692BA2D6F63E@mac.com> <20090629210003.GA24038@keira.kiwi-computer.com> <704EE47D-F0C4-4C63-AA3C-3ADF92CC8379@mac.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <704EE47D-F0C4-4C63-AA3C-3ADF92CC8379@mac.com> User-Agent: Mutt/1.4.2.3i Cc: Ivan Voras , freebsd-geom@freebsd.org Subject: Re: gmirror gm0 destroyed on shutdown; GPT corrupt X-BeenThere: freebsd-geom@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: rick-freebsd2008@kiwi-computer.com List-Id: GEOM-specific discussions and implementations List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 30 Jun 2009 21:53:47 -0000 On Tue, Jun 30, 2009 at 02:37:55PM -0700, Marcel Moolenaar wrote: > > On Jun 29, 2009, at 2:00 PM, Rick C. Petty wrote: > > >[[ Removing the double cross-post, since this is GEOM-specific ]] > > > >On Sat, Jun 27, 2009 at 06:20:49PM -0700, Marcel Moolenaar wrote: > >> > >>Using the last sector is not only flawed because it creates a race > >>condition, > > > >It shouldn't create a race condition. > > It does. I didn't say it didn't, I said it shouldn't. > Answer the following: > > foo0 is a provider with 3 sectors. > bar is a geom class that puts meta-data in the first sector. > baz is a geom class that puts meta-data in the last sector. > > Both bar and baz get to taste foo0. Which one should go first? Both bar and baz should validate their metadata and it should be pretty apparent that one of them has a smaller size. If the one that is smaller fits perfectly into the one that is bigger, the taste should pass to the latter first. Yes, it would be more complicated to implement, but there *should not* be a race condition because one container fits inside the other. It *should* only be a race condition if one of the classes does not decrease its size to store its metadata. I know of no geom providers which do this. In other words, I meant that it's deterministic. If GEOM can't handle that situation (which *should* be deterministic), then I believe it's broken. -- Rick C. Petty From owner-freebsd-geom@FreeBSD.ORG Tue Jun 30 22:08:47 2009 Return-Path: Delivered-To: freebsd-geom@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 8DB3A1065678 for ; Tue, 30 Jun 2009 22:08:47 +0000 (UTC) (envelope-from ivoras@gmail.com) Received: from mail-ew0-f213.google.com (mail-ew0-f213.google.com [209.85.219.213]) by mx1.freebsd.org (Postfix) with ESMTP id 1739D8FC27 for ; Tue, 30 Jun 2009 22:08:46 +0000 (UTC) (envelope-from ivoras@gmail.com) Received: by ewy9 with SMTP id 9so571271ewy.43 for ; Tue, 30 Jun 2009 15:08:46 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:sender:received:in-reply-to :references:from:date:x-google-sender-auth:message-id:subject:to:cc :content-type:content-transfer-encoding; bh=2TvTXJKDDzVFZp82L+b9y/58mvwOFrVpzJJvnvtM9+c=; b=TkpVB32l26nAv4f7o6oZtfl9vXwosUohLSsBrGyOwQxvgeI1P14B3s8C3wyyKvP8FE /G4JXEy/AahzXGGYj9zOQnGZxH91APDcFi6sAPthYXnuw1PDr9Njb9iCZwgHRKzv8NOT ZZenu1pf9LP9EXV4EMWpvfToImALpmtqyFQi0= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:from:date :x-google-sender-auth:message-id:subject:to:cc:content-type :content-transfer-encoding; b=R3gtiR7Qkdop66F/MJcQVznISFq9H4qLlgJrRjAAGCg75FYQuBmjzRKyGFdq6PRxoD cHbdoTQH7qe7/UXL02YKqgMFxvABTDGSb/8yCMnzAJieFfXZslztM4ubSvRKfS3g7U3M kS0Chw4nUBP8OobcXBKQCjDW2cIEocIThrr9o= MIME-Version: 1.0 Sender: ivoras@gmail.com Received: by 10.210.78.16 with SMTP id a16mr374358ebb.73.1246399726062; Tue, 30 Jun 2009 15:08:46 -0700 (PDT) In-Reply-To: <20090630215345.GC33849@keira.kiwi-computer.com> References: <20090625110253.GA31443@mech-cluster238.men.bris.ac.uk> <10FCC74D-6D46-4112-AD89-BBB4C5933957@mac.com> <2FFFB36F-EFA3-4D92-98A3-692BA2D6F63E@mac.com> <20090629210003.GA24038@keira.kiwi-computer.com> <704EE47D-F0C4-4C63-AA3C-3ADF92CC8379@mac.com> <20090630215345.GC33849@keira.kiwi-computer.com> From: Ivan Voras Date: Wed, 1 Jul 2009 00:08:25 +0200 X-Google-Sender-Auth: fc83062afb204565 Message-ID: <9bbcef730906301508l6f2ae344tff8f7495e870049e@mail.gmail.com> To: rick-freebsd2008@kiwi-computer.com Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Cc: Marcel Moolenaar , freebsd-geom@freebsd.org Subject: Re: gmirror gm0 destroyed on shutdown; GPT corrupt X-BeenThere: freebsd-geom@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: GEOM-specific discussions and implementations List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 30 Jun 2009 22:08:49 -0000 2009/6/30 Rick C. Petty : > On Tue, Jun 30, 2009 at 02:37:55PM -0700, Marcel Moolenaar wrote: > Both bar and baz should validate their metadata and it should be pretty > apparent that one of them has a smaller size. =C2=A0If the one that is sm= aller > fits perfectly into the one that is bigger, the taste should pass to the > latter first. This is how it's currently done with "native" GEOM classes like gmirror - if gmirror is put where it and something else can taste the metadata, gmirror will decide by checking the size - usually +/- 1 sector. But we can't embed this logic into "foreign" classes like GPT. GTP check the first sector (and the last sector for backup), while gmirror checks the first sector, and GPT metadata (AFAIK) doesn't contain media size. From owner-freebsd-geom@FreeBSD.ORG Tue Jun 30 22:25:41 2009 Return-Path: Delivered-To: freebsd-geom@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 80A311065675 for ; Tue, 30 Jun 2009 22:25:41 +0000 (UTC) (envelope-from rick@kiwi-computer.com) Received: from hamlet.setfilepointer.com (hamlet.SetFilePointer.com [63.224.10.2]) by mx1.freebsd.org (Postfix) with SMTP id 251B38FC1D for ; Tue, 30 Jun 2009 22:25:41 +0000 (UTC) (envelope-from rick@kiwi-computer.com) Received: (qmail 49440 invoked from network); 30 Jun 2009 17:25:40 -0500 Received: from keira.kiwi-computer.com (HELO kiwi-computer.com) (63.224.10.3) by hamlet.setfilepointer.com with SMTP; 30 Jun 2009 17:25:40 -0500 Received: (qmail 34700 invoked by uid 2001); 30 Jun 2009 22:25:40 -0000 Date: Tue, 30 Jun 2009 17:25:40 -0500 From: "Rick C. Petty" To: Ivan Voras Message-ID: <20090630222540.GA34541@keira.kiwi-computer.com> References: <20090625110253.GA31443@mech-cluster238.men.bris.ac.uk> <10FCC74D-6D46-4112-AD89-BBB4C5933957@mac.com> <2FFFB36F-EFA3-4D92-98A3-692BA2D6F63E@mac.com> <20090629210003.GA24038@keira.kiwi-computer.com> <704EE47D-F0C4-4C63-AA3C-3ADF92CC8379@mac.com> <20090630215345.GC33849@keira.kiwi-computer.com> <9bbcef730906301508l6f2ae344tff8f7495e870049e@mail.gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <9bbcef730906301508l6f2ae344tff8f7495e870049e@mail.gmail.com> User-Agent: Mutt/1.4.2.3i Cc: Marcel Moolenaar , freebsd-geom@freebsd.org Subject: Re: gmirror gm0 destroyed on shutdown; GPT corrupt X-BeenThere: freebsd-geom@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: rick-freebsd2008@kiwi-computer.com List-Id: GEOM-specific discussions and implementations List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 30 Jun 2009 22:25:41 -0000 On Wed, Jul 01, 2009 at 12:08:25AM +0200, Ivan Voras wrote: > 2009/6/30 Rick C. Petty : > > On Tue, Jun 30, 2009 at 02:37:55PM -0700, Marcel Moolenaar wrote: > > > Both bar and baz should validate their metadata and it should be pretty > > apparent that one of them has a smaller size.  If the one that is smaller > > fits perfectly into the one that is bigger, the taste should pass to the > > latter first. > > This is how it's currently done with "native" GEOM classes like > gmirror - if gmirror is put where it and something else can taste the > metadata, gmirror will decide by checking the size - usually +/- 1 > sector. But we can't embed this logic into "foreign" classes like GPT. Then those foreign classes should be given the last opportunity to taste, not the first. > GTP check the first sector (and the last sector for backup), while > gmirror checks the first sector, and GPT metadata (AFAIK) doesn't > contain media size. According to wikipedia, the GPT header contains: - (offset 40) First usable LBA for partitions - (offset 48) Last usable LBA -- Rick C. Petty From owner-freebsd-geom@FreeBSD.ORG Wed Jul 1 03:43:28 2009 Return-Path: Delivered-To: freebsd-geom@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 860C81065670; Wed, 1 Jul 2009 03:43:28 +0000 (UTC) (envelope-from xcllnt@mac.com) Received: from asmtpout028.mac.com (asmtpout028.mac.com [17.148.16.103]) by mx1.freebsd.org (Postfix) with ESMTP id 6C63E8FC0A; Wed, 1 Jul 2009 03:43:28 +0000 (UTC) (envelope-from xcllnt@mac.com) MIME-version: 1.0 Content-transfer-encoding: 7BIT Content-type: text/plain; charset=us-ascii; format=flowed Received: from macbook-pro.lan.xcllnt.net ([75.101.29.67]) by asmtp028.mac.com (Sun Java(tm) System Messaging Server 6.3-8.01 (built Dec 16 2008; 32bit)) with ESMTPSA id <0KM300AMV3NL1O40@asmtp028.mac.com>; Tue, 30 Jun 2009 20:42:58 -0700 (PDT) From: Marcel Moolenaar In-reply-to: <20090630222540.GA34541@keira.kiwi-computer.com> Date: Tue, 30 Jun 2009 20:42:57 -0700 Message-id: <06F4B172-3A59-49EA-A271-CCFC74B2B52A@mac.com> References: <20090625110253.GA31443@mech-cluster238.men.bris.ac.uk> <10FCC74D-6D46-4112-AD89-BBB4C5933957@mac.com> <2FFFB36F-EFA3-4D92-98A3-692BA2D6F63E@mac.com> <20090629210003.GA24038@keira.kiwi-computer.com> <704EE47D-F0C4-4C63-AA3C-3ADF92CC8379@mac.com> <20090630215345.GC33849@keira.kiwi-computer.com> <9bbcef730906301508l6f2ae344tff8f7495e870049e@mail.gmail.com> <20090630222540.GA34541@keira.kiwi-computer.com> To: rick-freebsd2008@kiwi-computer.com X-Mailer: Apple Mail (2.1068) Cc: Ivan Voras , freebsd-geom@freebsd.org Subject: Re: gmirror gm0 destroyed on shutdown; GPT corrupt X-BeenThere: freebsd-geom@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: GEOM-specific discussions and implementations List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 01 Jul 2009 03:43:28 -0000 On Jun 30, 2009, at 3:25 PM, Rick C. Petty wrote: > > According to wikipedia, the GPT header contains: > - (offset 40) First usable LBA for partitions > - (offset 48) Last usable LBA These do not represent the media size. They relate to the region of the disk that can be assigned to partitions. -- Marcel Moolenaar xcllnt@mac.com From owner-freebsd-geom@FreeBSD.ORG Wed Jul 1 13:53:36 2009 Return-Path: Delivered-To: freebsd-geom@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id C211D106564A for ; Wed, 1 Jul 2009 13:53:36 +0000 (UTC) (envelope-from pjd@garage.freebsd.pl) Received: from mail.garage.freebsd.pl (chello087206192061.chello.pl [87.206.192.61]) by mx1.freebsd.org (Postfix) with ESMTP id 0A0418FC1B for ; Wed, 1 Jul 2009 13:53:35 +0000 (UTC) (envelope-from pjd@garage.freebsd.pl) Received: by mail.garage.freebsd.pl (Postfix, from userid 65534) id 5B2DE45C8C; Wed, 1 Jul 2009 15:53:33 +0200 (CEST) Received: from localhost (pjd.wheel.pl [10.0.1.1]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.garage.freebsd.pl (Postfix) with ESMTP id 6D872456B1; Wed, 1 Jul 2009 15:53:28 +0200 (CEST) Date: Wed, 1 Jul 2009 15:53:38 +0200 From: Pawel Jakub Dawidek To: Marcel Moolenaar Message-ID: <20090701135338.GE4372@garage.freebsd.pl> References: <20090625110253.GA31443@mech-cluster238.men.bris.ac.uk> <10FCC74D-6D46-4112-AD89-BBB4C5933957@mac.com> <2FFFB36F-EFA3-4D92-98A3-692BA2D6F63E@mac.com> <20090629210003.GA24038@keira.kiwi-computer.com> <704EE47D-F0C4-4C63-AA3C-3ADF92CC8379@mac.com> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="gr/z0/N6AeWAPJVB" Content-Disposition: inline In-Reply-To: <704EE47D-F0C4-4C63-AA3C-3ADF92CC8379@mac.com> User-Agent: Mutt/1.4.2.3i X-PGP-Key-URL: http://people.freebsd.org/~pjd/pjd.asc X-OS: FreeBSD 8.0-CURRENT i386 X-Spam-Checker-Version: SpamAssassin 3.0.4 (2005-06-05) on mail.garage.freebsd.pl X-Spam-Level: X-Spam-Status: No, score=-5.9 required=4.5 tests=ALL_TRUSTED,BAYES_00 autolearn=ham version=3.0.4 Cc: rick-freebsd2008@kiwi-computer.com, freebsd-geom@freebsd.org Subject: Re: gmirror gm0 destroyed on shutdown; GPT corrupt X-BeenThere: freebsd-geom@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: GEOM-specific discussions and implementations List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 01 Jul 2009 13:53:37 -0000 --gr/z0/N6AeWAPJVB Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Tue, Jun 30, 2009 at 02:37:55PM -0700, Marcel Moolenaar wrote: >=20 > On Jun 29, 2009, at 2:00 PM, Rick C. Petty wrote: >=20 > >[[ Removing the double cross-post, since this is GEOM-specific ]] > > > >On Sat, Jun 27, 2009 at 06:20:49PM -0700, Marcel Moolenaar wrote: > >> > >>Using the last sector is not only flawed because it creates a race > >>condition, > > > >It shouldn't create a race condition. >=20 > It does. >=20 > Answer the following: >=20 > foo0 is a provider with 3 sectors. > bar is a geom class that puts meta-data in the first sector. > baz is a geom class that puts meta-data in the last sector. >=20 > Both bar and baz get to taste foo0. Which one should go first? Marcel, I don't think you expect than entire world will agree on one place where metadata should be stored? A provider can contain metadata of few independent GEOM classes and its class responsibility to detect its providers correctly. Even for my classes where I store provider size in metadata there are configurations I can't cope with cleanly, like the 'c' partition. Workaround I implemented is to store provider name in metadata, but of course it's problematic if your disk name will change. All in all there is nothing wrong with gmirror. In your example you want all metadata formats to be exact same size and stored in exact same place... --=20 Pawel Jakub Dawidek http://www.wheel.pl pjd@FreeBSD.org http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! --gr/z0/N6AeWAPJVB Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.4 (FreeBSD) iD8DBQFKS2piForvXbEpPzQRAkj4AJ9NaaqGxeBUng6CxtcLK2immVHt3ACfQKOg BhtPeBma/nRIevbiyQlsBxg= =WImV -----END PGP SIGNATURE----- --gr/z0/N6AeWAPJVB-- From owner-freebsd-geom@FreeBSD.ORG Wed Jul 1 14:38:33 2009 Return-Path: Delivered-To: freebsd-geom@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id F2C131065679 for ; Wed, 1 Jul 2009 14:38:33 +0000 (UTC) (envelope-from rick@kiwi-computer.com) Received: from hamlet.setfilepointer.com (hamlet.SetFilePointer.com [63.224.10.2]) by mx1.freebsd.org (Postfix) with SMTP id 928698FC24 for ; Wed, 1 Jul 2009 14:38:33 +0000 (UTC) (envelope-from rick@kiwi-computer.com) Received: (qmail 25178 invoked from network); 1 Jul 2009 09:38:33 -0500 Received: from keira.kiwi-computer.com (HELO kiwi-computer.com) (63.224.10.3) by hamlet.setfilepointer.com with SMTP; 1 Jul 2009 09:38:33 -0500 Received: (qmail 41948 invoked by uid 2001); 1 Jul 2009 14:38:32 -0000 Date: Wed, 1 Jul 2009 09:38:32 -0500 From: "Rick C. Petty" To: Marcel Moolenaar Message-ID: <20090701143832.GA41858@keira.kiwi-computer.com> References: <20090625110253.GA31443@mech-cluster238.men.bris.ac.uk> <10FCC74D-6D46-4112-AD89-BBB4C5933957@mac.com> <2FFFB36F-EFA3-4D92-98A3-692BA2D6F63E@mac.com> <20090629210003.GA24038@keira.kiwi-computer.com> <704EE47D-F0C4-4C63-AA3C-3ADF92CC8379@mac.com> <20090630215345.GC33849@keira.kiwi-computer.com> <9bbcef730906301508l6f2ae344tff8f7495e870049e@mail.gmail.com> <20090630222540.GA34541@keira.kiwi-computer.com> <06F4B172-3A59-49EA-A271-CCFC74B2B52A@mac.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <06F4B172-3A59-49EA-A271-CCFC74B2B52A@mac.com> User-Agent: Mutt/1.4.2.3i Cc: Ivan Voras , freebsd-geom@freebsd.org Subject: Re: gmirror gm0 destroyed on shutdown; GPT corrupt X-BeenThere: freebsd-geom@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: rick-freebsd2008@kiwi-computer.com List-Id: GEOM-specific discussions and implementations List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 01 Jul 2009 14:38:34 -0000 On Tue, Jun 30, 2009 at 08:42:57PM -0700, Marcel Moolenaar wrote: > > On Jun 30, 2009, at 3:25 PM, Rick C. Petty wrote: > > > >According to wikipedia, the GPT header contains: > > - (offset 40) First usable LBA for partitions > > - (offset 48) Last usable LBA > > These do not represent the media size. They relate to > the region of the disk that can be assigned to partitions. According to wikipedia: "The values for current and backup LBAs of the primary header should be the second sector of the disk (1) and the last sector of the disk, respectively." And: offset contents ------ -------- 24 Current LBA (location of this header copy) 32 Backup LBA (location of the other header copy) 40 First usable LBA for partitions (primary partition table last LBA + 1) 48 Last usable LBA (secondary partition table first LBA - 1) So that the media is from relative LBA 0 (the protective MBR) to LBA N-1, the secondary GPT header, which is described in offset 32. Offset 48 should contain LBA N-2. Therefore the media size N is the value of offset 32 minus the value of offset 24, plus 1 (for the MBR). It seems pretty clear cut to me. -- Rick C. Petty From owner-freebsd-geom@FreeBSD.ORG Wed Jul 1 15:29:40 2009 Return-Path: Delivered-To: freebsd-geom@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 8099B1065673; Wed, 1 Jul 2009 15:29:40 +0000 (UTC) (envelope-from xcllnt@mac.com) Received: from asmtpout023.mac.com (asmtpout023.mac.com [17.148.16.98]) by mx1.freebsd.org (Postfix) with ESMTP id 695E88FC13; Wed, 1 Jul 2009 15:29:40 +0000 (UTC) (envelope-from xcllnt@mac.com) MIME-version: 1.0 Content-transfer-encoding: 7BIT Content-type: text/plain; charset=us-ascii; format=flowed Received: from macbook-pro.lan.xcllnt.net ([75.101.29.67]) by asmtp023.mac.com (Sun Java(tm) System Messaging Server 6.3-8.01 (built Dec 16 2008; 32bit)) with ESMTPSA id <0KM400B4S0CZ4O50@asmtp023.mac.com>; Wed, 01 Jul 2009 08:29:30 -0700 (PDT) From: Marcel Moolenaar In-reply-to: <20090701135338.GE4372@garage.freebsd.pl> Date: Wed, 01 Jul 2009 08:29:23 -0700 Message-id: References: <20090625110253.GA31443@mech-cluster238.men.bris.ac.uk> <10FCC74D-6D46-4112-AD89-BBB4C5933957@mac.com> <2FFFB36F-EFA3-4D92-98A3-692BA2D6F63E@mac.com> <20090629210003.GA24038@keira.kiwi-computer.com> <704EE47D-F0C4-4C63-AA3C-3ADF92CC8379@mac.com> <20090701135338.GE4372@garage.freebsd.pl> To: Pawel Jakub Dawidek X-Mailer: Apple Mail (2.1068) Cc: rick-freebsd2008@kiwi-computer.com, freebsd-geom@freebsd.org Subject: Re: gmirror gm0 destroyed on shutdown; GPT corrupt X-BeenThere: freebsd-geom@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: GEOM-specific discussions and implementations List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 01 Jul 2009 15:29:40 -0000 On Jul 1, 2009, at 6:53 AM, Pawel Jakub Dawidek wrote: >> Answer the following: >> >> foo0 is a provider with 3 sectors. >> bar is a geom class that puts meta-data in the first sector. >> baz is a geom class that puts meta-data in the last sector. >> >> Both bar and baz get to taste foo0. Which one should go first? > > Marcel, I don't think you expect than entire world will agree on one > place where metadata should be stored? No, I don't expect it. But we do need to realize that there is a race and unless we keep track of the ordering (outside of GEOM), we will always run into some scenarios where the tasting results in warnings or errors... -- Marcel Moolenaar xcllnt@mac.com From owner-freebsd-geom@FreeBSD.ORG Sat Jul 4 09:15:38 2009 Return-Path: Delivered-To: freebsd-geom@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id B8202106566C for ; Sat, 4 Jul 2009 09:15:38 +0000 (UTC) (envelope-from pjd@garage.freebsd.pl) Received: from mail.garage.freebsd.pl (chello087206192061.chello.pl [87.206.192.61]) by mx1.freebsd.org (Postfix) with ESMTP id D39CC8FC14 for ; Sat, 4 Jul 2009 09:15:36 +0000 (UTC) (envelope-from pjd@garage.freebsd.pl) Received: by mail.garage.freebsd.pl (Postfix, from userid 65534) id DFA9145C98; Sat, 4 Jul 2009 11:15:34 +0200 (CEST) Received: from localhost (abia29.neoplus.adsl.tpnet.pl [83.7.116.29]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.garage.freebsd.pl (Postfix) with ESMTP id A224E45683; Sat, 4 Jul 2009 11:15:28 +0200 (CEST) Date: Sat, 4 Jul 2009 11:15:38 +0200 From: Pawel Jakub Dawidek To: Marcel Moolenaar Message-ID: <20090704091538.GA2891@garage.freebsd.pl> References: <20090625110253.GA31443@mech-cluster238.men.bris.ac.uk> <10FCC74D-6D46-4112-AD89-BBB4C5933957@mac.com> <2FFFB36F-EFA3-4D92-98A3-692BA2D6F63E@mac.com> <20090629210003.GA24038@keira.kiwi-computer.com> <704EE47D-F0C4-4C63-AA3C-3ADF92CC8379@mac.com> <20090701135338.GE4372@garage.freebsd.pl> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="sdtB3X0nJg68CQEu" Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.4.2.3i X-PGP-Key-URL: http://people.freebsd.org/~pjd/pjd.asc X-OS: FreeBSD 8.0-CURRENT i386 X-Spam-Checker-Version: SpamAssassin 3.0.4 (2005-06-05) on mail.garage.freebsd.pl X-Spam-Level: ** X-Spam-Status: No, score=2.5 required=4.5 tests=BAYES_00,RCVD_IN_SORBS_DUL, RCVD_IN_XBL autolearn=no version=3.0.4 Cc: rick-freebsd2008@kiwi-computer.com, freebsd-geom@freebsd.org Subject: Re: gmirror gm0 destroyed on shutdown; GPT corrupt X-BeenThere: freebsd-geom@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: GEOM-specific discussions and implementations List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 04 Jul 2009 09:15:39 -0000 --sdtB3X0nJg68CQEu Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Wed, Jul 01, 2009 at 08:29:23AM -0700, Marcel Moolenaar wrote: >=20 > On Jul 1, 2009, at 6:53 AM, Pawel Jakub Dawidek wrote: >=20 > >>Answer the following: > >> > >>foo0 is a provider with 3 sectors. > >>bar is a geom class that puts meta-data in the first sector. > >>baz is a geom class that puts meta-data in the last sector. > >> > >>Both bar and baz get to taste foo0. Which one should go first? > > > >Marcel, I don't think you expect than entire world will agree on one > >place where metadata should be stored? >=20 > No, I don't expect it. But we do need to realize that there > is a race and unless we keep track of the ordering (outside > of GEOM), we will always run into some scenarios where the > tasting results in warnings or errors... This is not a race, really and also ordering is not important. Let's do the following: # gmirror create test da0 # gpt create /dev/mirror/test Let's assume GPT will be given providers for tasting before MIRROR on boot: da0 arrives GEOM: GPT->taste(da0) GPT: Raport GPT corrupted (da0 is not the size we expect) GEOM: MIRROR->taste(da0) MIRROR: g_new_providerf(mirror/test) GEOM: GPT->taste(mirror/test) GPT: GPT is ok, configure partitions, etc. Now let's revert the order: MIRROR goes first, then GPT: da0 arrives GEOM: MIRROR->taste(da0) MIRROR: g_new_providerf(mirror/test) GEOM: GPT->taste(da0) GPT: Raport GPT corrupted (da0 is not the size we expect) GEOM: GPT->taste(mirror/test) GPT: GPT is ok, configure partitions, etc. This is the same, because GEOM will still present da0 for GPT tasting even if MIRROR will decide to use it. I do agree that it is hard to cope with, especially for metadata formats that are given and that we cannot extend. The real problem here is that in some situations (for some metadata formats) class cannot auto-discover its providers reliably. GPT is not alone here. There is similar issue for UFS labels. You have a 500GB disk da0, you also have 200GB partition da0a starting at sector 0. You create UFS file system on da0a: # newfs -L foo /dev/da0a The LABEL class is given disk da0 for tasting. How can it tell if the file system was created on da0 or da0a? What we do now is to look inside UFS metadata and get file system size from there. If the file system size is equal to provider's size this is our provider. So in this case file system size is 200GB and da0 size is 500GB, so we skip it. This is not perfect, because one can create smaller UFS file system than provider size: # newfs -s 419430400 -L foo /dev/da0 We created 200GB file system on 500GB da0. Now the LABEL class will incorrectly skip da0 during tasting, because of size mismatch. The problem is similar to GPT: they cannot reliably work in auto-discovery mode. This is also problematic that provider can have multiple consumers attached, but solution I use in some classes (which is a side-effect really) is to open provider for write and exclusively during tasting. Even if MIRROR provider isn't mounted it keeps its components open for writing and exclusively all the time (the main reason was to allow synchronization). Once MIRROR opens provider for writing every consumer attached to this provider gets spoiled event (at least those that depend on metadata). Going back to our example even if GPT will configure partitions on da0, it should remove them on spoiled event once MIRROR opens this provider for writing. At the end GPT will configure partitions on mirror/test. This is of course not perfect, but reduce the mess in /dev/ a bit. --=20 Pawel Jakub Dawidek http://www.wheel.pl pjd@FreeBSD.org http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! --sdtB3X0nJg68CQEu Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.4 (FreeBSD) iD8DBQFKTx26ForvXbEpPzQRAqNqAKCIOf1YZAiE8ct1Z63/qdTnkBNjFQCgylmY 02uu/aS9yrvyUbp6I5D5Aw8= =6Rqk -----END PGP SIGNATURE----- --sdtB3X0nJg68CQEu--