From owner-freebsd-geom@FreeBSD.ORG Fri Dec 19 01:52:13 2014 Return-Path: Delivered-To: freebsd-geom@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 8D707805 for ; Fri, 19 Dec 2014 01:52:13 +0000 (UTC) Received: from h2.funkthat.com (gate2.funkthat.com [208.87.223.18]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client CN "funkthat.com", Issuer "funkthat.com" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id 6AF261C0 for ; Fri, 19 Dec 2014 01:52:13 +0000 (UTC) Received: from h2.funkthat.com (localhost [127.0.0.1]) by h2.funkthat.com (8.14.3/8.14.3) with ESMTP id sBJ1qB7o080056 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Thu, 18 Dec 2014 17:52:12 -0800 (PST) (envelope-from jmg@h2.funkthat.com) Received: (from jmg@localhost) by h2.funkthat.com (8.14.3/8.14.3/Submit) id sBJ1qANn080055; Thu, 18 Dec 2014 17:52:10 -0800 (PST) (envelope-from jmg) Date: Thu, 18 Dec 2014 17:52:10 -0800 From: John-Mark Gurney To: "Pokala, Ravi" Subject: Re: Converting LBAs to byte offsets through the GEOM stack Message-ID: <20141219015210.GY25139@funkthat.com> Mail-Followup-To: "Pokala, Ravi" , "freebsd-geom@freebsd.org" References: Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.4.2.3i X-Operating-System: FreeBSD 7.2-RELEASE i386 X-PGP-Fingerprint: 54BA 873B 6515 3F10 9E88 9322 9CB1 8F74 6D3F A396 X-Files: The truth is out there X-URL: http://resnet.uoregon.edu/~gurney_j/ X-Resume: http://resnet.uoregon.edu/~gurney_j/resume.html X-TipJar: bitcoin:13Qmb6AeTgQecazTWph4XasEsP7nGRbAPE X-to-the-FBI-CIA-and-NSA: HI! HOW YA DOIN? can i haz chizburger? X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.2.2 (h2.funkthat.com [127.0.0.1]); Thu, 18 Dec 2014 17:52:12 -0800 (PST) Cc: "freebsd-geom@freebsd.org" X-BeenThere: freebsd-geom@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: GEOM-specific discussions and implementations List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 19 Dec 2014 01:52:13 -0000 Ravi Pokala wrote this message on Thu, Dec 18, 2014 at 23:11 +0000: > When you issue a BIO, the requested byte offset (bio_offset) gets > transformed by each layer of the GEOM stack as needed. If the bottom of > the stack is a physical disk, g_disk_start() transforms the final offset > to a device block address (bio_pblkno), which the disk device driver uses > as the LBA. > > My question is this - is there a way to go in the other direction, from an > LBA to a byte offset? For example, let's say I have a set of four drives > which are configured as a RAID10: > > STRIPE: /dev/ada0p2 && /dev/ada1p2 => /dev/stripe/gs0 > STRIPE: /dev/ada2p2 && /dev/ada3p2 => /dev/stripe/gs1 > MIRROR: /dev/stripe/gs0 && /dev/stripe/gs1 => /dev/stripe/gm0 > > I kick off a media scrub of the drive devices, to look for unreadable > sectors. For the sake of saving bandwidth, I use the ATA_READ_VERIFY / > ATA_READ_VERIFY48 commands (which read from the media, set the status and > error bits, but don't transfer the data to the host). That requires > talking directly to the drive, not the higher-level GEOMs, so I have to > work in terms of LBAs. Hmm. that'd be a nice to be able to expose via geom too... > If I find an unreadable sector on one of the drives, I'd like to re-write > the sector to heal it. I can do that by reading from the mirror; that will > either pick the good side of the mirror in the first place, or will try > and fail from the bad side, then failover to read from the good side. > Either way, I end up with the proper data, and can re-write unreadable > sector. The problem is, how do I calculate the byte offset in the mirror > to read from? > > In the example above, since it's a relatively straightforward stack, I > could do some math taking into account the LBA offsets for the GPT > partitions, and the stripesize of the stripes, etc. That would work for > this example, but it gets ugly fast if there are more complex transforms > in the stack. > > It's easy enough to look at the partition table and say "LBA 12345 is in > the range 1024 - 1048576, which is part of ada0p2". Going from there to > "ada0p2 is part of gs0, which has a stripe interleave of 256KB" is more > complicated. If there's something like GEOM_RAID3 in the mix - which has > parity sectors which are not visible to the higher layers of the stack - > then it gets uglier still. > > Is there a generic, supported way for doing this mapping? Or can someone > point me in the right direction so, I can *create* a generic way for doing > this, and submit it? :-) I've only done this manually... It isn't too hard, as all the partitioning schemes are simple offsets, and the stripe should be regular... The funny thing is that I hit a similar issue today myself! The issue w/ manually mapping, is that you might loose a race w/ the mirror when writing out new data... I was thinking it would be good for gmirror to grow a mode that when it detects a pending sector or offline sector, to figure out via some mapping, what data needed to be fixed, and attempt to read/write the data back... It would be interesting to add to geom the ability to notify the upper layers about possible bad sectores, though this will take more work to add... My work to fix it today somewhat failed, in that I read from the gmirror device, and when it hit the broken drive, returned a read error and kicked the drive out of the mirror instead of fixing it (which I would have preferred)... Even having a simple mode that upon read error, would read from the other drive and write back would be good... We'd need to have a way to say that this drive is FAILING, but still usable... -- John-Mark Gurney Voice: +1 415 225 5579 "All that I will do, has been done, All that I have, has not."