From owner-freebsd-geom@FreeBSD.ORG Thu Dec 18 23:11:56 2014 Return-Path: Delivered-To: freebsd-geom@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id A25E7C96 for ; Thu, 18 Dec 2014 23:11:56 +0000 (UTC) Received: from na01-by2-obe.outbound.protection.outlook.com (mail-by2on0093.outbound.protection.outlook.com [207.46.100.93]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-SHA384 (256/256 bits)) (Client CN "mail.protection.outlook.com", Issuer "MSIT Machine Auth CA 2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 68A4D1970 for ; Thu, 18 Dec 2014 23:11:55 +0000 (UTC) Received: from DM2PR0801MB0944.namprd08.prod.outlook.com (25.160.131.27) by DM2PR0801MB0941.namprd08.prod.outlook.com (25.160.131.24) with Microsoft SMTP Server (TLS) id 15.1.31.17; Thu, 18 Dec 2014 23:11:47 +0000 Received: from DM2PR0801MB0944.namprd08.prod.outlook.com ([25.160.131.27]) by DM2PR0801MB0944.namprd08.prod.outlook.com ([25.160.131.27]) with mapi id 15.01.0031.000; Thu, 18 Dec 2014 23:11:47 +0000 From: "Pokala, Ravi" To: "freebsd-geom@freebsd.org" Subject: Converting LBAs to byte offsets through the GEOM stack Thread-Topic: Converting LBAs to byte offsets through the GEOM stack Thread-Index: AQHQGxf+kE6X/QdjbU+vuIDpLz2I5Q== Date: Thu, 18 Dec 2014 23:11:46 +0000 Message-ID: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: user-agent: Microsoft-MacOutlook/14.4.7.141117 x-originating-ip: [64.80.217.3] authentication-results: spf=none (sender IP is ) smtp.mailfrom=rpokala@panasas.com; x-microsoft-antispam: BCL:0;PCL:0;RULEID:;SRVR:DM2PR0801MB0941; x-exchange-antispam-report-test: UriScan:; x-exchange-antispam-report-cfa-test: BCL:0; PCL:0; RULEID:; SRVR:DM2PR0801MB0941; x-forefront-prvs: 042957ACD7 x-forefront-antispam-report: SFV:NSPM; SFS:(10009020)(6009001)(199003)(189002)(164054003)(120916001)(2656002)(105586002)(64706001)(50986999)(36756003)(54356999)(99286002)(2900100001)(102836002)(20776003)(21056001)(83506001)(68736005)(107886001)(99396003)(2351001)(107046002)(62966003)(575784001)(46102003)(101416001)(110136001)(40100003)(106116001)(229853001)(106356001)(66066001)(87936001)(97736003)(4396001)(77156002)(86362001)(450100001)(122556002); DIR:OUT; SFP:1101; SCL:1; SRVR:DM2PR0801MB0941; H:DM2PR0801MB0944.namprd08.prod.outlook.com; FPR:; SPF:None; MLV:sfv; PTR:InfoNoRecords; A:1; MX:1; LANG:en; received-spf: None (protection.outlook.com: panasas.com does not designate permitted sender hosts) Content-Type: text/plain; charset="us-ascii" Content-ID: <17CA285FBD268942A93A9809695B1F9E@namprd08.prod.outlook.com> Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-OriginatorOrg: panasas.com X-BeenThere: freebsd-geom@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: GEOM-specific discussions and implementations List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 18 Dec 2014 23:11:56 -0000 Hi folks, When you issue a BIO, the requested byte offset (bio_offset) gets transformed by each layer of the GEOM stack as needed. If the bottom of the stack is a physical disk, g_disk_start() transforms the final offset to a device block address (bio_pblkno), which the disk device driver uses as the LBA. My question is this - is there a way to go in the other direction, from an LBA to a byte offset? For example, let's say I have a set of four drives which are configured as a RAID10: STRIPE: /dev/ada0p2 && /dev/ada1p2 =3D> /dev/stripe/gs0 STRIPE: /dev/ada2p2 && /dev/ada3p2 =3D> /dev/stripe/gs1 MIRROR: /dev/stripe/gs0 && /dev/stripe/gs1 =3D> /dev/stripe/gm0 I kick off a media scrub of the drive devices, to look for unreadable sectors. For the sake of saving bandwidth, I use the ATA_READ_VERIFY / ATA_READ_VERIFY48 commands (which read from the media, set the status and error bits, but don't transfer the data to the host). That requires talking directly to the drive, not the higher-level GEOMs, so I have to work in terms of LBAs. If I find an unreadable sector on one of the drives, I'd like to re-write the sector to heal it. I can do that by reading from the mirror; that will either pick the good side of the mirror in the first place, or will try and fail from the bad side, then failover to read from the good side. Either way, I end up with the proper data, and can re-write unreadable sector. The problem is, how do I calculate the byte offset in the mirror to read from? In the example above, since it's a relatively straightforward stack, I could do some math taking into account the LBA offsets for the GPT partitions, and the stripesize of the stripes, etc. That would work for this example, but it gets ugly fast if there are more complex transforms in the stack. It's easy enough to look at the partition table and say "LBA 12345 is in the range 1024 - 1048576, which is part of ada0p2". Going from there to "ada0p2 is part of gs0, which has a stripe interleave of 256KB" is more complicated. If there's something like GEOM_RAID3 in the mix - which has parity sectors which are not visible to the higher layers of the stack - then it gets uglier still. Is there a generic, supported way for doing this mapping? Or can someone point me in the right direction so, I can *create* a generic way for doing this, and submit it? :-) Thanks, Ravi From owner-freebsd-geom@FreeBSD.ORG Fri Dec 19 01:52:13 2014 Return-Path: Delivered-To: freebsd-geom@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 8D707805 for ; Fri, 19 Dec 2014 01:52:13 +0000 (UTC) Received: from h2.funkthat.com (gate2.funkthat.com [208.87.223.18]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client CN "funkthat.com", Issuer "funkthat.com" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id 6AF261C0 for ; Fri, 19 Dec 2014 01:52:13 +0000 (UTC) Received: from h2.funkthat.com (localhost [127.0.0.1]) by h2.funkthat.com (8.14.3/8.14.3) with ESMTP id sBJ1qB7o080056 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Thu, 18 Dec 2014 17:52:12 -0800 (PST) (envelope-from jmg@h2.funkthat.com) Received: (from jmg@localhost) by h2.funkthat.com (8.14.3/8.14.3/Submit) id sBJ1qANn080055; Thu, 18 Dec 2014 17:52:10 -0800 (PST) (envelope-from jmg) Date: Thu, 18 Dec 2014 17:52:10 -0800 From: John-Mark Gurney To: "Pokala, Ravi" Subject: Re: Converting LBAs to byte offsets through the GEOM stack Message-ID: <20141219015210.GY25139@funkthat.com> Mail-Followup-To: "Pokala, Ravi" , "freebsd-geom@freebsd.org" References: Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.4.2.3i X-Operating-System: FreeBSD 7.2-RELEASE i386 X-PGP-Fingerprint: 54BA 873B 6515 3F10 9E88 9322 9CB1 8F74 6D3F A396 X-Files: The truth is out there X-URL: http://resnet.uoregon.edu/~gurney_j/ X-Resume: http://resnet.uoregon.edu/~gurney_j/resume.html X-TipJar: bitcoin:13Qmb6AeTgQecazTWph4XasEsP7nGRbAPE X-to-the-FBI-CIA-and-NSA: HI! HOW YA DOIN? can i haz chizburger? X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.2.2 (h2.funkthat.com [127.0.0.1]); Thu, 18 Dec 2014 17:52:12 -0800 (PST) Cc: "freebsd-geom@freebsd.org" X-BeenThere: freebsd-geom@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: GEOM-specific discussions and implementations List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 19 Dec 2014 01:52:13 -0000 Ravi Pokala wrote this message on Thu, Dec 18, 2014 at 23:11 +0000: > When you issue a BIO, the requested byte offset (bio_offset) gets > transformed by each layer of the GEOM stack as needed. If the bottom of > the stack is a physical disk, g_disk_start() transforms the final offset > to a device block address (bio_pblkno), which the disk device driver uses > as the LBA. > > My question is this - is there a way to go in the other direction, from an > LBA to a byte offset? For example, let's say I have a set of four drives > which are configured as a RAID10: > > STRIPE: /dev/ada0p2 && /dev/ada1p2 => /dev/stripe/gs0 > STRIPE: /dev/ada2p2 && /dev/ada3p2 => /dev/stripe/gs1 > MIRROR: /dev/stripe/gs0 && /dev/stripe/gs1 => /dev/stripe/gm0 > > I kick off a media scrub of the drive devices, to look for unreadable > sectors. For the sake of saving bandwidth, I use the ATA_READ_VERIFY / > ATA_READ_VERIFY48 commands (which read from the media, set the status and > error bits, but don't transfer the data to the host). That requires > talking directly to the drive, not the higher-level GEOMs, so I have to > work in terms of LBAs. Hmm. that'd be a nice to be able to expose via geom too... > If I find an unreadable sector on one of the drives, I'd like to re-write > the sector to heal it. I can do that by reading from the mirror; that will > either pick the good side of the mirror in the first place, or will try > and fail from the bad side, then failover to read from the good side. > Either way, I end up with the proper data, and can re-write unreadable > sector. The problem is, how do I calculate the byte offset in the mirror > to read from? > > In the example above, since it's a relatively straightforward stack, I > could do some math taking into account the LBA offsets for the GPT > partitions, and the stripesize of the stripes, etc. That would work for > this example, but it gets ugly fast if there are more complex transforms > in the stack. > > It's easy enough to look at the partition table and say "LBA 12345 is in > the range 1024 - 1048576, which is part of ada0p2". Going from there to > "ada0p2 is part of gs0, which has a stripe interleave of 256KB" is more > complicated. If there's something like GEOM_RAID3 in the mix - which has > parity sectors which are not visible to the higher layers of the stack - > then it gets uglier still. > > Is there a generic, supported way for doing this mapping? Or can someone > point me in the right direction so, I can *create* a generic way for doing > this, and submit it? :-) I've only done this manually... It isn't too hard, as all the partitioning schemes are simple offsets, and the stripe should be regular... The funny thing is that I hit a similar issue today myself! The issue w/ manually mapping, is that you might loose a race w/ the mirror when writing out new data... I was thinking it would be good for gmirror to grow a mode that when it detects a pending sector or offline sector, to figure out via some mapping, what data needed to be fixed, and attempt to read/write the data back... It would be interesting to add to geom the ability to notify the upper layers about possible bad sectores, though this will take more work to add... My work to fix it today somewhat failed, in that I read from the gmirror device, and when it hit the broken drive, returned a read error and kicked the drive out of the mirror instead of fixing it (which I would have preferred)... Even having a simple mode that upon read error, would read from the other drive and write back would be good... We'd need to have a way to say that this drive is FAILING, but still usable... -- John-Mark Gurney Voice: +1 415 225 5579 "All that I will do, has been done, All that I have, has not." From owner-freebsd-geom@FreeBSD.ORG Fri Dec 19 02:47:17 2014 Return-Path: Delivered-To: freebsd-geom@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id E36F712F for ; Fri, 19 Dec 2014 02:47:16 +0000 (UTC) Received: from na01-bn1-obe.outbound.protection.outlook.com (mail-bn1on0099.outbound.protection.outlook.com [157.56.110.99]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-SHA384 (256/256 bits)) (Client CN "mail.protection.outlook.com", Issuer "MSIT Machine Auth CA 2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 8010CC9F for ; Fri, 19 Dec 2014 02:47:15 +0000 (UTC) Received: from DM2PR0801MB0944.namprd08.prod.outlook.com (25.160.131.27) by DM2PR0801MB0943.namprd08.prod.outlook.com (25.160.131.26) with Microsoft SMTP Server (TLS) id 15.1.31.17; Fri, 19 Dec 2014 02:13:20 +0000 Received: from DM2PR0801MB0944.namprd08.prod.outlook.com ([25.160.131.27]) by DM2PR0801MB0944.namprd08.prod.outlook.com ([25.160.131.27]) with mapi id 15.01.0031.000; Fri, 19 Dec 2014 02:13:20 +0000 From: "Pokala, Ravi" To: John-Mark Gurney Subject: Re: Converting LBAs to byte offsets through the GEOM stack Thread-Topic: Converting LBAs to byte offsets through the GEOM stack Thread-Index: AQHQGxf+kE6X/QdjbU+vuIDpLz2I5ZyWJp0A//9/yoA= Date: Fri, 19 Dec 2014 02:13:19 +0000 Message-ID: References: <20141219015210.GY25139@funkthat.com> In-Reply-To: <20141219015210.GY25139@funkthat.com> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: user-agent: Microsoft-MacOutlook/14.4.7.141117 x-originating-ip: [64.80.217.3] authentication-results: spf=none (sender IP is ) smtp.mailfrom=rpokala@panasas.com; x-microsoft-antispam: BCL:0;PCL:0;RULEID:;SRVR:DM2PR0801MB0943; x-exchange-antispam-report-test: UriScan:; x-exchange-antispam-report-cfa-test: BCL:0; PCL:0; RULEID:; SRVR:DM2PR0801MB0943; x-forefront-prvs: 0430FA5CB7 x-forefront-antispam-report: SFV:NSPM; SFS:(10009020)(6009001)(199003)(189002)(107046002)(83506001)(50986999)(64706001)(2656002)(4396001)(99396003)(86362001)(110136001)(66066001)(87936001)(106116001)(20776003)(46102003)(99286002)(106356001)(21056001)(120916001)(62966003)(122556002)(97736003)(54356999)(76176999)(105586002)(102836002)(68736005)(101416001)(77156002)(40100003)(36756003)(2900100001)(2950100001); DIR:OUT; SFP:1101; SCL:1; SRVR:DM2PR0801MB0943; H:DM2PR0801MB0944.namprd08.prod.outlook.com; FPR:; SPF:None; MLV:sfv; PTR:InfoNoRecords; A:1; MX:1; LANG:en; received-spf: None (protection.outlook.com: panasas.com does not designate permitted sender hosts) Content-Type: text/plain; charset="us-ascii" Content-ID: Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-OriginatorOrg: panasas.com Cc: "freebsd-geom@freebsd.org" X-BeenThere: freebsd-geom@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: GEOM-specific discussions and implementations List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 19 Dec 2014 02:47:17 -0000 > I've only done this manually... It isn't too hard, as all the >partitioning schemes are simple offsets, and the stripe should be >regular... The *partitioning schemes*, yes. But once you start building up more layers, it gets complicated. > The issue w/ manually mapping, is that you might loose a race w/ the >mirror when writing out new data... Yeah, I described it the way I did because I'm trying to avoid discussing some proprietary details. :-P It was "close enough" for you to get the idea what I was talking about. > I was thinking it would be good for gmirror to grow a mode that when it >detects a pending sector or offline sector, to figure out via some >mapping, what data needed to be fixed, and attempt to read/write the data >back... What you're talking about would be called "sector resilvering", or perhaps "on-the-fly resilvering" - rebuilding the mirror for just sectors that we know are bad, without having to re-mirror the entire device. Panasas actually implemented that in gmirror on our old 7.2-RELEASE-based system, and we are planning on porting it forward to a 10.1-RELEASE-based (or better still -CURRENT-based) system in the near future; when we do that, we'll submit it. > Even having a simple mode that upon read error, would read from the >other drive and write back would be good... Yes, that's exactly what we (Panasas) implemented. -Ravi From owner-freebsd-geom@FreeBSD.ORG Sat Dec 20 06:01:00 2014 Return-Path: Delivered-To: freebsd-geom@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id A4E1172B for ; Sat, 20 Dec 2014 06:01:00 +0000 (UTC) Received: from mail-wg0-x236.google.com (mail-wg0-x236.google.com [IPv6:2a00:1450:400c:c00::236]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 3512B1312 for ; Sat, 20 Dec 2014 06:01:00 +0000 (UTC) Received: by mail-wg0-f54.google.com with SMTP id l2so2921336wgh.13 for ; Fri, 19 Dec 2014 22:00:58 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:cc:content-type; bh=kLdE4xKr6U+aQ3edEQXvcHM8G/OpK4dQ2OjFmxATOZ8=; b=Pzr4kgL8j4Bx86lhcXIgXzqKz6EV+x7colurj2vuYffAkZyfa5+M7CeJc1xbsa21A9 EBXsypRdzXL2b3vmnftgadouMCziHrLBLfx5sj6rQynqWanYhOFkLTrlY5AJ6mJHKsvC Mw5ZM0GEN3q3XRsJTwdB2kQZHe0ZiXOH07RKsZ/QgkjawmR2L8QegaiWYdgImEWmuoE6 eapOaY9GR8M/AqEdB8nrtLl3qEAAUx1UP5+7bvtitew5GkJ5YZwsOQEi5YBH9obd4G4V TtsrhdBpjulQzjtYk7+1In1SoClxONWtK/HWFVnJHtEauGR9JmXhYDjWQ8NvYJg6QtLh ywGw== MIME-Version: 1.0 X-Received: by 10.194.85.83 with SMTP id f19mr21994261wjz.20.1419055258534; Fri, 19 Dec 2014 22:00:58 -0800 (PST) Sender: adrian.chadd@gmail.com Received: by 10.216.106.195 with HTTP; Fri, 19 Dec 2014 22:00:58 -0800 (PST) In-Reply-To: References: <20141219015210.GY25139@funkthat.com> Date: Fri, 19 Dec 2014 22:00:58 -0800 X-Google-Sender-Auth: Vjm1dfArlb5JRUWZEdkk161DS0c Message-ID: Subject: Re: Converting LBAs to byte offsets through the GEOM stack From: Adrian Chadd To: "Pokala, Ravi" Content-Type: text/plain; charset=UTF-8 Cc: John-Mark Gurney , "freebsd-geom@freebsd.org" X-BeenThere: freebsd-geom@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: GEOM-specific discussions and implementations List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 20 Dec 2014 06:01:00 -0000 Hi, So when I did stuff like this back in the day, I also had to deal with some layers doing not just straight static translations, but things like dynamic sector remapping for what was effectively software error correction. Reaching "around" the layers with some mapping from virtual -> physical disk device and blocks ended up being problematic as between the time you did the lookup and the time you did the IO, the mapping could change. So when doing stuff like this, I ended up piggybacking commands through the translation layers, so stuff was done (a) in line with the rest of IO processing, and (b) wouldn't suffer from stale data. It doesn't matter as long as the translation stays static, but there's nothing in GEOM that requires you to have a static translation layer. :) -adrian From owner-freebsd-geom@FreeBSD.ORG Sat Dec 20 20:09:18 2014 Return-Path: Delivered-To: freebsd-geom@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 0E155EB5; Sat, 20 Dec 2014 20:09:18 +0000 (UTC) Received: from na01-by2-obe.outbound.protection.outlook.com (mail-by2on0088.outbound.protection.outlook.com [207.46.100.88]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-SHA384 (256/256 bits)) (Client CN "mail.protection.outlook.com", Issuer "MSIT Machine Auth CA 2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 9037B2FF7; Sat, 20 Dec 2014 20:09:16 +0000 (UTC) Received: from BN3PR0801MB0930.namprd08.prod.outlook.com (25.160.184.24) by BN3PR0801MB1138.namprd08.prod.outlook.com (25.161.218.23) with Microsoft SMTP Server (TLS) id 15.1.36.23; Sat, 20 Dec 2014 19:54:08 +0000 Received: from BN3PR0801MB0931.namprd08.prod.outlook.com (25.160.184.25) by BN3PR0801MB0930.namprd08.prod.outlook.com (25.160.184.24) with Microsoft SMTP Server (TLS) id 15.1.36.22; Sat, 20 Dec 2014 19:54:06 +0000 Received: from BN3PR0801MB0931.namprd08.prod.outlook.com ([25.160.184.25]) by BN3PR0801MB0931.namprd08.prod.outlook.com ([25.160.184.25]) with mapi id 15.01.0036.010; Sat, 20 Dec 2014 19:54:06 +0000 From: "Pokala, Ravi" To: Adrian Chadd Subject: Re: Converting LBAs to byte offsets through the GEOM stack Thread-Topic: Converting LBAs to byte offsets through the GEOM stack Thread-Index: AQHQGxf+kE6X/QdjbU+vuIDpLz2I5ZyWJp0A//9/yoCAAlgOAIAAYqmA Date: Sat, 20 Dec 2014 19:54:06 +0000 Message-ID: References: <20141219015210.GY25139@funkthat.com> In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: user-agent: Microsoft-MacOutlook/14.4.7.141117 x-originating-ip: [24.6.178.251] authentication-results: spf=none (sender IP is ) smtp.mailfrom=rpokala@panasas.com; x-microsoft-antispam: BCL:0;PCL:0;RULEID:;SRVR:BN3PR0801MB0930;UriScan:; x-exchange-antispam-report-test: UriScan:; x-exchange-antispam-report-cfa-test: BCL:0; PCL:0; RULEID:; SRVR:BN3PR0801MB0930; x-forefront-prvs: 0431F981D8 x-forefront-antispam-report: SFV:NSPM; SFS:(10009020)(6009001)(199003)(164054003)(189002)(83506001)(93886004)(77156002)(99396003)(31966008)(62966003)(36756003)(76176999)(106356001)(86362001)(99286002)(558084003)(87936001)(92566001)(4396001)(106116001)(120916001)(105586002)(2656002)(50986999)(21056001)(2900100001)(102836002)(2950100001)(66066001)(20776003)(122556002)(64706001)(101416001)(46102003)(40100003)(68736005)(97736003)(54356999)(107046002)(110136001); DIR:OUT; SFP:1101; SCL:1; SRVR:BN3PR0801MB0930; H:BN3PR0801MB0931.namprd08.prod.outlook.com; FPR:; SPF:None; MLV:sfv; PTR:InfoNoRecords; A:1; MX:1; LANG:en; received-spf: None (protection.outlook.com: panasas.com does not designate permitted sender hosts) Content-Type: text/plain; charset="us-ascii" Content-ID: <09FAC75C781BCE4D81EABF4E791E14E6@namprd08.prod.outlook.com> Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-MS-Exchange-CrossTenant-originalarrivaltime: 20 Dec 2014 19:54:06.3125 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: acf01c9d-c699-42af-bdbb-44bf582e60b0 X-MS-Exchange-Transport-CrossTenantHeadersStamped: BN3PR0801MB0930 X-Microsoft-Antispam: BCL:0;PCL:0;RULEID:;SRVR:BN3PR0801MB1138; X-OriginatorOrg: panasas.com Cc: John-Mark Gurney , "freebsd-geom@freebsd.org" X-BeenThere: freebsd-geom@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: GEOM-specific discussions and implementations List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 20 Dec 2014 20:09:18 -0000 Hi Adrian, >So when doing stuff like this, I ended up piggybacking commands through >the translation layers, so stuff was done (a) in line with the rest of IO >processing, and (b) wouldn't suffer from stale data. Could you expand on that a little? Thanks, Ravi From owner-freebsd-geom@FreeBSD.ORG Sat Dec 20 20:56:51 2014 Return-Path: Delivered-To: freebsd-geom@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 3A3FF8A0 for ; Sat, 20 Dec 2014 20:56:51 +0000 (UTC) Received: from mail-wi0-x22b.google.com (mail-wi0-x22b.google.com [IPv6:2a00:1450:400c:c05::22b]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id BC7B81BBB for ; Sat, 20 Dec 2014 20:56:50 +0000 (UTC) Received: by mail-wi0-f171.google.com with SMTP id bs8so4953903wib.4 for ; Sat, 20 Dec 2014 12:56:49 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:cc:content-type; bh=5WSHFoIjP3U1Y2bqm3hBc6VW3Fvu9Nn9slkdYs1Wfpg=; b=kxsoZfHH48Cq75y9F5wHnwIjy/3N9V5UljbIpeYcKBheyV9hnq/FY1mHbbc44uctAB FebwxE1sG2UpqHPmlcDhVluwj3ag7Y+d9JOayO3OePkBciRK5WaUMTVhksdtFGoaqP0V 7xjmAw0LdIg07fY4qHfBI/aoJik8RuWRUVVX6l67hgCiDXWy6NnCOE0o63ybH/RqdKTW qiuK29DQP63zlysH4Yi7U8ojSDxdGljpYggp+afcp7sfPVvgKjpe0+tM6UKTeXooAIPP s0J+vDGad1DnlsIwZloE3M3sxbL6G6r2cB76ngVZWYdP+/soR/7ObSqefoI/qvcE0RVA 4IrA== MIME-Version: 1.0 X-Received: by 10.180.20.6 with SMTP id j6mr16915353wie.59.1419109009217; Sat, 20 Dec 2014 12:56:49 -0800 (PST) Sender: adrian.chadd@gmail.com Received: by 10.216.106.195 with HTTP; Sat, 20 Dec 2014 12:56:49 -0800 (PST) In-Reply-To: References: <20141219015210.GY25139@funkthat.com> Date: Sat, 20 Dec 2014 12:56:49 -0800 X-Google-Sender-Auth: ff2GUKpmMNxb7GjRMD4KrE6xKdY Message-ID: Subject: Re: Converting LBAs to byte offsets through the GEOM stack From: Adrian Chadd To: "Pokala, Ravi" Content-Type: text/plain; charset=UTF-8 Cc: John-Mark Gurney , "freebsd-geom@freebsd.org" X-BeenThere: freebsd-geom@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: GEOM-specific discussions and implementations List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 20 Dec 2014 20:56:51 -0000 On 20 December 2014 at 11:54, Pokala, Ravi wrote: > Hi Adrian, > >>So when doing stuff like this, I ended up piggybacking commands through >>the translation layers, so stuff was done (a) in line with the rest of IO >>processing, and (b) wouldn't suffer from stale data. > > Could you expand on that a little? So say you had a geom layer that was doing bad block remapping. It's a black box with a queue (and now it'd be a black box with locks protecting the state, since there's direct dispatch GEOM, but ..) where you push in IO requests to some particular offsets, and the black box figures out which real disk / real offsets those requests are for. So to start with, you issue a request for block 0 from your geom black box, and it maps it to block 0 on disk 0. At some point it decides that it should map it to block 100 on disk 0 (or block 0 on disk 1, etc.) The only thing that knows about the current state of the mapping is that black box. And it's up to that black box to make sure that the IO requests that are coming in get mapped to the right places. If you have multiple dispatch threads that are sending the black box requests, it's up to the black box to ensure that some ordering/consistency for where things are mapped to occurs. So, imagine then you want to do a reverse lookup. You ask through the layer for what disk/block backs "block 0." It tells you, "block 0, disk 0." Now, that's valid as long as the remapping layer doesn't change that underneath you. If it decides to, you don't know - so when you send your direct-to-disk request as you said, it may be right for the time you did the reverse lookup, but it's certainly not right "now." When i was doing this stuff, it was a kind of bad block remapping and disk mirroring thing for caching disk blocks. So when you issued a request for "block 0 from this provider", it (a) would map to some arbitrary disk and arbitrary offset, (b) that could change at any point and your information would be stale, and (c) it may have mapped to multiple backend disks, so what you really needed to do was send that command to "all" the disks that backed that particular block. So I had a thing that I attached commands to that would funnel down to the geom layer that did this mirroring/caching/remapping thing, and it would handle schedule the commands to whatever block(s) on whatever disk(s) actually represented that particular logical offset. I actually had something that'd let me issue commands that would map to a single command to a single disk, or could be replicated to multiple commands to multiple disks (and then i'd just get the completion from them all in the reply message, as the bio didn't have enough space to write multiple block reads into, and mostly I was issuing status check commands like you are. :) Is that making more sense? I can whiteboard it up next time we're in the same place. -adrian From owner-freebsd-geom@FreeBSD.ORG Sat Dec 20 21:21:12 2014 Return-Path: Delivered-To: freebsd-geom@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 690D7BB4; Sat, 20 Dec 2014 21:21:12 +0000 (UTC) Received: from phk.freebsd.dk (phk.freebsd.dk [130.225.244.222]) by mx1.freebsd.org (Postfix) with ESMTP id 25E701E84; Sat, 20 Dec 2014 21:21:11 +0000 (UTC) Received: from critter.freebsd.dk (unknown [192.168.60.3]) by phk.freebsd.dk (Postfix) with ESMTP id 73D983BD1A; Sat, 20 Dec 2014 21:21:04 +0000 (UTC) Received: from critter.freebsd.dk (localhost [127.0.0.1]) by critter.freebsd.dk (8.14.9/8.14.9) with ESMTP id sBKLL1N9013565; Sat, 20 Dec 2014 21:21:01 GMT (envelope-from phk@phk.freebsd.dk) To: Adrian Chadd Subject: Re: Converting LBAs to byte offsets through the GEOM stack In-reply-to: From: "Poul-Henning Kamp" References: <20141219015210.GY25139@funkthat.com> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-ID: <13563.1419110461.1@critter.freebsd.dk> Content-Transfer-Encoding: quoted-printable Date: Sat, 20 Dec 2014 21:21:01 +0000 Message-ID: <13564.1419110461@critter.freebsd.dk> Cc: John-Mark Gurney , "Pokala, Ravi" , "freebsd-geom@freebsd.org" X-BeenThere: freebsd-geom@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: GEOM-specific discussions and implementations List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 20 Dec 2014 21:21:12 -0000 -------- In message , Adrian Chadd writes: >The only thing that knows about the current state of the mapping is >that black box. In the original sketches for GEOM, almost 20 years ago (!) there were a maintenance facility to ask; "where does this byterange on this provider end up?" That turned out to become a major headache to implement. First of, there is not a 1:1 correspondence in sight anywhere. A single provider can have multiple open consumers and you have to ask them all how they feel about it. Next, something like GBDE or RAID5 will turn your single sector into a range of sectors, and that is the simple case. Imagine a MBR label, with "extended partitions" which are effectively a linked list, and your query interval is one of these pseudo-linked-list- MBR-sectors, suddenly the answer becomes "a large fraction of the disk". Once you start to think about this, it can get really icky: There is no guarantee that the mapping is one-interval onto another-interval, it could return N intervals. Now, please design a sensible datastructure to capture that... And second: All this happens the wrong way around: It starts at the bottom and works its way up the GEOM stack, which means that lock-inversions is the dish-of-the-day every single place. In the end I simply dropped it: The complexity would in no way justify putting the necessary code in the kernel. If this were important, the geom(8) tool could probably do it, based on the exported XML state of the geom-mesh and support modules mirroring actual logic for all relevant geoms, that way it would at least live in userland. -- = Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk@FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe = Never attribute to malice what can adequately be explained by incompetence= .