From owner-freebsd-current@FreeBSD.ORG Mon Aug 2 00:15:48 2004 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from green.homeunix.org (freefall.freebsd.org [216.136.204.21]) by hub.freebsd.org (Postfix) with ESMTP id D0FB716A4CE; Mon, 2 Aug 2004 00:15:47 +0000 (GMT) Received: from green.homeunix.org (green@localhost [127.0.0.1]) by green.homeunix.org (8.12.11/8.12.11) with ESMTP id i720Fkws030505; Sun, 1 Aug 2004 20:15:47 -0400 (EDT) (envelope-from green@green.homeunix.org) Received: (from green@localhost) by green.homeunix.org (8.12.11/8.12.11/Submit) id i720Fj30030504; Sun, 1 Aug 2004 20:15:45 -0400 (EDT) (envelope-from green) Date: Sun, 1 Aug 2004 20:15:45 -0400 From: Brian Fundakowski Feldman To: Nate Lawson Message-ID: <20040802001545.GA91621@green.homeunix.org> References: <410AD054.8070202@root.org> <20040731064433.GD33220@green.homeunix.org> <410D853F.6080704@cryptography.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <410D853F.6080704@cryptography.com> User-Agent: Mutt/1.5.6i cc: current@freebsd.org cc: sos@deepcore.dk Subject: Re: memory corruption/panic solved ("FAILURE - ATAPI_IDENTIFY no interrupt") X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 02 Aug 2004 00:15:48 -0000 On Sun, Aug 01, 2004 at 05:05:19PM -0700, Nate Lawson wrote: > Brian Fundakowski Feldman wrote: > >On Fri, Jul 30, 2004 at 03:48:52PM -0700, Nate Lawson wrote: > >>I've tracked down the source of the memory corruption in -current that > >>results when booting with various CD and DVD drives (especially the ones > >>that come with Thinkpads including T23, R32, T41, etc.) The panic is > >>obvious when running with INVARIANTS ("memory modified after free") but > >>not so obvious in other configurations. For instance, without > >>INVARIANTS, part of the rt_info structure is corrupted on my wireless > >>card, resulting in a panic during ifconfig on boot. This is likely the > >>source of other problems, including phk's ACPI panic (again, only > >>triggered when booting with the CD drive in the bay.) > >> > >>The root problem is that ata_timeout() fires and calls ata_pio_read() > >>which overwrites 512 bytes random memory. There are actually two bugs > >>here that overwrite memory. The code path is as follows: > > > >Good job identifying it more exactly. I decided it should just > >fundamentally > >be using GEOM primitives everywhere to move the solutions to all these > >side cases into where they're already handled generically... still think > >that's probably the right solution, but I'm glad to see this specific > >problem fixed. > > I'm not sure if this is a troll or not but I'll answer it seriously. > GEOM and other upper layers are never the right place to handle error > recovery for transactions initiated at the lower layers (like this > device scan). > > In every system I've seen, error recovery is the hardest part of storage > code to get right and is seldom well-tested. It's a very difficult > problem that involves a lot of careful fault injection/testing. > Divergence in hardware fault handling behavior only complicates things. What would make it a troll? If GEOM were used so that all transactions were centrallized, and there were one timeout mechanism used to run the request queues for ATA, it wouldn't be racing and crashing when a device reset occurs (and it would be a net reduction in code). -- Brian Fundakowski Feldman \'[ FreeBSD ]''''''''''\ <> green@FreeBSD.org \ The Power to Serve! \ Opinions expressed are my own. \,,,,,,,,,,,,,,,,,,,,,,\