Date: Sun, 01 Aug 2004 17:05:19 -0700 From: Nate Lawson <nate@cryptography.com> To: Brian Fundakowski Feldman <green@freebsd.org> Cc: sos@deepcore.dk Subject: Re: memory corruption/panic solved ("FAILURE - ATAPI_IDENTIFY no interrupt") Message-ID: <410D853F.6080704@cryptography.com> In-Reply-To: <20040731064433.GD33220@green.homeunix.org> References: <410AD054.8070202@root.org> <20040731064433.GD33220@green.homeunix.org>
next in thread | previous in thread | raw e-mail | index | archive | help
Brian Fundakowski Feldman wrote: > On Fri, Jul 30, 2004 at 03:48:52PM -0700, Nate Lawson wrote: >>I've tracked down the source of the memory corruption in -current that >>results when booting with various CD and DVD drives (especially the ones >>that come with Thinkpads including T23, R32, T41, etc.) The panic is >>obvious when running with INVARIANTS ("memory modified after free") but >>not so obvious in other configurations. For instance, without >>INVARIANTS, part of the rt_info structure is corrupted on my wireless >>card, resulting in a panic during ifconfig on boot. This is likely the >>source of other problems, including phk's ACPI panic (again, only >>triggered when booting with the CD drive in the bay.) >> >>The root problem is that ata_timeout() fires and calls ata_pio_read() >>which overwrites 512 bytes random memory. There are actually two bugs >>here that overwrite memory. The code path is as follows: > > Good job identifying it more exactly. I decided it should just fundamentally > be using GEOM primitives everywhere to move the solutions to all these > side cases into where they're already handled generically... still think > that's probably the right solution, but I'm glad to see this specific > problem fixed. I'm not sure if this is a troll or not but I'll answer it seriously. GEOM and other upper layers are never the right place to handle error recovery for transactions initiated at the lower layers (like this device scan). In every system I've seen, error recovery is the hardest part of storage code to get right and is seldom well-tested. It's a very difficult problem that involves a lot of careful fault injection/testing. Divergence in hardware fault handling behavior only complicates things. -Nate
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?410D853F.6080704>