From owner-freebsd-current@FreeBSD.ORG Mon Aug 2 04:06:22 2004 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 4238E16A4CE; Mon, 2 Aug 2004 04:06:22 +0000 (GMT) Received: from pooker.samsco.org (pooker.samsco.org [168.103.85.57]) by mx1.FreeBSD.org (Postfix) with ESMTP id CE86F43D60; Mon, 2 Aug 2004 04:06:21 +0000 (GMT) (envelope-from scottl@samsco.org) Received: from [192.168.0.201] ([192.168.0.201]) (authenticated bits=0) by pooker.samsco.org (8.12.11/8.12.10) with ESMTP id i724DTe9032790; Sun, 1 Aug 2004 22:13:30 -0600 (MDT) (envelope-from scottl@samsco.org) Message-ID: <410DBD35.3030107@samsco.org> Date: Sun, 01 Aug 2004 22:04:05 -0600 From: Scott Long User-Agent: Mozilla/5.0 (X11; U; FreeBSD i386; en-US; rv:1.7.1) Gecko/20040801 X-Accept-Language: en-us, en MIME-Version: 1.0 To: Brian Fundakowski Feldman References: <410AD054.8070202@root.org> <20040731064433.GD33220@green.homeunix.org> <410D853F.6080704@cryptography.com> <20040802001545.GA91621@green.homeunix.org> In-Reply-To: <20040802001545.GA91621@green.homeunix.org> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Status: No, hits=0.0 required=3.8 tests=none autolearn=no version=2.63 X-Spam-Checker-Version: SpamAssassin 2.63 (2004-01-11) on pooker.samsco.org cc: Nate Lawson cc: current@freebsd.org cc: sos@deepcore.dk Subject: Re: memory corruption/panic solved ("FAILURE - ATAPI_IDENTIFY no interrupt") X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 02 Aug 2004 04:06:22 -0000 Brian Fundakowski Feldman wrote: > On Sun, Aug 01, 2004 at 05:05:19PM -0700, Nate Lawson wrote: > >>Brian Fundakowski Feldman wrote: >> >>>On Fri, Jul 30, 2004 at 03:48:52PM -0700, Nate Lawson wrote: >>> >>>>I've tracked down the source of the memory corruption in -current that >>>>results when booting with various CD and DVD drives (especially the ones >>>>that come with Thinkpads including T23, R32, T41, etc.) The panic is >>>>obvious when running with INVARIANTS ("memory modified after free") but >>>>not so obvious in other configurations. For instance, without >>>>INVARIANTS, part of the rt_info structure is corrupted on my wireless >>>>card, resulting in a panic during ifconfig on boot. This is likely the >>>>source of other problems, including phk's ACPI panic (again, only >>>>triggered when booting with the CD drive in the bay.) >>>> >>>>The root problem is that ata_timeout() fires and calls ata_pio_read() >>>>which overwrites 512 bytes random memory. There are actually two bugs >>>>here that overwrite memory. The code path is as follows: >>> >>>Good job identifying it more exactly. I decided it should just >>>fundamentally >>>be using GEOM primitives everywhere to move the solutions to all these >>>side cases into where they're already handled generically... still think >>>that's probably the right solution, but I'm glad to see this specific >>>problem fixed. >> >>I'm not sure if this is a troll or not but I'll answer it seriously. >>GEOM and other upper layers are never the right place to handle error >>recovery for transactions initiated at the lower layers (like this >>device scan). >> >>In every system I've seen, error recovery is the hardest part of storage >>code to get right and is seldom well-tested. It's a very difficult >>problem that involves a lot of careful fault injection/testing. >>Divergence in hardware fault handling behavior only complicates things. > > > What would make it a troll? If GEOM were used so that all transactions > were centrallized, and there were one timeout mechanism used to run the > request queues for ATA, it wouldn't be racing and crashing when a device > reset occurs (and it would be a net reduction in code). > Nate is absolutely correct. Error recovery is something that belongs in the device drivers, not in the common block layer. Also, your proposal is entirely ATA centric and ignores the myriad of other block drivers in the tree, which again all have different recovery requirements. GEOM doesn't need to know the gory details of command timeouts, retries, etc. That's why we have an ATA framework and a SCSI framework. In fact if you look at SCSI, CAM doesn't even track command timeouts because the actual hardware driver still has better access to dealing with those kinds of problems than the upper layer. Error recovery belongs where it is. If the ATA driver has bugs, then the ATA driver needs to be dealt with. Scott