From owner-freebsd-stable@FreeBSD.ORG Wed Mar 30 05:43:20 2005 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id BB4BD16A4CE for ; Wed, 30 Mar 2005 05:43:20 +0000 (GMT) Received: from FS.denninger.net (wsip-68-15-213-52.at.at.cox.net [68.15.213.52]) by mx1.FreeBSD.org (Postfix) with ESMTP id 917B443D60 for ; Wed, 30 Mar 2005 05:43:19 +0000 (GMT) (envelope-from karl@FS.denninger.net) Received: from fs.denninger.net (localhost [127.0.0.1]) by FS.denninger.net (8.13.3/8.13.1) with SMTP id j2U5hIZN004111 for ; Tue, 29 Mar 2005 23:43:18 -0600 (CST) (envelope-from karl@FS.denninger.net) Received: from fs.denninger.net [127.0.0.1] by Spamblock-sys; Tue Mar 29 23:43:18 2005 Received: (from karl@localhost) by FS.denninger.net (8.13.3/8.13.1/Submit) id j2U5hIew004109; Tue, 29 Mar 2005 23:43:18 -0600 (CST) (envelope-from karl) Message-ID: <20050329234318.A3883@denninger.net> Date: Tue, 29 Mar 2005 23:43:18 -0600 From: Karl Denninger To: "Matthew N. Dodd" References: <20050329200841.A772@denninger.net> <20050329233843.L328@sasami.jurai.net> <20050329230830.A3222@denninger.net> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 0.93.2i In-Reply-To: <20050329230830.A3222@denninger.net>; from Karl Denninger on Tue, Mar 29, 2005 at 11:08:30PM -0600 Organization: Karl's Sushi and Packet Smashers X-Die-Spammers: Spammers cheerfully broiled for supper and served with ketchup! cc: freebsd-stable@freebsd.org Subject: Re: DANGER WILL ROBINSON! SERIOUS problem with current 5.4-PRERELEASE X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 30 Mar 2005 05:43:20 -0000 Here's the diff and some thoughts.... Fs:/usr/src/sys/dev/ata> cvs diff -r 1.32.2.5 ata-queue.c Index: ata-queue.c =================================================================== RCS file: /usr/cvs/src/sys/dev/ata/ata-queue.c,v retrieving revision 1.32.2.5 retrieving revision 1.32.2.6 diff -r1.32.2.5 -r1.32.2.6 30c30 < __FBSDID("$FreeBSD: src/sys/dev/ata/ata-queue.c,v 1.32.2.5 2004/10/24 09:27:37 sos Exp $"); --- > __FBSDID("$FreeBSD: src/sys/dev/ata/ata-queue.c,v 1.32.2.6 2005/03/23 04:50:26 mdodd Exp $"); 218a219,221 > if (!dumping) > callout_reset(&request->callout, request->timeout * hz, > (timeout_t*)ata_timeout, request); 241,243c244,249 < < /* if reinit succeeded and retries still permit, reinject request */ < if (ata_reinit(ch) && request->retries-- > 0 && request->device->param){ --- > /* > * if reinit succeeds, retries still permit and device didn't > * get removed by the reinit, reinject request > */ > if (!ata_reinit(ch) && request->retries-- > 0 > && request->device->param){ 245a252 > request->donecount = 0; The second diff is really just a formatting and comment change.. you're certainly correct that the changes are small! :-) Without the last delta the requeue doesn't happen at all. I remember pulling that when I ran into this originally towards the first of the month and without it the requeue just didn't happen (e.g. it reverted back to the old behavior of just detaching the disk on the original error) - so it appears that the reset of donecount was what was "missing" in the original implementation in terms of the retry actually happening. The difficulty is figuring out if requeueing is broken in general (and the effective disabling of it by not resetting "donecount" masked the brokenness) or whether its something else (e.g. perhaps the reset of the callout?) The destabilization that happens is bizarre - the system gets VERY strange, with interrupt-driven things "disappearing" - like serial port input - and if left alone eventually (within a half-hour or so) you'll get to the point where the network and console are completely unresponsive. BTW after reboot the filesystem was damaged enough that it refused to check in the background, forcing me to sit through the entire fsck sequence (~30 minutes) before it'll come back up. Will see if I can get my sandbox machine into a configuration that will make looking into this further possible - in the meantime until its sorted out you might want to think about rolling back this one for RELENG_5. -- -- Karl Denninger (karl@denninger.net) Internet Consultant & Kids Rights Activist http://www.denninger.net My home on the net - links to everything I do! http://scubaforum.org Your UNCENSORED place to talk about DIVING! http://www.spamcuda.net SPAM FREE mailboxes - FREE FOR A LIMITED TIME! http://genesis3.blogspot.com Musings Of A Sentient Mind On Tue, Mar 29, 2005 at 11:08:30PM -0600, Karl Denninger wrote: > On Tue, Mar 29, 2005 at 11:40:48PM -0500, Matthew N. Dodd wrote: > > On Tue, 29 Mar 2005, Karl Denninger wrote: > > > 1.42: When resubmitting a timed out request, reset donecount. > > > 1.41: Reset timeout when we are back from interrupt. > > > 1.40: Correct logical error, result was that retries wasn't always made but > > > failure reported instead. > > > 1.39: Do not retry on requests that have lost their device during reinit. > > > > > > This change is EXTREMELY DANGEROUS. > > > > > > This change needs to be backed out immediately until it can be determined > > > why a requeued request destabilizes the system. > > > > The changes in question are very small. Could you attempt to isolate > > which one is the cause? > > > > Thanks. > > Pretty sure its the requeue (e.g. 1.40 and 1.42); I attempted to put this > patch in the system back before it was MFC'd (when it orginally showed up in > -HEAD) and it failed in exactly the same way. The first time it created a > LOT of head-scratching ("how come my serial board has suddenly gone deaf?!") > and it wasn't until it got to where the console wouldn't respond that the > light went on and I said "oh, so THAT's what that patch really does!" :-> > > That got backed out FAST :-) > > I believe the previous version of that file in -STABLE was 1.38 - that has > the 'errors don't actually get retried' problem that results in immediate > detaches - the reason for the update was that I noted the commit and > figured that the problem from my last attempt with including this had > either been fixed or I had missed some dependancy in my earlier attempt. > > I have an open PR on the underlying problem (SATA drives on a number of > common configurations returning false errors and detaching when part of a > geom mirror) which I've marked as "serious". Its at > http://www.freebsd.org/cgi/query-pr.cgi?pr=77643 > > There is a comment attached to the PR from another user who has duplicated > the underlying problem. > > Note that back on 3/2/05 I attempted to apply the 1.42 version of this file > to -STABLE and got the same failure, and added that fact to the PR. I also > reported it here. It appears that both reports were either missed or ignored > and this change was committed to -RELENG_5. > > I'm not sure if I can cobble up a test machine with the right configuration > of hardware to go through each of the above changes in turn to see if I can > isolate which of the three it is, but I'll give it a shot over the next > couple of days. I'm 1 SATA disk short of what I need to do this in my > sandbox. > > If I do not trigger the requeue all appears to be fine. > > This is one that IMHO has to either be found and fixed or backed out for the > impending -RELEASE. > > -- > -- > Karl Denninger (karl@denninger.net) Internet Consultant & Kids Rights Activist > http://www.denninger.net My home on the net - links to everything I do! > http://scubaforum.org Your UNCENSORED place to talk about DIVING! > http://www.spamcuda.net SPAM FREE mailboxes - FREE FOR A LIMITED TIME! > http://genesis3.blogspot.com Musings Of A Sentient Mind > > > _______________________________________________ > freebsd-stable@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-stable > To unsubscribe, send any mail to "freebsd-stable-unsubscribe@freebsd.org" > > > %SPAMBLOCK-SYS: Matched [freebsd], message ok