Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 29 Mar 2005 23:43:18 -0600
From:      Karl Denninger <karl@denninger.net>
To:        "Matthew N. Dodd" <mdodd@freebsd.org>
Cc:        freebsd-stable@freebsd.org
Subject:   Re: DANGER WILL ROBINSON! SERIOUS problem with current 5.4-PRERELEASE
Message-ID:  <20050329234318.A3883@denninger.net>
In-Reply-To: <20050329230830.A3222@denninger.net>; from Karl Denninger on Tue, Mar 29, 2005 at 11:08:30PM -0600
References:  <20050329200841.A772@denninger.net> <20050329233843.L328@sasami.jurai.net> <20050329230830.A3222@denninger.net>

next in thread | previous in thread | raw e-mail | index | archive | help
Here's the diff and some thoughts....

Fs:/usr/src/sys/dev/ata> cvs diff -r 1.32.2.5 ata-queue.c
Index: ata-queue.c
===================================================================
RCS file: /usr/cvs/src/sys/dev/ata/ata-queue.c,v
retrieving revision 1.32.2.5
retrieving revision 1.32.2.6
diff -r1.32.2.5 -r1.32.2.6
30c30
< __FBSDID("$FreeBSD: src/sys/dev/ata/ata-queue.c,v 1.32.2.5 2004/10/24 09:27:37 sos Exp $");
---
> __FBSDID("$FreeBSD: src/sys/dev/ata/ata-queue.c,v 1.32.2.6 2005/03/23 04:50:26 mdodd Exp $");
218a219,221
>       if (!dumping)
>           callout_reset(&request->callout, request->timeout * hz,
>                         (timeout_t*)ata_timeout, request);
241,243c244,249
< 
<       /* if reinit succeeded and retries still permit, reinject request */
<       if (ata_reinit(ch) && request->retries-- > 0 && request->device->param){
---
>       /*
>        * if reinit succeeds, retries still permit and device didn't
>        * get removed by the reinit, reinject request
>        */
>       if (!ata_reinit(ch) && request->retries-- > 0
>           && request->device->param){
245a252
>           request->donecount = 0;

The second diff is really just a formatting and comment change.. you're
certainly correct that the changes are small! :-)

Without the last delta the requeue doesn't happen at all.  I remember
pulling that when I ran into this originally towards the first of the
month and without it the requeue just didn't happen (e.g. it reverted back
to the old behavior of just detaching the disk on the original error) - so 
it appears that the reset of donecount was what was "missing" in the original 
implementation in terms of the retry actually happening.

The difficulty is figuring out if requeueing is broken in general (and 
the effective disabling of it by not resetting "donecount" masked the
brokenness) or whether its something else (e.g. perhaps the reset of 
the callout?)

The destabilization that happens is bizarre - the system gets VERY strange, 
with interrupt-driven things "disappearing" - like serial port input - and 
if left alone eventually (within a half-hour or so) you'll get to the point 
where the network and console are completely unresponsive.

BTW after reboot the filesystem was damaged enough that it refused to check
in the background, forcing me to sit through the entire fsck sequence (~30
minutes) before it'll come back up.

Will see if I can get my sandbox machine into a configuration that will make
looking into this further possible - in the meantime until its sorted out
you might want to think about rolling back this one for RELENG_5.

--
-- 
Karl Denninger (karl@denninger.net) Internet Consultant & Kids Rights Activist
http://www.denninger.net	My home on the net - links to everything I do!
http://scubaforum.org		Your UNCENSORED place to talk about DIVING!
http://www.spamcuda.net		SPAM FREE mailboxes - FREE FOR A LIMITED TIME!
http://genesis3.blogspot.com	Musings Of A Sentient Mind


On Tue, Mar 29, 2005 at 11:08:30PM -0600, Karl Denninger wrote:
> On Tue, Mar 29, 2005 at 11:40:48PM -0500, Matthew N. Dodd wrote:
> > On Tue, 29 Mar 2005, Karl Denninger wrote:
> > >  1.42: When resubmitting a timed out request, reset donecount.
> > >  1.41: Reset timeout when we are back from interrupt.
> > >  1.40: Correct logical error, result was that retries wasn't always made but
> > >        failure reported instead.
> > >  1.39: Do not retry on requests that have lost their device during reinit.
> > >
> > > This change is EXTREMELY DANGEROUS.
> > >
> > > This change needs to be backed out immediately until it can be determined
> > > why a requeued request destabilizes the system.
> > 
> > The changes in question are very small.  Could you attempt to isolate 
> > which one is the cause?
> > 
> > Thanks.
> 
> Pretty sure its the requeue (e.g. 1.40 and 1.42); I attempted to put this
> patch in the system back before it was MFC'd (when it orginally showed up in
> -HEAD) and it failed in exactly the same way.  The first time it created a
> LOT of head-scratching ("how come my serial board has suddenly gone deaf?!")
> and it wasn't until it got to where the console wouldn't respond that the
> light went on and I said "oh, so THAT's what that patch really does!" :->
> 
> That got backed out FAST :-)
> 
> I believe the previous version of that file in -STABLE was 1.38 - that has 
> the 'errors don't actually get retried' problem that results in immediate 
> detaches - the reason for the update was that I noted the commit and 
> figured that the problem from my last attempt with including this had 
> either been fixed or I had missed some dependancy in my earlier attempt.
> 
> I have an open PR on the underlying problem (SATA drives on a number of
> common configurations returning false errors and detaching when part of a
> geom mirror) which I've marked as "serious".  Its at 
> http://www.freebsd.org/cgi/query-pr.cgi?pr=77643
> 
> There is a comment attached to the PR from another user who has duplicated 
> the underlying problem.
> 
> Note that back on 3/2/05 I attempted to apply the 1.42 version of this file
> to -STABLE and got the same failure, and added that fact to the PR.  I also
> reported it here.  It appears that both reports were either missed or ignored 
> and this change was committed to -RELENG_5.
> 
> I'm not sure if I can cobble up a test machine with the right configuration
> of hardware to go through each of the above changes in turn to see if I can
> isolate which of the three it is, but I'll give it a shot over the next
> couple of days.  I'm 1 SATA disk short of what I need to do this in my 
> sandbox.
> 
> If I do not trigger the requeue all appears to be fine.
> 
> This is one that IMHO has to either be found and fixed or backed out for the
> impending -RELEASE.
> 
> --
> -- 
> Karl Denninger (karl@denninger.net) Internet Consultant & Kids Rights Activist
> http://www.denninger.net	My home on the net - links to everything I do!
> http://scubaforum.org		Your UNCENSORED place to talk about DIVING!
> http://www.spamcuda.net		SPAM FREE mailboxes - FREE FOR A LIMITED TIME!
> http://genesis3.blogspot.com	Musings Of A Sentient Mind
> 
> 
> _______________________________________________
> freebsd-stable@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-stable
> To unsubscribe, send any mail to "freebsd-stable-unsubscribe@freebsd.org"
> 
> 
> %SPAMBLOCK-SYS: Matched [freebsd], message ok




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20050329234318.A3883>