Date: Wed, 30 Mar 2005 23:00:46 -0600 From: Karl Denninger <karl@denninger.net> To: "Matthew N. Dodd" <mdodd@freebsd.org> Cc: freebsd-stable@freebsd.org Subject: Re: DANGER WILL ROBINSON! SERIOUS problem with current 5.4-PRERELEASE - UPDATE (real this time) Message-ID: <20050330230046.A68235@denninger.net> In-Reply-To: <20050330210830.A46956@denninger.net>; from Karl Denninger on Wed, Mar 30, 2005 at 09:08:30PM -0600 References: <20050329200841.A772@denninger.net> <20050329233843.L328@sasami.jurai.net> <20050329230830.A3222@denninger.net> <20050329234318.A3883@denninger.net> <20050330210830.A46956@denninger.net>
next in thread | previous in thread | raw e-mail | index | archive | help
Ok, here's what I've got so far. Pulling the SECOND delta both gets rid of the stability problem AND the requeue fix (e.g. getting rid of that denies the essential purpose of the deltas in the first place.) Removing the FIRST delta, which is: 218a219,221 if (!dumping) callout_reset(&request->callout, request->timeout * hz, (timeout_t*)ata_timeout, request); appears to get rid of the crashes while not harming data integrity OR the reqeueing. With this one out the errors (I was able to general over a dozen retries in less than 10 minutes doing a large file copy with a 3-disk RAID 1 array comprised of 2 SATA disks, 1 UDMA100) still occur, BUT they are retried (apparently successfully.) I copied the source tree to /usr/src2 and took the errors. I am now attempting to "buildworld" off it - so far, so good (about 1/4 of the way through - if there was data corruption it should have failed by now) Also, the sandbox system is still up. That also is a major improvement. I will let this buildworld complete, and if it is successful (proving that the retried errors didn't actually result in corrupted files!), will put this same change (pulling the first delta only) on the production system, rebuild the other RAID disks (I had to pull the cartridges from there to use them on the sandbox) and see if intentionally provoking the same error there allows the system to remain stable once the errors start showing up. Again, I will not have a "final" determination on this until late tomorrow, but at first blush pulling the first delta appears to fix the stability issue. Further update tomorrow as soon as I have it.... -- -- Karl Denninger (karl@denninger.net) Internet Consultant & Kids Rights Activist http://www.denninger.net My home on the net - links to everything I do! http://scubaforum.org Your UNCENSORED place to talk about DIVING! http://www.spamcuda.net SPAM FREE mailboxes - FREE FOR A LIMITED TIME! http://genesis3.blogspot.com Musings Of A Sentient Mind On Wed, Mar 30, 2005 at 09:08:30PM -0600, Karl Denninger wrote: > On Tue, Mar 29, 2005 at 11:43:18PM -0600, Karl Denninger wrote: > > Here's the diff and some thoughts.... > > > > Fs:/usr/src/sys/dev/ata> cvs diff -r 1.32.2.5 ata-queue.c > > Index: ata-queue.c > > =================================================================== > > RCS file: /usr/cvs/src/sys/dev/ata/ata-queue.c,v > > retrieving revision 1.32.2.5 > > retrieving revision 1.32.2.6 > > diff -r1.32.2.5 -r1.32.2.6 > > 30c30 > > < __FBSDID("$FreeBSD: src/sys/dev/ata/ata-queue.c,v 1.32.2.5 2004/10/24 09:27:37 sos Exp $"); > > --- > > > __FBSDID("$FreeBSD: src/sys/dev/ata/ata-queue.c,v 1.32.2.6 2005/03/23 04:50:26 mdodd Exp $"); > > 218a219,221 > > > if (!dumping) > > > callout_reset(&request->callout, request->timeout * hz, > > > (timeout_t*)ata_timeout, request); > > 241,243c244,249 > > < > > < /* if reinit succeeded and retries still permit, reinject request */ > > < if (ata_reinit(ch) && request->retries-- > 0 && request->device->param){ > > --- > > > /* > > > * if reinit succeeds, retries still permit and device didn't > > > * get removed by the reinit, reinject request > > > */ > > > if (!ata_reinit(ch) && request->retries-- > 0 > > > && request->device->param){ > > 245a252 > > > request->donecount = 0; > > Removing the second change (changing the test on the "ata_reinit") appears to > prevent both the destabilization and the actual requeue from taking place > (that is, you get the immediate disconnect from the array when the error > occurs; therefore whatever is causing the destabilization doesn't happen.) > > I will attempt to remove the first delta alone (and put back the second), but > from a quick perusal of the code I doubt this will make a material change. > > -- > -- > Karl Denninger (karl@denninger.net) Internet Consultant & Kids Rights Activist > http://www.denninger.net My home on the net - links to everything I do! > http://scubaforum.org Your UNCENSORED place to talk about DIVING! > http://www.spamcuda.net SPAM FREE mailboxes - FREE FOR A LIMITED TIME! > http://genesis3.blogspot.com Musings Of A Sentient Mind > > > _______________________________________________ > freebsd-stable@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-stable > To unsubscribe, send any mail to "freebsd-stable-unsubscribe@freebsd.org" > > > %SPAMBLOCK-SYS: Matched [freebsd], message ok
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20050330230046.A68235>