From owner-freebsd-hackers Thu Mar 6 15:50:57 1997 Return-Path: Received: (from root@localhost) by freefall.freebsd.org (8.8.5/8.8.5) id PAA20679 for hackers-outgoing; Thu, 6 Mar 1997 15:50:57 -0800 (PST) Received: from dg-rtp.dg.com (dg-rtp.rtp.dg.com [128.222.1.2]) by freefall.freebsd.org (8.8.5/8.8.5) with SMTP id PAA20657 for ; Thu, 6 Mar 1997 15:50:37 -0800 (PST) Received: by dg-rtp.dg.com (5.4R3.10/dg-rtp-v02) id AA14395; Thu, 6 Mar 1997 18:50:05 -0500 Received: from ponds by dg-rtp.dg.com.rtp.dg.com; Thu, 6 Mar 1997 18:50 EST Received: from lakes.water.net (lakes [10.0.0.3]) by ponds.water.net (8.8.3/8.7.3) with ESMTP id RAA13186; Thu, 6 Mar 1997 17:09:34 -0500 (EST) Received: (from rivers@localhost) by lakes.water.net (8.8.3/8.6.9) id RAA08636; Thu, 6 Mar 1997 17:15:18 -0500 (EST) Date: Thu, 6 Mar 1997 17:15:18 -0500 (EST) From: Thomas David Rivers Message-Id: <199703062215.RAA08636@lakes.water.net> To: ponds!lakes.water.net!rivers, ponds!lambert.org!terry Subject: Re: "dup alloc" - nope - kern/2875 wasn't it. Cc: ponds!freebsd.org!hackers Content-Type: text Sender: owner-hackers@FreeBSD.ORG X-Loop: FreeBSD.org Precedence: bulk > > > > I guess it would be worth while to take out the printf's until you can > > > isolate the printf's that "fix" the problem. Then analyze the effects of > > > the printfs serializing writes. > > > > My thinking exactly - I've now gone back to just a pristine kernel and > > I'm trying to find a missing splbio()/splx(), or something along those > > lines... so far, no luck... > > > I am, of course, unable to duplicate your panics. If you have a spare disk lying around; others have demonstrated with MFS as well - so you may be able to reproduce it there. trash the disk (i.e. copy a large file, as large as the partition to the partition - or write a program that simply write n 0xff's...) newfs the disk fsck the disk If you get any fsck errors; you've run into the problem... But - it appears to be extremely timing dependent! (As you point out.) > > I suggest you buckle down and do it the hard way; I'd help if I could > duplicate the problem, or if my changes would not be seen as gratuitous, > but I can't. Without a problem fix resulting, there's no way I can > prove that eliminating all possible race conditions is a Good Thing(tm) > to those people who aren't getting bitten. Well, it is difficult to suggest to people that "oh yes, that system that's been running fine for over a year does, in fact, have a bug in it; you've just been lucky..." I have a certain empathy for that; especially when I was the only person in-the-entire-world reporting the problem. It's very easy to dismiss me as a nut with bad hardware :-) Now that other people have reported it; I'm hoping to get more. [I should quickly add here that I'm delighted with, and grateful for, the response I have gotten, and I'm not complaining, I'm just saying I could be easily seen as a "nut"...] > > Here is what I suggest; effectively, you will be required to perform > a full branch-path analysis of much of the code, by hand. If you > have a copy of BattleMap, you could use it some places, but since > most kernel routines are not single-entry/single-exit, I would not > recommend spending the $4000 or so for the software just for this > problem, since it won't help much. Wow! I was hoping not to have to do that for all (well most) of the kernel.... My approach will likely be to try and find items that appear to cause a difference here; finding several such changes could help triangulate on the problem... That is - If I change "this" the problem goes away, if I change "that" the problem goes away; now what's common in the effects of "this" and "that." Unfortunately, my changes thus far that actually affect the problem are my printf()s to determine what the problem is; any the only common effect is that they (presumably) alter timings in such a way as to avoid the problem... not very useful. I'm trying to read through some of the code now, looking for mis-matched splbio()/splx(). Or, something like that... I'm just not (yet) educated enough to catch everything. I've also noted that some of these have been corrected in 2.2-GAMMA (i.e. vfs_subr.c has splbio()/splx()'s in 2.2-GAMMA that it doesn't have in 2.1.6.1) I'm guessing now that a missing one of these is the culprit.... If someone were to detail exactly when you can futz with a struct buf without being splbio() it would help my reading.... - Dave R. -