From owner-freebsd-hackers Thu Mar 6 10:13:17 1997 Return-Path: Received: (from root@localhost) by freefall.freebsd.org (8.8.5/8.8.5) id KAA02054 for hackers-outgoing; Thu, 6 Mar 1997 10:13:17 -0800 (PST) Received: from phaeton.artisoft.com (phaeton.Artisoft.COM [198.17.250.50]) by freefall.freebsd.org (8.8.5/8.8.5) with SMTP id KAA02047 for ; Thu, 6 Mar 1997 10:13:14 -0800 (PST) Received: (from terry@localhost) by phaeton.artisoft.com (8.6.11/8.6.9) id LAA13715; Thu, 6 Mar 1997 11:07:27 -0700 From: Terry Lambert Message-Id: <199703061807.LAA13715@phaeton.artisoft.com> Subject: Re: "dup alloc" - nope - kern/2875 wasn't it. To: ponds!rivers@dg-rtp.dg.com (Thomas David Rivers) Date: Thu, 6 Mar 1997 11:07:27 -0700 (MST) Cc: hackers@freebsd.org In-Reply-To: <199703061133.GAA06021@lakes.water.net> from "Thomas David Rivers" at Mar 6, 97 06:33:04 am X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-hackers@freebsd.org X-Loop: FreeBSD.org Precedence: bulk > > I guess it would be worth while to take out the printf's until you can > > isolate the printf's that "fix" the problem. Then analyze the effects of > > the printfs serializing writes. > > My thinking exactly - I've now gone back to just a pristine kernel and > I'm trying to find a missing splbio()/splx(), or something along those > lines... so far, no luck... I am, of course, unable to duplicate your panics. I suggest you buckle down and do it the hard way; I'd help if I could duplicate the problem, or if my changes would not be seen as gratuitous, but I can't. Without a problem fix resulting, there's no way I can prove that eliminating all possible race conditions is a Good Thing(tm) to those people who aren't getting bitten. Here is what I suggest; effectively, you will be required to perform a full branch-path analysis of much of the code, by hand. If you have a copy of BattleMap, you could use it some places, but since most kernel routines are not single-entry/single-exit, I would not recommend spending the $4000 or so for the software just for this problem, since it won't help much. Get a full call path for a single operation mapped out, using whatever epicycles are necessary in the graph to represent concurrency of the operations. You must produce a branch map for each routine involved. A concurency occurs wherever: o Interrupts are enabled o A page fault may occur during processing o An operation is queued o An operation is dequeued o A queue element is allocated o A queue element is freed o A queue element is potentially reused o A sleep occurs o A wakeup occurs o An operation is queued toa bus master device o A bus master device completes an operation o A bus master device *cancels* an operation o A bus master device *restarts* an operation Then redzone your maps for all possible "context switches" (quoted to account for fault based or interrupt based processing path reentrancy). Then bluezone any shared datum in the code path for every possible cycle. Whatever is simultaneously in a redzone and a bluezone is a possible problem. One of them is *the* problem. Adjust the redzones to add reeentrancy protection (probably via spl) so that they do not overlap the bluezones. The problem should go away. This would be a lot easier if the code were datum-prime instead of procedure-prime, but no one respects dataflow any more but us old theorists. 8-(. Regards, Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers.