From owner-freebsd-arch@FreeBSD.ORG Tue Oct 5 18:32:18 2004 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id E7EF716A4CE for ; Tue, 5 Oct 2004 18:32:17 +0000 (GMT) Received: from mail.vicor-nb.com (bigwoop.vicor-nb.com [208.206.78.2]) by mx1.FreeBSD.org (Postfix) with ESMTP id A8E8943D1D for ; Tue, 5 Oct 2004 18:32:17 +0000 (GMT) (envelope-from julian@elischer.org) Received: from elischer.org (julian.vicor-nb.com [208.206.78.97]) by mail.vicor-nb.com (Postfix) with ESMTP id 223EF7A457; Tue, 5 Oct 2004 11:32:17 -0700 (PDT) Message-ID: <4162E8B1.90803@elischer.org> Date: Tue, 05 Oct 2004 11:32:17 -0700 From: Julian Elischer User-Agent: Mozilla/5.0 (X11; U; FreeBSD i386; en-US; rv:1.3.1) Gecko/20030516 X-Accept-Language: en, hu MIME-Version: 1.0 To: Peter Holm References: <1095468747.31297.241.camel@palm.tree.com> <1096496057.3733.2163.camel@palm.tree.com> <1096603981.21577.195.camel@palm.tree.com> <200410041131.35387.jhb@FreeBSD.org> <1096911278.44307.17.camel@palm.tree.com> <20041004184939.GA8178@peter.osted.lan> <41619D29.1000704@elischer.org> <20041004191410.GA8423@peter.osted.lan> <4161A7BD.3040706@elischer.org> <20041005130308.GA2586@peter.osted.lan> In-Reply-To: <20041005130308.GA2586@peter.osted.lan> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit cc: Stephan Uphoff cc: "freebsd-arch@freebsd.org" Subject: Re: scheduler (sched_4bsd) questions X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 05 Oct 2004 18:32:18 -0000 Peter Holm wrote: >On Mon, Oct 04, 2004 at 12:42:53PM -0700, Julian Elischer wrote: > >OK, I got a crash dump now, after a few modifications to kern_shutdown.c > >There are however a few strange things worth noticing: > >1) The are no panic string: > >Mounted root from ufs:/dev/ad0s1a. >pid 1146: corrected slot count (2->1) >[thread 100796] >Stopped at sched_add+0x13: movl 0x14c(%esi),%ebx > >2) The gdb stack trace gets a bit weird at: > >#8 0xc07812da in calltrap () at ../../../i386/i386/exception.s:140 >#9 0xc05f0018 in flock (td=0x0, uap=0x0) at ../../../kern/kern_descrip.c:2138 >#10 0xc0619fd1 in setrunqueue (td=0xc2319180, flags=0x0) at kern_switch.c:521 >#11 0xc061921f in sched_wakeup (td=0xc2319180) at ../../../kern/sched_4bsd.c:859 > >Where did flock() come from? > probably just a partially initialised frame.. ddb seems to have a good trace, starting at setrunqueue(). there are two things to notice.. firstly the "corrected slot count (2->1)" messge is still there. (grumble). this is hapenning when a threade dprocess moves back to be ing an unthreaded preocess. for some reason, the number of openning s is not being set back to 1 but rather to 2. I believe it is because while in thhe threaded mode it is already too high by some amount (sometimes equivalent to NTHREAD) but I can not see why. Hopefully it is not a fatal problem (as it would be if it were too LOW, but I hope to figure it out soon (maybe another one for Stephan :-) On the topic of the crash. the ktr shows no unexpected activity in the time before the crash.... no preemption, or similar.. it might be possible that there was an interrupt, but there is nothing htath the ktr mask used shows.. maybe you could compile in and use a few more bits in the ktr masks to show process events and interrupts In the absence of unexpected happennings we must assume the kseg runq is in an odd state before it gets used in setrunqueue, leading to the panic.. I think I will check in some debug and cleanup stuff I have here.. maybe it will shake out something.. > >The full console output is at http://www.holm.cc/stress/log/cons82.html > >- Peter > > > >>ok, then if it happens again, from ddb, run >>show ktr >>after you've done the 'ps' and go back a couple of hundred events.. >> >>thanks. >> >> >>Peter Holm wrote: >> >> >> >>>On Mon, Oct 04, 2004 at 11:57:45AM -0700, Julian Elischer wrote: >>> >>> >>> >>> >>>>can you run ktrdump against teh corefile and get the ktr output? >>>>(you do have it enabled right?) >>>> >>>> >>>> >>>> >>>> >>>No, that's one of the problems: doadump() fails with this specific panic. >>> >>>- Peter >>> >>> >>> >>> >>> >>>>Peter Holm wrote: >>>> >>>> >>>> >>>> >>>> >>>>>On Mon, Oct 04, 2004 at 01:34:38PM -0400, Stephan Uphoff wrote: >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>>>On Mon, 2004-10-04 at 11:31, John Baldwin wrote: >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>>On Friday 01 October 2004 12:13 am, Stephan Uphoff wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>>On Wed, 2004-09-29 at 18:14, Stephan Uphoff wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>>I was looking at the MUTEX_WAKE_ALL undefined case when I used the >>>>>>>>>critical section for turnstile_claim(). >>>>>>>>>However there are bigger problems with MUTEX_WAKE_ALL undefined >>>>>>>>>so you are right - the critical section for turnstile_claim is pretty >>>>>>>>>useless. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>Arghhh !!! >>>>>>>> >>>>>>>>MUTEX_WAKE_ALL is NOT an option in GENERIC. >>>>>>>>I recall verifying that it is defined twice. Guess I must have looked >>>>>>>>at >>>>>>>>the wrong source tree :-( >>>>>>>>This means yes - we have bigger problems! >>>>>>>> >>>>>>>>Example: >>>>>>>> >>>>>>>>Thread A holds a mutex x contested by Thread B and C and has priority >>>>>>>>pri(A). >>>>>>>> >>>>>>>>Thread C holds a mutex y and pri(B) < pri(C) >>>>>>>> >>>>>>>>Thread A releases the lock wakes thread B but lets C on the turnstile >>>>>>>>wait queue. >>>>>>>> >>>>>>>>An interrupt thread I tries to lock mutex y owned by C. >>>>>>>> >>>>>>>>However priority inheritance does not work since B needs to run first >>>>>>>>to >>>>>>>>take ownership of the lock. >>>>>>>> >>>>>>>>I is blocked :-( >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>Ermm, if the interrupt happens after x is released then I's priority >>>>>>>should propagate from I to C to B. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>There is a hole after the mutex x is released by A - but before B can >>>>>>claim the mutex. The turnstile for mutex x is unowned and interrupt >>>>>>thread I when trying to donate its priority will run into: >>>>>> >>>>>> if (td == NULL) { >>>>>> /* >>>>>> * This really isn't quite right. Really >>>>>> * ought to bump priority of thread that >>>>>> * next acquires the lock. >>>>>> */ >>>>>> return; >>>>>> } >>>>>> >>>>>>So B needs to run and acquire the mutex before priority inheritance >>>>>>works again and does not get a priority boost to do so. >>>>>> >>>>>>This is easy to fix and MUTEX_WAKE_ALL can be removed again at that time >>>>>>- but my time budget is limited and Peter has an interesting bug left >>>>>>that has priority. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>I'm not closer to being able to create this panic in a controlled way. >>>>>After a whole day of different tests I finally got this panic: >>>>>http://www.holm.cc/stress/log/cons81.html. The trigger seems to be one >>>>>particular Java applet, but it is not easily reproduceable. >>>>> >>>>>- Peter >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>>>>If the interrupt happens before x is released, >>>>>>>then the final bit of propagate_priority() should handle it since it >>>>>>>resorts the turnstile's thread queue so that C will be awakened rather >>>>>>>than B. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>Agreed. >>>>>> >>>>>> Stephan >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>_______________________________________________ >>>>>freebsd-arch@freebsd.org mailing list >>>>>http://lists.freebsd.org/mailman/listinfo/freebsd-arch >>>>>To unsubscribe, send any mail to "freebsd-arch-unsubscribe@freebsd.org" >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>> >>> >>> > > > >------------------------------------------------------------------------ > >Index: kern_shutdown.c >=================================================================== >RCS file: /home/ncvs/src/sys/kern/kern_shutdown.c,v >retrieving revision 1.166 >diff -u -r1.166 kern_shutdown.c >--- kern_shutdown.c 2 Sep 2004 18:59:15 -0000 1.166 >+++ kern_shutdown.c 5 Oct 2004 12:23:45 -0000 >@@ -230,10 +230,14 @@ > return; > } > >+ if (panicstr == NULL) >+ panicstr = "In doadump()"; /* Major hack XXX pho */ > savectx(&dumppcb); > dumptid = curthread->td_tid; > dumping++; > dumpsys(&dumper); >+ if (!strcmp(panicstr, "In doadump()")) >+ panicstr = NULL; /* Major hack XXX pho */ > } > > /* >@@ -519,6 +523,8 @@ > #endif > > #ifdef KDB >+ if (panicstr == NULL) >+ panicstr = "(NULL)"; /* XXX pho */ > if (newpanic && trace_on_panic) > kdb_backtrace(); > if (debugger_on_panic) > >