Date: Wed, 6 Dec 2006 12:09:39 -0500 From: John Baldwin <jhb@freebsd.org> To: freebsd-stable@freebsd.org Cc: Gleb Smirnoff <glebius@freebsd.org> Subject: Re: RELENG_6 panic under heavy load Message-ID: <200612061209.40253.jhb@freebsd.org> In-Reply-To: <20061116160900.GQ32700@FreeBSD.org> References: <20061113084430.GE59604@dimma.mow.oilspace.com> <20061116111525.GO32700@FreeBSD.org> <20061116160900.GQ32700@FreeBSD.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Thursday 16 November 2006 11:09, Gleb Smirnoff wrote: > On Thu, Nov 16, 2006 at 02:15:25PM +0300, Gleb Smirnoff wrote: > T> On Thu, Nov 16, 2006 at 01:24:36PM +0300, Gleb Smirnoff wrote: > T> T> I wonder why UMA was suspected to be the problem. Dima gave > T> T> me access to the core. Here are more details from the trace: > > And even more: > > (kgdb) thread 133 > [Switching to thread 133 (Thread 100147)]#0 sched_switch (td=0xd745c900, newtd=0xd51f7a80, flags=2) at /usr/src/sys/kern/sched_4bsd.c:980 > 980 sched_lock.mtx_lock = (uintptr_t)td; > (kgdb) frame 9 > #9 0xd07a6e16 in syscall (frame= > {tf_fs = 134938683, tf_es = 59, tf_ds = -809566149, tf_edi = 134997504, tf_esi = 134998528, tf_ebp = -813707944, tf_isp = -170046108, tf_ebx = 672261300, tf_edx = 0, tf_ecx = 134969072, tf_eax = 1, tf_trapno = 0, tf_err = 2, tf_eip = 672832335, tf_cs = 51, tf_eflags = 646, tf_esp = -813707972, tf_ss = 59}) > at /usr/src/sys/i386/i386/trap.c:1034 > 1034 userret(td, &frame, sticks); > (kgdb) p *callp > $92 = {sy_narg = 65539, sy_call = 0xd0630550 <poll>, sy_auevent = 43012} > > (kgdb) set $poll = (struct thread *)0xd745c900 > (kgdb) set $fork = (struct thread *)0xd59aad80 > > (kgdb) p $poll->td_state > $93 = TDS_INHIBITED > (kgdb) p $poll->td_inhibitors > $94 = 1 == TDI_SUSPENDED > (kgdb) p/x $poll->td_flags > $96 = 0x1010c01 == TDF_BORROWING | TDF_BOUNDARY | TDF_ASTPENDING | TDF_NEEDRESCHED | TDF_SCHED0 > (kgdb) p $fork->td_state > $97 = TDS_INHIBITED > (kgdb) p $fork->td_inhibitors > $98 = 8 == TDI_LOCK > (kgdb) p/x $fork->td_flags > $99 = 0x1000000 == TDF_SCHED0 > > Not everything clear yet, but looks like: > > 1) $fork thread obtains proc lock > 2) $poll thread blocks on proc lock > 3) $fork thread has suspended the $poll thread in thread_single() > 4) $fork thread temporarily unlocks proc lock (line 821) and is > preempted by $poll thread > 5) $poll thread obtains proc lock, and starts doing its poll job > 6) $fork thread blocks on proc lock, and is added to its turnstile > 7) $poll thread drops the proc lock, but isn't preempted by $fork > 8) $poll thread exits and is preempted by $fork > > ...) and here is something difficult to understand, when $poll tries to > make $fork runnable, while $fork is trying to put itself in the turnstile > that is owned by $poll Hmm. I'm guessing the problem is the $poll thread is suspended (not exited) while holding the proc lock? That would appear to be the problem. That thread can't run again to release the lock. Ah, yes, I see the bug. Something like this should fix it: Index: kern_thread.c =================================================================== RCS file: /usr/cvs/src/sys/kern/kern_thread.c,v retrieving revision 1.216.2.6 diff -u -r1.216.2.6 kern_thread.c --- kern_thread.c 2 Sep 2006 17:29:57 -0000 1.216.2.6 +++ kern_thread.c 6 Dec 2006 17:06:26 -0000 @@ -969,7 +969,9 @@ TAILQ_REMOVE(&p->p_suspended, td, td_runq); TD_CLR_SUSPENDED(td); p->p_suspcount--; + critical_enter(); setrunnable(td); + critical_exit(); } /* What this does is force setrunnable() to be in a nested critical section so we won't preempt during setrunnable() until either the caller of thread_unsuspend_one() eventually releases sched_lock, or, in the case you ran into, the thread does a PROC_UNLOCK() and calls mi_switch(). -- John Baldwin
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200612061209.40253.jhb>