From owner-freebsd-current@FreeBSD.ORG Fri Jan 18 16:56:33 2013 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id D7387C42 for ; Fri, 18 Jan 2013 16:56:33 +0000 (UTC) (envelope-from rysto32@gmail.com) Received: from mail-ob0-f182.google.com (mail-ob0-f182.google.com [209.85.214.182]) by mx1.freebsd.org (Postfix) with ESMTP id 964557B4 for ; Fri, 18 Jan 2013 16:56:33 +0000 (UTC) Received: by mail-ob0-f182.google.com with SMTP id 16so3864593obc.41 for ; Fri, 18 Jan 2013 08:56:27 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:x-received:date:message-id:subject:from:to :content-type; bh=sSKQrlO/Sg0RcB3RXS4D1DLT9V2vo6ycA9tc3fNBhhs=; b=zewwNG0Nf0vjQKdw0SSKkvZDmsii0Cnck/DWKegqHJzUrNRAti5xyIHitt8WlwvQBt 1PIjH/XLZ1MMREBWaEN8qP+Z0+rN4i1dgOYx8vr/thYmpaKiv+lF9M0PKErRFZ31w8jZ KwFqpesYp0PTZpQGBsYSNwOdNH+mXcHPqKqd3gA/ODTY9QiFWLrAa/NUT+fUGrWLqygr nekAbGWU4Ay7BbsjNo9X4LC46L2dGBtjyttuo9WgFz7giyWdNbDn9ShCtyrmL/3PIsnC MLpQ2T6Qb4XrUPCkXxmLd/iyB9FT8pg8twfrCizvTz3L3ta4ehlvrzQWLYHAE6K8Sg/u f1YQ== MIME-Version: 1.0 X-Received: by 10.60.12.72 with SMTP id w8mr7646956oeb.83.1358528187620; Fri, 18 Jan 2013 08:56:27 -0800 (PST) Received: by 10.76.128.68 with HTTP; Fri, 18 Jan 2013 08:56:27 -0800 (PST) Date: Fri, 18 Jan 2013 11:56:27 -0500 Message-ID: Subject: ULE can leak TDQ_LOCK() if statclock() called outside of critical_enter() From: Ryan Stone To: FreeBSD Current Content-Type: text/plain; charset=ISO-8859-1 X-Content-Filtered-By: Mailman/MimeDel 2.1.14 X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 18 Jan 2013 16:56:33 -0000 I have been experiencing occasional deadlocks on FreeBSD 8.2 systems using the ULE scheduler. The root cause in every case has been that ULE's TDQ_LOCK for cpu 0 is owned by a thread that is not running. I have been investigating the issue, and I believe that I see the issue. The problem occurs if the interrupt that drives statclock does not call critical_enter() upon calling into statclock(). The lapic timer does use critical_enter(), so default configurations would not see this. I have local patches to use the RTC to drive statclock, and from a quick reading of the eventtimer code in -CURRENT the same thing is possible there. The RTC code does not call statclock within a critical section. So here's the bug: 1) A thread with interrupts enabled, running on CPU 0, with td_owepreempt=1 and td_critnest=1 calls critical_exit(): void critical_exit(void) { // ... if (td->td_critnest == 1) { td->td_critnest = 0; if (td->td_owepreempt && !kdb_active) { // Irrelevant bits snipped 2) td_critnest is set to 0, and then the RTC interrupt fires. 3) rtcintr calls into statclock (8.2) or statclock_cnt(head) with td_critnest still 0 (on head it goes through the eventtimer code, but it ends up in statclock eventually). 4) statclock takes the thread_lock on curthread, then calls sched_clock(). sched_clock calls sched_balance(); static void sched_balance(void) { // snip... tdq = TDQ_SELF(); TDQ_UNLOCK(tdq); sched_balance_group(cpu_top); TDQ_LOCK(tdq); } TDQ_UNLOCK does a spinlock_exit which does a critical_exit. td_critnest will be decremented back to 0 and td_owepreempt is still 1, so this triggers a preemption. Note that this TDQ_UNLOCK is (intentionally) unlocking the thread_lock done by statclock. 5) thread migrates to any other CPU, call it CPU n. tdq is now stale. TDQ_LOCK takes the lock for CPU 0 (but really it's intending to re-take the thread_lock, although a thread_lock() here would be equally incorrect -- sched_balance's caller is going to be mucking around with the TDQ when sched_balance returns). 6) The thread returns to statclock. statclock does a thread_unlock(). The td_lock is TDQ_LOCK(n), which we don't hold. We mangle the stat of TDQ_LOCK(n) and leave TDQ_LOCK(0) held. The simplest solution would be to do a critical_enter() in sched_balance, although that would be superfluous in the normal case where the lapic timer is driving statclock. I'm not sure if there's other code in the eventtimers infrastructure that's assuming that a preemption or migration is impossible while handling an event. A quick look at kern_clocksource.c turns up worrying comments like "Handle all events for specified time on this CPU" and uses of curcpu, so there may well be other issues lurking here. It looks to me that the safest thing to do would be to push the critical_enter() into the eventtimer code or even all the way back to the interrupt handlers (mirroring what the lapic code already does).