From owner-freebsd-current@FreeBSD.ORG  Fri Jan 18 16:56:33 2013
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115])
 by hub.freebsd.org (Postfix) with ESMTP id D7387C42
 for <freebsd-current@freebsd.org>; Fri, 18 Jan 2013 16:56:33 +0000 (UTC)
 (envelope-from rysto32@gmail.com)
Received: from mail-ob0-f182.google.com (mail-ob0-f182.google.com
 [209.85.214.182]) by mx1.freebsd.org (Postfix) with ESMTP id 964557B4
 for <freebsd-current@freebsd.org>; Fri, 18 Jan 2013 16:56:33 +0000 (UTC)
Received: by mail-ob0-f182.google.com with SMTP id 16so3864593obc.41
 for <freebsd-current@freebsd.org>; Fri, 18 Jan 2013 08:56:27 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=mime-version:x-received:date:message-id:subject:from:to
 :content-type; bh=sSKQrlO/Sg0RcB3RXS4D1DLT9V2vo6ycA9tc3fNBhhs=;
 b=zewwNG0Nf0vjQKdw0SSKkvZDmsii0Cnck/DWKegqHJzUrNRAti5xyIHitt8WlwvQBt
 1PIjH/XLZ1MMREBWaEN8qP+Z0+rN4i1dgOYx8vr/thYmpaKiv+lF9M0PKErRFZ31w8jZ
 KwFqpesYp0PTZpQGBsYSNwOdNH+mXcHPqKqd3gA/ODTY9QiFWLrAa/NUT+fUGrWLqygr
 nekAbGWU4Ay7BbsjNo9X4LC46L2dGBtjyttuo9WgFz7giyWdNbDn9ShCtyrmL/3PIsnC
 MLpQ2T6Qb4XrUPCkXxmLd/iyB9FT8pg8twfrCizvTz3L3ta4ehlvrzQWLYHAE6K8Sg/u
 f1YQ==
MIME-Version: 1.0
X-Received: by 10.60.12.72 with SMTP id w8mr7646956oeb.83.1358528187620; Fri,
 18 Jan 2013 08:56:27 -0800 (PST)
Received: by 10.76.128.68 with HTTP; Fri, 18 Jan 2013 08:56:27 -0800 (PST)
Date: Fri, 18 Jan 2013 11:56:27 -0500
Message-ID: <CAFMmRNz7AQMRr2+-59702k_mr=3=DsFP9c7ejPEBC7d=4w0bTA@mail.gmail.com>
Subject: ULE can leak TDQ_LOCK() if statclock() called outside of
 critical_enter()
From: Ryan Stone <rysto32@gmail.com>
To: FreeBSD Current <freebsd-current@freebsd.org>
Content-Type: text/plain; charset=ISO-8859-1
X-Content-Filtered-By: Mailman/MimeDel 2.1.14
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
 <freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-current>, 
 <mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
 <mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 18 Jan 2013 16:56:33 -0000

I have been experiencing occasional deadlocks on FreeBSD 8.2 systems using
the ULE scheduler.  The root cause in every case has been that ULE's
TDQ_LOCK for cpu 0 is owned by a thread that is not running.  I have been
investigating the issue, and I believe that I see the issue.  The problem
occurs if the interrupt that drives statclock does not call
critical_enter() upon calling into statclock().  The lapic timer does use
critical_enter(), so default configurations would not see this.  I have
local patches to use the RTC to drive statclock, and from a quick reading
of the eventtimer code in -CURRENT the same thing is possible there.  The
RTC code does not call statclock within a critical section.  So here's the
bug:

1) A thread with interrupts enabled, running on CPU 0, with td_owepreempt=1
and td_critnest=1 calls critical_exit():

void
critical_exit(void)
{
   // ...
    if (td->td_critnest == 1) {
        td->td_critnest = 0;
        if (td->td_owepreempt && !kdb_active) {
// Irrelevant bits snipped

2) td_critnest is set to 0, and then the RTC interrupt fires.

3) rtcintr calls into statclock (8.2) or statclock_cnt(head) with
td_critnest still 0 (on head it goes through the eventtimer code, but it
ends up in statclock eventually).

4) statclock takes the thread_lock on curthread, then calls sched_clock().
sched_clock calls sched_balance();

static void
sched_balance(void)
{
    // snip...
    tdq = TDQ_SELF();
    TDQ_UNLOCK(tdq);
    sched_balance_group(cpu_top);
    TDQ_LOCK(tdq);
}

TDQ_UNLOCK does a spinlock_exit which does a critical_exit.  td_critnest
will be decremented back to 0 and td_owepreempt is still 1, so this
triggers a preemption.  Note that this TDQ_UNLOCK is (intentionally)
unlocking the thread_lock done by statclock.

5) thread migrates to any other CPU, call it CPU n.  tdq is now stale.
TDQ_LOCK takes the lock for CPU 0 (but really it's intending to re-take the
thread_lock, although a thread_lock() here would be equally incorrect --
sched_balance's caller is going to be mucking around with the TDQ when
sched_balance returns).

6) The thread returns to statclock.  statclock does a thread_unlock(). The
td_lock is TDQ_LOCK(n), which we don't hold.  We mangle the stat of
TDQ_LOCK(n) and leave TDQ_LOCK(0) held.


The simplest solution would be to do a critical_enter() in sched_balance,
although that would be superfluous in the normal case where the lapic timer
is driving statclock.  I'm not sure if there's other code in the
eventtimers infrastructure that's assuming that a preemption or migration
is impossible while handling an event.  A quick look at kern_clocksource.c
turns up worrying comments like "Handle all events for specified time on
this CPU" and uses of curcpu, so there may well be other issues lurking
here.

It looks to me that the safest thing to do would be to push the
critical_enter() into the eventtimer code or even all the way back to the
interrupt handlers (mirroring what the lapic code already does).