From owner-freebsd-current@FreeBSD.ORG Wed Sep 1 16:54:15 2010 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id E2A1D10656A8; Wed, 1 Sep 2010 16:54:15 +0000 (UTC) (envelope-from mdf356@gmail.com) Received: from mail-ey0-f182.google.com (mail-ey0-f182.google.com [209.85.215.182]) by mx1.freebsd.org (Postfix) with ESMTP id 3E8868FC0A; Wed, 1 Sep 2010 16:54:14 +0000 (UTC) Received: by eyx24 with SMTP id 24so5053944eyx.13 for ; Wed, 01 Sep 2010 09:54:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:sender:received :in-reply-to:references:date:x-google-sender-auth:message-id:subject :from:to:cc:content-type:content-transfer-encoding; bh=gPf3Z7FxlmMZCZUQ311kowMd4vygcefj4IaiQVyhSno=; b=VMsMxMhF1g23PoRjwocf3hFaGa8IxFdzbv/b30qc9L8t1p6Cu/zGUA75lgfl5CfnG6 LW0rt4TM6W6ER6sx4ZevMdZbdf86hOI5pdYPNXTL32ChAXVlB7LxPatB523hVNdbtTMH 9bq7Hfkjt9paLReJdiw1Prm7lbVMe3wwHYX8I= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type :content-transfer-encoding; b=lJ++GvtXPs562kCB7JnkdoF7qGdwiVxDVhawpIH6bejhvHOnh+PEbPt+MPdqqcQuyy SwybQP5mekLJDZxqy069ESDzA8boyOIjGg8X4Rabodt6NonAUPyUTUdD0xquoJDbY9YV rsHJIWNCws3ob1WClyc1WgsSgYljpTK2C75S8= MIME-Version: 1.0 Received: by 10.213.34.4 with SMTP id j4mr641998ebd.72.1283360053992; Wed, 01 Sep 2010 09:54:13 -0700 (PDT) Sender: mdf356@gmail.com Received: by 10.213.30.13 with HTTP; Wed, 1 Sep 2010 09:54:13 -0700 (PDT) In-Reply-To: <201009010950.00422.jhb@freebsd.org> References: <201009010950.00422.jhb@freebsd.org> Date: Wed, 1 Sep 2010 09:54:13 -0700 X-Google-Sender-Auth: LeSFxI5uuyg8bOvC_BfHzyqmlO0 Message-ID: From: mdf@FreeBSD.org To: John Baldwin Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Cc: freebsd-current@freebsd.org, Jeff Roberson Subject: Re: sched_pin() bug in SCHED_ULE X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 01 Sep 2010 16:54:16 -0000 On Wed, Sep 1, 2010 at 6:49 AM, John Baldwin wrote: > On Tuesday, August 31, 2010 2:53:12 pm mdf@freebsd.org wrote: >> On Tue, Aug 31, 2010 at 10:16 AM, =A0 wrote: >> > I recorded the stack any time ts->ts_cpu was set and when a thread was >> > migrated by sched_switch() I printed out the recorded info. =A0Here's >> > what I found: >> > >> > >> > XXX bug 67957: moving 0xffffff003ff9b800 from 3 to 1 >> > [1]: pin 0 state 4 move 3 -> 1 done by 0xffffff000cc44000: >> > #0 0xffffffff802b36b4 at bug67957+0x84 >> > #1 0xffffffff802b5dd4 at sched_affinity+0xd4 >> > #2 0xffffffff8024a707 at cpuset_setthread+0x137 >> > #3 0xffffffff8024aeae at cpuset_setaffinity+0x21e >> > #4 0xffffffff804a82df at freebsd32_cpuset_setaffinity+0x4f >> > #5 0xffffffff80295f49 at isi_syscall+0x99 >> > #6 0xffffffff804a630e at ia32_syscall+0x1ce >> > #7 0xffffffff8046dc60 at Xint0x80_syscall+0x60 >> > [0]: pin 0 state 2 move 0 -> 3 done by 0xffffff000cc44000: >> > #0 0xffffffff802b36b4 at bug67957+0x84 >> > #1 0xffffffff802b4ad8 at sched_add+0xe8 >> > #2 0xffffffff8029b96a at create_thread+0x34a >> > #3 0xffffffff8029badc at kern_thr_new+0x8c >> > #4 0xffffffff804a8912 at freebsd32_thr_new+0x122 >> > #5 0xffffffff80295f49 at isi_syscall+0x99 >> > #6 0xffffffff804a630e at ia32_syscall+0x1ce >> > #7 0xffffffff8046dc60 at Xint0x80_syscall+0x60 >> > >> > So one thread in the process called cpuset_setaffinity(2), and another >> > thread in the process was forcibly migrated by the IPI while returning >> > from a syscall, while it had td_pinned set. >> > >> > Given this path, it seems reasonable to me to skip the migrate if we >> > notice THREAD_CAN_MIGRATE is false. >> > >> > Opinions? =A0My debug code is below. =A0I'll try to write a short test= case >> > that exhibits this bug. >> >> Just a few more thoughts on this. =A0The check in sched_affinity for >> THREAD_CAN_MIGRATE is racy. =A0Since witness uses sched_pin, it's not >> simple to take the THREAD lock around an increment of td_pinned. =A0So >> I'm looking for suggestions on the best way to fix this issue. =A0My >> thoughts: >> >> 1) add a check in sched_switch() for THREAD_CAN_MIGRATE >> 2) have WITNESS not use sched_pin, and take the THREAD lock when >> modifying td_pinned >> 3) have the IPI_PREEMPT handler notice the thread is pinned (and not >> trying to bind) and postpone the mi_switch, just like it postpones >> when a thread is in a critical section. >> >> Except for the potential complexity of implementation, I think (2) is >> the best solution. > > I don't like 2) only because you'd really need to fix all the other place= s > that use sched_pin() to grab the thread lock. =A0That would be a bit expe= nsive. > I think the problem is code that checks THREAD_CAN_MIGRATE() for running > threads that aren't curthread. > > The sched_affinity() bug is probably my fault FWIW. =A0I think something = along > the lines of 1) is the best approach. =A0Maybe something like the patch b= elow. > It moves the CPU assignment out of sched_affinity() and moves it into > sched_switch() instead so that it takes effect on the next context switch= for > this thread. =A0That is the point at which the CPU binding actually takes= effect > anyway since sched_affinity() doesn't force an immediate context switch. = =A0This > patch is relative to 7. =A0The patch for 9 is nearly identical (just a fi= xup for > ipi_cpu() vs ipi_selected()). > > Index: sched_ule.c > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > --- sched_ule.c (revision 212065) > +++ sched_ule.c (working copy) > @@ -1885,6 +1885,8 @@ > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0srqflag =3D (flags & SW_PREEMPT) ? > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0SRQ_OURSELF|SRQ_YIELDING|SRQ_PREEM= PTED : > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0SRQ_OURSELF|SRQ_YIELDING; > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (THREAD_CAN_MIGRATE(td) && !THREAD_CAN_S= CHED(td, ts->tscpu)) > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 ts->ts_cpu =3D sched_pickcp= u(td, 0); > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0if (ts->ts_cpu =3D=3D cpuid) > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0tdq_add(tdq, td, srqflag); > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0else > @@ -2536,7 +2538,6 @@ > =A0{ > =A0#ifdef SMP > =A0 =A0 =A0 =A0struct td_sched *ts; > - =A0 =A0 =A0 int cpu; > > =A0 =A0 =A0 =A0THREAD_LOCK_ASSERT(td, MA_OWNED); > =A0 =A0 =A0 =A0ts =3D td->td_sched; > @@ -2550,17 +2551,13 @@ > =A0 =A0 =A0 =A0if (!TD_IS_RUNNING(td)) > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0return; > =A0 =A0 =A0 =A0td->td_flags |=3D TDF_NEEDRESCHED; > - =A0 =A0 =A0 if (!THREAD_CAN_MIGRATE(td)) > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 return; > =A0 =A0 =A0 =A0/* > - =A0 =A0 =A0 =A0* Assign the new cpu and force a switch before returning= to > - =A0 =A0 =A0 =A0* userspace. =A0If the target thread is not running loca= lly send > - =A0 =A0 =A0 =A0* an ipi to force the issue. > + =A0 =A0 =A0 =A0* Force a switch before returning to userspace. =A0If th= e > + =A0 =A0 =A0 =A0* target thread is not running locally send an ipi to fo= rce > + =A0 =A0 =A0 =A0* the issue. > =A0 =A0 =A0 =A0 */ > - =A0 =A0 =A0 cpu =3D ts->ts_cpu; > - =A0 =A0 =A0 ts->ts_cpu =3D sched_pickcpu(td, 0); > - =A0 =A0 =A0 if (cpu !=3D PCPU_GET(cpuid)) > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 ipi_selected(1 << cpu, IPI_PREEMPT); > + =A0 =A0 =A0 if (td !=3D curthread) > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 ipi_selected(1 << ts->ts_cpu, IPI_PREEMPT); > =A0#endif > =A0} I will test this patch out; thanks for the help! Two questions: 1) How does a thread get moved between CPUs when it's not running? I see that we change the runqueue for non-running threads that are on a runqueue. Does the code always check for THREAD_CAN_SCHED when adding a sleeping thread to a runqueue on wakeup? 2) I assume sched_switch() runs a lot more often than sched_pin() or sched_affinity(), so it would make sense to add any performance penalty of increased work in either of those places than in the scheduler. I suppose the two memory references for THREAD_CAN_MIGRATE and THREAD_CAN_SCHED won't be that expensive. Thanks, matthew