Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 1 Sep 2010 09:54:13 -0700
From:      mdf@FreeBSD.org
To:        John Baldwin <jhb@freebsd.org>
Cc:        freebsd-current@freebsd.org, Jeff Roberson <jroberson@jroberson.net>
Subject:   Re: sched_pin() bug in SCHED_ULE
Message-ID:  <AANLkTinA66XZ%2BjS11B3evNhWnFnyrxr0Lxa3onQYKVUu@mail.gmail.com>
In-Reply-To: <201009010950.00422.jhb@freebsd.org>
References:  <AANLkTin6V9pc3d7ifyOmR2V-H5-g_AQFji-LbZzWfAKM@mail.gmail.com> <AANLkTim9qSn_g=1dM6P0xfV=%2B616jMygPDbmM7RkyWGO@mail.gmail.com> <AANLkTinu9MW5Ckd%2BmXE-s8t8QvbYtsAHC9FNwtfOK%2BAF@mail.gmail.com> <201009010950.00422.jhb@freebsd.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, Sep 1, 2010 at 6:49 AM, John Baldwin <jhb@freebsd.org> wrote:
> On Tuesday, August 31, 2010 2:53:12 pm mdf@freebsd.org wrote:
>> On Tue, Aug 31, 2010 at 10:16 AM, =A0<mdf@freebsd.org> wrote:
>> > I recorded the stack any time ts->ts_cpu was set and when a thread was
>> > migrated by sched_switch() I printed out the recorded info. =A0Here's
>> > what I found:
>> >
>> >
>> > XXX bug 67957: moving 0xffffff003ff9b800 from 3 to 1
>> > [1]: pin 0 state 4 move 3 -> 1 done by 0xffffff000cc44000:
>> > #0 0xffffffff802b36b4 at bug67957+0x84
>> > #1 0xffffffff802b5dd4 at sched_affinity+0xd4
>> > #2 0xffffffff8024a707 at cpuset_setthread+0x137
>> > #3 0xffffffff8024aeae at cpuset_setaffinity+0x21e
>> > #4 0xffffffff804a82df at freebsd32_cpuset_setaffinity+0x4f
>> > #5 0xffffffff80295f49 at isi_syscall+0x99
>> > #6 0xffffffff804a630e at ia32_syscall+0x1ce
>> > #7 0xffffffff8046dc60 at Xint0x80_syscall+0x60
>> > [0]: pin 0 state 2 move 0 -> 3 done by 0xffffff000cc44000:
>> > #0 0xffffffff802b36b4 at bug67957+0x84
>> > #1 0xffffffff802b4ad8 at sched_add+0xe8
>> > #2 0xffffffff8029b96a at create_thread+0x34a
>> > #3 0xffffffff8029badc at kern_thr_new+0x8c
>> > #4 0xffffffff804a8912 at freebsd32_thr_new+0x122
>> > #5 0xffffffff80295f49 at isi_syscall+0x99
>> > #6 0xffffffff804a630e at ia32_syscall+0x1ce
>> > #7 0xffffffff8046dc60 at Xint0x80_syscall+0x60
>> >
>> > So one thread in the process called cpuset_setaffinity(2), and another
>> > thread in the process was forcibly migrated by the IPI while returning
>> > from a syscall, while it had td_pinned set.
>> >
>> > Given this path, it seems reasonable to me to skip the migrate if we
>> > notice THREAD_CAN_MIGRATE is false.
>> >
>> > Opinions? =A0My debug code is below. =A0I'll try to write a short test=
case
>> > that exhibits this bug.
>>
>> Just a few more thoughts on this. =A0The check in sched_affinity for
>> THREAD_CAN_MIGRATE is racy. =A0Since witness uses sched_pin, it's not
>> simple to take the THREAD lock around an increment of td_pinned. =A0So
>> I'm looking for suggestions on the best way to fix this issue. =A0My
>> thoughts:
>>
>> 1) add a check in sched_switch() for THREAD_CAN_MIGRATE
>> 2) have WITNESS not use sched_pin, and take the THREAD lock when
>> modifying td_pinned
>> 3) have the IPI_PREEMPT handler notice the thread is pinned (and not
>> trying to bind) and postpone the mi_switch, just like it postpones
>> when a thread is in a critical section.
>>
>> Except for the potential complexity of implementation, I think (2) is
>> the best solution.
>
> I don't like 2) only because you'd really need to fix all the other place=
s
> that use sched_pin() to grab the thread lock. =A0That would be a bit expe=
nsive.
> I think the problem is code that checks THREAD_CAN_MIGRATE() for running
> threads that aren't curthread.
>
> The sched_affinity() bug is probably my fault FWIW. =A0I think something =
along
> the lines of 1) is the best approach. =A0Maybe something like the patch b=
elow.
> It moves the CPU assignment out of sched_affinity() and moves it into
> sched_switch() instead so that it takes effect on the next context switch=
 for
> this thread. =A0That is the point at which the CPU binding actually takes=
 effect
> anyway since sched_affinity() doesn't force an immediate context switch. =
=A0This
> patch is relative to 7. =A0The patch for 9 is nearly identical (just a fi=
xup for
> ipi_cpu() vs ipi_selected()).
>
> Index: sched_ule.c
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> --- sched_ule.c (revision 212065)
> +++ sched_ule.c (working copy)
> @@ -1885,6 +1885,8 @@
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0srqflag =3D (flags & SW_PREEMPT) ?
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0SRQ_OURSELF|SRQ_YIELDING|SRQ_PREEM=
PTED :
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0SRQ_OURSELF|SRQ_YIELDING;
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (THREAD_CAN_MIGRATE(td) && !THREAD_CAN_S=
CHED(td, ts->tscpu))
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 ts->ts_cpu =3D sched_pickcp=
u(td, 0);
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0if (ts->ts_cpu =3D=3D cpuid)
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0tdq_add(tdq, td, srqflag);
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0else
> @@ -2536,7 +2538,6 @@
> =A0{
> =A0#ifdef SMP
> =A0 =A0 =A0 =A0struct td_sched *ts;
> - =A0 =A0 =A0 int cpu;
>
> =A0 =A0 =A0 =A0THREAD_LOCK_ASSERT(td, MA_OWNED);
> =A0 =A0 =A0 =A0ts =3D td->td_sched;
> @@ -2550,17 +2551,13 @@
> =A0 =A0 =A0 =A0if (!TD_IS_RUNNING(td))
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0return;
> =A0 =A0 =A0 =A0td->td_flags |=3D TDF_NEEDRESCHED;
> - =A0 =A0 =A0 if (!THREAD_CAN_MIGRATE(td))
> - =A0 =A0 =A0 =A0 =A0 =A0 =A0 return;
> =A0 =A0 =A0 =A0/*
> - =A0 =A0 =A0 =A0* Assign the new cpu and force a switch before returning=
 to
> - =A0 =A0 =A0 =A0* userspace. =A0If the target thread is not running loca=
lly send
> - =A0 =A0 =A0 =A0* an ipi to force the issue.
> + =A0 =A0 =A0 =A0* Force a switch before returning to userspace. =A0If th=
e
> + =A0 =A0 =A0 =A0* target thread is not running locally send an ipi to fo=
rce
> + =A0 =A0 =A0 =A0* the issue.
> =A0 =A0 =A0 =A0 */
> - =A0 =A0 =A0 cpu =3D ts->ts_cpu;
> - =A0 =A0 =A0 ts->ts_cpu =3D sched_pickcpu(td, 0);
> - =A0 =A0 =A0 if (cpu !=3D PCPU_GET(cpuid))
> - =A0 =A0 =A0 =A0 =A0 =A0 =A0 ipi_selected(1 << cpu, IPI_PREEMPT);
> + =A0 =A0 =A0 if (td !=3D curthread)
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 ipi_selected(1 << ts->ts_cpu, IPI_PREEMPT);
> =A0#endif
> =A0}

I will test this patch out; thanks for the help!

Two questions:

1) How does a thread get moved between CPUs when it's not running?  I
see that we change the runqueue for non-running threads that are on a
runqueue.  Does the code always check for THREAD_CAN_SCHED when adding
a sleeping thread to a runqueue on wakeup?

2) I assume sched_switch() runs a lot more often than sched_pin() or
sched_affinity(), so it would make sense to add any performance
penalty of increased work in either of those places than in the
scheduler.  I suppose the two memory references for THREAD_CAN_MIGRATE
and THREAD_CAN_SCHED won't be that expensive.

Thanks,
matthew



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?AANLkTinA66XZ%2BjS11B3evNhWnFnyrxr0Lxa3onQYKVUu>