From owner-freebsd-hackers@FreeBSD.ORG  Wed Nov  7 07:45:52 2012
Return-Path: <owner-freebsd-hackers@FreeBSD.ORG>
Delivered-To: freebsd-hackers@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
 by hub.freebsd.org (Postfix) with ESMTP id A4781C2E;
 Wed,  7 Nov 2012 07:45:52 +0000 (UTC)
 (envelope-from davidxu@freebsd.org)
Received: from freefall.freebsd.org (freefall.freebsd.org
 [IPv6:2001:1900:2254:206c::16:87])
 by mx1.freebsd.org (Postfix) with ESMTP id 7EA558FC0C;
 Wed,  7 Nov 2012 07:45:52 +0000 (UTC)
Received: from xyf.my.dom (localhost [127.0.0.1])
 by freefall.freebsd.org (8.14.5/8.14.5) with ESMTP id qA77joD9020527;
 Wed, 7 Nov 2012 07:45:51 GMT (envelope-from davidxu@freebsd.org)
Message-ID: <509A11B4.7030605@freebsd.org>
Date: Wed, 07 Nov 2012 15:45:56 +0800
From: David Xu <davidxu@freebsd.org>
User-Agent: Mozilla/5.0 (X11; FreeBSD i386;
 rv:14.0) Gecko/20120822 Thunderbird/14.0
MIME-Version: 1.0
To: Jeff Roberson <jroberson@jroberson.net>
Subject: Re: ule+smp: small optimization for turnstile priority lending
References: <50587F8D.9060102@FreeBSD.org> <505AD2A5.6060008@freebsd.org>
 <CAJ-FndBGEgkrKJ9bNdq0QrdyYZb=LXUsAG3wz5Lp-HLUBd5d9w@mail.gmail.com>
 <5099C5C9.4040703@freebsd.org> <alpine.BSF.2.00.1211062016380.4081@desktop>
In-Reply-To: <alpine.BSF.2.00.1211062016380.4081@desktop>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: attilio@freebsd.org, freebsd-hackers <freebsd-hackers@freebsd.org>,
 Jeff Roberson <jeff@freebsd.org>, Andriy Gapon <avg@freebsd.org>
X-BeenThere: freebsd-hackers@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Technical Discussions relating to FreeBSD
 <freebsd-hackers.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-hackers>, 
 <mailto:freebsd-hackers-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hackers>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Help: <mailto:freebsd-hackers-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>,
 <mailto:freebsd-hackers-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 07 Nov 2012 07:45:52 -0000

On 2012/11/07 14:17, Jeff Roberson wrote:
> On Wed, 7 Nov 2012, David Xu wrote:
>
>> On 2012/11/06 19:03, Attilio Rao wrote:
>>> On 9/20/12, David Xu <davidxu@freebsd.org> wrote:
>>>> On 2012/09/18 22:05, Andriy Gapon wrote:
>>>>>
>>>>> Here is a snippet that demonstrates the issue on a supposedly fully
>>>>> loaded
>>>>> 2-processor system:
>>>>>
>>>>> 136794   0 3670427870244462 KTRGRAPH group:"thread", id:"Xorg tid
>>>>> 102818",
>>>>> state:"running", attributes: prio:122
>>>>>
>>>>> 136793   0 3670427870241000 KTRGRAPH group:"thread", id:"cc1plus tid
>>>>> 111916",
>>>>> state:"yielding", attributes: prio:183, wmesg:"(null)",
>>>>> lockname:"(null)"
>>>>>
>>>>> 136792   1 3670427870240829 KTRGRAPH group:"thread", id:"idle: cpu1
>>>>> tid
>>>>> 100004",
>>>>> state:"running", attributes: prio:255
>>>>>
>>>>> 136791   1 3670427870239520 KTRGRAPH group:"load", id:"CPU 1 load",
>>>>> counter:0,
>>>>> attributes: none
>>>>>
>>>>> 136790   1 3670427870239248 KTRGRAPH group:"thread", id:"firefox tid
>>>>> 113473",
>>>>> state:"blocked", attributes: prio:122, wmesg:"(null)",
>>>>> lockname:"unp_mtx"
>>>>>
>>>>> 136789   1 3670427870237697 KTRGRAPH group:"load", id:"CPU 0 load",
>>>>> counter:2,
>>>>> attributes: none
>>>>>
>>>>> 136788   1 3670427870236394 KTRGRAPH group:"thread", id:"firefox tid
>>>>> 113473",
>>>>> point:"wokeup", attributes: linkedto:"Xorg tid 102818"
>>>>>
>>>>> 136787   1 3670427870236145 KTRGRAPH group:"thread", id:"Xorg tid
>>>>> 102818",
>>>>> state:"runq add", attributes: prio:122, linkedto:"firefox tid 113473"
>>>>>
>>>>> 136786   1 3670427870235981 KTRGRAPH group:"load", id:"CPU 1 load",
>>>>> counter:1,
>>>>> attributes: none
>>>>>
>>>>> 136785   1 3670427870235707 KTRGRAPH group:"thread", id:"Xorg tid
>>>>> 102818",
>>>>> state:"runq rem", attributes: prio:176
>>>>>
>>>>> 136784   1 3670427870235423 KTRGRAPH group:"thread", id:"Xorg tid
>>>>> 102818",
>>>>> point:"prio", attributes: prio:176, new prio:122, linkedto:"firefox
>>>>> tid
>>>>> 113473"
>>>>>
>>>>> 136783   1 3670427870202392 KTRGRAPH group:"thread", id:"firefox tid
>>>>> 113473",
>>>>> state:"running", attributes: prio:104
>>>>>
>>>>> See how how the Xorg thread was forced from CPU 1 to CPU 0 where it
>>>>> preempted
>>>>> cc1plus thread (I do have preemption enabled) only to leave CPU 1 with
>>>>> zero load.
>>>>>
>>>>> Here is a proposed solution:
>>>>>
>>>>>       turnstile_wait: optimize priority lending to a thread on a
>>>>> runqueue
>>>>>
>>>>>       As the current thread is definitely going into mi_switch, it now
>>>>> removes
>>>>>       its load before doing priority propagation which can potentially
>>>>> result
>>>>>       in sched_add.  In the SMP && ULE case the latter searches for
>>>>> the
>>>>>       least loaded CPU to place a boosted thread, which is supposedly
>>>>> about
>>>>>       to run.
>>>>>
>>>>> diff --git a/sys/kern/sched_ule.c b/sys/kern/sched_ule.c
>>>>> index 8e466cd..3299cae 100644
>>>>> --- a/sys/kern/sched_ule.c
>>>>> +++ b/sys/kern/sched_ule.c
>>>>> @@ -1878,7 +1878,10 @@ sched_switch(struct thread *td, struct thread
>>>>> *newtd, int
>>>>> flags)
>>>>>            /* This thread must be going to sleep. */
>>>>>            TDQ_LOCK(tdq);
>>>>>            mtx = thread_lock_block(td);
>>>>> -        tdq_load_rem(tdq, td);
>>>>> +#if defined(SMP)
>>>>> +        if ((flags & SW_TYPE_MASK) != SWT_TURNSTILE)
>>>>> +#endif
>>>>> +            tdq_load_rem(tdq, td);
>>>>>        }
>>>>>        /*
>>>>>         * We enter here with the thread blocked and assigned to the
>>>>> @@ -2412,6 +2415,21 @@ sched_rem(struct thread *td)
>>>>>            tdq_setlowpri(tdq, NULL);
>>>>>    }
>>>>>
>>>>> +void
>>>>> +sched_load_rem(struct thread *td)
>>>>> +{
>>>>> +    struct tdq *tdq;
>>>>> +
>>>>> +    KASSERT(td == curthread,
>>>>> +        ("sched_rem_load: only curthread is supported"));
>>>>> +    KASSERT(td->td_oncpu == td->td_sched->ts_cpu,
>>>>> +        ("thread running on cpu different from ts_cpu"));
>>>>> +    tdq = TDQ_CPU(td->td_sched->ts_cpu);
>>>>> +    TDQ_LOCK_ASSERT(tdq, MA_OWNED);
>>>>> +    MPASS(td->td_lock == TDQ_LOCKPTR(tdq));
>>>>> +    tdq_load_rem(tdq, td);
>>>>> +}
>>>>> +
>>>>>    /*
>>>>>     * Fetch cpu utilization information.  Updates on demand.
>>>>>     */
>>>>> diff --git a/sys/kern/subr_turnstile.c b/sys/kern/subr_turnstile.c
>>>>> index 31d16fe..d1d68e9 100644
>>>>> --- a/sys/kern/subr_turnstile.c
>>>>> +++ b/sys/kern/subr_turnstile.c
>>>>> @@ -731,6 +731,13 @@ turnstile_wait(struct turnstile *ts, struct
>>>>> thread
>>>>> *owner,
>>>>> int queue)
>>>>>            LIST_INSERT_HEAD(&ts->ts_free, td->td_turnstile, ts_hash);
>>>>>        }
>>>>>        thread_lock(td);
>>>>> +#if defined(SCHED_ULE) && defined(SMP)
>>>>> +    /*
>>>>> +     * Remove load earlier so that it does not affect cpu selection
>>>>> +     * for a thread waken up due to priority lending, if any.
>>>>> +     */
>>>>> +    sched_load_rem(td);
>>>>> +#endif
>>>>>        thread_lock_set(td, &ts->ts_lock);
>>>>>        td->td_turnstile = NULL;
>>>>>
>>>>> diff --git a/sys/sys/sched.h b/sys/sys/sched.h
>>>>> index 4b8387c..b1ead1b 100644
>>>>> --- a/sys/sys/sched.h
>>>>> +++ b/sys/sys/sched.h
>>>>> @@ -110,6 +110,9 @@ void    sched_preempt(struct thread *td);
>>>>>    void    sched_add(struct thread *td, int flags);
>>>>>    void    sched_clock(struct thread *td);
>>>>>    void    sched_rem(struct thread *td);
>>>>> +#if defined(SCHED_ULE) && defined(SMP)
>>>>> +void    sched_load_rem(struct thread *td);
>>>>> +#endif
>>>>>    void    sched_tick(int cnt);
>>>>>    void    sched_relinquish(struct thread *td);
>>>>>    struct thread *sched_choose(void);
>>>>>
>>>>
>>>> I found another scenario in taskqueue, in the function
>>>> taskqueue_terminate, current thread tries to wake
>>>> another thread up and sleep immediately, the tq_mutex sometimes
>>>> is a spinlock. So if you remove one thread load from current cpu
>>>> before wakeup, the resumed thread may be put on same cpu,
>>>> so it will optimize the cpu scheduling too.
>>>
>>> I think that in order to fit with sched_add() modifies I have in mind
>>> (see other patches within this thread) wakeup() should grow a new
>>> argument, or maybe we can use wakeup_flags() new KPI.
>>> If the latter is the case, I would also propose to let wakeup_one() to
>>> be absorbed into wakeup_flags() with its own flag.
>>>
>>
>> Yes, I like the idea.
>
>> From ~2007
>
> http://people.freebsd.org/~jeff/wakeupflags.diff
>
> This has some different optimizations that can be done when you have a
> hinted wakeup.
>

wakeup_flags is a good start point.
But what flags should be supported? I think WAKEUP_WILLSLEEP may be
needed if the current thread will switch out as soon as possible.

I see you have added WAKEUP_LOCAL and WAKEUP_TAIL. Are they for cache
optimization ? both of them are arguable.

WAKEUP_LOCAL increases cpu migration, not good, since one benefit of
ULE is keeping thread on its old cpu, the WAKEUP_LOCAL violates the
design.
WAKEUP_LOCAL used in pipe code may not be correct if the pipe only
has few of bytes to be transfered or receiver only eats a few bytes
each time, so why bother to move other threads to local cpu ?
If the other threads have many bytes cached in their old cpu, this
migration is expensive.

WAKEUP_TAIL is a good idea, I guess that you want to wake up hot-thread 
its code and data are still in its old cpu's cache. But this will lead
to unfair. I'd like the sleep queue to implement a policy like I did
in libthr, it picks a thread at tail of queue in a fixed percentage, it
does have some level of unfair but not fatal at all.

> Thanks,
> Jeff