From owner-freebsd-current@FreeBSD.ORG  Sat Aug 14 03:16:52 2004
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP
	id E7BAB16A4CE; Sat, 14 Aug 2004 03:16:52 +0000 (GMT)
Received: from gw.catspoiler.org (217-ip-163.nccn.net [209.79.217.163])
	by mx1.FreeBSD.org (Postfix) with ESMTP
	id 6010143D2F; Sat, 14 Aug 2004 03:16:52 +0000 (GMT)
	(envelope-from truckman@FreeBSD.org)
Received: from FreeBSD.org (mousie.catspoiler.org [192.168.101.2])
	by gw.catspoiler.org (8.12.11/8.12.11) with ESMTP id i7E3GgHG003765;
	Fri, 13 Aug 2004 20:16:46 -0700 (PDT)
	(envelope-from truckman@FreeBSD.org)
Message-Id: <200408140316.i7E3GgHG003765@gw.catspoiler.org>
Date: Fri, 13 Aug 2004 20:16:42 -0700 (PDT)
From: Don Lewis <truckman@FreeBSD.org>
To: rwatson@FreeBSD.org
In-Reply-To: <200408131002.i7DA2afX001590@gw.catspoiler.org>
MIME-Version: 1.0
Content-Type: TEXT/plain; charset=us-ascii
cc: jroberson@chesapeake.net
cc: freebsd-current@FreeBSD.org
Subject: Re: nice handling in ULE (was: Re: SCHEDULE and high load
 situations)
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 14 Aug 2004 03:16:53 -0000

On 13 Aug, To: rwatson@freebsd.org wrote:
> On 12 Aug, Don Lewis wrote:
> 
>> I did some experimentation, and the problem I'm seeing appears to just
>> be related to how nice values are handled by ULE.  I'm running two
>> copies of the following program, one at nice +15, and the other not
>> niced:
>> 
>> hairball:~ 102>cat sponge.c
>> int
>> main(int argc, char **argv)
>> {
>>         while (1)
>>                 ;
>> }
>> 
>> The niced process was started second, but it has accumulated more CPU
>> time and is getting a larger percentage of the CPU time according to
>> top.
>> 
>> last pid:   662;  load averages:  2.00,  1.95,  1.45    up 0+00:22:35  15:14:27
>> 31 processes:  3 running, 28 sleeping
>> CPU states: 45.3% user, 53.1% nice,  1.2% system,  0.4% interrupt,  0.0% idle
>> Mem: 22M Active, 19M Inact, 44M Wired, 28K Cache, 28M Buf, 408M Free
>> Swap: 1024M Total, 1024M Free
>> Seconds to delay: 
>>   PID USERNAME PRI NICE   SIZE    RES STATE    TIME   WCPU    CPU COMMAND
>>   599 dl       139   15  1180K   448K RUN      8:34 53.91% 53.91% sponge
>>   598 dl       139    0  1180K   448K RUN      7:22 42.97% 42.97% sponge
>>   587 dl        76    0  2288K  1580K RUN      0:03  0.00%  0.00% top
>>   462 root      76    0 56656K 46200K select   0:02  0.00%  0.00% Xorg
>>   519 gdm       76    0 11252K  8564K select   0:01  0.00%  0.00% gdmlogin
>>   579 dl        76    0  6088K  2968K select   0:00  0.00%  0.00% sshd
>> 
>> 
>> 
>> I thought it might have something to do with grouping by niceness, which
>> would group the un-niced process with a bunch of other processes that
>> wake up every now and then for a little bit if CPU time, so I tried the
>> experiment again with nice +1 and nice +15.  This gave a rather
>> interesting result.  Top reports the nice +15 process as getting a
>> higher %CPU, but the nice +1 process has slowly accumulated a bit more
>> total CPU time.  The difference in total CPU time was initially seven
>> seconds or less.
>> 
>> last pid:   745;  load averages:  2.00,  1.99,  1.84    up 0+00:43:30  15:35:22
>> 31 processes:  3 running, 28 sleeping
>> CPU states:  0.0% user, 99.6% nice,  0.4% system,  0.0% interrupt,  0.0% idle
>> Mem: 22M Active, 19M Inact, 44M Wired, 28K Cache, 28M Buf, 408M Free
>> Swap: 1024M Total, 1024M Free
>> Seconds to delay: 
>>   PID USERNAME PRI NICE   SIZE    RES STATE    TIME   WCPU    CPU COMMAND
>>   675 dl       139   15  1180K   448K RUN      9:48 52.34% 52.34% sponge
>>   674 dl       139    1  1180K   448K RUN     10:03 44.53% 44.53% sponge
>>   587 dl        76    0  2288K  1580K RUN      0:06  0.00%  0.00% top
>>   462 root      76    0 56656K 46200K select   0:03  0.00%  0.00% Xorg
>>   519 gdm       76    0 11252K  8564K select   0:02  0.00%  0.00% gdmlogin
>>   579 dl        76    0  6088K  2968K select   0:00  0.00%  0.00% sshd
> 
> 
> I compiled a kernel with the KTR stuff and ran this last experiment
> again.  It looks like the two niced processes get the appropriate slice
> values assigned by ULE, and they both have the same priority.  Where
> things seem to be going wrong is that the two processes are being run in
> a round robin fashion, alternating execution once every tick or two. The
> less-nice process gets prempted multiple times by the more-nice process
> before the less-nice process has exhausted its slice.

I managed to figure out a bit more of what is going on.

The following code in sched_choose() pulls the selected kse off the run
queue.

        ke = kseq_choose(kseq);
        if (ke) {
#ifdef SMP
                if (ke->ke_ksegrp->kg_pri_class == PRI_IDLE)
                        if (kseq_idled(kseq) == 0)
                                goto restart;
#endif
                kseq_runq_rem(kseq, ke);
                ke->ke_state = KES_THREAD;

                if (ke->ke_ksegrp->kg_pri_class == PRI_TIMESHARE) {
                        CTR4(KTR_ULE, "Run kse %p from %p (slice: %d, pri: %d)",
                            ke, ke->ke_runq, ke->ke_slice,
                            ke->ke_thread->td_priority);
                }
                return (ke);
        }


At some later time, setrunqueue() gets called for this thread, which
calls sched_add(), which calls sched_add_internal(), which executes the
following code frament.

        class = PRI_BASE(kg->kg_pri_class);
        switch (class) {
        case PRI_ITHD:
        case PRI_REALTIME:
                ke->ke_runq = kseq->ksq_curr;
                ke->ke_slice = SCHED_SLICE_MAX; 
                ke->ke_cpu = PCPU_GET(cpuid);
                break;
        case PRI_TIMESHARE:
                if (SCHED_CURR(kg, ke)) 
                        ke->ke_runq = kseq->ksq_curr;
                else
                        ke->ke_runq = kseq->ksq_next;
                break;
[snip]
        kseq_runq_add(kseq, ke);
        kseq_load_add(kseq, ke);


Because the thread is a CPU hog, it gets put on the next run queue, even
though it hasn't exhausted its current slice, which means that it has to
wait for all the other CPU hogs to get a turn at the CPU before it can
execute again.

I don't know how to fix this problem.  I think the desired behaviour
would be for the kse to be restored to its previous location on the run
queue.