Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 18 Aug 1998 18:38:13 +0000 (GMT)
From:      Terry Lambert <tlambert@primenet.com>
To:        Lars.Koeller@post.uni-bielefeld.de (Lars =?iso-8859-1?Q?K=F6ller?=)
Cc:        tlambert@primenet.com, chuckr@glue.umd.edu, freebsd-smp@FreeBSD.ORG
Subject:   Re: Per processor load?
Message-ID:  <199808181838.LAA20956@usr06.primenet.com>
In-Reply-To: <199808180539.FAA26168@mitch.hrz.uni-bielefeld.de> from "Lars =?iso-8859-1?Q?K=F6ller?=" at Aug 18, 98 07:39:30 am

next in thread | previous in thread | raw e-mail | index | archive | help
>  > For a symmetric system, if the load is 1.0 or above, both CPU's
>  > should be actively working.
>  > 
>  > I suppose that what you are asking for is a "processor not idle
>  > in the case of 1.0 >= load >= 0.0".
>  > 
>  > To get this, you would have to insert counters into the per CPU
>  > idle loops, probably using the Appendix H cycle counter before and
>  > after the per CPU HLT instruction, subtracting the count at exit
>  > of the last HLT from both, and then subtracting the entry from the
>  > exit, and dividing to get an "idle ratio".
>  > 
>  > Gathering this type of statistic could be actively harmful to CPU
>  > latency coming out of the HLT condition, and could be as high as 10%
>  > to 20% of the systems ability to do work.
> 
> The basic idea was to treat the CPU's as seperate systems each with 
> it's own load. This is well known from HPUX, Linux, Solaris, ...
> They display the following in, e.g. top:
> 
> System: share                                        Tue Aug 18 07:30:58 1998
> Load averages: 2.42, 2.29, 2.28
> 280 processes: 273 sleeping, 5 running, 2 zombies
> Cpu states:
> CPU   LOAD   USER   NICE    SYS   IDLE  BLOCK  SWAIT   INTR   SSYS
>  0    2.62   0.4%  97.6%   2.0%   0.0%   0.0%   0.0%   0.0%   0.0%
>  1    2.22   0.8%  97.0%   2.2%   0.0%   0.0%   0.0%   0.0%   0.0%
> ---   ----  -----  -----  -----  -----  -----  -----  -----  -----
> avg   2.42   0.6%  97.2%   2.2%   0.0%   0.0%   0.0%   0.0%   0.0%


This basically implies a scheduler artifact; each CPU must have its own
ready-to-run queue for you to get this statistic; I'm sure that on
Solaris, at least, you have to know how to grovel /dev/kmem for the
information.

FreeBSD is symmetric.  That is, there is only one ready-to-run queue
for all processors.  Anything else would either result in potential
job starvation (inequity because on one processor, the jobs you are
competing with use 75% of their quantum, being compute intensive,
and on the other, they use only 10% of their quantum, being I/O
intensive).

To combat this, there must be highly complex scheduler changes to
ensure CPU affinity and assure process migration based on average
behaviour.  This is an imperfect homeostasis, at best.  To my
knowledge, only Solaris attempts this, and then only started doing
it halfway well in 2.5.

CPU affinity s a big win for non-cache-busting programs (it ensures
some cache will be valid the next time the process is run), but for
most servers, it's a NOP.  Depending on your application mix, expect
to perhaps lose (or gain) total compute power over it.

As far as INTR time goes, I notice it's not reported.  This is not
surprising.  In Symmetric (APIC) I/O, or "virtual wire mode", the
interrupt is directed to any available processor, lowest APIC ID
first (see the Intel MP Spec version 1.4).  It's really not possible,
unless you modify the ISR to record APIC ID, and reverse look it up
(an expensive operation) on each interrupt, to determine which CPU
is actually getting the interrupt.

I notice the other fields are not reported as well, probably for
similar reasons.


> Memory: 180344K (29336K) real, 256220K (66940K) virtual, 5160K free  Page# 1/26
> 
> CPU TTY   PID USERNAME PRI NI   SIZE    RES STATE    TIME %WCPU  %CPU COMMAND
>  0    ? 19703 mcfutz   251 25   632K   116K run      6:05 80.27 80.13 schlu
>  1    ? 19721 physik   251 25   632K   112K run      4:52 49.42 49.34 process
>  1    ?  5375 plond    251 25 34756K 15900K run   2173:38 46.66 46.58 l502.exe

Pretty obviously, there aren't two running process on that one CPU. A
CPU can be in user space in only one process at a time.  8-).

I think what they are doing, since they can tell you the CPU, is either
recording what CPU they last ran on, *or*, they are reporting which of
the multiple run queues that the program is on.

The way to tell this would be to dump this information, and then count
the number of processes on one CPU or the other.  If there isn't an
imbalance, then they are talking ready-to-run-queue.

If there *is* an imbalance, then they *may* be talking ready-to-run
queue, if they are cache-busting round-robin.  This would be a design
error, but may be what is happening in an attempt to achieve load
balance between the CPU's.  To do this, if you last ran on CPU M of N,
then you next run on CPU M + 1; when M = N, you next run on CPU 0.
For programs that don't benefit from L1 cache, this is a normative
win; but again, they would be special casing the code for something
that wasn't a very general purpose use.

In either case, the statistics that *I* would find interesting is
"process migration rate" and "cache miss rate"; the second would
be as hard to do as "idle time".  8-(.


> So the idea was in a first step to display the load of each CPU in a 
> seperate graph of xperfmon++ . Perhaps it's a better idea to display 
> the other parameters like IO rate, interrupts, ... but I don't see a 
> way to get them CPU dependent.
> 
> Is there any CPU-private parameter in the kernel?

There are CPU private areas, certainly; these are memory regions mapped
to a single CPU.  The processor data area and processor stack are examples;
you can get all of them by looking in the locore code for the SMP case,
if you are interested.

I think the thing to do is to better understand the scheduler and the
model to determine which metrics are truly useful, and which statistics
are "too expensive".

Both Steve Passe and John Dyson would be good resources on this.

Note that the FreeBSD SMP scheduling algorithm is not really set in
stone yet; for example, there is experimental kernel threading code
and CPU affinity code (I'm not sure how complex this is; certainly
it's no Solaris) that would make some of what I said weigh differently
based on type of load expected.

Unfortunately, displaying this information is complicated, in that
you have to know what you are displaying...


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-smp" in the body of the message



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199808181838.LAA20956>