Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 31 Jan 1997 18:03:38 +73700 (MST)
From:      Terry Lambert <terry@lambert.org>
To:        bag@sinbin.demos.su (Alex G. Bulushev)
Cc:        freebsd-smp@freebsd.org, mishania@demos.su
Subject:   Re: bytebench not correct for SMP kernel ?
Message-ID:  <199702010103.SAA03619@phaeton.artisoft.com>
In-Reply-To: <199701312020.XAA18178@sinbin.demos.su> from "Alex G. Bulushev" at Jan 31, 97 11:20:19 pm

next in thread | previous in thread | raw e-mail | index | archive | help
> System Call Overhead Test         lps   68192.2  51738.2  38070.8

This is probably fair, but it's too high.  The kernel is not processor
reeentrant in the system call gate at this time; any other process
running on the system will detract from the call time available to
any other, even though they are both runnable, one or the other will
be stuck in the mutex grab in the trap call.

This is expected to be corrected once we can propagate locks down
the data flow through each subsystem (I personally want to work
on doing the VFS for this).

> Pipe Throughput Test              lps   92324.9  68053.5  57780.3

This is probably an artifact of the call mutex again.

> Pipe-based Context Switching Test lps   40542.8  20177.0   8785.4

This is because the benchmark is doing the wrong thing.  The context
switch testing does not expect the processes so switched to operate
concurrently.  To correctly model this would require modelling a
scarce resource becoming available after a small percentage of the
run time of a process being switched... that is, the second CPU will
be capable of entering the shared resource on the second process.
This is inherently serialized by the way they are using the pipes
in the context switch... so it measures serial context switch
overhead, not concurrency of resource access.  The context switch
overhea is less important than the resource access concurrency
in any SMP or kernel multithreading case.  The more kernel threads
and/or processors that can reeenter the resource, the worse this
will be for accurate modelling.

> Process Creation Test             lps    3256.4   2739.2   1568.9

This is an effect of the fork call gate, and of the flush when a
process is started on a CPU other than the one in which the process
that forked is running (in a traditional UP environment, a cache
flush is not required on the processor when the child starts, and
it will, in effect be here).  The problem derives from the child
and the parent both being immediately placed on the ready-to-run
queue.  The proper method of fixing this is probably to establish
a split scheduler queue model to enforce an initial processor
affinity in the child for the processor the parent that forked was
running on.  If we scale this with the initial call mutex reduction,
we see that this is slightly worse than the test case.  This amount
of "slightly worse" is the cost to establish the child process
mappings on the second CPU in the absence of usable cached data,
as you would have in the UP case.

You would probably discard the initial affinity (if the user has not
forced an affinity) after the first context switch of the child,
allowing the process to migrate off the parent's CPU, which would
effect an increase in cocurrency, assuming neither CPU was bound
with work.

> Execl Throughput Test             lps    1437.4   1206.6   1032.5

This is a truer measure of just the call gate overhead, since an
exec'ed process won't have usable cache.  We can scale them by
their relationship on the UP case to see that the effects I
predecited for CPU switching have about the scale we decided
they would have.

Again, this would benefit from deserializing the ready-to-run queue.

> File Read  (10 seconds)           KBps 254626.0 190873.0 151645.0
> File Read  (30 seconds)           KBps 255236.0 191890.0 152978.0

These are, again, the processor affinity issue, since the processor
you go into the kernelon is not necessarily the processor you come
out of the kernel on.  This is a scheduler problem unrelated to
the actual existance of SMP, per se...

> Dc: sqrt(2) to 99 decimal places  lpm    9533.5   8497.6   7406.6

I'm not sure on this... there must be some call gate effects for
the controlling process, but they would be minimal in a CPU bound
environment.  More like, this is related to poor FPU context handling.
In the standard UP kernel, the FPU context is "lazy bound"... that
is, if a process uses FPU, the FPU state will not be flushed until
another, *different* process also decides to use FPU -- then it
will need to (potentially) signal exception state for unprocessed
exceptions (the FPU design is "except-behind", probably an error on
Intel's part, if you want an SMP system).  This implies that if we did
FPU "right" and did not tie the lazy flush to processor affinity
changes (if any), then we would expect a higher overhead on context
switch out of an FPU using process.

> is it bytebench bug?

There are a couple of bugs... there are also a lot of overemphasis
on issues related solely to the scheduler, and less related to
what the tests purport to benchmark.  Big effects from things the
benchmark designers felt would be "noise", generally, and aren't,
in the SMP case.

> what bench tests SMP corectly?

One benchmark is good as another, as long as you know what you
are comparing, and compare only similar things.  This particular
benchmark doesn't compare very good things to show SMP vs. non-SMP,
and instead shows up scheduler differences (which is good too, but
really doesn't match with the labels they've said describe what
they are trying to test).


					Regards,
					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199702010103.SAA03619>