Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 24 Apr 95 14:08:11 MDT
From:      terry@cs.weber.edu (Terry Lambert)
To:        toor@jsdinc.root.com (John S. Dyson)
Cc:        geli.com!rcarter@implode.root.com, hackers@FreeBSD.org, jkh@violet.berkeley.edu
Subject:   Re: benchmark hell..
Message-ID:  <9504242008.AA19390@cs.weber.edu>
In-Reply-To: <199504231256.MAA02465@jsdinc.root.com> from "John S. Dyson" at Apr 23, 95 12:56:23 pm

next in thread | previous in thread | raw e-mail | index | archive | help
> > Well, the real measure would be Linux 1.2++ and FreeBSD from
> > mid March on.  And file copy on the Barracuda isn't so hot,
> > I only get ~3.5 MB/s on bonnie out of mine, so I suspect
> > that the advantage there is real.  That leaves just the execl
> > and pipe based context switch to check out.  It's been
> > years since I ran these, aren't there a lot more tests?
> > 
> > Russell
> > 
> Part of the execl problem is the Sun-OS style shared libs that we
> use.  Without shared libs, our fork/exec times kick-butt.  I think
> that we-all have made some improvements in the times now so that
> my lmbench fork/exec tests (with our shared libs) on my 486/66 are fairly
> close to the example Linux Pentium results.  But, our pipe performance could
> be better though.
> 
> John

First, this is a bogus benchmark set.  Unless the benchmarks are run
on identical hardware, there are totally non-referential and are thus
meaningless.

The correct way to run comparative benchmarks is to boot a DOS disk
and fdisk/mbr the same machine and installl on the same machine over
and over with the different OS's.  Not "identical hardware", the same
machine.


That out of the way, there are several areas where Linux does outperform
BSD, mostly because people haven't been paying attention to them (and
there's no reason for recriminations here because of this fact, since
there's really no reason that they should be required to.  On the other
hand, there's nothing preventing them from being considered nod).


The first is context switch.  There are several significan differences
in the way context switch takes place in BSD and Linux.  The BSD model
for the actual switch itself is very close to the UnixWare/Solaris model,
but is missing delayed storage of the FPU registers on a switch.  This is
because BSD really doesn't have its act together regarding the FPU, and
can't really be corrected until it does.  On hardware that does proper
exception handling (like the Pentiums tested), the FPU context can be
thrown out to the process it belongs to after being delayed over several
context switches previous on the basis of "uses FPU" being set in the
process or not, and a soft interrupt of the FPU as if trapping to an
emulator to tag the first reference in each process.  Pretty much all
the UNIX implementations and Linux do this, but BSD does not.

It should be pretty obvious that for a benchmark, when there is a single
program doing FPU crap, that the FPU delayed switchout means no switch
actually occurs during the running of the benchmark.  You can think of
this as a benchmark cheat, since it is a large locality of reference
hack, in effect.

A second issue in context switch is the TLB lookaside switchout.  I
have to say that personnaly, I'd be happy with the UnixWare level of
performance in this area, because I believe the Linux stuff to be
extremely processer dependent.  Nevertheless, it should be looked at.

The system call overhead in BSD is typically larger.  This is because
of address range checking for copyin/copyout operations.  Linux has
split this up into a seperate check call and copy operations, which is
more prone to programmer error leaving security holes than an integral
copy/check, but they have an advantage when it comes to multiple use
memoy regions because of this (areas that are copied from several times
or which are copied both in and out).

Linux, as part of this, has no copyinstr.  Instead, they use a routine
called "getpathname".  This not only allows them to special case the
code, it also allows them greater flexibility than traditional copyinstr
implementations when it comes to internationalization.  Since the only
strings allowed into the kernel from user space are path names, and
since there is a single string routine for this, then a simple replacement
of this routine allows a quick change to 16 bit character strings.


Part of the checking is to allow address faulting instead of precheck
comparison -- in other words, if the processer honors write protect in
protected mode, this becomes a NULL op.  The magic here is that you
then only actually perform the check on i386 processers and not on
i486/i586.  The memory mapping is adjusted so this works.


The final major advantage is the kernel premapping from where they
locate their processes in memory relative to the kernel, which allows
them to only partially change instead of fully changing the page
mappings.  This means that they keep more per process info in core
than BSD.


The file system overhead is based in large part on function call
overhead relative to path component decompositions.  Linus Torvalds
has correctly implemented a two stage caching algorithm that I see
as being about 33% faster than the BSD implementation, but about 10%
slower than the USL implementation (in addition, both the BSD and the
Linux schemes implementat negative caching, an addition of 3 compares
being required to support this in the USL code.

The main problem with the path component decomposition in the BSD model
is that it returns to the lookup routine and requires iteration back
down into the file system across a function call boundry.  This could
be avoided with some changes to the lookup mechanism itself, and it
would coincidently fix the context lookup problem for a devfs with
a directory depth greater than 2 (the current limit is because of the
fact that a component can not cause context to be inherited from lookup
to lookup, thus the only context you have is the previous object: the
directory in which the current object is being looked up).

Part of the fix would involve cleaning up the symlink following code
(which causes a lot of grunge even when a link is not being followed),
fixing the special casing for the trailing slash problem, and seperating
the // POSIX file system escape prefix processing.

The second part of the fix is to avoid actual recursion by establising
a directory depth limit and using a stack variable array instead of a
function call to recurse.


On the shared libraries themselves, there should be preestablished
memory maps that can be copied instead of being reestablished for each
fork and for each exec that results in a shared library being loaded.

This can be considered as a cache, and would result in reduces startup
time, at the cost of either replacing the mmap() call, or making mmap()
a library wrapper to a call that takes a command subfunction, then using
a different command subfunction for library mapping (and potentially
dlopen as well) than is used for user access to mmap() as a libc call.


To speed the forks themselves, prespawing of uncommited processes, or
pregeneration of an uncommited process that could be clones, with the
number of these outstanding being high and low watermarked (and killed
processes being reclaimed to the pool below the high watermark instead
of being totally discarded).


Finally, the pipe overhead is traceable to system call overhead, the pipe
implementation itself, and the file system stack coeelescing being a
little less than desirable.


					Terry Lambert
					terry@cs.weber.edu
---
Any opinions in this posting are my own and not those of my present
or previous employers.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?9504242008.AA19390>