Date: Mon, 24 Apr 95 14:08:11 MDT From: terry@cs.weber.edu (Terry Lambert) To: toor@jsdinc.root.com (John S. Dyson) Cc: geli.com!rcarter@implode.root.com, hackers@FreeBSD.org, jkh@violet.berkeley.edu Subject: Re: benchmark hell.. Message-ID: <9504242008.AA19390@cs.weber.edu> In-Reply-To: <199504231256.MAA02465@jsdinc.root.com> from "John S. Dyson" at Apr 23, 95 12:56:23 pm
next in thread | previous in thread | raw e-mail | index | archive | help
> > Well, the real measure would be Linux 1.2++ and FreeBSD from > > mid March on. And file copy on the Barracuda isn't so hot, > > I only get ~3.5 MB/s on bonnie out of mine, so I suspect > > that the advantage there is real. That leaves just the execl > > and pipe based context switch to check out. It's been > > years since I ran these, aren't there a lot more tests? > > > > Russell > > > Part of the execl problem is the Sun-OS style shared libs that we > use. Without shared libs, our fork/exec times kick-butt. I think > that we-all have made some improvements in the times now so that > my lmbench fork/exec tests (with our shared libs) on my 486/66 are fairly > close to the example Linux Pentium results. But, our pipe performance could > be better though. > > John First, this is a bogus benchmark set. Unless the benchmarks are run on identical hardware, there are totally non-referential and are thus meaningless. The correct way to run comparative benchmarks is to boot a DOS disk and fdisk/mbr the same machine and installl on the same machine over and over with the different OS's. Not "identical hardware", the same machine. That out of the way, there are several areas where Linux does outperform BSD, mostly because people haven't been paying attention to them (and there's no reason for recriminations here because of this fact, since there's really no reason that they should be required to. On the other hand, there's nothing preventing them from being considered nod). The first is context switch. There are several significan differences in the way context switch takes place in BSD and Linux. The BSD model for the actual switch itself is very close to the UnixWare/Solaris model, but is missing delayed storage of the FPU registers on a switch. This is because BSD really doesn't have its act together regarding the FPU, and can't really be corrected until it does. On hardware that does proper exception handling (like the Pentiums tested), the FPU context can be thrown out to the process it belongs to after being delayed over several context switches previous on the basis of "uses FPU" being set in the process or not, and a soft interrupt of the FPU as if trapping to an emulator to tag the first reference in each process. Pretty much all the UNIX implementations and Linux do this, but BSD does not. It should be pretty obvious that for a benchmark, when there is a single program doing FPU crap, that the FPU delayed switchout means no switch actually occurs during the running of the benchmark. You can think of this as a benchmark cheat, since it is a large locality of reference hack, in effect. A second issue in context switch is the TLB lookaside switchout. I have to say that personnaly, I'd be happy with the UnixWare level of performance in this area, because I believe the Linux stuff to be extremely processer dependent. Nevertheless, it should be looked at. The system call overhead in BSD is typically larger. This is because of address range checking for copyin/copyout operations. Linux has split this up into a seperate check call and copy operations, which is more prone to programmer error leaving security holes than an integral copy/check, but they have an advantage when it comes to multiple use memoy regions because of this (areas that are copied from several times or which are copied both in and out). Linux, as part of this, has no copyinstr. Instead, they use a routine called "getpathname". This not only allows them to special case the code, it also allows them greater flexibility than traditional copyinstr implementations when it comes to internationalization. Since the only strings allowed into the kernel from user space are path names, and since there is a single string routine for this, then a simple replacement of this routine allows a quick change to 16 bit character strings. Part of the checking is to allow address faulting instead of precheck comparison -- in other words, if the processer honors write protect in protected mode, this becomes a NULL op. The magic here is that you then only actually perform the check on i386 processers and not on i486/i586. The memory mapping is adjusted so this works. The final major advantage is the kernel premapping from where they locate their processes in memory relative to the kernel, which allows them to only partially change instead of fully changing the page mappings. This means that they keep more per process info in core than BSD. The file system overhead is based in large part on function call overhead relative to path component decompositions. Linus Torvalds has correctly implemented a two stage caching algorithm that I see as being about 33% faster than the BSD implementation, but about 10% slower than the USL implementation (in addition, both the BSD and the Linux schemes implementat negative caching, an addition of 3 compares being required to support this in the USL code. The main problem with the path component decomposition in the BSD model is that it returns to the lookup routine and requires iteration back down into the file system across a function call boundry. This could be avoided with some changes to the lookup mechanism itself, and it would coincidently fix the context lookup problem for a devfs with a directory depth greater than 2 (the current limit is because of the fact that a component can not cause context to be inherited from lookup to lookup, thus the only context you have is the previous object: the directory in which the current object is being looked up). Part of the fix would involve cleaning up the symlink following code (which causes a lot of grunge even when a link is not being followed), fixing the special casing for the trailing slash problem, and seperating the // POSIX file system escape prefix processing. The second part of the fix is to avoid actual recursion by establising a directory depth limit and using a stack variable array instead of a function call to recurse. On the shared libraries themselves, there should be preestablished memory maps that can be copied instead of being reestablished for each fork and for each exec that results in a shared library being loaded. This can be considered as a cache, and would result in reduces startup time, at the cost of either replacing the mmap() call, or making mmap() a library wrapper to a call that takes a command subfunction, then using a different command subfunction for library mapping (and potentially dlopen as well) than is used for user access to mmap() as a libc call. To speed the forks themselves, prespawing of uncommited processes, or pregeneration of an uncommited process that could be clones, with the number of these outstanding being high and low watermarked (and killed processes being reclaimed to the pool below the high watermark instead of being totally discarded). Finally, the pipe overhead is traceable to system call overhead, the pipe implementation itself, and the file system stack coeelescing being a little less than desirable. Terry Lambert terry@cs.weber.edu --- Any opinions in this posting are my own and not those of my present or previous employers.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?9504242008.AA19390>