From owner-freebsd-smp Fri Jan 31 17:05:10 1997 Return-Path: Received: (from root@localhost) by freefall.freebsd.org (8.8.5/8.8.5) id RAA25979 for smp-outgoing; Fri, 31 Jan 1997 17:05:10 -0800 (PST) Received: from phaeton.artisoft.com (phaeton.Artisoft.COM [198.17.250.211]) by freefall.freebsd.org (8.8.5/8.8.5) with SMTP id RAA25974 for ; Fri, 31 Jan 1997 17:05:05 -0800 (PST) Received: (from terry@localhost) by phaeton.artisoft.com (8.6.11/8.6.9) id SAA03619; Fri, 31 Jan 1997 18:03:38 -0700 From: Terry Lambert Message-Id: <199702010103.SAA03619@phaeton.artisoft.com> Subject: Re: bytebench not correct for SMP kernel ? To: bag@sinbin.demos.su (Alex G. Bulushev) Date: Fri, 31 Jan 1997 18:03:38 +73700 (MST) Cc: freebsd-smp@freebsd.org, mishania@demos.su In-Reply-To: <199701312020.XAA18178@sinbin.demos.su> from "Alex G. Bulushev" at Jan 31, 97 11:20:19 pm X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-smp@freebsd.org X-Loop: FreeBSD.org Precedence: bulk > System Call Overhead Test lps 68192.2 51738.2 38070.8 This is probably fair, but it's too high. The kernel is not processor reeentrant in the system call gate at this time; any other process running on the system will detract from the call time available to any other, even though they are both runnable, one or the other will be stuck in the mutex grab in the trap call. This is expected to be corrected once we can propagate locks down the data flow through each subsystem (I personally want to work on doing the VFS for this). > Pipe Throughput Test lps 92324.9 68053.5 57780.3 This is probably an artifact of the call mutex again. > Pipe-based Context Switching Test lps 40542.8 20177.0 8785.4 This is because the benchmark is doing the wrong thing. The context switch testing does not expect the processes so switched to operate concurrently. To correctly model this would require modelling a scarce resource becoming available after a small percentage of the run time of a process being switched... that is, the second CPU will be capable of entering the shared resource on the second process. This is inherently serialized by the way they are using the pipes in the context switch... so it measures serial context switch overhead, not concurrency of resource access. The context switch overhea is less important than the resource access concurrency in any SMP or kernel multithreading case. The more kernel threads and/or processors that can reeenter the resource, the worse this will be for accurate modelling. > Process Creation Test lps 3256.4 2739.2 1568.9 This is an effect of the fork call gate, and of the flush when a process is started on a CPU other than the one in which the process that forked is running (in a traditional UP environment, a cache flush is not required on the processor when the child starts, and it will, in effect be here). The problem derives from the child and the parent both being immediately placed on the ready-to-run queue. The proper method of fixing this is probably to establish a split scheduler queue model to enforce an initial processor affinity in the child for the processor the parent that forked was running on. If we scale this with the initial call mutex reduction, we see that this is slightly worse than the test case. This amount of "slightly worse" is the cost to establish the child process mappings on the second CPU in the absence of usable cached data, as you would have in the UP case. You would probably discard the initial affinity (if the user has not forced an affinity) after the first context switch of the child, allowing the process to migrate off the parent's CPU, which would effect an increase in cocurrency, assuming neither CPU was bound with work. > Execl Throughput Test lps 1437.4 1206.6 1032.5 This is a truer measure of just the call gate overhead, since an exec'ed process won't have usable cache. We can scale them by their relationship on the UP case to see that the effects I predecited for CPU switching have about the scale we decided they would have. Again, this would benefit from deserializing the ready-to-run queue. > File Read (10 seconds) KBps 254626.0 190873.0 151645.0 > File Read (30 seconds) KBps 255236.0 191890.0 152978.0 These are, again, the processor affinity issue, since the processor you go into the kernelon is not necessarily the processor you come out of the kernel on. This is a scheduler problem unrelated to the actual existance of SMP, per se... > Dc: sqrt(2) to 99 decimal places lpm 9533.5 8497.6 7406.6 I'm not sure on this... there must be some call gate effects for the controlling process, but they would be minimal in a CPU bound environment. More like, this is related to poor FPU context handling. In the standard UP kernel, the FPU context is "lazy bound"... that is, if a process uses FPU, the FPU state will not be flushed until another, *different* process also decides to use FPU -- then it will need to (potentially) signal exception state for unprocessed exceptions (the FPU design is "except-behind", probably an error on Intel's part, if you want an SMP system). This implies that if we did FPU "right" and did not tie the lazy flush to processor affinity changes (if any), then we would expect a higher overhead on context switch out of an FPU using process. > is it bytebench bug? There are a couple of bugs... there are also a lot of overemphasis on issues related solely to the scheduler, and less related to what the tests purport to benchmark. Big effects from things the benchmark designers felt would be "noise", generally, and aren't, in the SMP case. > what bench tests SMP corectly? One benchmark is good as another, as long as you know what you are comparing, and compare only similar things. This particular benchmark doesn't compare very good things to show SMP vs. non-SMP, and instead shows up scheduler differences (which is good too, but really doesn't match with the labels they've said describe what they are trying to test). Regards, Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers.