From owner-freebsd-arch@FreeBSD.ORG Tue Jan 19 18:53:36 2010 Return-Path: Delivered-To: arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id E37B01065694; Tue, 19 Jan 2010 18:53:36 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail10.syd.optusnet.com.au (mail10.syd.optusnet.com.au [211.29.132.191]) by mx1.freebsd.org (Postfix) with ESMTP id 5F1978FC1B; Tue, 19 Jan 2010 18:53:35 +0000 (UTC) Received: from besplex.bde.org (c220-239-227-214.carlnfd1.nsw.optusnet.com.au [220.239.227.214]) by mail10.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id o0JIrWAr003972 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Wed, 20 Jan 2010 05:53:33 +1100 Date: Wed, 20 Jan 2010 05:53:32 +1100 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: John Baldwin In-Reply-To: <201001191144.23299.jhb@freebsd.org> Message-ID: <20100120042822.L4223@besplex.bde.org> References: <3bbf2fe10911271542h2b179874qa0d9a4a7224dcb2f@mail.gmail.com> <20100116205752.J64514@delplex.bde.org> <3bbf2fe11001160409w1dfdbb9j36458c52d596c92a@mail.gmail.com> <201001191144.23299.jhb@freebsd.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: Attilio Rao , FreeBSD Arch , Ed Maste Subject: Re: [PATCH] Statclock aliasing by LAPIC X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 19 Jan 2010 18:53:37 -0000 On Tue, 19 Jan 2010, John Baldwin wrote: > On Saturday 16 January 2010 7:09:38 am Attilio Rao wrote: >> >> Well, the primary things I wanted to fix is not the hiding of >> malicious programs but the clock aliasing created when handling all >> the clocks by the same source. I probably misdiagnosed the aliasing in a previous reply -- after the one being replied to here -- please reply to the latest version --: the problem for malicious programs seems to be sort of the opposite of the one fixed by using a separate hardware clock for the statclock. It seems to be the near-aliasing of the separate statclock that gets short-lived timeout processes accounted for at all (but not enough if there are many such processes). A non-separate statclock won't see these processes excessively like I first thought, even when the statclock() call immediately follows the hardclock() call, since hardclock() doesn't start any new processes; thus a statclock() at the same time as a hardclock() is the same as a statclock() 1/hz- epsilon after the previous hardclock() arranged to start a few timeouts -- usually these timeouts will have finished. A separate statclock() is little better at seeing short-lived timeout processes, since it has to sweep nearly uniformly over the entire interval between hardclock() interrupts, so it cannot spend long nearly in sync. However, to fix the problem with malicious programs, except for short-lived (short-active) ones started by a timeout which hopefully don't matter because they are short-lived, statclock() just needs to sweep not so uniformly over the entire interval, and this doesn't need a separate statclock() -- interrupting at points randomly distributed at distances of a large fraction of 1/hz should do. This depends on other system activity not being in sync with hardclock(). >> What I mean, then is: I see your points, I'm not arguing that at all, >> but the old code has other problems that gets fixed with this patch >> (having different sources make the whole system more flexible) while >> the new things it does introduce are secondarilly (but still: I'm fine >> with whatever second source is picked up for statclock, profclock) if >> you really see a concern wrt atrtc slowness. > > You can't use the i8254 reliable with APIC enabled. Some motherboards don't > actually hook up IRQ 0 to pin 2. We used to support this by enabling IRQ 0 in > the atpic and enabling the ExtINT pin to use both sets of PICs in tandem. > However, this was very gross and had its own set of issues, so we removed the > support for "mixed mode" a while ago. Also, the ACPI specification > specifically forbids an OS from using "mixed mode". I thought that recent changes reenabled some of this. And what's to stop some motherboards breaking the RTC too? > My feeling, btw, is that the real solution is to not use a sampling clock for > per-process stats, but to just use the cycle counter and keep separate user, > system, and interrupt cycle counts (like the rux_runtime we have now). The total runtime info is already available (in rux_runtime). It is the main thing that we use to see that scheduling is broken :-) -- we see that the runtime is too large or small relative to %CPU. I think using this and never using ticks for scheduling would work OK. Schedulers shouldn't care about the difference between user and sys time. Something like this is also needed for tickless kernels. With schedulers still wanting ticks, perhaps the total runtime could be distributed as fake ticks for schedulers only to see, so that if the tick count is broken schedulers would still get feedback from the runtime. And/or processes started by a timeout could be charged a fake tick so that they can't run for free. Interrupt cycle counts are mostly already kept too, since most interrupt handlers are heavyweight and take a full context switch to get to. However, counting cycles to separate user from sys time would probably be too inefficient. A minimal syscalls now should take about 200 cycles. rdtsc on Athlon1 takes 12 cycles. rdtsc on Core2 and Phenom takes 40+ cycles. 2 of these would be needed for every syscall. These would only not be too inefficient if they ran mostly in parallel. They are non-serializing, but if they actually ran mostly in parallel then they might also be off 40+ cycles/call. > This > makes calcru() trivial and eliminates many of the weird "going backwards", > etc. problems. The only issue with this approach is that not all platforms > have a cheap cycle counter (many embedded platforms lack one I think), so you > would almost need to support both modes of operation and maybe have an #define > in to choose between the two modes. Not the only problem. This also doesn't work for things like vm statistics gathered in statclock(). You still need statclock() for these, and if you want the statistics to be reasonably accurate then you need a sufficiently non-aliased aliased and non-random random statclock(). > Even in that mode you still need a sampling clock I think for cp_time[] and > cp_times[], but individual threads can no longer "hide" as we would be keeping > precise timing stats. Not so much a problem as the vm stats -- most time-related statistics could be handled by adding up per-thread components, if we had them all. If we had fine-grained programability of a single timer, then accounting for threads started by a timeout would probably be best implemented for almost perfect correctness and slowness as follows: - statclock() interrupt a few usec after starting a timeout - then periodic statclock() interrupts every few tens or hundreds of usec a few times - then back to normal periodic statclock() interrupts, hopefully not so often All statistics including tick counts are a weighted sum depending on the current stathz (an integral over time, like now for the non-tick count stats, except with the time deltas varying). This would be slow, but it seems to be the only way to correctly account for short-lived processes started by a timeout -- in a limiting case, all system activity would be run as timeouts and on fast machines finish in a few usec. Maintaining the total runtime, which should be enough for scheduling, doesn't need this, but other statistics do. Other system activity probably doesn't need this, because it is probably started by other interrupts that aren't in sync with hardclock() -- only hardclock() combined with time^callout sematics gives a huge bias towards starting processes at particular times. Probably nothing needs this, since we don't really care about other statistics. Probably completely tickless kernels can't support the other statistics. Bruce