From owner-freebsd-arch@FreeBSD.ORG Tue Dec 1 15:30:15 2009 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id EB465106568B; Tue, 1 Dec 2009 15:30:14 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail04.syd.optusnet.com.au (mail04.syd.optusnet.com.au [211.29.132.185]) by mx1.freebsd.org (Postfix) with ESMTP id 25C098FC18; Tue, 1 Dec 2009 15:30:13 +0000 (UTC) Received: from besplex.bde.org (c220-239-235-116.carlnfd3.nsw.optusnet.com.au [220.239.235.116]) by mail04.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id nB1FU9g6019186 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Wed, 2 Dec 2009 02:30:11 +1100 Date: Wed, 2 Dec 2009 02:30:09 +1100 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: John Baldwin In-Reply-To: <200911301305.30572.jhb@freebsd.org> Message-ID: <20091201233938.K1089@besplex.bde.org> References: <3bbf2fe10911271542h2b179874qa0d9a4a7224dcb2f@mail.gmail.com> <200911301305.30572.jhb@freebsd.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: Attilio Rao , FreeBSD Arch , Ed Maste Subject: Re: [PATCH] Statclock aliasing by LAPIC X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 01 Dec 2009 15:30:15 -0000 On Mon, 30 Nov 2009, John Baldwin wrote: > On Friday 27 November 2009 6:42:50 pm Attilio Rao wrote: >> Handling all the three clocks (hardclock, softclock, profclock) within >> the LAPIC can lead to aliasing for the softclock and profclock because >> hz is sized to fit mainly hardclock. The fashion to handle all of them >> within the LAPIC was introduced in 2005 and before than the softclock >> and profclock were supposed to be handled in the rtc. Right now, too, >> there is the necessary support to handle profclock and statclock in >> atrtc which gets enabled if the LAPIC signals it can't take in charge >> the three clocks. >> The proposed patch reverts the situation preferring the atrtc to >> handle the statclock and profclock (then a different source from the >> LAPIC) and then avoid the aliasing problem: This would defeat most of the point of using the lapic timer. RTC interrupts are too slow to use for anything if there is an alternative like the lapic timer. i8254 interrupts are not so bad, and in fact are just as efficient as lapic timer interrupts iff they are controlled by the APIC and not by the ATPIC. This is because RTC interrupts must be acked and tested for in RTC registers, and the RTC is on the ISA bus so accessing it is very slow, while the i8254 is programmed for its interrupts to not need any acking or testing. RTC and i8254 interrupts may also be be controlled by the ATPIC, and then the ATPIC must be acked on the ISA bus too. This gives the following number of ISA bus accesses for most interrupts: device read write ------ ---- ----- lapic_timer 0 0 i8254 0 0+0/1 RTC (current) 1 0+0/2 RTC (old) 3 1+0/2 Here "+0/1" and "+0/2" are for the ATPIC ack, if any. RTC (old) is before I optimized rtcin() to not write the index register in the usual case where it has not changed (writing the index register takes 1 extra write and uses 2 dummy reads in an attempt to satisfy timing requirements). However, there is apparently broken (or just incompatible) hardware that fails with this optimization. There would probably be more reports of this brokenness if using the RTC became the default again. The 4-6 ISA accesses for RTC (old) take about 4-9 usec, so using the RTC at stathz = 128 Hz takes only 0.05-0.12% of 1 CPU, which is acceptable. Using the RTC at profhz = 1024 Hz takes 0.4-0.9% of 1 CPU, which may also acceptable, but profhz = 1024 was too slow even for a 386/20 in 1993; it should be 200-1000 times larger now, but the RTC just can't support that, and one reason it was never increased is that the RTC is too inefficient. Profiling can now be implemented better using the lapic timer, but using the lapic timer currently implements profiling slightly worse than using the RTC. > http://www.freebsd.org/~attilio/Sandvine/STABLE_8/statclock_aliasing/statclock_aliasing3.diff >> >> In this patch, lapic_setup_clock() has been changed in order to return >> a three-states variable which identified if the LAPIC got in charge >> all the three clocks, just the hardclock or none of them (the current >> situation is just none/all) and the rtc handling runs subsequently. >> A tunable as been added to force LAPI to get in charge all the three >> clocks if needed. >> In ia32 atrtc compiling is linked to atpic compiling, so a compile >> time flag has been added to check if atpic is not present and in case >> force LAPIC to take in charge all the three clocks (which is still >> better than the 'safe belt values' still present in the rtc code). I don't like tunables, especially to switch from one bug to another. This can be done better using sysctls only, since it is not needed for booting. The sysctls would need to be runnable at any time, but reprogramming the lapic timer at any time is already needed to fix profiling (cpu_start/stopprofclock() are missing support for the lapic timer; instead, the default lapic_timer_hz is set excessively large but not large enough for a good profhz). sysctls also let you test this stuff without rebooting. >> Please note that statclock and profclock are widely used in our kernel >> (rusage is, for example, statclock driven) and fixing this would >> result in specific improvements (as a several-reported wrong CPU usage >> statistic in top). >> This bug has been found by Sandvine Incorporated. What bug exactly? Bugs like this must have been found before 1993, since statclock() in 4.4BSD was supposed to fix them. See "A Randomized Sampling Clock for CPU Utilization Estimation and Code Profiling" (ftp://ftp.ee.lbl.gov/papers/statclk-usenix93.ps.Z). FreeBSD never implemented the "Randomized" part, but its statclock() used to sort of work, since by default stathz was > hz and was not nearly a multiple of hz. Someone broke the former by increasing the default hz to 1000. This allowed malicious programs to easily hide themself from statclock() while consuming a large fraction of CPU cycles (when stathz was > hz, it was not so easy to hide, and very difficult to both hide and consume significant CPU, since timeout granularity makes it hard to control wakeups). Then using the single lapic timer to generate all periodic timer interrupts increased the synchonization of these interrupts, thus moving further from a randomized statclock(). However, the defaults with the lapic timer give an even larger beat frequency than before, so I don't see how using the lapic timer can increase the problem much. (The beat frequency of (1000, 128) is 16000. The beat frequency of (1000, 133) is 133000. The latter means that, with defaults, statclock and hardclock() ticks are only perfectly synced once in every 133 seconds. Misconfiguring hz to a multiple of 128 can give perfect synchronization, which may be a more of a problem, or a fix -- see below). >> >> Reviews, comments and testing are welcome. Review of the part of the patch visible in the mail: . > Presumably in the RTC case lapic_timer_hz should always be hz and not some > multiple of hz. Sure. Except the allocation of the timers is backwards at best. You need profhz on the most efficient timer so that it can be very large (other changes are required for large profhz to actually work). You want stathz on the next most efficient timer so that it can be larger than hz (see above) (other changes are needed for a stathz much larger than 128 to actually work. Note that at least SCHED_4BSD wants a scheduling clock frequency of much less than 128 -- it essentially divides stathz by 8 to get this. Scaling in calcru() is currently broken after several hundred days of runtime, and would break sooner with larger stathz). Perhaps your recent changes (that removed the literal constant dividers) made the synchronization problem worse. But these changes make it easy to implement any number of independent timers with optional randomness using the lapic timer. E.g., to randomize statclock(), just add a small random value (+-) as well as stathz. Note that statistics utilities won't like this -- some like systat(1) use statistics ticks as a timebase so they want statclock() to be perfectly periodic. I don't worry about the synchronization or broken profiling, and use lapic_timer_hz = profhz = stathz = hz = 100 whenever the lapic timer is used. I haven't noticed any problems caused by this (mostly using SCHED_4BSD), except the unavoidable one that hz = 100 gives less accurate usr/sys decomposition than does hz = 1000. I have noticed that this fixed the cosmetic problem that systat(1) shows glitches in the lapic timer interrupt rates: Although using the lapic timer for all timer interrupts makes them all perfectly periodic, systat cannot see this because stathz = 133 is too small a sampling rate and is not an exact divisor of lapic_timer_hz -- it caused a glitch every lapic_timer_hz/stathz seconds. For other interrupts, we wouldn't expect the rates to be constant, but we know that the lapic timer interrupt rate is constant so we know that the oscillation of its displayed value is a bug. Right now on ref8-i386.freebsd.org, I see the values not oscillating much but being weird: for cpu0-1, they are near 1973; for cpu2-3, they are near 1981; for cpu4-5, they are near 2043, and for cpu6-7 they are near 2003. A tickless kernel would need to at least consider running the scheduler and statistics gathering on most context switches (unless it keeps using ticks when not idle). The scheduler parts of this would also fix timer synchronization problems for !tickless kernels, but I don't see how they can be as efficient as only considering scheduling at infrequent tick intervals. > Also, did you check to make sure all the lock is present? I > think at one point I changed the locking for the RTC and/or ISA timer to just > use critical_enter/exit so that UP machines running an SMP kernel wouldn't pay > the locking overhead since the code was only used on UP machines. It may very > well be that I only changed that in a development branch though and not in > HEAD. I don't remember any locking changes for RTC ever being committed. rtcin() still uses mtx_lock_spin(&clock_lock). clock_lock is the i8254 clock's lock, and is still abused for the RTC. This abuse was convenient when the RTC driver was implemented in the same file as the i8254 driver, but now the RTC driver is in its own file. The i8254's private variable `clock_lock' is even declared in the RTC driver's public header, with other style bugs of course. Bruce