From owner-svn-src-head@freebsd.org Fri Jul 1 08:54:57 2016 Return-Path: Delivered-To: svn-src-head@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id E9CBDB87CB2; Fri, 1 Jul 2016 08:54:57 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail105.syd.optusnet.com.au (mail105.syd.optusnet.com.au [211.29.132.249]) by mx1.freebsd.org (Postfix) with ESMTP id 915422AB1; Fri, 1 Jul 2016 08:54:57 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from besplex.bde.org (c110-21-100-149.carlnfd1.nsw.optusnet.com.au [110.21.100.149]) by mail105.syd.optusnet.com.au (Postfix) with ESMTPS id 371A01045278; Fri, 1 Jul 2016 18:54:48 +1000 (AEST) Date: Fri, 1 Jul 2016 18:54:47 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Konstantin Belousov cc: src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-head@freebsd.org Subject: Re: svn commit: r302252 - head/sys/kern In-Reply-To: <20160630180106.GU38613@kib.kiev.ua> Message-ID: <20160701160438.U1233@besplex.bde.org> References: <201606281643.u5SGhNsi061606@repo.freebsd.org> <20160629175917.O968@besplex.bde.org> <20160629145443.GG38613@kib.kiev.ua> <20160629153233.GI38613@kib.kiev.ua> <20160630040123.F791@besplex.bde.org> <20160629211953.GK38613@kib.kiev.ua> <20160701005401.Q1084@besplex.bde.org> <20160630180106.GU38613@kib.kiev.ua> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.1 cv=EfU1O6SC c=1 sm=1 tr=0 a=XDAe9YG+7EcdVXYrgT+/UQ==:117 a=XDAe9YG+7EcdVXYrgT+/UQ==:17 a=L9H7d07YOLsA:10 a=9cW_t1CCXrUA:10 a=s5jvgZ67dGcA:10 a=kj9zAlcOel0A:10 a=zejX9Yh8_zp2-627HP4A:9 a=sE8r4AzeOeev_3Q_:21 a=1KXmKgoS2apXmfzf:21 a=CjuIK1q_8ugA:10 X-BeenThere: svn-src-head@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: SVN commit messages for the src tree for head/-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 01 Jul 2016 08:54:58 -0000 On Thu, 30 Jun 2016, Konstantin Belousov wrote: > On Fri, Jul 01, 2016 at 03:30:41AM +1000, Bruce Evans wrote: >> On Thu, 30 Jun 2016, Konstantin Belousov wrote: >> >>> On Thu, Jun 30, 2016 at 05:47:39AM +1000, Bruce Evans wrote: >>>> On Wed, 29 Jun 2016, Konstantin Belousov wrote: >>>> >>>>> On Wed, Jun 29, 2016 at 05:54:43PM +0300, Konstantin Belousov wrote: >>>>>> As an additional, but probably excessive measure, invocation of >>>>>> tc_windup() from hardclock could ask the setclock() to re-execute >>>>>> tc_windup() after its call is done. >>>> >>>> I don't quite understand this. hardclock -> tc_windup() normally doesn't >>>> have interference from tc_setclock(). >>> In what sense it does not have interference ? tc_windup() may be executed >>> from hardclock/hardclock_cnt (low half) and concurrently from setclock() >>> (top half). And this is exactly what we want to avoid, isn't it ? >> >> Ues, but you seemed to be saying that hardclock should schedule or call >> tc_setclock(). That makes no sense. So I thought that you meant >> tc_setclock() calling or scheduling tc_windup() in a different way than >> it does now (it just calls it now). > More precisely, I mean that hardclock, when not able to execute tc_windup() > due to the top-half currently executing tc_windup(), may ask top-half to > re-execute tc_windup() (and not tc_setclock()) once the current invocation > finishes. I see. I think that is unnecessary. tc_windup() from from the quasi-periodic hardclock() just needs to be called often enough to keep the timecounters from overflowing, but not so often as to expose races or inefficiencies by cycling through the timehands too rapidly. So if there are problems with the extra calls from tc_setclock(), then it is the case where the lock is contended that is least harmful! In that case, the call from tc_setclock() replaces a call from hardclock() with the same timing to within a few nanoseconds. We might have problems in other cases: - malicous/stress-testing calls to settime() cause rapid cycling of the timehands. Hopefully we fixed the races exposed by this last year. You are now reducing the number of timehands to 2. This runs the races more often. But if the fixes work, then even 1 timehands should work. The multiple timehands are just an optimization that works by increasing the chance that biuptime() and friends don't have to wait for a usable timehands. - on idle systems, tickless kernels are close to not calling tc_windup() enough to keep the timecounters from overflowing. I use HZ = 100 and this gives only 20-30 timer interrupts/second. The TSC timecounter at 4 GHz would overflow after 1 second, and the bogus TSC-low timecounter at 2 GHz overflows after 2 seconds. Other timecounters might have overflow in less than 1 second, or tickless kernels might get closer to actually tickless and give only 1 timer interrupt every few seconds on idle systems. Suspended systems give no timer interrupts, and the resume code for restarting the timecounter isn't quite right. - on idle systems, after coming out of idle, IIRC the timer code delivers virtual ticks to the timercounter and other subsystems. This also isn't quite right. It risks rapidly cycling through the timehands. It is better than for restarting for resume. In both cases, whenever the sleep interval is so long that timecounter state is lost by stopping or wrapping, the timecounter should be reset. The length of the sleep interval must be determined using another timer if the timecounter stops or wraps. >>> *... >>>>> + for (;;) { >>>>> + if (atomic_cmpset_int(&tc_windup_lock, 0, 1)) { >>>>> + atomic_thread_fence_acq(); >>>>> + tc_windup_locked(); >>>>> + atomic_store_rel_int(&tc_windup_lock, 0); >>>>> + break; >>>>> + } else if (!top_call) { >>>>> + break; >>>>> + } >>>>> + } >>>>> } >>>> >>>> I like to write optimized locking like that, but don't see why you >>>> want it here. >>> Because I do not want the fast clock interrupt handler to wait on lock, >>> which neccessary becomes spinlock. The ntp_lock is bad enough already. >>> What I coded above is not real lock. In contention case it bails out, >>> so there is no blocking. In particular, witness is not relevant to it. >> >> I think this is too complicated. Spinlocks aren't very expensive, and >> hardclock() already has some. I think it has several. It has at least >> one silly one -- in hardclock_cpu(), a thread locking to lock a single >> flags update. hardclock() and tc_windup() just aren't called enough >> for the locking overhead or contention to matter, even at too-large hz. > Spinlocks are quite expensive. They delay interrupt delivery. > I do not want the fast interrupt handler to be delayed due to the top-half > of the kernel executing settimeofday(2) in loop. I still think it is too complicated. Malicious/stress-testing users can easily find many other denial of service attacks involving mutexes, without needing root privilege like settimeofday(). Just using time syscalls, there is a thread_lock() (not on curthread(?)) in get_thread_cputme(). Thread locking doesn't have much contention but it does use spinlocks so it delays fast interrupts much the same as a general spinlock, especially in the UP case. I counted spinlocks and instructions in the "fast" clock interrupt handler. There aren't many spinlocks, but there are too many instructions to be fast: -current my-5.2 Xtimerint 10k-50k - clkintr - 896 hardclock 1091 842 (later counts are included in callers) tc_windup 987 693 __udivdi3 498 422 timehands_update 163 - -current is on i386 with an SMP kernel but on a 4x2 core system reduced to 1x1 since tracing is too broken to work with multiple cores (the multiple core are hard to control, and bugs cause crashes). my-5.2 is on i386 with a UP kernel on a 1x1 core system. Xtimerint is a "fast" interrupt handler. clkintr is a normal interrupt handler scheduled by a fast interrupt handler since the locking for a fast timer interrupt handler is too hard to get right. So the times for all the timer interrupt handling that is now done by Xtimerint are unavailable in ~5.2. They are even larger -- much more for context switches. tc_windup takes 30% more instructions in -current for not completely obvious reasons. 50-60% of the overhead is for a pessimized division, and somehow the compiler pessimizes this even better in -current than in 5.2 (using old gcc instead of older gcc). The division is 64/64 -> 64 bits. On amd64, this would be a single instruction, but on i386 the libcall is used. The pessimizations are: - the divisor (tc_frequency) is 64 bits, but for all supported timecounters it only needs 32 bits. My CPU can be overclocked to need 33 bits for the TSC timecounter, but the only the bogus TSC-low timecounter is available, partly to avoid needing the 33rd bit - i386 has a 64/32 -> 32 bit division instruction which can handle all the 32-bit divisors that can occur, but no one ever bother to optimize __udivdi3 to use this, so the full 64-bit division is always done. __udivdi3 shows up in other instruction traces. This is annoying but doesn't really matter since it is not called very often. >>> cpu_tick_calibrate(1); >>> nanotime(&tbef); >>> timespec2bintime(ts, &bt); >>> @@ -1247,8 +1252,10 @@ tc_setclock(struct timespec *ts) >>> bintime2timeval(&bt, &boottime); >> >> The leap second update to boottimebin in tc_windup() races with the >> update to boottimebin here. This can be fixed partially by using the >> same locking there. Then the locking would have to be a spin mutex >> (or special). This still doesn't fix non-atomic accesses to boottimebin >> in nanotime() and friends. Putting it in timehands seems to be best for >> fixing that. > Yes, timehands for bootimebin should be the solution, but > not in the scope of this patch. I will work on this right after the > current changeset lands in svn. You wrote the patch faster than I can reply :-). >> boottime is easier to fix (and doesn't need to be in timehands). It >> is only used to copy it out in a historical sysctl. The locking for >> that is buggy. We convert boottimebin to boottime here (now with locking) >> but in the sysctl we copy out boottime non-atomically. > Do we need boottime at all ? Can't it be calculated from boottimebin > on the sysctl invocation ? Later calculation is cleaner and probably easier with correct locking. boottime should be invariant after initialization. The invariant copy doesn't need any locking, except to initialize it. With our fuzzy/broken boottime, boottime has an error of about 1 seconds initially and later. Errors from races in not locking it are at most 1 second (except with 32-bit time they are 2**32 during a 1-second race in 2038). So null locking is good enough. The first write to the RTC in the ntp callout is a good place to freeze boottime so that it is invariant. I think that only applications doing buggy calculations of the current real time like 'now = boottime + CLOCK_UPTIME_time' would be broken further by fixing boottime. uptime(1) used to do the reverse of this calculation. It now uses CLOCK_UPTIME_time directly. This would work if CLOCK_UPTIME worked. Both CLOCK_UPTIME and CLOCK_MONOTONIC fail to count the time while the system is suspended, and have relatively minor errors for some clock drifts. Bruce