From owner-svn-src-head@freebsd.org  Fri Jul  1 08:54:57 2016
Return-Path: <owner-svn-src-head@freebsd.org>
Delivered-To: svn-src-head@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id E9CBDB87CB2;
 Fri,  1 Jul 2016 08:54:57 +0000 (UTC)
 (envelope-from brde@optusnet.com.au)
Received: from mail105.syd.optusnet.com.au (mail105.syd.optusnet.com.au
 [211.29.132.249])
 by mx1.freebsd.org (Postfix) with ESMTP id 915422AB1;
 Fri,  1 Jul 2016 08:54:57 +0000 (UTC)
 (envelope-from brde@optusnet.com.au)
Received: from besplex.bde.org (c110-21-100-149.carlnfd1.nsw.optusnet.com.au
 [110.21.100.149])
 by mail105.syd.optusnet.com.au (Postfix) with ESMTPS id 371A01045278;
 Fri,  1 Jul 2016 18:54:48 +1000 (AEST)
Date: Fri, 1 Jul 2016 18:54:47 +1000 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: Konstantin Belousov <kostikbel@gmail.com>
cc: src-committers@freebsd.org, svn-src-all@freebsd.org, 
 svn-src-head@freebsd.org
Subject: Re: svn commit: r302252 - head/sys/kern
In-Reply-To: <20160630180106.GU38613@kib.kiev.ua>
Message-ID: <20160701160438.U1233@besplex.bde.org>
References: <201606281643.u5SGhNsi061606@repo.freebsd.org>
 <20160629175917.O968@besplex.bde.org> <20160629145443.GG38613@kib.kiev.ua>
 <20160629153233.GI38613@kib.kiev.ua> <20160630040123.F791@besplex.bde.org>
 <20160629211953.GK38613@kib.kiev.ua> <20160701005401.Q1084@besplex.bde.org>
 <20160630180106.GU38613@kib.kiev.ua>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
X-Optus-CM-Score: 0
X-Optus-CM-Analysis: v=2.1 cv=EfU1O6SC c=1 sm=1 tr=0
 a=XDAe9YG+7EcdVXYrgT+/UQ==:117 a=XDAe9YG+7EcdVXYrgT+/UQ==:17
 a=L9H7d07YOLsA:10 a=9cW_t1CCXrUA:10 a=s5jvgZ67dGcA:10 a=kj9zAlcOel0A:10
 a=zejX9Yh8_zp2-627HP4A:9 a=sE8r4AzeOeev_3Q_:21 a=1KXmKgoS2apXmfzf:21
 a=CjuIK1q_8ugA:10
X-BeenThere: svn-src-head@freebsd.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: SVN commit messages for the src tree for head/-current
 <svn-src-head.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/svn-src-head>,
 <mailto:svn-src-head-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/svn-src-head/>
List-Post: <mailto:svn-src-head@freebsd.org>
List-Help: <mailto:svn-src-head-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/svn-src-head>,
 <mailto:svn-src-head-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 01 Jul 2016 08:54:58 -0000

On Thu, 30 Jun 2016, Konstantin Belousov wrote:

> On Fri, Jul 01, 2016 at 03:30:41AM +1000, Bruce Evans wrote:
>> On Thu, 30 Jun 2016, Konstantin Belousov wrote:
>>
>>> On Thu, Jun 30, 2016 at 05:47:39AM +1000, Bruce Evans wrote:
>>>> On Wed, 29 Jun 2016, Konstantin Belousov wrote:
>>>>
>>>>> On Wed, Jun 29, 2016 at 05:54:43PM +0300, Konstantin Belousov wrote:
>>>>>> As an additional, but probably excessive measure, invocation of
>>>>>> tc_windup() from hardclock could ask the setclock() to re-execute
>>>>>> tc_windup() after its call is done.
>>>>
>>>> I don't quite understand this.  hardclock -> tc_windup() normally doesn't
>>>> have interference from tc_setclock().
>>> In what sense it does not have interference ?  tc_windup() may be executed
>>> from hardclock/hardclock_cnt (low half) and concurrently from setclock()
>>> (top half).  And this is exactly what we want to avoid, isn't it ?
>>
>> Ues, but you seemed to be saying that hardclock should schedule or call
>> tc_setclock().  That makes no sense.  So I thought that you meant
>> tc_setclock() calling or scheduling tc_windup() in a different way than
>> it does now (it just calls it now).
> More precisely, I mean that hardclock, when not able to execute tc_windup()
> due to the top-half currently executing tc_windup(), may ask top-half to
> re-execute tc_windup() (and not tc_setclock()) once the current invocation
> finishes.

I see.

I think that is unnecessary.  tc_windup() from from the quasi-periodic
hardclock() just needs to be called often enough to keep the timecounters
from overflowing, but not so often as to expose races or inefficiencies
by cycling through the timehands too rapidly.  So if there are problems
with the extra calls from tc_setclock(), then it is the case where the
lock is contended that is least harmful!  In that case, the call from
tc_setclock() replaces a call from hardclock() with the same timing
to within a few nanoseconds.

We might have problems in other cases:
- malicous/stress-testing calls to settime() cause rapid cycling of the
   timehands.  Hopefully we fixed the races exposed by this last year.
   You are now reducing the number of timehands to 2.  This runs the
   races more often.  But if the fixes work, then even 1 timehands should
   work.  The multiple timehands are just an optimization that works by
   increasing the chance that biuptime() and friends don't have to wait
   for a usable timehands.
- on idle systems, tickless kernels are close to not calling tc_windup()
   enough to keep the timecounters from overflowing.  I use HZ = 100 and
   this gives only 20-30 timer interrupts/second.  The TSC timecounter
   at 4 GHz would overflow after 1 second, and the bogus TSC-low timecounter
   at 2 GHz overflows after 2 seconds.  Other timecounters might have
   overflow in less than 1 second, or tickless kernels might get closer
   to actually tickless and give only 1 timer interrupt every few seconds
   on idle systems.  Suspended systems give no timer interrupts, and the
   resume code for restarting the timecounter isn't quite right.
- on idle systems, after coming out of idle, IIRC the timer code delivers
   virtual ticks to the timercounter and other subsystems.  This also isn't
   quite right.  It risks rapidly cycling through the timehands.  It is
   better than for restarting for resume.  In both cases, whenever the
   sleep interval is so long that timecounter state is lost by stopping
   or wrapping, the timecounter should be reset.  The length of the sleep
   interval must be determined using another timer if the timecounter
   stops or wraps.

>>> *...
>>>>> +	for (;;) {
>>>>> +		if (atomic_cmpset_int(&tc_windup_lock, 0, 1)) {
>>>>> +			atomic_thread_fence_acq();
>>>>> +			tc_windup_locked();
>>>>> +			atomic_store_rel_int(&tc_windup_lock, 0);
>>>>> +			break;
>>>>> +		} else if (!top_call) {
>>>>> +			break;
>>>>> +		}
>>>>> +	}
>>>>> }
>>>>
>>>> I like to write optimized locking like that, but don't see why you
>>>> want it here.
>>> Because I do not want the fast clock interrupt handler to wait on lock,
>>> which neccessary becomes spinlock.  The ntp_lock is bad enough already.
>>> What I coded above is not real lock.  In contention case it bails out,
>>> so there is no blocking.  In particular, witness is not relevant to it.
>>
>> I think this is too complicated.  Spinlocks aren't very expensive, and
>> hardclock() already has some.  I think it has several.  It has at least
>> one silly one -- in hardclock_cpu(), a thread locking to lock a single
>> flags update.  hardclock() and tc_windup() just aren't called enough
>> for the locking overhead or contention to matter, even at too-large hz.
> Spinlocks are quite expensive.  They delay interrupt delivery.
> I do not want the fast interrupt handler to be delayed due to the top-half
> of the kernel executing settimeofday(2) in loop.

I still think it is too complicated.  Malicious/stress-testing users can
easily find many other denial of service attacks involving mutexes, without
needing root privilege like settimeofday().  Just using time syscalls,
there is a thread_lock() (not on curthread(?)) in get_thread_cputme().
Thread locking doesn't have much contention but it does use spinlocks so
it delays fast interrupts much the same as a general spinlock, especially
in the UP case.

I counted spinlocks and instructions in the "fast" clock interrupt handler.
There aren't many spinlocks, but there are too many instructions to be fast:

                    -current       my-5.2
Xtimerint          10k-50k        -
clkintr            -              896
hardclock          1091           842  (later counts are included in callers)
tc_windup          987            693
__udivdi3          498            422
timehands_update   163            -

-current is on i386 with an SMP kernel but on a 4x2 core system reduced
to 1x1 since tracing is too broken to work with multiple cores (the
multiple core are hard to control, and bugs cause crashes).  my-5.2 is
on i386 with a UP kernel on a 1x1 core system.  Xtimerint is a "fast"
interrupt handler.  clkintr is a normal interrupt handler scheduled
by a fast interrupt handler since the locking for a fast timer interrupt
handler is too hard to get right.  So the times for all the timer interrupt
handling that is now done by Xtimerint are unavailable in ~5.2.  They
are even larger -- much more for context switches.

tc_windup takes 30% more instructions in -current for not completely
obvious reasons.  50-60% of the overhead is for a pessimized division,
and somehow the compiler pessimizes this even better in -current than
in 5.2 (using old gcc instead of older gcc).  The division is 64/64 ->
64 bits.  On amd64, this would be a single instruction, but on i386 the
libcall is used.  The pessimizations are:
- the divisor (tc_frequency) is 64 bits, but for all supported timecounters
   it only needs 32 bits.  My CPU can be overclocked to need 33 bits for
   the TSC timecounter, but the only the bogus TSC-low timecounter is
   available, partly to avoid needing the 33rd bit
- i386 has a 64/32 -> 32 bit division instruction which can handle all the
   32-bit divisors that can occur, but no one ever bother to optimize
   __udivdi3 to use this, so the full 64-bit division is always done.
   __udivdi3 shows up in other instruction traces.  This is annoying but
   doesn't really matter since it is not called very often.

>>> 	cpu_tick_calibrate(1);
>>> 	nanotime(&tbef);
>>> 	timespec2bintime(ts, &bt);
>>> @@ -1247,8 +1252,10 @@ tc_setclock(struct timespec *ts)
>>> 	bintime2timeval(&bt, &boottime);
>>
>> The leap second update to boottimebin in tc_windup() races with the
>> update to boottimebin here.  This can be fixed partially by using the
>> same locking there.  Then the locking would have to be a spin mutex
>> (or special).  This still doesn't fix non-atomic accesses to boottimebin
>> in nanotime() and friends.  Putting it in timehands seems to be best for
>> fixing that.
> Yes, timehands for bootimebin should be the solution, but
> not in the scope of this patch.  I will work on this right after the
> current changeset lands in svn.

You wrote the patch faster than I can reply :-).

>> boottime is easier to fix (and doesn't need to be in timehands).  It
>> is only used to copy it out in a historical sysctl.  The locking for
>> that is buggy.  We convert boottimebin to boottime here (now with locking)
>> but in the sysctl we copy out boottime non-atomically.
> Do we need boottime at all ?  Can't it be calculated from boottimebin
> on the sysctl invocation ?

Later calculation is cleaner and probably easier with correct locking.

boottime should be invariant after initialization.  The invariant copy
doesn't need any locking, except to initialize it.

With our fuzzy/broken boottime, boottime has an error of about 1 seconds
initially and <total suspend time> later.  Errors from races in not
locking it are at most 1 second (except with 32-bit time they are 2**32
during a 1-second race in 2038).  So null locking is good enough.

The first write to the RTC in the ntp callout is a good place to freeze
boottime so that it is invariant.

I think that only applications doing buggy calculations of the current
real time like 'now = boottime + CLOCK_UPTIME_time' would be broken
further by fixing boottime.  uptime(1) used to do the reverse of this
calculation.  It now uses CLOCK_UPTIME_time directly.  This would work
if CLOCK_UPTIME worked.  Both CLOCK_UPTIME and CLOCK_MONOTONIC fail
to count the time while the system is suspended, and have relatively
minor errors for some clock drifts.

Bruce