Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 1 Jul 2016 03:30:41 +1000 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        Konstantin Belousov <kostikbel@gmail.com>
Cc:        src-committers@freebsd.org, svn-src-all@freebsd.org,  svn-src-head@freebsd.org
Subject:   Re: svn commit: r302252 - head/sys/kern
Message-ID:  <20160701005401.Q1084@besplex.bde.org>
In-Reply-To: <20160629211953.GK38613@kib.kiev.ua>
References:  <201606281643.u5SGhNsi061606@repo.freebsd.org> <20160629175917.O968@besplex.bde.org> <20160629145443.GG38613@kib.kiev.ua> <20160629153233.GI38613@kib.kiev.ua> <20160630040123.F791@besplex.bde.org> <20160629211953.GK38613@kib.kiev.ua>

next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, 30 Jun 2016, Konstantin Belousov wrote:

> On Thu, Jun 30, 2016 at 05:47:39AM +1000, Bruce Evans wrote:
>> On Wed, 29 Jun 2016, Konstantin Belousov wrote:
>>
>>> On Wed, Jun 29, 2016 at 05:54:43PM +0300, Konstantin Belousov wrote:
>>>> This is a reply to two mails in one, both for r302251 and r302252 threads.
>
>> After removing the foot-shooting tz_minuteswest variable from utc_offset(),
>> all of the variables in it are ones set by adjkerntz sysctls.  The settings
>> are normally ordered by not running more than 1 instance of adjkerntz.
>> When adjkerntz starts, this instance shouldn't have any contention with
>> other initialization doing resettodr().  There is contention every 6
>> months if wall_cmos_clock when adjkerntz changes the offset for daylight
>> savings time.  The chances of this used to be tiny since resettodr() was
>> almost never called, but now there is the periodic resettodr() in
>> kern_ntptime.c.
> I do not see how your text above changes anything I said about consistency.
> Each sysctl sets or fetches integer variable, so individually it is atomic.
> Inter-dependencies between wall_cmos_clock and adjkerntz settings are
> completely up to user, no locking can change it while keeping separate
> sysctls to set and fetch each var individually.
> 
> The only side-effects in resettodr() are possible in CLOCK_SETTIME(),
> which is protected against parallel invocation by resettodr_lock.

It says that the users are sufficiently limited and careful about ordering
for the problem to be small.

>>>>> ...
>>>>> The change in the other part of the diff is related to this, and doesn't
>>>>> seem quite right.  utc_offset() uses the variables tz_minuteswest,
>>>>> wall_cmos_clock and adjkerntz:
>>>>> - tz_minuteswest is set with no locking in kern_settimeofday(), just
>>>>>    after the settime() call.  This last had a chance of being correct
>>>>>    when the kernel was UP and not preemptible.  But it is broken in
>>>>>    a more fundamental way back to at least FreeBSD-1: resettodr() is
>>>>>    called _before_ updating the timezone, so it uses an out-of-date
>>>>>    tz_minuteswest.
>>>> First, the man page is not correct, saying that 'this information is
>>>> kept outside the kernel'.
>>
>> We deprecate the timezone support in settimeofday(), but use wrong wording
>> for this here.  The info set by this sysctl is of course kept in the kernel,
>> and I think the man page is trying to say that this info is never used
>> outside of the kernel.
> How the later statement could be true, since that information is
> returned by gettimeofday(), and we do not control what programs a user
> might run on his machine.

Indeed.  The man page is only trying to say that this info is not used by
libc (and therefore should not be used by any program).

>*...
>>>> Second, why does settime(), logicall, need the updated timezone ?
>>>> I understand that resettodr() is called on successful update of
>>>> the wall clock.  But shouldn't the need to reset not-GMT RTC clock
>>
>> The timezone updates just give foot-shooting, but we need an up to
>> date adjkerntz variable initially and on every daylight saving change
>> in the wall_cmos_clock case.  This is the normal way to keep the rtc
>> on wall time after every daylight saving changes -- adjkerntz() runs,
>> and it should change the adjkerntz variable first and then call
>> settimeofday() with a "null" change to get the side effect of calling
>> resettodr() which does a large change of the rtc.  It is impossible
>> to make a really null change of the software clock using settimeofday(),
>> but I don't know of any better API to do this.  I think the locking that
>> you just added is enough -- adjkerntz() just has to arrange a call to
>> resettodr() strictly after it updates the variable, and that happens
>> almost automatically.
> I think that the way it is done is just by doing sysctl machdep.adjkerntz,
> which calls resettodr().

I probably knew that once, but had forgotten it.

> Old wisdom to re-sync RTC to kernel-maintained
> time was to run the sysctl from cron, but Andrey added the callout
> several years ago.

I never knew that old wisdom, and just avoided re-syncing the RTC.

>>>> I do not think that inittodr() actually needs that locking.  This code
>>>> is executed at the initialization stage, or at the resume stage.
>>
>> Perhaps, but resume runs more often than every 6 months for the daylight
>> savings time race, and it isn't clear if anything stops resettodr() running
>> concurrently then.
>>
>> In my version, something like inittodr() runs 3 seconds after every exit
>> from ddb.  Stopping in ddb is a bit like a short suspend-resume -- it
>> stops or breaks some timers, and inittodr() is a way to fix up the
>> software real time using an unstopped timer.
> Lets postpone this.  Might be, a resettodr_lock can be applied to
> inittodr() as a whole, and additionally merged with some other lock.
> BUt I think inittodr() is not very important right now.

I see some more things that need the common lock:
- the RTC interrupt handler.  resettodr() begins by disabling RTC
   interrupts.  This is to prevent interference from the interrupt
   handler, but it only works for UP systems.  For SMP systems, the
   interrupt handler might already be running, or scheduled to run,
   on another CPU.  The interrupt handler doesn't do many RTC accesses
   (except in my version), but it does delicate interrupt re-enabling
   by reading the volatile status register.  resettodr() ends by
   re-enabling RTC interrupts and reading the volatile status register.
   I think all orderings work, but it would be simpler to not allow the
   contention or maybe to allow it intentionally by not disabling RTC
   interrupts.
- atc_restore().  This runs mainly for resume.  It resets the same
   status register as resettodr() and another status registers.  It
   must not be allowed to run concurrently with resettodr().  Resume
   call it before allowing any callouts.  It is unclear if this is
   guaranteed.
>*...
>>>> So I postponed looking at tc_windup(), and it seems that it was
>>>> useful. After you note above, it is mostly clear that we can skip some
>>>> tc_windup() calls which conflicts with currently active windup calls.
>>>> E.g., the hardclock update, which is done simultaneously with the
>>>> setclock(), could be just skipped.  But we must not skip the tc_windup()
>>>> call from setclock().
>>
>> Several things could be deferred, but I think that would just be more
>> complicated.
>>
>>>> As an additional, but probably excessive measure, invocation of
>>>> tc_windup() from hardclock could ask the setclock() to re-execute
>>>> tc_windup() after its call is done.
>>
>> I don't quite understand this.  hardclock -> tc_windup() normally doesn't
>> have interference from tc_setclock().
> In what sense it does not have interference ?  tc_windup() may be executed
> from hardclock/hardclock_cnt (low half) and concurrently from setclock()
> (top half).  And this is exactly what we want to avoid, isn't it ?

Ues, but you seemed to be saying that hardclock should schedule or call
tc_setclock().  That makes no sense.  So I thought that you meant
tc_setclock() calling or scheduling tc_windup() in a different way than
it does now (it just calls it now).

>>> ...
>>> diff --git a/sys/kern/kern_tc.c b/sys/kern/kern_tc.c
>>> index 0f015b3..a5776d2 100644
>>> --- a/sys/kern/kern_tc.c
>>> +++ b/sys/kern/kern_tc.c
>>> ...
>>> @@ -1237,6 +1238,7 @@ tc_setclock(struct timespec *ts)
>>> 	struct timespec tbef, taft;
>>> 	struct bintime bt, bt2;
>>>
>>> +	critical_enter();
>>> 	cpu_tick_calibrate(1);
>>> 	nanotime(&tbef);
>>> 	timespec2bintime(ts, &bt);
>>> @@ -1247,8 +1249,10 @@ tc_setclock(struct timespec *ts)
>>> 	bintime2timeval(&bt, &boottime);
>>
>> boottime is one of the critical variables.  Before this, there are 2
>> accesses to boottimebin.  The Giant locking that you just removed gave
>> only some protection here.  It prevented concurrent acccess by
>> malicious/stress-testing code doing settime().  It didn't prevent
>> nanotime() and friends seeing boottime in an in-between state.  Oops,
>> mutex locking won't fix that either.  I think the update of these
>> variables needs to be moved into tc_windup() where it will hopefully
>> automatically be protected by the generation count, like the timehands
>> offsets already are.
> I do not understand how that would help.  Do you mean to move the vars
> into current timehands ?

I hoped that it would just work without putting it in timehands, but
now I don't see a better way that putting it in timehands.  The space
wastage is not too bad, and we might need it in timehands for other
reasons.  One is that the actual boot time is invariant, but FreeBSD
varies it; a global is needed for the actual boot time and then it is
natural to put the fudged boot time in timehands (it is not a boot
time, but an offset which is sometimes close to the boot time).

Note that boottimebin is also updated in tc_windup() (for leap seconds).
The new mutex doesn't protect it there.  This is the next of other
reasons to put it in timehands.

>*...
>>> +	for (;;) {
>>> +		if (atomic_cmpset_int(&tc_windup_lock, 0, 1)) {
>>> +			atomic_thread_fence_acq();
>>> +			tc_windup_locked();
>>> +			atomic_store_rel_int(&tc_windup_lock, 0);
>>> +			break;
>>> +		} else if (!top_call) {
>>> +			break;
>>> +		}
>>> +	}
>>> }
>>
>> I like to write optimized locking like that, but don't see why you
>> want it here.
> Because I do not want the fast clock interrupt handler to wait on lock,
> which neccessary becomes spinlock.  The ntp_lock is bad enough already.
> What I coded above is not real lock.  In contention case it bails out,
> so there is no blocking.  In particular, witness is not relevant to it.

I think this is too complicated.  Spinlocks aren't very expensive, and
hardclock() already has some.  I think it has several.  It has at least
one silly one -- in hardclock_cpu(), a thread locking to lock a single
flags update.  hardclock() and tc_windup() just aren't called enough
for the locking overhead or contention to matter, even at too-large hz.

> ...
> diff --git a/sys/kern/kern_tc.c b/sys/kern/kern_tc.c
> index 0f015b3..7bb767c 100644
> --- a/sys/kern/kern_tc.c
> +++ b/sys/kern/kern_tc.c
>...
> +static struct mtx tc_setclock_mtx;
> +MTX_SYSINIT(tc_setclock_init, &tc_setclock_mtx, "tcsetc", MTX_DEF);

I thought it would be a spinlock.  The sleep lock is better if you can
make ot work, but this is complicated.

> +
> /*
>  * Step our concept of UTC.  This is done by modifying our estimate of
>  * when we booted.
> - * XXX: not locked.
>  */

Now "not properly locked" -).

> void
> tc_setclock(struct timespec *ts)
> @@ -1237,6 +1240,8 @@ tc_setclock(struct timespec *ts)
> 	struct timespec tbef, taft;
> 	struct bintime bt, bt2;
>
> +	mtx_lock(&tc_setclock_mtx);
> +	critical_enter();

I thought at first that this critical section was redundant, since a
spinlock would already have it.

> 	cpu_tick_calibrate(1);
> 	nanotime(&tbef);
> 	timespec2bintime(ts, &bt);
> @@ -1247,8 +1252,10 @@ tc_setclock(struct timespec *ts)
> 	bintime2timeval(&bt, &boottime);

The leap second update to boottimebin in tc_windup() races with the
update to boottimebin here.  This can be fixed partially by using the
same locking there.  Then the locking would have to be a spin mutex
(or special).  This still doesn't fix non-atomic accesses to boottimebin
in nanotime() and friends.  Putting it in timehands seems to be best for
fixing that.

boottime is easier to fix (and doesn't need to be in timehands).  It
is only used to copy it out in a historical sysctl.  The locking for
that is buggy.  We convert boottimebin to boottime here (now with locking)
but in the sysctl we copy out boottime non-atomically.


Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20160701005401.Q1084>