Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 30 Jun 2016 05:47:39 +1000 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        Konstantin Belousov <kostikbel@gmail.com>
Cc:        Bruce Evans <brde@optusnet.com.au>, src-committers@freebsd.org,  svn-src-all@freebsd.org, svn-src-head@freebsd.org
Subject:   Re: svn commit: r302252 - head/sys/kern
Message-ID:  <20160630040123.F791@besplex.bde.org>
In-Reply-To: <20160629153233.GI38613@kib.kiev.ua>
References:  <201606281643.u5SGhNsi061606@repo.freebsd.org> <20160629175917.O968@besplex.bde.org> <20160629145443.GG38613@kib.kiev.ua> <20160629153233.GI38613@kib.kiev.ua>

next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, 29 Jun 2016, Konstantin Belousov wrote:

> On Wed, Jun 29, 2016 at 05:54:43PM +0300, Konstantin Belousov wrote:
>> This is a reply to two mails in one, both for r302251 and r302252 threads.

Reply to your second reply.

>> On Wed, Jun 29, 2016 at 05:58:05PM +1000, Bruce Evans wrote:
>>> On Tue, 28 Jun 2016, Konstantin Belousov wrote:

>> ...
>>> - the locking should be stronger than mutexes (or splclock()) to
>>>    minimise the delay between deciding to set the time and actually
>>>    setting it.
>> ...
>> This is separate issue.  Might be a critical section around the code
>> is the right approximation for reducing delays there.

Spinlocks give a critical section automatically, but we might not want
to put either around the whole clock_gettime() syscall and if we don't
do that then neither works for reducing delays (preemption is especially
likely when the critical region is left).

>>>...
>>>> 	timespecadd(&ts, &clock_adj);
>>>> 	ts.tv_sec -= utc_offset();
>>>
>>> Who knows what utc_offset() is locked by?  Certainly not this new locking,
>>> since it is set in another file but the new locking is only in this file.
>>> It still seems to be most under Giant :-(.
>> utc_offset() uses several variables, updates to each of the variable is
>> atomic.  There is no even an attempt to provide userspace with an interface
>> which would makes updates to the components used by utc_offset() consistent.
>> Thus I do not see a need in any locking there.

After removing the foot-shooting tz_minuteswest variable from utc_offset(),
all of the variables in it are ones set by adjkerntz sysctls.  The settings
are normally ordered by not running more than 1 instance of adjkerntz.
When adjkerntz starts, this instance shouldn't have any contention with
other initialization doing resettodr().  There is contention every 6
months if wall_cmos_clock when adjkerntz changes the offset for daylight
savings time.  The chances of this used to be tiny since resettodr() was
almost never called, but now there is the periodic resettodr() in
kern_ntptime.c.


>>> ...
>>> The change in the other part of the diff is related to this, and doesn't
>>> seem quite right.  utc_offset() uses the variables tz_minuteswest,
>>> wall_cmos_clock and adjkerntz:
>>> - tz_minuteswest is set with no locking in kern_settimeofday(), just
>>>    after the settime() call.  This last had a chance of being correct
>>>    when the kernel was UP and not preemptible.  But it is broken in
>>>    a more fundamental way back to at least FreeBSD-1: resettodr() is
>>>    called _before_ updating the timezone, so it uses an out-of-date
>>>    tz_minuteswest.
>> First, the man page is not correct, saying that 'this information is
>> kept outside the kernel'.

We deprecate the timezone support in settimeofday(), but use wrong wording
for this here.  The info set by this sysctl is of course kept in the kernel,
and I think the man page is trying to say that this info is never used
outside of the kernel.

The man page doesn't say enough about deprecation, or that we only use this
info for foot-shooting in the kernel, only (?) in utc_offset().  utc_offset()
does:

 	return (tz_minuteswest * 60 + (wall_cmos_clock ? adjkerntz : 0));

If !wall_cmos_clock, then the offset should be 0, so any setting of
tz_minutewest using the timezone info breaks the offset and this breaks
the rtc _not being on wall time.

If wall_cmos_clock, then adjkerntz(8) normally maintains the offset in
the adjkerntz variable, so again tz_minuteswest must be 0 to work and
any setting of it to nonzero using the timezone info breaks the offset.

>> Second, why does settime(), logicall, need the updated timezone ?
>> I understand that resettodr() is called on successful update of
>> the wall clock.  But shouldn't the need to reset not-GMT RTC clock
c

The timezone updates just give foot-shooting, but we need an up to
date adjkerntz variable initially and on every daylight saving change
in the wall_cmos_clock case.  This is the normal way to keep the rtc
on wall time after every daylight saving changes -- adjkerntz() runs,
and it should change the adjkerntz variable first and then call
settimeofday() with a "null" change to get the side effect of calling
resettodr() which does a large change of the rtc.  It is impossible
to make a really null change of the software clock using settimeofday(),
but I don't know of any better API to do this.  I think the locking that
you just added is enough -- adjkerntz() just has to arrange a call to
resettodr() strictly after it updates the variable, and that happens
almost automatically.

>>>> 	/* XXX: We should really set all registered RTCs */
>>>> -	if ((error = CLOCK_SETTIME(clock_dev, &ts)) != 0)
>>>> +	error = CLOCK_SETTIME(clock_dev, &ts);
>>>> +	mtx_unlock(&resettodr_lock);
>>>> +	if (error != 0)
>>>> 		printf("warning: clock_settime failed (%d), time-of-day clock "
>>>> 		    "not adjusted to system time\n", error);
>>>> }
>>>
>>> The lock needs to cover all high-level accesses to the rtc and related
>>> variables to avoid races, and a bit more to minimise delays.  In particular,
>>> inittodr() should be under this lock, and there is a problem with
>>> "show rtc" in ddb.
>> I do not think that inittodr() actually needs that locking.  This code
>> is executed at the initialization stage, or at the resume stage.

Perhaps, but resume runs more often than every 6 months for the daylight
savings time race, and it isn't clear if anything stops resettodr() running
concurrently then.

In my version, something like inittodr() runs 3 seconds after every exit
from ddb.  Stopping in ddb is a bit like a short suspend-resume -- it
stops or breaks some timers, and inittodr() is a way to fix up the
software real time using an unstopped timer.

>*...
>>>> +/*
>>>> + * When PPS_SYNC is defined, hardpps() function is provided which can
>>>> + * be legitimately called from interrupt filters.  Due to this, use
>>>> + * spinlock for ntptime state protection, otherwise sleepable mutex is
>>>> + * adequate.
>>>> + */
>>>> +#ifdef PPS_SYNC
>>>> +#define	NTPADJ_LOCK()		mtx_lock_spin(&ntpadj_lock)
>>>> +#define	NTPADJ_UNLOCK()		mtx_unlock_spin(&ntpadj_lock)
>>>> +#else
>>>> +#define	NTPADJ_LOCK()		mtx_lock(&ntpadj_lock)
>>>> +#define	NTPADJ_UNLOCK()		mtx_unlock(&ntpadj_lock)
>>>> +#endif
>>>> +#define	NTPADJ_ASSERT_LOCKED()	mtx_assert(&ntpadj_lock, MA_OWNED)
>>>
>>> *ADJ* and *adj* are not good names, since much more than ntp_adj*() needs
>>> to be locked.
>> Do you have a suggestion for a better name ?

Just a general name like ntp_lock for ntp or t_lock if the same lock is
used for timecounters.  I also prefer not to use macros except for
development versions, and then name the lock as xxx_mtx, not xxx_lock.
We depend on special properties of (spin) mutexes.

>*...
>>> Probably the locking should be shared with kern_tc.c.  tc_windup() still
>>> uses only fancy time-domaind/atomic-op locking.  This mostly works, but
>>> is quite broken by calling tc_windup() from places other than hardclock().
>>> The most problematic other place is from tc_setclock().  That is unlocked,
>>> except accidentally by callers with only Giant locking, so it races with
>>> the fast interrupt handler.
>>>
>>> tc_windup() doesn't need the fancy locking for efficiency -- it only needs
>>> it to be consistent with binuptime() and friends.  So it may as well use
>>> the same lock as ntp.  Its lock must be around the whole function to
>>> protect it from tc_setclock().  Then the same lock works for
>>> ntp_update_second() and pps.  Any lock that is only acquired once per
>>> second is not worth optimizing.  Hopefully the same for ntp's other locks.
>>> ntp syscalls should only be frequent under unusual/malicious loads, so
>>> they shouldn't cause much contention with the tc_windup() lock.
>> So I postponed looking at tc_windup(), and it seems that it was
>> useful. After you note above, it is mostly clear that we can skip some
>> tc_windup() calls which conflicts with currently active windup calls.
>> E.g., the hardclock update, which is done simultaneously with the
>> setclock(), could be just skipped.  But we must not skip the tc_windup()
>> call from setclock().

Several things could be deferred, but I think that would just be more
complicated.

>> As an additional, but probably excessive measure, invocation of
>> tc_windup() from hardclock could ask the setclock() to re-execute
>> tc_windup() after its call is done.

I don't quite understand this.  hardclock -> tc_windup() normally doesn't
have interference from tc_setclock().


>> ...
>> Patch does the following:
>> - change the ntpadj_lock to spinlock.
>> - acquire the ntpadj_lock around ntp_update_second().
>> - prevent parallel tc_windup() invocations, while not allowing to
>>   skip calls coming from kernel top half.

tc_setclock() needs locking even more than tc_windup().

>> - move resettodr_lock acquisition in resettodr() later.
>
> I now think that the lock around tc_windup() should have more flavor of
> real spinlocks.  Ideally the spinlock_enter() should be taken around it,
> but it might be too extreme.  Just a critical section in tc_setclock()
> is good enough.

No, it must have a full lock since it does read-write-modify on critical
variables.

> ...
> diff --git a/sys/kern/kern_tc.c b/sys/kern/kern_tc.c
> index 0f015b3..a5776d2 100644
> --- a/sys/kern/kern_tc.c
> +++ b/sys/kern/kern_tc.c
> ...
> @@ -1237,6 +1238,7 @@ tc_setclock(struct timespec *ts)
> 	struct timespec tbef, taft;
> 	struct bintime bt, bt2;
>
> +	critical_enter();
> 	cpu_tick_calibrate(1);
> 	nanotime(&tbef);
> 	timespec2bintime(ts, &bt);
> @@ -1247,8 +1249,10 @@ tc_setclock(struct timespec *ts)
> 	bintime2timeval(&bt, &boottime);

boottime is one of the critical variables.  Before this, there are 2
accesses to boottimebin.  The Giant locking that you just removed gave
only some protection here.  It prevented concurrent acccess by
malicious/stress-testing code doing settime().  It didn't prevent
nanotime() and friends seeing boottime in an in-between state.  Oops, 
mutex locking won't fix that either.  I think the update of these
variables needs to be moved into tc_windup() where it will hopefully
automatically be protected by the generation count, like the timehands
offsets already are.

bootime is also accessed with no locking in the kern.boottime sysctl.

There is also ffclock_boottime.  I don't like the duplication for
ffclock, but it doesn't have much duplication for initializing
ffsclock_boottime.

>
> 	/* XXX fiddle all the little crinkly bits around the fiords... */
> -	tc_windup();
> +	tc_windup(true);
> 	nanotime(&taft);
> +	cpu_tick_calibrate(1);

The 2 calls to cpu_tick_calibrate() in this function are excessive.  These
calls are just a hack to recalibrate after suspend/resume as a side effect
of calling here.  The first call suffices, since cpu ticks have nothing to
do with any time variable that is set here.

> +	critical_exit();
> 	if (timestepwarnings) {
> 		log(LOG_INFO,
> 		    "Time stepped from %jd.%09ld to %jd.%09ld (%jd.%09ld)\n",
> @@ -1256,7 +1260,23 @@ tc_setclock(struct timespec *ts)
> 		    (intmax_t)taft.tv_sec, taft.tv_nsec,
> 		    (intmax_t)ts->tv_sec, ts->tv_nsec);
> 	}
> -	cpu_tick_calibrate(1);
> +}
> +
> +static volatile int tc_windup_lock;
> +static void
> +tc_windup(bool top_call)
> +{
> +
> +	for (;;) {
> +		if (atomic_cmpset_int(&tc_windup_lock, 0, 1)) {
> +			atomic_thread_fence_acq();
> +			tc_windup_locked();
> +			atomic_store_rel_int(&tc_windup_lock, 0);
> +			break;
> +		} else if (!top_call) {
> +			break;
> +		}
> +	}
> }

I like to write optimized locking like that, but don't see why you
want it here.

Also, how does WITNESS work for almost any new lock that we don't add
to the witness list?  WITNESS of course doesn't work for special locking
like this.  MTX_NOWITNESS gives mutexes that are almost as efficient as
the above, but with with no witnessing of course.  I prefer to witness
almost everything so that it can be debugged.

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20160630040123.F791>