From owner-svn-src-all@freebsd.org  Wed Jun 29 10:08:17 2016
Return-Path: <owner-svn-src-all@freebsd.org>
Delivered-To: svn-src-all@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id F0E4CB86F8C;
 Wed, 29 Jun 2016 10:08:17 +0000 (UTC)
 (envelope-from brde@optusnet.com.au)
Received: from mail106.syd.optusnet.com.au (mail106.syd.optusnet.com.au
 [211.29.132.42])
 by mx1.freebsd.org (Postfix) with ESMTP id A0F532752;
 Wed, 29 Jun 2016 10:08:17 +0000 (UTC)
 (envelope-from brde@optusnet.com.au)
Received: from besplex.bde.org (c110-21-100-149.carlnfd1.nsw.optusnet.com.au
 [110.21.100.149])
 by mail106.syd.optusnet.com.au (Postfix) with ESMTPS id A08DC3D031F;
 Wed, 29 Jun 2016 19:39:45 +1000 (AEST)
Date: Wed, 29 Jun 2016 19:39:45 +1000 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: Konstantin Belousov <kib@freebsd.org>
cc: src-committers@freebsd.org, svn-src-all@freebsd.org, 
 svn-src-head@freebsd.org
Subject: Re: svn commit: r302252 - head/sys/kern
In-Reply-To: <201606281643.u5SGhNsi061606@repo.freebsd.org>
Message-ID: <20160629175917.O968@besplex.bde.org>
References: <201606281643.u5SGhNsi061606@repo.freebsd.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
X-Optus-CM-Score: 0
X-Optus-CM-Analysis: v=2.1 cv=EfU1O6SC c=1 sm=1 tr=0
 a=XDAe9YG+7EcdVXYrgT+/UQ==:117 a=XDAe9YG+7EcdVXYrgT+/UQ==:17
 a=L9H7d07YOLsA:10 a=9cW_t1CCXrUA:10 a=s5jvgZ67dGcA:10 a=kj9zAlcOel0A:10
 a=YBTR0DOa03FoEZgt01gA:9 a=To-spq41Ae8Nz7We:21 a=wCAWHQOgn0QvNm88:21
 a=CjuIK1q_8ugA:10
X-BeenThere: svn-src-all@freebsd.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: "SVN commit messages for the entire src tree \(except for &quot;
 user&quot; and &quot; projects&quot; \)" <svn-src-all.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/svn-src-all>,
 <mailto:svn-src-all-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/svn-src-all/>
List-Post: <mailto:svn-src-all@freebsd.org>
List-Help: <mailto:svn-src-all-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/svn-src-all>,
 <mailto:svn-src-all-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 29 Jun 2016 10:08:18 -0000

On Tue, 28 Jun 2016, Konstantin Belousov wrote:

> Log:
>  Currently the ntptime code and resettodr() are Giant-locked. In
>  particular, the Giant is supposed to protect against parallel
>  ntp_adjtime(2) invocations.  But, for instance, sys_ntp_adjtime() does
>  copyout(9) under Giant and then examines time_status to return syscall
>  result.  Since copyout(9) could sleep, the syscall result might be
>  inconsistent.

Thanks.

As you know, there are still many errors near here.  This is more complicated
than for settime() vs resettodr() so I won't try to remember them all.
Some are:

- kern_ntptime.c still says that "all routines must run priority splclock
   or higher".  Since spls are null, this just points to missing locking.
   The pointers are quite weak since it only calls splclock() once
   internally and explicit splclock() calls have been removed from all
   callers.  There was only this one call in kern_ntptime.c as far back
   as FreeBSD-3.  Then the syscalls didn't have any locking, so I think
   there were lots of races even with UP non-preemptible kernels.

>  Another and more important issue is that if PPS is configured,
>  hardpps(9) is executed without any protection against the parallel
>  top-level code invocation. Potentially, this may result in the
>  inconsistent state of the ntptime state variables, but I cannot say
>  how serious such distortion is. The non-functional splclock() call in

It is surprising how unserious it is.

>  sys_ntp_adjtime() protected against clock interrupts calling hardpps()
>  in the pre-SMP era.
>
>  Modernize the locking. A mutex protects ntptime data.  Due to the
>  hardpps() KPI legitimately serving from the interrupt filters (and
>  e.g. uart(4) does call it from filter), the lock cannot be sleepable
>  mutex if PPS_SYNC is defined.  Otherwise, use normal sleepable mutex
>  to reduce interrupt latency.

The largest locking errors that I knew about were for pps and
ntp_update_second().  I still don't trust calling time code from fast
interrupt handlers (now misnamed filters) and don't allow it in my
version.  So my hardclock(), statclock() and profclock() interrupt handlers
are not fast, although this adds large overheads (nearly 1% at low hz and
np profileing).  I have a pps handler in the RTC interrupt handler.  I
didn't get around to fixing the pps locking, but since the handler is not
fast the fixes using sleep locks would be easier.  I have a pps handler
in sio.  sio normally uses fast interrupts, but I never use pps for it
so I didn't need to fix this.

> Modified: head/sys/kern/kern_ntptime.c
> ==============================================================================
> --- head/sys/kern/kern_ntptime.c	Tue Jun 28 16:42:40 2016	(r302251)
> +++ head/sys/kern/kern_ntptime.c	Tue Jun 28 16:43:23 2016	(r302252)
> @@ -162,6 +162,30 @@ static l_fp time_adj;			/* tick adjust (
>
> static int64_t time_adjtime;		/* correction from adjtime(2) (usec) */
>
> +static struct mtx ntpadj_lock;
> +MTX_SYSINIT(ntpadj, &ntpadj_lock, "ntpadj",
> +#ifdef PPS_SYNC
> +    MTX_SPIN
> +#else
> +    MTX_DEF
> +#endif
> +);

This needs to be a spinlock in all cases, since ntp_update_second() needs
to be locked and ntp_update_second() is called from tc_windup() is usually
called from a fast interrupt handler.

ntp_update_second() is still unlocked.  It accesses lots of variables, so
it obviously needs locking.  E.g., its very first access is
time_maxerror += MAXFREQ / 1000.  This has the usual non-atomicity for
a read-modify write.  This obiously races with the locked access on the
28th linke of sys_ntp_adjtime().  Only the latter is locked now.

> +
> +/*
> + * When PPS_SYNC is defined, hardpps() function is provided which can
> + * be legitimately called from interrupt filters.  Due to this, use
> + * spinlock for ntptime state protection, otherwise sleepable mutex is
> + * adequate.
> + */
> +#ifdef PPS_SYNC
> +#define	NTPADJ_LOCK()		mtx_lock_spin(&ntpadj_lock)
> +#define	NTPADJ_UNLOCK()		mtx_unlock_spin(&ntpadj_lock)
> +#else
> +#define	NTPADJ_LOCK()		mtx_lock(&ntpadj_lock)
> +#define	NTPADJ_UNLOCK()		mtx_unlock(&ntpadj_lock)
> +#endif
> +#define	NTPADJ_ASSERT_LOCKED()	mtx_assert(&ntpadj_lock, MA_OWNED)

*ADJ* and *adj* are not good names, since much more than ntp_adj*() needs
to be locked.

Probably the locking should be shared with kern_tc.c.  tc_windup() still
uses only fancy time-domaind/atomic-op locking.  This mostly works, but
is quite broken by calling tc_windup() from places other than hardclock().
The most problematic other place is from tc_setclock().  That is unlocked,
except accidentally by callers with only Giant locking, so it races with
the fast interrupt handler.

tc_windup() doesn't need the fancy locking for efficiency -- it only needs
it to be consistent with binuptime() and friends.  So it may as well use
the same lock as ntp.  Its lock must be around the whole function to
protect it from tc_setclock().  Then the same lock works for
ntp_update_second() and pps.  Any lock that is only acquired once per
second is not worth optimizing.  Hopefully the same for ntp's other locks.
ntp syscalls should only be frequent under unusual/malicious loads, so
they shouldn't cause much contention with the tc_windup() lock.

> @@ -203,11 +227,12 @@ static long pps_errcnt;			/* calibration
> static void ntp_init(void);
> static void hardupdate(long offset);
> static void ntp_gettime1(struct ntptimeval *ntvp);
> -static int ntp_is_time_error(void);
> +static bool ntp_is_time_error(int tsl);
>
> -static int
> -ntp_is_time_error(void)
> +static bool
> +ntp_is_time_error(int tsl)
> {
> +

Is it worth changing this to make a single variable less volatile?

bool shoudn't be used in old portable code.

> ...
> @@ -291,14 +315,17 @@ ntp_sysctl(SYSCTL_HANDLER_ARGS)
> {
> 	struct ntptimeval ntv;	/* temporary structure */
>
> +	NTPADJ_LOCK();
> 	ntp_gettime1(&ntv);
> +	NTPADJ_UNLOCK();
>
> 	return (sysctl_handle_opaque(oidp, &ntv, sizeof(ntv), req));
> }

> #ifdef PPS_SYNC
> SYSCTL_INT(_kern_ntp_pll, OID_AUTO, pps_shiftmax, CTLFLAG_RW,
> @@ -308,10 +335,12 @@ SYSCTL_INT(_kern_ntp_pll, OID_AUTO, pps_
> SYSCTL_LONG(_kern_ntp_pll, OID_AUTO, time_monitor, CTLFLAG_RD,
>     &time_monitor, 0, "Last time offset scaled (ns)");
>
> -SYSCTL_OPAQUE(_kern_ntp_pll, OID_AUTO, pps_freq, CTLFLAG_RD,
> -    &pps_freq, sizeof(pps_freq), "I", "Scaled frequency offset (ns/sec)");
> -SYSCTL_OPAQUE(_kern_ntp_pll, OID_AUTO, time_freq, CTLFLAG_RD,
> -    &time_freq, sizeof(time_freq), "I", "Frequency offset (ns/sec)");
> +SYSCTL_S64(_kern_ntp_pll, OID_AUTO, pps_freq, CTLFLAG_RD | CTLFLAG_MPSAFE,
> +    &pps_freq, 0,
> +    "Scaled frequency offset (ns/sec)");
> +SYSCTL_S64(_kern_ntp_pll, OID_AUTO, time_freq, CTLFLAG_RD | CTLFLAG_MPSAFE,
> +    &time_freq, 0,
> +    "Frequency offset (ns/sec)");
> #endif

These are not really MPSAFE, and OPAQUE was better than S64 since it
doesn't hard-code the size.  None of the simple sysctl handlers actually
gives atomic/coherent data.  They just attempt it, and have comments
ad nauseum saying this even for the int8_t case where the access is
surely (?) atomic.  For OPAQUE 64 bits and S64, the accesses are equally
non-atomic for 32-bit arches.

pps_freq and time_freq are volatile, so it is technically necessary to
use a SYSCTL_PROC() to copy them to a local variable with locking as above.

S64 does give better printing that OPAQUE.  My version adds sysctls for
some more variables (mainly time_offset and time_adj).  I still use
OPAQUE, and with this, at least i386 the variables are printed confusingly
as 2 decimal intergers.

Bruce