From owner-svn-src-head@freebsd.org  Wed Jun 29 15:32:46 2016
Return-Path: <owner-svn-src-head@freebsd.org>
Delivered-To: svn-src-head@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 028CEB866C2;
 Wed, 29 Jun 2016 15:32:46 +0000 (UTC)
 (envelope-from kostikbel@gmail.com)
Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1])
 (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 4901C24C5;
 Wed, 29 Jun 2016 15:32:45 +0000 (UTC)
 (envelope-from kostikbel@gmail.com)
Received: from tom.home (kib@localhost [127.0.0.1])
 by kib.kiev.ua (8.15.2/8.15.2) with ESMTPS id u5TFWX28072367
 (version=TLSv1 cipher=DHE-RSA-CAMELLIA256-SHA bits=256 verify=NO);
 Wed, 29 Jun 2016 18:32:33 +0300 (EEST)
 (envelope-from kostikbel@gmail.com)
DKIM-Filter: OpenDKIM Filter v2.10.3 kib.kiev.ua u5TFWX28072367
Received: (from kostik@localhost)
 by tom.home (8.15.2/8.15.2/Submit) id u5TFWXPM072366;
 Wed, 29 Jun 2016 18:32:33 +0300 (EEST)
 (envelope-from kostikbel@gmail.com)
X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com
 using -f
Date: Wed, 29 Jun 2016 18:32:33 +0300
From: Konstantin Belousov <kostikbel@gmail.com>
To: Bruce Evans <brde@optusnet.com.au>
Cc: src-committers@freebsd.org, svn-src-all@freebsd.org,
 svn-src-head@freebsd.org
Subject: Re: svn commit: r302252 - head/sys/kern
Message-ID: <20160629153233.GI38613@kib.kiev.ua>
References: <201606281643.u5SGhNsi061606@repo.freebsd.org>
 <20160629175917.O968@besplex.bde.org>
 <20160629145443.GG38613@kib.kiev.ua>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20160629145443.GG38613@kib.kiev.ua>
User-Agent: Mutt/1.6.1 (2016-04-27)
X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00,
 DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no
 autolearn_force=no version=3.4.1
X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on tom.home
X-BeenThere: svn-src-head@freebsd.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: SVN commit messages for the src tree for head/-current
 <svn-src-head.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/svn-src-head>,
 <mailto:svn-src-head-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/svn-src-head/>
List-Post: <mailto:svn-src-head@freebsd.org>
List-Help: <mailto:svn-src-head-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/svn-src-head>,
 <mailto:svn-src-head-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 29 Jun 2016 15:32:46 -0000

On Wed, Jun 29, 2016 at 05:54:43PM +0300, Konstantin Belousov wrote:
> This is a reply to two mails in one, both for r302251 and r302252 threads.
> 
> On Wed, Jun 29, 2016 at 05:58:05PM +1000, Bruce Evans wrote:
> > On Tue, 28 Jun 2016, Konstantin Belousov wrote:
> > 
> > > Log:
> > >  Do not use Giant to prevent parallel calls to CLOCK_SETTIME().  Use
> > >  private mtx in resettodr(), no implementation of CLOCK_SETTIME() is
> > >  allowed to sleep.
> > 
> > Thanks.
> > 
> > As you know, clock locking is still quite broken.  The most obvious
> > bugs near here are now:
> > - settime() still begins with splclock().  This gave fewer bugs in
> >    clock locking when it last worked in FreeBSD-4.  Now it is just
> >    a hint that the clock locking has still not been converted to use
> >    mutexes.
> > - bitrot near the end of settime() removed the splx() that should
> >    be paired with the splclock().
> Yes, I am aware of both splclock() and Giant braces in the setclock().
> I did not wanted to even look at tc_setclock() at all, for that change.
> See below.
> 
> > - the locking should be stronger than mutexes (or splclock()) to
> >    minimise the delay between deciding to set the time and actually
> >    setting it.
> ...
> This is separate issue.  Might be a critical section around the code
> is the right approximation for reducing delays there.
> 
> > > +	mtx_lock(&resettodr_lock);
> > 
> > This could be later (only around the hardware parts) since we are sloppy
> > with the delays.
> Done.
> 
> > 
> > > 	getnanotime(&ts);
> > 
> > This doesn't even try to get nanoseconds accuracy.
> > 
> > > 	timespecadd(&ts, &clock_adj);
> > > 	ts.tv_sec -= utc_offset();
> > 
> > Who knows what utc_offset() is locked by?  Certainly not this new locking,
> > since it is set in another file but the new locking is only in this file.
> > It still seems to be most under Giant :-(.
> utc_offset() uses several variables, updates to each of the variable is
> atomic.  There is no even an attempt to provide userspace with an interface
> which would makes updates to the components used by utc_offset() consistent.
> Thus I do not see a need in any locking there.
> 
> > 
> > The change in the other part of the diff is related to this, and doesn't
> > seem quite right.  utc_offset() uses the variables tz_minuteswest,
> > wall_cmos_clock and adjkerntz:
> > - tz_minuteswest is set with no locking in kern_settimeofday(), just
> >    after the settime() call.  This last had a chance of being correct
> >    when the kernel was UP and not preemptible.  But it is broken in
> >    a more fundamental way back to at least FreeBSD-1: resettodr() is
> >    called _before_ updating the timezone, so it uses an out-of-date
> >    tz_minuteswest.
> First, the man page is not correct, saying that 'this information is
> kept outside the kernel'.
> 
> Second, why does settime(), logicall, need the updated timezone ?
> I understand that resettodr() is called on successful update of
> the wall clock.  But shouldn't the need to reset not-GMT RTC clock
> in kern_settimeofday() satisfied by yet another call to resettodr()
> if tzp != NULL, after the tz_* vars update ?
> 
> > - wall_cmos_clock is still under a simple (not MPSAFE) sysctl
> > - adjkerntz is under a PROC sysctl, and you just changed this sysctl to
> >    be MPSAFE.  The sysctl calls here, so we don't want all of it
> >    giant-locked.  But it now does unlocked accesses to the adjkerntz
> >    variable, and the access to the disable_rtc_set variable is unlocked
> >    because we do it before acquiring the lock in this function (this
> >    variable is still under a simple (not MPSAFE) sysctl.
> > 
> > Most of the variables are ints, so access to them are atomic, but without
> > locking the correct order only occurs accidentally and is hard to
> > understand.
> I think there is no any global consistency, mostly caused by existence
> of separate interfaces to control all of them.  IMO it is not worth to
> try to ensure any.
> 
> > 
> > > 	/* XXX: We should really set all registered RTCs */
> > > -	if ((error = CLOCK_SETTIME(clock_dev, &ts)) != 0)
> > > +	error = CLOCK_SETTIME(clock_dev, &ts);
> > > +	mtx_unlock(&resettodr_lock);
> > > +	if (error != 0)
> > > 		printf("warning: clock_settime failed (%d), time-of-day clock "
> > > 		    "not adjusted to system time\n", error);
> > > }
> > 
> > The lock needs to cover all high-level accesses to the rtc and related
> > variables to avoid races, and a bit more to minimise delays.  In particular,
> > inittodr() should be under this lock, and there is a problem with
> > "show rtc" in ddb.
> I do not think that inittodr() actually needs that locking.  This code
> is executed at the initialization stage, or at the resume stage.
> 
> Do you mean that CLOCK_GETTIME() hardware access should be made exclusive
> with CLOCK_SETTIME() ?  Is there a hardware that rely on exclusivity and
> not enforce that internally, how x86 rtc does ?
> 
> DDB access to rtc is separate problem.
> 
> On Wed, Jun 29, 2016 at 07:39:45PM +1000, Bruce Evans wrote:
> > On Tue, 28 Jun 2016, Konstantin Belousov wrote:
> > > static int64_t time_adjtime;		/* correction from adjtime(2) (usec) */
> > >
> > > +static struct mtx ntpadj_lock;
> > > +MTX_SYSINIT(ntpadj, &ntpadj_lock, "ntpadj",
> > > +#ifdef PPS_SYNC
> > > +    MTX_SPIN
> > > +#else
> > > +    MTX_DEF
> > > +#endif
> > > +);
> > 
> > This needs to be a spinlock in all cases, since ntp_update_second() needs
> > to be locked and ntp_update_second() is called from tc_windup() is usually
> > called from a fast interrupt handler.
> Oh, I see.
> 
> > 
> > ntp_update_second() is still unlocked.  It accesses lots of variables, so
> > it obviously needs locking.  E.g., its very first access is
> > time_maxerror += MAXFREQ / 1000.  This has the usual non-atomicity for
> > a read-modify write.  This obiously races with the locked access on the
> > 28th linke of sys_ntp_adjtime().  Only the latter is locked now.
> > 
> > > +
> > > +/*
> > > + * When PPS_SYNC is defined, hardpps() function is provided which can
> > > + * be legitimately called from interrupt filters.  Due to this, use
> > > + * spinlock for ntptime state protection, otherwise sleepable mutex is
> > > + * adequate.
> > > + */
> > > +#ifdef PPS_SYNC
> > > +#define	NTPADJ_LOCK()		mtx_lock_spin(&ntpadj_lock)
> > > +#define	NTPADJ_UNLOCK()		mtx_unlock_spin(&ntpadj_lock)
> > > +#else
> > > +#define	NTPADJ_LOCK()		mtx_lock(&ntpadj_lock)
> > > +#define	NTPADJ_UNLOCK()		mtx_unlock(&ntpadj_lock)
> > > +#endif
> > > +#define	NTPADJ_ASSERT_LOCKED()	mtx_assert(&ntpadj_lock, MA_OWNED)
> > 
> > *ADJ* and *adj* are not good names, since much more than ntp_adj*() needs
> > to be locked.
> Do you have a suggestion for a better name ?
> 
> > 
> > Probably the locking should be shared with kern_tc.c.  tc_windup() still
> > uses only fancy time-domaind/atomic-op locking.  This mostly works, but
> > is quite broken by calling tc_windup() from places other than hardclock().
> > The most problematic other place is from tc_setclock().  That is unlocked,
> > except accidentally by callers with only Giant locking, so it races with
> > the fast interrupt handler.
> > 
> > tc_windup() doesn't need the fancy locking for efficiency -- it only needs
> > it to be consistent with binuptime() and friends.  So it may as well use
> > the same lock as ntp.  Its lock must be around the whole function to
> > protect it from tc_setclock().  Then the same lock works for
> > ntp_update_second() and pps.  Any lock that is only acquired once per
> > second is not worth optimizing.  Hopefully the same for ntp's other locks.
> > ntp syscalls should only be frequent under unusual/malicious loads, so
> > they shouldn't cause much contention with the tc_windup() lock.
> So I postponed looking at tc_windup(), and it seems that it was
> useful. After you note above, it is mostly clear that we can skip some
> tc_windup() calls which conflicts with currently active windup calls.
> E.g., the hardclock update, which is done simultaneously with the
> setclock(), could be just skipped.  But we must not skip the tc_windup()
> call from setclock().
> 
> As an additional, but probably excessive measure, invocation of
> tc_windup() from hardclock could ask the setclock() to re-execute
> tc_windup() after its call is done.
> 
> > > -static int
> > > -ntp_is_time_error(void)
> > > +static bool
> > > +ntp_is_time_error(int tsl)
> > > {
> > > +
> > 
> > Is it worth changing this to make a single variable less volatile?
> Yes, to rely on single-variable read being atomic and avoiding locking
> in callout.
> 
> [Skipped the S64/OPAQUE stuff]
> 
> Patch does the following:
> - change the ntpadj_lock to spinlock.
> - acquire the ntpadj_lock around ntp_update_second().
> - prevent parallel tc_windup() invocations, while not allowing to
>   skip calls coming from kernel top half.
> - move resettodr_lock acquisition in resettodr() later.

I now think that the lock around tc_windup() should have more flavor of
real spinlocks.  Ideally the spinlock_enter() should be taken around it,
but it might be too extreme.  Just a critical section in tc_setclock()
is good enough.

diff --git a/sys/kern/kern_ntptime.c b/sys/kern/kern_ntptime.c
index d352ee7..34b29d9 100644
--- a/sys/kern/kern_ntptime.c
+++ b/sys/kern/kern_ntptime.c
@@ -163,27 +163,10 @@ static l_fp time_adj;			/* tick adjust (ns/s) */
 static int64_t time_adjtime;		/* correction from adjtime(2) (usec) */
 
 static struct mtx ntpadj_lock;
-MTX_SYSINIT(ntpadj, &ntpadj_lock, "ntpadj",
-#ifdef PPS_SYNC
-    MTX_SPIN
-#else
-    MTX_DEF
-#endif
-);
+MTX_SYSINIT(ntpadj, &ntpadj_lock, "ntpadj", MTX_SPIN);
 
-/*
- * When PPS_SYNC is defined, hardpps() function is provided which can
- * be legitimately called from interrupt filters.  Due to this, use
- * spinlock for ntptime state protection, otherwise sleepable mutex is
- * adequate.
- */
-#ifdef PPS_SYNC
 #define	NTPADJ_LOCK()		mtx_lock_spin(&ntpadj_lock)
 #define	NTPADJ_UNLOCK()		mtx_unlock_spin(&ntpadj_lock)
-#else
-#define	NTPADJ_LOCK()		mtx_lock(&ntpadj_lock)
-#define	NTPADJ_UNLOCK()		mtx_unlock(&ntpadj_lock)
-#endif
 #define	NTPADJ_ASSERT_LOCKED()	mtx_assert(&ntpadj_lock, MA_OWNED)
 
 #ifdef PPS_SYNC
@@ -506,6 +489,8 @@ ntp_update_second(int64_t *adjustment, time_t *newsec)
 	int tickrate;
 	l_fp ftemp;		/* 32/64-bit temporary */
 
+	NTPADJ_LOCK();
+
 	/*
 	 * On rollover of the second both the nanosecond and microsecond
 	 * clocks are updated and the state machine cranked as
@@ -627,6 +612,8 @@ ntp_update_second(int64_t *adjustment, time_t *newsec)
 	else
 		time_status &= ~STA_PPSSIGNAL;
 #endif /* PPS_SYNC */
+
+	NTPADJ_UNLOCK();
 }
 
 /*
diff --git a/sys/kern/kern_tc.c b/sys/kern/kern_tc.c
index 0f015b3..a5776d2 100644
--- a/sys/kern/kern_tc.c
+++ b/sys/kern/kern_tc.c
@@ -135,7 +135,8 @@ SYSCTL_PROC(_kern_timecounter, OID_AUTO, alloweddeviation,
 
 static int tc_chosen;	/* Non-zero if a specific tc was chosen via sysctl. */
 
-static void tc_windup(void);
+static void tc_windup(bool top_call);
+static void tc_windup_locked(void);
 static void cpu_tick_calibrate(int);
 
 void dtrace_getnanotime(struct timespec *tsp);
@@ -1237,6 +1238,7 @@ tc_setclock(struct timespec *ts)
 	struct timespec tbef, taft;
 	struct bintime bt, bt2;
 
+	critical_enter();
 	cpu_tick_calibrate(1);
 	nanotime(&tbef);
 	timespec2bintime(ts, &bt);
@@ -1247,8 +1249,10 @@ tc_setclock(struct timespec *ts)
 	bintime2timeval(&bt, &boottime);
 
 	/* XXX fiddle all the little crinkly bits around the fiords... */
-	tc_windup();
+	tc_windup(true);
 	nanotime(&taft);
+	cpu_tick_calibrate(1);
+	critical_exit();
 	if (timestepwarnings) {
 		log(LOG_INFO,
 		    "Time stepped from %jd.%09ld to %jd.%09ld (%jd.%09ld)\n",
@@ -1256,7 +1260,23 @@ tc_setclock(struct timespec *ts)
 		    (intmax_t)taft.tv_sec, taft.tv_nsec,
 		    (intmax_t)ts->tv_sec, ts->tv_nsec);
 	}
-	cpu_tick_calibrate(1);
+}
+
+static volatile int tc_windup_lock;
+static void
+tc_windup(bool top_call)
+{
+
+	for (;;) {
+		if (atomic_cmpset_int(&tc_windup_lock, 0, 1)) {
+			atomic_thread_fence_acq();
+			tc_windup_locked();
+			atomic_store_rel_int(&tc_windup_lock, 0);
+			break;
+		} else if (!top_call) {
+			break;
+		}
+	}
 }
 
 /*
@@ -1265,7 +1285,7 @@ tc_setclock(struct timespec *ts)
  * timecounter and/or do seconds processing in NTP.  Slightly magic.
  */
 static void
-tc_windup(void)
+tc_windup_locked(void)
 {
 	struct bintime bt;
 	struct timehands *th, *tho;
@@ -1846,7 +1866,7 @@ tc_ticktock(int cnt)
 	if (count < tc_tick)
 		return;
 	count = 0;
-	tc_windup();
+	tc_windup(false);
 }
 
 static void __inline
@@ -1921,7 +1941,7 @@ inittimecounter(void *dummy)
 	/* warm up new timecounter (again) and get rolling. */
 	(void)timecounter->tc_get_timecount(timecounter);
 	(void)timecounter->tc_get_timecount(timecounter);
-	tc_windup();
+	tc_windup(true);
 }
 
 SYSINIT(timecounter, SI_SUB_CLOCKS, SI_ORDER_SECOND, inittimecounter, NULL);
diff --git a/sys/kern/kern_time.c b/sys/kern/kern_time.c
index 148da2b..82710f7 100644
--- a/sys/kern/kern_time.c
+++ b/sys/kern/kern_time.c
@@ -115,9 +115,7 @@ settime(struct thread *td, struct timeval *tv)
 	struct timeval delta, tv1, tv2;
 	static struct timeval maxtime, laststep;
 	struct timespec ts;
-	int s;
 
-	s = splclock();
 	microtime(&tv1);
 	delta = *tv;
 	timevalsub(&delta, &tv1);
@@ -147,10 +145,8 @@ settime(struct thread *td, struct timeval *tv)
 				printf("Time adjustment clamped to -1 second\n");
 			}
 		} else {
-			if (tv1.tv_sec == laststep.tv_sec) {
-				splx(s);
+			if (tv1.tv_sec == laststep.tv_sec)
 				return (EPERM);
-			}
 			if (delta.tv_sec > 1) {
 				tv->tv_sec = tv1.tv_sec + 1;
 				printf("Time adjustment clamped to +1 second\n");
@@ -161,10 +157,8 @@ settime(struct thread *td, struct timeval *tv)
 
 	ts.tv_sec = tv->tv_sec;
 	ts.tv_nsec = tv->tv_usec * 1000;
-	mtx_lock(&Giant);
 	tc_setclock(&ts);
 	resettodr();
-	mtx_unlock(&Giant);
 	return (0);
 }
 
diff --git a/sys/kern/subr_rtc.c b/sys/kern/subr_rtc.c
index dbad36d..4bac324 100644
--- a/sys/kern/subr_rtc.c
+++ b/sys/kern/subr_rtc.c
@@ -172,11 +172,11 @@ resettodr(void)
 	if (disable_rtc_set || clock_dev == NULL)
 		return;
 
-	mtx_lock(&resettodr_lock);
 	getnanotime(&ts);
 	timespecadd(&ts, &clock_adj);
 	ts.tv_sec -= utc_offset();
 	/* XXX: We should really set all registered RTCs */
+	mtx_lock(&resettodr_lock);
 	error = CLOCK_SETTIME(clock_dev, &ts);
 	mtx_unlock(&resettodr_lock);
 	if (error != 0)