From owner-svn-src-head@freebsd.org Wed Jun 29 15:32:46 2016 Return-Path: Delivered-To: svn-src-head@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 028CEB866C2; Wed, 29 Jun 2016 15:32:46 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 4901C24C5; Wed, 29 Jun 2016 15:32:45 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from tom.home (kib@localhost [127.0.0.1]) by kib.kiev.ua (8.15.2/8.15.2) with ESMTPS id u5TFWX28072367 (version=TLSv1 cipher=DHE-RSA-CAMELLIA256-SHA bits=256 verify=NO); Wed, 29 Jun 2016 18:32:33 +0300 (EEST) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.10.3 kib.kiev.ua u5TFWX28072367 Received: (from kostik@localhost) by tom.home (8.15.2/8.15.2/Submit) id u5TFWXPM072366; Wed, 29 Jun 2016 18:32:33 +0300 (EEST) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Wed, 29 Jun 2016 18:32:33 +0300 From: Konstantin Belousov To: Bruce Evans Cc: src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-head@freebsd.org Subject: Re: svn commit: r302252 - head/sys/kern Message-ID: <20160629153233.GI38613@kib.kiev.ua> References: <201606281643.u5SGhNsi061606@repo.freebsd.org> <20160629175917.O968@besplex.bde.org> <20160629145443.GG38613@kib.kiev.ua> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160629145443.GG38613@kib.kiev.ua> User-Agent: Mutt/1.6.1 (2016-04-27) X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00, DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no autolearn_force=no version=3.4.1 X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on tom.home X-BeenThere: svn-src-head@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: SVN commit messages for the src tree for head/-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 29 Jun 2016 15:32:46 -0000 On Wed, Jun 29, 2016 at 05:54:43PM +0300, Konstantin Belousov wrote: > This is a reply to two mails in one, both for r302251 and r302252 threads. > > On Wed, Jun 29, 2016 at 05:58:05PM +1000, Bruce Evans wrote: > > On Tue, 28 Jun 2016, Konstantin Belousov wrote: > > > > > Log: > > > Do not use Giant to prevent parallel calls to CLOCK_SETTIME(). Use > > > private mtx in resettodr(), no implementation of CLOCK_SETTIME() is > > > allowed to sleep. > > > > Thanks. > > > > As you know, clock locking is still quite broken. The most obvious > > bugs near here are now: > > - settime() still begins with splclock(). This gave fewer bugs in > > clock locking when it last worked in FreeBSD-4. Now it is just > > a hint that the clock locking has still not been converted to use > > mutexes. > > - bitrot near the end of settime() removed the splx() that should > > be paired with the splclock(). > Yes, I am aware of both splclock() and Giant braces in the setclock(). > I did not wanted to even look at tc_setclock() at all, for that change. > See below. > > > - the locking should be stronger than mutexes (or splclock()) to > > minimise the delay between deciding to set the time and actually > > setting it. > ... > This is separate issue. Might be a critical section around the code > is the right approximation for reducing delays there. > > > > + mtx_lock(&resettodr_lock); > > > > This could be later (only around the hardware parts) since we are sloppy > > with the delays. > Done. > > > > > > getnanotime(&ts); > > > > This doesn't even try to get nanoseconds accuracy. > > > > > timespecadd(&ts, &clock_adj); > > > ts.tv_sec -= utc_offset(); > > > > Who knows what utc_offset() is locked by? Certainly not this new locking, > > since it is set in another file but the new locking is only in this file. > > It still seems to be most under Giant :-(. > utc_offset() uses several variables, updates to each of the variable is > atomic. There is no even an attempt to provide userspace with an interface > which would makes updates to the components used by utc_offset() consistent. > Thus I do not see a need in any locking there. > > > > > The change in the other part of the diff is related to this, and doesn't > > seem quite right. utc_offset() uses the variables tz_minuteswest, > > wall_cmos_clock and adjkerntz: > > - tz_minuteswest is set with no locking in kern_settimeofday(), just > > after the settime() call. This last had a chance of being correct > > when the kernel was UP and not preemptible. But it is broken in > > a more fundamental way back to at least FreeBSD-1: resettodr() is > > called _before_ updating the timezone, so it uses an out-of-date > > tz_minuteswest. > First, the man page is not correct, saying that 'this information is > kept outside the kernel'. > > Second, why does settime(), logicall, need the updated timezone ? > I understand that resettodr() is called on successful update of > the wall clock. But shouldn't the need to reset not-GMT RTC clock > in kern_settimeofday() satisfied by yet another call to resettodr() > if tzp != NULL, after the tz_* vars update ? > > > - wall_cmos_clock is still under a simple (not MPSAFE) sysctl > > - adjkerntz is under a PROC sysctl, and you just changed this sysctl to > > be MPSAFE. The sysctl calls here, so we don't want all of it > > giant-locked. But it now does unlocked accesses to the adjkerntz > > variable, and the access to the disable_rtc_set variable is unlocked > > because we do it before acquiring the lock in this function (this > > variable is still under a simple (not MPSAFE) sysctl. > > > > Most of the variables are ints, so access to them are atomic, but without > > locking the correct order only occurs accidentally and is hard to > > understand. > I think there is no any global consistency, mostly caused by existence > of separate interfaces to control all of them. IMO it is not worth to > try to ensure any. > > > > > > /* XXX: We should really set all registered RTCs */ > > > - if ((error = CLOCK_SETTIME(clock_dev, &ts)) != 0) > > > + error = CLOCK_SETTIME(clock_dev, &ts); > > > + mtx_unlock(&resettodr_lock); > > > + if (error != 0) > > > printf("warning: clock_settime failed (%d), time-of-day clock " > > > "not adjusted to system time\n", error); > > > } > > > > The lock needs to cover all high-level accesses to the rtc and related > > variables to avoid races, and a bit more to minimise delays. In particular, > > inittodr() should be under this lock, and there is a problem with > > "show rtc" in ddb. > I do not think that inittodr() actually needs that locking. This code > is executed at the initialization stage, or at the resume stage. > > Do you mean that CLOCK_GETTIME() hardware access should be made exclusive > with CLOCK_SETTIME() ? Is there a hardware that rely on exclusivity and > not enforce that internally, how x86 rtc does ? > > DDB access to rtc is separate problem. > > On Wed, Jun 29, 2016 at 07:39:45PM +1000, Bruce Evans wrote: > > On Tue, 28 Jun 2016, Konstantin Belousov wrote: > > > static int64_t time_adjtime; /* correction from adjtime(2) (usec) */ > > > > > > +static struct mtx ntpadj_lock; > > > +MTX_SYSINIT(ntpadj, &ntpadj_lock, "ntpadj", > > > +#ifdef PPS_SYNC > > > + MTX_SPIN > > > +#else > > > + MTX_DEF > > > +#endif > > > +); > > > > This needs to be a spinlock in all cases, since ntp_update_second() needs > > to be locked and ntp_update_second() is called from tc_windup() is usually > > called from a fast interrupt handler. > Oh, I see. > > > > > ntp_update_second() is still unlocked. It accesses lots of variables, so > > it obviously needs locking. E.g., its very first access is > > time_maxerror += MAXFREQ / 1000. This has the usual non-atomicity for > > a read-modify write. This obiously races with the locked access on the > > 28th linke of sys_ntp_adjtime(). Only the latter is locked now. > > > > > + > > > +/* > > > + * When PPS_SYNC is defined, hardpps() function is provided which can > > > + * be legitimately called from interrupt filters. Due to this, use > > > + * spinlock for ntptime state protection, otherwise sleepable mutex is > > > + * adequate. > > > + */ > > > +#ifdef PPS_SYNC > > > +#define NTPADJ_LOCK() mtx_lock_spin(&ntpadj_lock) > > > +#define NTPADJ_UNLOCK() mtx_unlock_spin(&ntpadj_lock) > > > +#else > > > +#define NTPADJ_LOCK() mtx_lock(&ntpadj_lock) > > > +#define NTPADJ_UNLOCK() mtx_unlock(&ntpadj_lock) > > > +#endif > > > +#define NTPADJ_ASSERT_LOCKED() mtx_assert(&ntpadj_lock, MA_OWNED) > > > > *ADJ* and *adj* are not good names, since much more than ntp_adj*() needs > > to be locked. > Do you have a suggestion for a better name ? > > > > > Probably the locking should be shared with kern_tc.c. tc_windup() still > > uses only fancy time-domaind/atomic-op locking. This mostly works, but > > is quite broken by calling tc_windup() from places other than hardclock(). > > The most problematic other place is from tc_setclock(). That is unlocked, > > except accidentally by callers with only Giant locking, so it races with > > the fast interrupt handler. > > > > tc_windup() doesn't need the fancy locking for efficiency -- it only needs > > it to be consistent with binuptime() and friends. So it may as well use > > the same lock as ntp. Its lock must be around the whole function to > > protect it from tc_setclock(). Then the same lock works for > > ntp_update_second() and pps. Any lock that is only acquired once per > > second is not worth optimizing. Hopefully the same for ntp's other locks. > > ntp syscalls should only be frequent under unusual/malicious loads, so > > they shouldn't cause much contention with the tc_windup() lock. > So I postponed looking at tc_windup(), and it seems that it was > useful. After you note above, it is mostly clear that we can skip some > tc_windup() calls which conflicts with currently active windup calls. > E.g., the hardclock update, which is done simultaneously with the > setclock(), could be just skipped. But we must not skip the tc_windup() > call from setclock(). > > As an additional, but probably excessive measure, invocation of > tc_windup() from hardclock could ask the setclock() to re-execute > tc_windup() after its call is done. > > > > -static int > > > -ntp_is_time_error(void) > > > +static bool > > > +ntp_is_time_error(int tsl) > > > { > > > + > > > > Is it worth changing this to make a single variable less volatile? > Yes, to rely on single-variable read being atomic and avoiding locking > in callout. > > [Skipped the S64/OPAQUE stuff] > > Patch does the following: > - change the ntpadj_lock to spinlock. > - acquire the ntpadj_lock around ntp_update_second(). > - prevent parallel tc_windup() invocations, while not allowing to > skip calls coming from kernel top half. > - move resettodr_lock acquisition in resettodr() later. I now think that the lock around tc_windup() should have more flavor of real spinlocks. Ideally the spinlock_enter() should be taken around it, but it might be too extreme. Just a critical section in tc_setclock() is good enough. diff --git a/sys/kern/kern_ntptime.c b/sys/kern/kern_ntptime.c index d352ee7..34b29d9 100644 --- a/sys/kern/kern_ntptime.c +++ b/sys/kern/kern_ntptime.c @@ -163,27 +163,10 @@ static l_fp time_adj; /* tick adjust (ns/s) */ static int64_t time_adjtime; /* correction from adjtime(2) (usec) */ static struct mtx ntpadj_lock; -MTX_SYSINIT(ntpadj, &ntpadj_lock, "ntpadj", -#ifdef PPS_SYNC - MTX_SPIN -#else - MTX_DEF -#endif -); +MTX_SYSINIT(ntpadj, &ntpadj_lock, "ntpadj", MTX_SPIN); -/* - * When PPS_SYNC is defined, hardpps() function is provided which can - * be legitimately called from interrupt filters. Due to this, use - * spinlock for ntptime state protection, otherwise sleepable mutex is - * adequate. - */ -#ifdef PPS_SYNC #define NTPADJ_LOCK() mtx_lock_spin(&ntpadj_lock) #define NTPADJ_UNLOCK() mtx_unlock_spin(&ntpadj_lock) -#else -#define NTPADJ_LOCK() mtx_lock(&ntpadj_lock) -#define NTPADJ_UNLOCK() mtx_unlock(&ntpadj_lock) -#endif #define NTPADJ_ASSERT_LOCKED() mtx_assert(&ntpadj_lock, MA_OWNED) #ifdef PPS_SYNC @@ -506,6 +489,8 @@ ntp_update_second(int64_t *adjustment, time_t *newsec) int tickrate; l_fp ftemp; /* 32/64-bit temporary */ + NTPADJ_LOCK(); + /* * On rollover of the second both the nanosecond and microsecond * clocks are updated and the state machine cranked as @@ -627,6 +612,8 @@ ntp_update_second(int64_t *adjustment, time_t *newsec) else time_status &= ~STA_PPSSIGNAL; #endif /* PPS_SYNC */ + + NTPADJ_UNLOCK(); } /* diff --git a/sys/kern/kern_tc.c b/sys/kern/kern_tc.c index 0f015b3..a5776d2 100644 --- a/sys/kern/kern_tc.c +++ b/sys/kern/kern_tc.c @@ -135,7 +135,8 @@ SYSCTL_PROC(_kern_timecounter, OID_AUTO, alloweddeviation, static int tc_chosen; /* Non-zero if a specific tc was chosen via sysctl. */ -static void tc_windup(void); +static void tc_windup(bool top_call); +static void tc_windup_locked(void); static void cpu_tick_calibrate(int); void dtrace_getnanotime(struct timespec *tsp); @@ -1237,6 +1238,7 @@ tc_setclock(struct timespec *ts) struct timespec tbef, taft; struct bintime bt, bt2; + critical_enter(); cpu_tick_calibrate(1); nanotime(&tbef); timespec2bintime(ts, &bt); @@ -1247,8 +1249,10 @@ tc_setclock(struct timespec *ts) bintime2timeval(&bt, &boottime); /* XXX fiddle all the little crinkly bits around the fiords... */ - tc_windup(); + tc_windup(true); nanotime(&taft); + cpu_tick_calibrate(1); + critical_exit(); if (timestepwarnings) { log(LOG_INFO, "Time stepped from %jd.%09ld to %jd.%09ld (%jd.%09ld)\n", @@ -1256,7 +1260,23 @@ tc_setclock(struct timespec *ts) (intmax_t)taft.tv_sec, taft.tv_nsec, (intmax_t)ts->tv_sec, ts->tv_nsec); } - cpu_tick_calibrate(1); +} + +static volatile int tc_windup_lock; +static void +tc_windup(bool top_call) +{ + + for (;;) { + if (atomic_cmpset_int(&tc_windup_lock, 0, 1)) { + atomic_thread_fence_acq(); + tc_windup_locked(); + atomic_store_rel_int(&tc_windup_lock, 0); + break; + } else if (!top_call) { + break; + } + } } /* @@ -1265,7 +1285,7 @@ tc_setclock(struct timespec *ts) * timecounter and/or do seconds processing in NTP. Slightly magic. */ static void -tc_windup(void) +tc_windup_locked(void) { struct bintime bt; struct timehands *th, *tho; @@ -1846,7 +1866,7 @@ tc_ticktock(int cnt) if (count < tc_tick) return; count = 0; - tc_windup(); + tc_windup(false); } static void __inline @@ -1921,7 +1941,7 @@ inittimecounter(void *dummy) /* warm up new timecounter (again) and get rolling. */ (void)timecounter->tc_get_timecount(timecounter); (void)timecounter->tc_get_timecount(timecounter); - tc_windup(); + tc_windup(true); } SYSINIT(timecounter, SI_SUB_CLOCKS, SI_ORDER_SECOND, inittimecounter, NULL); diff --git a/sys/kern/kern_time.c b/sys/kern/kern_time.c index 148da2b..82710f7 100644 --- a/sys/kern/kern_time.c +++ b/sys/kern/kern_time.c @@ -115,9 +115,7 @@ settime(struct thread *td, struct timeval *tv) struct timeval delta, tv1, tv2; static struct timeval maxtime, laststep; struct timespec ts; - int s; - s = splclock(); microtime(&tv1); delta = *tv; timevalsub(&delta, &tv1); @@ -147,10 +145,8 @@ settime(struct thread *td, struct timeval *tv) printf("Time adjustment clamped to -1 second\n"); } } else { - if (tv1.tv_sec == laststep.tv_sec) { - splx(s); + if (tv1.tv_sec == laststep.tv_sec) return (EPERM); - } if (delta.tv_sec > 1) { tv->tv_sec = tv1.tv_sec + 1; printf("Time adjustment clamped to +1 second\n"); @@ -161,10 +157,8 @@ settime(struct thread *td, struct timeval *tv) ts.tv_sec = tv->tv_sec; ts.tv_nsec = tv->tv_usec * 1000; - mtx_lock(&Giant); tc_setclock(&ts); resettodr(); - mtx_unlock(&Giant); return (0); } diff --git a/sys/kern/subr_rtc.c b/sys/kern/subr_rtc.c index dbad36d..4bac324 100644 --- a/sys/kern/subr_rtc.c +++ b/sys/kern/subr_rtc.c @@ -172,11 +172,11 @@ resettodr(void) if (disable_rtc_set || clock_dev == NULL) return; - mtx_lock(&resettodr_lock); getnanotime(&ts); timespecadd(&ts, &clock_adj); ts.tv_sec -= utc_offset(); /* XXX: We should really set all registered RTCs */ + mtx_lock(&resettodr_lock); error = CLOCK_SETTIME(clock_dev, &ts); mtx_unlock(&resettodr_lock); if (error != 0)