From owner-svn-src-head@freebsd.org Wed Jun 29 21:20:08 2016 Return-Path: Delivered-To: svn-src-head@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 69E0AB86951; Wed, 29 Jun 2016 21:20:08 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id DE147269B; Wed, 29 Jun 2016 21:20:07 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from tom.home (kib@localhost [127.0.0.1]) by kib.kiev.ua (8.15.2/8.15.2) with ESMTPS id u5TLJr9l054159 (version=TLSv1 cipher=DHE-RSA-CAMELLIA256-SHA bits=256 verify=NO); Thu, 30 Jun 2016 00:19:54 +0300 (EEST) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.10.3 kib.kiev.ua u5TLJr9l054159 Received: (from kostik@localhost) by tom.home (8.15.2/8.15.2/Submit) id u5TLJrUw054158; Thu, 30 Jun 2016 00:19:53 +0300 (EEST) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Thu, 30 Jun 2016 00:19:53 +0300 From: Konstantin Belousov To: Bruce Evans Cc: src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-head@freebsd.org Subject: Re: svn commit: r302252 - head/sys/kern Message-ID: <20160629211953.GK38613@kib.kiev.ua> References: <201606281643.u5SGhNsi061606@repo.freebsd.org> <20160629175917.O968@besplex.bde.org> <20160629145443.GG38613@kib.kiev.ua> <20160629153233.GI38613@kib.kiev.ua> <20160630040123.F791@besplex.bde.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160630040123.F791@besplex.bde.org> User-Agent: Mutt/1.6.1 (2016-04-27) X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00, DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no autolearn_force=no version=3.4.1 X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on tom.home X-BeenThere: svn-src-head@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: SVN commit messages for the src tree for head/-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 29 Jun 2016 21:20:08 -0000 On Thu, Jun 30, 2016 at 05:47:39AM +1000, Bruce Evans wrote: > On Wed, 29 Jun 2016, Konstantin Belousov wrote: > > > On Wed, Jun 29, 2016 at 05:54:43PM +0300, Konstantin Belousov wrote: > >> This is a reply to two mails in one, both for r302251 and r302252 threads. > > After removing the foot-shooting tz_minuteswest variable from utc_offset(), > all of the variables in it are ones set by adjkerntz sysctls. The settings > are normally ordered by not running more than 1 instance of adjkerntz. > When adjkerntz starts, this instance shouldn't have any contention with > other initialization doing resettodr(). There is contention every 6 > months if wall_cmos_clock when adjkerntz changes the offset for daylight > savings time. The chances of this used to be tiny since resettodr() was > almost never called, but now there is the periodic resettodr() in > kern_ntptime.c. I do not see how your text above changes anything I said about consistency. Each sysctl sets or fetches integer variable, so individually it is atomic. Inter-dependencies between wall_cmos_clock and adjkerntz settings are completely up to user, no locking can change it while keeping separate sysctls to set and fetch each var individually. The only side-effects in resettodr() are possible in CLOCK_SETTIME(), which is protected against parallel invocation by resettodr_lock. > > > >>> ... > >>> The change in the other part of the diff is related to this, and doesn't > >>> seem quite right. utc_offset() uses the variables tz_minuteswest, > >>> wall_cmos_clock and adjkerntz: > >>> - tz_minuteswest is set with no locking in kern_settimeofday(), just > >>> after the settime() call. This last had a chance of being correct > >>> when the kernel was UP and not preemptible. But it is broken in > >>> a more fundamental way back to at least FreeBSD-1: resettodr() is > >>> called _before_ updating the timezone, so it uses an out-of-date > >>> tz_minuteswest. > >> First, the man page is not correct, saying that 'this information is > >> kept outside the kernel'. > > We deprecate the timezone support in settimeofday(), but use wrong wording > for this here. The info set by this sysctl is of course kept in the kernel, > and I think the man page is trying to say that this info is never used > outside of the kernel. How the later statement could be true, since that information is returned by gettimeofday(), and we do not control what programs a user might run on his machine. > > The man page doesn't say enough about deprecation, or that we only use this > info for foot-shooting in the kernel, only (?) in utc_offset(). utc_offset() > does: > > return (tz_minuteswest * 60 + (wall_cmos_clock ? adjkerntz : 0)); > > If !wall_cmos_clock, then the offset should be 0, so any setting of > tz_minutewest using the timezone info breaks the offset and this breaks > the rtc _not being on wall time. > > If wall_cmos_clock, then adjkerntz(8) normally maintains the offset in > the adjkerntz variable, so again tz_minuteswest must be 0 to work and > any setting of it to nonzero using the timezone info breaks the offset. > > >> Second, why does settime(), logicall, need the updated timezone ? > >> I understand that resettodr() is called on successful update of > >> the wall clock. But shouldn't the need to reset not-GMT RTC clock > c > > The timezone updates just give foot-shooting, but we need an up to > date adjkerntz variable initially and on every daylight saving change > in the wall_cmos_clock case. This is the normal way to keep the rtc > on wall time after every daylight saving changes -- adjkerntz() runs, > and it should change the adjkerntz variable first and then call > settimeofday() with a "null" change to get the side effect of calling > resettodr() which does a large change of the rtc. It is impossible > to make a really null change of the software clock using settimeofday(), > but I don't know of any better API to do this. I think the locking that > you just added is enough -- adjkerntz() just has to arrange a call to > resettodr() strictly after it updates the variable, and that happens > almost automatically. I think that the way it is done is just by doing sysctl machdep.adjkerntz, which calls resettodr(). Old wisdom to re-sync RTC to kernel-maintained time was to run the sysctl from cron, but Andrey added the callout several years ago. > >> I do not think that inittodr() actually needs that locking. This code > >> is executed at the initialization stage, or at the resume stage. > > Perhaps, but resume runs more often than every 6 months for the daylight > savings time race, and it isn't clear if anything stops resettodr() running > concurrently then. > > In my version, something like inittodr() runs 3 seconds after every exit > from ddb. Stopping in ddb is a bit like a short suspend-resume -- it > stops or breaks some timers, and inittodr() is a way to fix up the > software real time using an unstopped timer. Lets postpone this. Might be, a resettodr_lock can be applied to inittodr() as a whole, and additionally merged with some other lock. BUt I think inittodr() is not very important right now. > >>> *ADJ* and *adj* are not good names, since much more than ntp_adj*() needs > >>> to be locked. > >> Do you have a suggestion for a better name ? > > Just a general name like ntp_lock for ntp or t_lock if the same lock is > used for timecounters. I also prefer not to use macros except for > development versions, and then name the lock as xxx_mtx, not xxx_lock. > We depend on special properties of (spin) mutexes. I want to keep this macroized. Yes, it is at development stage right now. > >> So I postponed looking at tc_windup(), and it seems that it was > >> useful. After you note above, it is mostly clear that we can skip some > >> tc_windup() calls which conflicts with currently active windup calls. > >> E.g., the hardclock update, which is done simultaneously with the > >> setclock(), could be just skipped. But we must not skip the tc_windup() > >> call from setclock(). > > Several things could be deferred, but I think that would just be more > complicated. > > >> As an additional, but probably excessive measure, invocation of > >> tc_windup() from hardclock could ask the setclock() to re-execute > >> tc_windup() after its call is done. > > I don't quite understand this. hardclock -> tc_windup() normally doesn't > have interference from tc_setclock(). In what sense it does not have interference ? tc_windup() may be executed from hardclock/hardclock_cnt (low half) and concurrently from setclock() (top half). And this is exactly what we want to avoid, isn't it ? > tc_setclock() needs locking even more than tc_windup(). ... > > I now think that the lock around tc_windup() should have more flavor of > > real spinlocks. Ideally the spinlock_enter() should be taken around it, > > but it might be too extreme. Just a critical section in tc_setclock() > > is good enough. > > No, it must have a full lock since it does read-write-modify on critical > variables. I talked only about tc_windup(). Parallel invocations of tc_setclock() should be guarded, indeed. I added a private mutex for it. > > > ... > > diff --git a/sys/kern/kern_tc.c b/sys/kern/kern_tc.c > > index 0f015b3..a5776d2 100644 > > --- a/sys/kern/kern_tc.c > > +++ b/sys/kern/kern_tc.c > > ... > > @@ -1237,6 +1238,7 @@ tc_setclock(struct timespec *ts) > > struct timespec tbef, taft; > > struct bintime bt, bt2; > > > > + critical_enter(); > > cpu_tick_calibrate(1); > > nanotime(&tbef); > > timespec2bintime(ts, &bt); > > @@ -1247,8 +1249,10 @@ tc_setclock(struct timespec *ts) > > bintime2timeval(&bt, &boottime); > > boottime is one of the critical variables. Before this, there are 2 > accesses to boottimebin. The Giant locking that you just removed gave > only some protection here. It prevented concurrent acccess by > malicious/stress-testing code doing settime(). It didn't prevent > nanotime() and friends seeing boottime in an in-between state. Oops, > mutex locking won't fix that either. I think the update of these > variables needs to be moved into tc_windup() where it will hopefully > automatically be protected by the generation count, like the timehands > offsets already are. I do not understand how that would help. Do you mean to move the vars into current timehands ? > > bootime is also accessed with no locking in the kern.boottime sysctl. > > There is also ffclock_boottime. I don't like the duplication for > ffclock, but it doesn't have much duplication for initializing > ffsclock_boottime. > > > > > /* XXX fiddle all the little crinkly bits around the fiords... */ > > - tc_windup(); > > + tc_windup(true); > > nanotime(&taft); > > + cpu_tick_calibrate(1); > > The 2 calls to cpu_tick_calibrate() in this function are excessive. These > calls are just a hack to recalibrate after suspend/resume as a side effect > of calling here. The first call suffices, since cpu ticks have nothing to > do with any time variable that is set here. Ok. > > +static volatile int tc_windup_lock; > > +static void > > +tc_windup(bool top_call) > > +{ > > + > > + for (;;) { > > + if (atomic_cmpset_int(&tc_windup_lock, 0, 1)) { > > + atomic_thread_fence_acq(); > > + tc_windup_locked(); > > + atomic_store_rel_int(&tc_windup_lock, 0); > > + break; > > + } else if (!top_call) { > > + break; > > + } > > + } > > } > > I like to write optimized locking like that, but don't see why you > want it here. Because I do not want the fast clock interrupt handler to wait on lock, which neccessary becomes spinlock. The ntp_lock is bad enough already. What I coded above is not real lock. In contention case it bails out, so there is no blocking. In particular, witness is not relevant to it. diff --git a/sys/kern/kern_ntptime.c b/sys/kern/kern_ntptime.c index d352ee7..efc3713 100644 --- a/sys/kern/kern_ntptime.c +++ b/sys/kern/kern_ntptime.c @@ -162,29 +162,12 @@ static l_fp time_adj; /* tick adjust (ns/s) */ static int64_t time_adjtime; /* correction from adjtime(2) (usec) */ -static struct mtx ntpadj_lock; -MTX_SYSINIT(ntpadj, &ntpadj_lock, "ntpadj", -#ifdef PPS_SYNC - MTX_SPIN -#else - MTX_DEF -#endif -); +static struct mtx ntp_lock; +MTX_SYSINIT(ntp, &ntp_lock, "ntp", MTX_SPIN); -/* - * When PPS_SYNC is defined, hardpps() function is provided which can - * be legitimately called from interrupt filters. Due to this, use - * spinlock for ntptime state protection, otherwise sleepable mutex is - * adequate. - */ -#ifdef PPS_SYNC -#define NTPADJ_LOCK() mtx_lock_spin(&ntpadj_lock) -#define NTPADJ_UNLOCK() mtx_unlock_spin(&ntpadj_lock) -#else -#define NTPADJ_LOCK() mtx_lock(&ntpadj_lock) -#define NTPADJ_UNLOCK() mtx_unlock(&ntpadj_lock) -#endif -#define NTPADJ_ASSERT_LOCKED() mtx_assert(&ntpadj_lock, MA_OWNED) +#define NTP_LOCK() mtx_lock_spin(&ntp_lock) +#define NTP_UNLOCK() mtx_unlock_spin(&ntp_lock) +#define NTP_ASSERT_LOCKED() mtx_assert(&ntp_lock, MA_OWNED) #ifdef PPS_SYNC /* @@ -271,7 +254,7 @@ ntp_gettime1(struct ntptimeval *ntvp) { struct timespec atv; /* nanosecond time */ - NTPADJ_ASSERT_LOCKED(); + NTP_ASSERT_LOCKED(); nanotime(&atv); ntvp->time.tv_sec = atv.tv_sec; @@ -302,9 +285,9 @@ sys_ntp_gettime(struct thread *td, struct ntp_gettime_args *uap) { struct ntptimeval ntv; - NTPADJ_LOCK(); + NTP_LOCK(); ntp_gettime1(&ntv); - NTPADJ_UNLOCK(); + NTP_UNLOCK(); td->td_retval[0] = ntv.time_state; return (copyout(&ntv, uap->ntvp, sizeof(ntv))); @@ -315,9 +298,9 @@ ntp_sysctl(SYSCTL_HANDLER_ARGS) { struct ntptimeval ntv; /* temporary structure */ - NTPADJ_LOCK(); + NTP_LOCK(); ntp_gettime1(&ntv); - NTPADJ_UNLOCK(); + NTP_UNLOCK(); return (sysctl_handle_opaque(oidp, &ntv, sizeof(ntv), req)); } @@ -382,7 +365,7 @@ sys_ntp_adjtime(struct thread *td, struct ntp_adjtime_args *uap) error = priv_check(td, PRIV_NTP_ADJTIME); if (error != 0) return (error); - NTPADJ_LOCK(); + NTP_LOCK(); if (modes & MOD_MAXERROR) time_maxerror = ntv.maxerror; if (modes & MOD_ESTERROR) @@ -484,7 +467,7 @@ sys_ntp_adjtime(struct thread *td, struct ntp_adjtime_args *uap) ntv.stbcnt = pps_stbcnt; #endif /* PPS_SYNC */ retval = ntp_is_time_error(time_status) ? TIME_ERROR : time_state; - NTPADJ_UNLOCK(); + NTP_UNLOCK(); error = copyout((caddr_t)&ntv, (caddr_t)uap->tp, sizeof(ntv)); if (error == 0) @@ -506,6 +489,8 @@ ntp_update_second(int64_t *adjustment, time_t *newsec) int tickrate; l_fp ftemp; /* 32/64-bit temporary */ + NTP_LOCK(); + /* * On rollover of the second both the nanosecond and microsecond * clocks are updated and the state machine cranked as @@ -627,6 +612,8 @@ ntp_update_second(int64_t *adjustment, time_t *newsec) else time_status &= ~STA_PPSSIGNAL; #endif /* PPS_SYNC */ + + NTP_UNLOCK(); } /* @@ -690,7 +677,7 @@ hardupdate(offset) long mtemp; l_fp ftemp; - NTPADJ_ASSERT_LOCKED(); + NTP_ASSERT_LOCKED(); /* * Select how the phase is to be controlled and from which @@ -772,7 +759,7 @@ hardpps(tsp, nsec) long u_sec, u_nsec, v_nsec; /* temps */ l_fp ftemp; - NTPADJ_LOCK(); + NTP_LOCK(); /* * The signal is first processed by a range gate and frequency @@ -956,7 +943,7 @@ hardpps(tsp, nsec) time_freq = pps_freq; out: - NTPADJ_UNLOCK(); + NTP_UNLOCK(); } #endif /* PPS_SYNC */ @@ -999,11 +986,11 @@ kern_adjtime(struct thread *td, struct timeval *delta, struct timeval *olddelta) return (error); ltw = (int64_t)delta->tv_sec * 1000000 + delta->tv_usec; } - NTPADJ_LOCK(); + NTP_LOCK(); ltr = time_adjtime; if (delta != NULL) time_adjtime = ltw; - NTPADJ_UNLOCK(); + NTP_UNLOCK(); if (olddelta != NULL) { atv.tv_sec = ltr / 1000000; atv.tv_usec = ltr % 1000000; diff --git a/sys/kern/kern_tc.c b/sys/kern/kern_tc.c index 0f015b3..7bb767c 100644 --- a/sys/kern/kern_tc.c +++ b/sys/kern/kern_tc.c @@ -135,7 +135,8 @@ SYSCTL_PROC(_kern_timecounter, OID_AUTO, alloweddeviation, static int tc_chosen; /* Non-zero if a specific tc was chosen via sysctl. */ -static void tc_windup(void); +static void tc_windup(bool top_call); +static void tc_windup_locked(void); static void cpu_tick_calibrate(int); void dtrace_getnanotime(struct timespec *tsp); @@ -1226,10 +1227,12 @@ tc_getfrequency(void) return (timehands->th_counter->tc_frequency); } +static struct mtx tc_setclock_mtx; +MTX_SYSINIT(tc_setclock_init, &tc_setclock_mtx, "tcsetc", MTX_DEF); + /* * Step our concept of UTC. This is done by modifying our estimate of * when we booted. - * XXX: not locked. */ void tc_setclock(struct timespec *ts) @@ -1237,6 +1240,8 @@ tc_setclock(struct timespec *ts) struct timespec tbef, taft; struct bintime bt, bt2; + mtx_lock(&tc_setclock_mtx); + critical_enter(); cpu_tick_calibrate(1); nanotime(&tbef); timespec2bintime(ts, &bt); @@ -1247,8 +1252,10 @@ tc_setclock(struct timespec *ts) bintime2timeval(&bt, &boottime); /* XXX fiddle all the little crinkly bits around the fiords... */ - tc_windup(); + tc_windup(true); nanotime(&taft); + critical_exit(); + mtx_unlock(&tc_setclock_mtx); if (timestepwarnings) { log(LOG_INFO, "Time stepped from %jd.%09ld to %jd.%09ld (%jd.%09ld)\n", @@ -1256,7 +1263,23 @@ tc_setclock(struct timespec *ts) (intmax_t)taft.tv_sec, taft.tv_nsec, (intmax_t)ts->tv_sec, ts->tv_nsec); } - cpu_tick_calibrate(1); +} + +static volatile int tc_windup_lock; +static void +tc_windup(bool top_call) +{ + + for (;;) { + if (atomic_cmpset_int(&tc_windup_lock, 0, 1)) { + atomic_thread_fence_acq(); + tc_windup_locked(); + atomic_store_rel_int(&tc_windup_lock, 0); + break; + } else if (!top_call) { + break; + } + } } /* @@ -1265,7 +1288,7 @@ tc_setclock(struct timespec *ts) * timecounter and/or do seconds processing in NTP. Slightly magic. */ static void -tc_windup(void) +tc_windup_locked(void) { struct bintime bt; struct timehands *th, *tho; @@ -1846,7 +1869,7 @@ tc_ticktock(int cnt) if (count < tc_tick) return; count = 0; - tc_windup(); + tc_windup(false); } static void __inline @@ -1921,7 +1944,7 @@ inittimecounter(void *dummy) /* warm up new timecounter (again) and get rolling. */ (void)timecounter->tc_get_timecount(timecounter); (void)timecounter->tc_get_timecount(timecounter); - tc_windup(); + tc_windup(true); } SYSINIT(timecounter, SI_SUB_CLOCKS, SI_ORDER_SECOND, inittimecounter, NULL); diff --git a/sys/kern/kern_time.c b/sys/kern/kern_time.c index 148da2b..82710f7 100644 --- a/sys/kern/kern_time.c +++ b/sys/kern/kern_time.c @@ -115,9 +115,7 @@ settime(struct thread *td, struct timeval *tv) struct timeval delta, tv1, tv2; static struct timeval maxtime, laststep; struct timespec ts; - int s; - s = splclock(); microtime(&tv1); delta = *tv; timevalsub(&delta, &tv1); @@ -147,10 +145,8 @@ settime(struct thread *td, struct timeval *tv) printf("Time adjustment clamped to -1 second\n"); } } else { - if (tv1.tv_sec == laststep.tv_sec) { - splx(s); + if (tv1.tv_sec == laststep.tv_sec) return (EPERM); - } if (delta.tv_sec > 1) { tv->tv_sec = tv1.tv_sec + 1; printf("Time adjustment clamped to +1 second\n"); @@ -161,10 +157,8 @@ settime(struct thread *td, struct timeval *tv) ts.tv_sec = tv->tv_sec; ts.tv_nsec = tv->tv_usec * 1000; - mtx_lock(&Giant); tc_setclock(&ts); resettodr(); - mtx_unlock(&Giant); return (0); } diff --git a/sys/kern/subr_rtc.c b/sys/kern/subr_rtc.c index dbad36d..4bac324 100644 --- a/sys/kern/subr_rtc.c +++ b/sys/kern/subr_rtc.c @@ -172,11 +172,11 @@ resettodr(void) if (disable_rtc_set || clock_dev == NULL) return; - mtx_lock(&resettodr_lock); getnanotime(&ts); timespecadd(&ts, &clock_adj); ts.tv_sec -= utc_offset(); /* XXX: We should really set all registered RTCs */ + mtx_lock(&resettodr_lock); error = CLOCK_SETTIME(clock_dev, &ts); mtx_unlock(&resettodr_lock); if (error != 0)