From owner-freebsd-arch@FreeBSD.ORG  Sat Aug 16 02:07:58 2014
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 2C931F19;
 Sat, 16 Aug 2014 02:07:58 +0000 (UTC)
Received: from mail106.syd.optusnet.com.au (mail106.syd.optusnet.com.au
 [211.29.132.42])
 by mx1.freebsd.org (Postfix) with ESMTP id C993B2911;
 Sat, 16 Aug 2014 02:07:57 +0000 (UTC)
Received: from c122-106-147-133.carlnfd1.nsw.optusnet.com.au
 (c122-106-147-133.carlnfd1.nsw.optusnet.com.au [122.106.147.133])
 by mail106.syd.optusnet.com.au (Postfix) with ESMTPS id 618043C1C46;
 Sat, 16 Aug 2014 11:35:00 +1000 (EST)
Date: Sat, 16 Aug 2014 11:34:59 +1000 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: John Baldwin <jhb@freebsd.org>
Subject: Re: [PATCH 0/2] plug capability races
In-Reply-To: <201408151031.45967.jhb@freebsd.org>
Message-ID: <20140816102840.V1007@besplex.bde.org>
References: <1408064112-573-1-git-send-email-mjguzik@gmail.com>
 <201408151031.45967.jhb@freebsd.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
X-Optus-CM-Score: 0
X-Optus-CM-Analysis: v=2.1 cv=fvDlOjIf c=1 sm=1 tr=0
 a=7NqvjVvQucbO2RlWB8PEog==:117 a=PO7r1zJSAAAA:8 a=tTSYktBZc9AA:10
 a=KN91Z2BipYgA:10 a=kj9zAlcOel0A:10 a=JzwRw_2MAAAA:8
 a=vlCwttaAoV0VwkxwwJ0A:9 a=CjuIK1q_8ugA:10
Cc: Robert Watson <rwatson@freebsd.org>, Mateusz Guzik <mjguzik@gmail.com>,
 Konstantin Belousov <kib@freebsd.org>, Johan Schuijt <johan@transip.nl>,
 freebsd-arch@freebsd.org
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.18-1
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch/>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 16 Aug 2014 02:07:58 -0000

On Fri, 15 Aug 2014, John Baldwin wrote:

> One thing I would like to see is for the timecounter code to be adapted to use
> the seq API instead of doing it by hand (the timecounter code is also missing
> barriers due to doing it by hand).

Locking in the timecounter code is poor (1), but I fear a general mechanism
would be slower.  Also, the timecounter code now extends into userland,
so purely kernel locking cannot work for it.  The userland part is
more careful about locking than the kernel.  It has memory barriers and
other pessimizations which were intentionally left out of the kernel
locking for timecounters.  If these barriers are actually necessary, then
they give the silly situation that there are less races for userland
timecounting than kernel timecounting provided userland mostly does
direct accesses instead of syscalls and kernel uses of timecounters are
are infrequent enough to not race often with the userland accesses.

(1) The main bugs are:
- various calls to tc_setclock() with only Giant locking or worse.  One
   is in settime().  settime() used to be locked by spclock().  That
   stopped working when SMPng removed spls and replaced them by Giant
   locking, and is more broken now.  Giant locking is inadequate for
   most things related to timing.  Using it in settime() gives the
   problem that settime() might be interrupted for many milliseconds.
   It then sets a time significantly in the past, assuming that it
   started with the correct time.  Of course, it might be preempted
   before it aquires a sufficient lock.  It never tried to recover
   from that.  The extra breakage is that someone removed the splx()
   at the end, so settime() now has an splclock() with no corresponding
   splx().  This is only harmless since splclock() has no effect.
     settime()'s locking in FreeBSD-4, where it was last mostly
     correct, was:

     splclock() at start.  Run almost entirely with that.  But lower
     the spl using splsoftclock() too soon, before the end.  It was
     apparently necessary to call lease_updatetime() at a lower
     priority, but the code was badly ordered so that was done before
     updating the time in the hardware.  In fact, the hardware part
     was done woth no locking at all: from FreeBSD-4:

% 	ts.tv_sec = tv->tv_sec;
% 	ts.tv_nsec = tv->tv_usec * 1000;
% 	set_timecounter(&ts);

     FreeBSD-4 had timecounters, and set_timecounter() corresponds to
     tc_setclock().  The locking here was provided all timecounter
     updates are locked by the same lock (splclock()).  The most
     critical ones were -- timecounter updates are normally done
     from hardclock() and that was locked by splclock().

% 	(void) splsoftclock();
% 	lease_updatetime(delta.tv_sec);

     Reduce priority, but not to 0, to call lease_updatetime().

% 	splx(s);
% 	resettodr();

     Reduce priority to 0 to update the hardware (TODR, not timecounter).
     This is broken.  The lock should be held throughout to increase the
     chance of updating the hardware to a time that is nearly current.
     Worse, most resettodr() implementations had and still have null
     locking except possibly for individual register accesses.  They
     depend on locking from here and other callers.  splclock() locking
     would have been adequate.

% 	return (0);

     SMPng broke all this.  It still doesn't even aquire Giant until near
     the end, leaving parts unlocked:

% static int
% settime(struct thread *td, struct timeval *tv)
% {
% 	struct timeval delta, tv1, tv2;
% 	static struct timeval maxtime, laststep;
% 	struct timespec ts;
% 	int s;
% 
% 	s = splclock();
% 	microtime(&tv1);

     No locking yet, except internally in microtime().

% 	delta = *tv;
% 	timevalsub(&delta, &tv1);
% 
% 	/*
% 	 * If the system is secure, we do not allow the time to be 
% 	 * set to a value earlier than 1 second less than the highest
% 	 * ...
% 	 */
% 	if (securelevel_gt(td->td_ucred, 1) != 0) {

     Race with changes in the setting of securelevel.  Not important to see
     a stale value provided we don't act nonatomically on it.

% 		if (delta.tv_sec < 0 || delta.tv_usec < 0) {
% 			/*
% 			 * Update maxtime to latest time we've seen.
% 			 */
% 			if (tv1.tv_sec > maxtime.tv_sec)
% 				maxtime = tv1;
% 			tv2 = *tv;
% 			timevalsub(&tv2, &maxtime);
% 			if (tv2.tv_sec < -1) {
% 				tv->tv_sec = maxtime.tv_sec - 1;
% 				printf("Time adjustment clamped to -1 second\n");
% 			}

     Race to change our state (maxtime).  Giant locking for this would be
     plenty.

% 		} else {
% 			if (tv1.tv_sec == laststep.tv_sec) {
% 				splx(s);
% 				return (EPERM);

     The null spls were carefully maintained here when this return was added.

% 			}
% 			if (delta.tv_sec > 1) {
% 				tv->tv_sec = tv1.tv_sec + 1;
% 				printf("Time adjustment clamped to +1 second\n");
% 			}
% 			laststep = *tv;
% 		}
% 	}

     Race to change our state (laststep).

% 
% 	ts.tv_sec = tv->tv_sec;
% 	ts.tv_nsec = tv->tv_usec * 1000;
% 	mtx_lock(&Giant);
% 	tc_setclock(&ts);

     Giant locking at last, at a point where it is almost no use.  Giant
     locking is neither necessary or sufficient for tc_setclock().

% 	resettodr();

     The Giant locking extends to resettodr() where it has a chance of
     doing some good iff all other calls to resettodr() do it too.

% 	mtx_unlock(&Giant);
% 	return (0);
% }

     Someone removed lease_updatetime() and the 2 spls near it.  Only one of
     these was related to lease_updatetime().  So splclock() is bracketed
     with splx() for the unusal return above, but not for the main return.

- locking in kern_ntptime.c has similar nonsense involving Giant, but
   worse since there are callbacks from timecounter code into ntp code.
   This file still has a comment saying that everything must use splclock():

%  * Note that all routines must run at priority splclock or higher.

   but it only has 1 splclock() call itself.  It depends on all calls into
   it the be done at splclock() and for splclock() to actually work.  It
   has a sprinkling of Giant locking.  But when timecounter code calls into
   it, it has nothing like Giant locking.  Timecounter code has some time-
   domain locking and perhaps still sched_lock for the usual case where it
   is called from the clock interrupt (settime() bypasses this).  Timecounter
   code calls ntp code mainly for seconds update and pps.  These functions
   provide no additional locking, but access ntptime's most critical state.
   Giant locking would be a gross LOR of course.  The Giant locking for the
   syscalls isn't a LOR, but just cannot work.

   Another interesting comment:

% #ifdef PPS_SYNC
% /*
%  * hardpps() - discipline CPU clock oscillator to external PPS signal
%  * ...
%  * Note that, on some Unix systems this routine runs at an interrupt
%  * priority level higher than the timer interrupt routine hardclock().
%  * Therefore, the variables used are distinct from the hardclock()
%  * variables, except for the actual time and frequency variables, which
%  * are determined by this routine and updated atomically.
%  */

   I don't see any atomic updates.  Certainly no mutexes, atomic ops or
   memory barriers.  I think there is at best stores that are atomic on
   most UP systems combined with time-domain locking.  The time-domain
   locking might even work, as in timecounter code generally, but is
   difficult to understand.  Any variable that is only updated every
   second is unlikely to have a race in the update itself.  Then it can
   be arranged that readers work if they see stale values out of order.
   But ntptime is mostly old UP code that depends on splclock(), so it
   is unlikely to arrange this.

   Only core timecounter code tries to arrange this.  It may be too
   sloppy with memory barriers for the generation count, but this is
   unclear.  Userland doesn't trust this, and sprinkles memory barriers.
   All userland can do to get more problems is to run stupid code that
   reads the timecounter in a loop.  Time-domain locking obviously works
   better if the times between the reads are larger, and the stupid
   loop can hit problems sooner by making more calls.

Bruce