From owner-freebsd-arch@FreeBSD.ORG Sat Aug 16 02:07:58 2014 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 2C931F19; Sat, 16 Aug 2014 02:07:58 +0000 (UTC) Received: from mail106.syd.optusnet.com.au (mail106.syd.optusnet.com.au [211.29.132.42]) by mx1.freebsd.org (Postfix) with ESMTP id C993B2911; Sat, 16 Aug 2014 02:07:57 +0000 (UTC) Received: from c122-106-147-133.carlnfd1.nsw.optusnet.com.au (c122-106-147-133.carlnfd1.nsw.optusnet.com.au [122.106.147.133]) by mail106.syd.optusnet.com.au (Postfix) with ESMTPS id 618043C1C46; Sat, 16 Aug 2014 11:35:00 +1000 (EST) Date: Sat, 16 Aug 2014 11:34:59 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: John Baldwin Subject: Re: [PATCH 0/2] plug capability races In-Reply-To: <201408151031.45967.jhb@freebsd.org> Message-ID: <20140816102840.V1007@besplex.bde.org> References: <1408064112-573-1-git-send-email-mjguzik@gmail.com> <201408151031.45967.jhb@freebsd.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.1 cv=fvDlOjIf c=1 sm=1 tr=0 a=7NqvjVvQucbO2RlWB8PEog==:117 a=PO7r1zJSAAAA:8 a=tTSYktBZc9AA:10 a=KN91Z2BipYgA:10 a=kj9zAlcOel0A:10 a=JzwRw_2MAAAA:8 a=vlCwttaAoV0VwkxwwJ0A:9 a=CjuIK1q_8ugA:10 Cc: Robert Watson , Mateusz Guzik , Konstantin Belousov , Johan Schuijt , freebsd-arch@freebsd.org X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 16 Aug 2014 02:07:58 -0000 On Fri, 15 Aug 2014, John Baldwin wrote: > One thing I would like to see is for the timecounter code to be adapted to use > the seq API instead of doing it by hand (the timecounter code is also missing > barriers due to doing it by hand). Locking in the timecounter code is poor (1), but I fear a general mechanism would be slower. Also, the timecounter code now extends into userland, so purely kernel locking cannot work for it. The userland part is more careful about locking than the kernel. It has memory barriers and other pessimizations which were intentionally left out of the kernel locking for timecounters. If these barriers are actually necessary, then they give the silly situation that there are less races for userland timecounting than kernel timecounting provided userland mostly does direct accesses instead of syscalls and kernel uses of timecounters are are infrequent enough to not race often with the userland accesses. (1) The main bugs are: - various calls to tc_setclock() with only Giant locking or worse. One is in settime(). settime() used to be locked by spclock(). That stopped working when SMPng removed spls and replaced them by Giant locking, and is more broken now. Giant locking is inadequate for most things related to timing. Using it in settime() gives the problem that settime() might be interrupted for many milliseconds. It then sets a time significantly in the past, assuming that it started with the correct time. Of course, it might be preempted before it aquires a sufficient lock. It never tried to recover from that. The extra breakage is that someone removed the splx() at the end, so settime() now has an splclock() with no corresponding splx(). This is only harmless since splclock() has no effect. settime()'s locking in FreeBSD-4, where it was last mostly correct, was: splclock() at start. Run almost entirely with that. But lower the spl using splsoftclock() too soon, before the end. It was apparently necessary to call lease_updatetime() at a lower priority, but the code was badly ordered so that was done before updating the time in the hardware. In fact, the hardware part was done woth no locking at all: from FreeBSD-4: % ts.tv_sec = tv->tv_sec; % ts.tv_nsec = tv->tv_usec * 1000; % set_timecounter(&ts); FreeBSD-4 had timecounters, and set_timecounter() corresponds to tc_setclock(). The locking here was provided all timecounter updates are locked by the same lock (splclock()). The most critical ones were -- timecounter updates are normally done from hardclock() and that was locked by splclock(). % (void) splsoftclock(); % lease_updatetime(delta.tv_sec); Reduce priority, but not to 0, to call lease_updatetime(). % splx(s); % resettodr(); Reduce priority to 0 to update the hardware (TODR, not timecounter). This is broken. The lock should be held throughout to increase the chance of updating the hardware to a time that is nearly current. Worse, most resettodr() implementations had and still have null locking except possibly for individual register accesses. They depend on locking from here and other callers. splclock() locking would have been adequate. % return (0); SMPng broke all this. It still doesn't even aquire Giant until near the end, leaving parts unlocked: % static int % settime(struct thread *td, struct timeval *tv) % { % struct timeval delta, tv1, tv2; % static struct timeval maxtime, laststep; % struct timespec ts; % int s; % % s = splclock(); % microtime(&tv1); No locking yet, except internally in microtime(). % delta = *tv; % timevalsub(&delta, &tv1); % % /* % * If the system is secure, we do not allow the time to be % * set to a value earlier than 1 second less than the highest % * ... % */ % if (securelevel_gt(td->td_ucred, 1) != 0) { Race with changes in the setting of securelevel. Not important to see a stale value provided we don't act nonatomically on it. % if (delta.tv_sec < 0 || delta.tv_usec < 0) { % /* % * Update maxtime to latest time we've seen. % */ % if (tv1.tv_sec > maxtime.tv_sec) % maxtime = tv1; % tv2 = *tv; % timevalsub(&tv2, &maxtime); % if (tv2.tv_sec < -1) { % tv->tv_sec = maxtime.tv_sec - 1; % printf("Time adjustment clamped to -1 second\n"); % } Race to change our state (maxtime). Giant locking for this would be plenty. % } else { % if (tv1.tv_sec == laststep.tv_sec) { % splx(s); % return (EPERM); The null spls were carefully maintained here when this return was added. % } % if (delta.tv_sec > 1) { % tv->tv_sec = tv1.tv_sec + 1; % printf("Time adjustment clamped to +1 second\n"); % } % laststep = *tv; % } % } Race to change our state (laststep). % % ts.tv_sec = tv->tv_sec; % ts.tv_nsec = tv->tv_usec * 1000; % mtx_lock(&Giant); % tc_setclock(&ts); Giant locking at last, at a point where it is almost no use. Giant locking is neither necessary or sufficient for tc_setclock(). % resettodr(); The Giant locking extends to resettodr() where it has a chance of doing some good iff all other calls to resettodr() do it too. % mtx_unlock(&Giant); % return (0); % } Someone removed lease_updatetime() and the 2 spls near it. Only one of these was related to lease_updatetime(). So splclock() is bracketed with splx() for the unusal return above, but not for the main return. - locking in kern_ntptime.c has similar nonsense involving Giant, but worse since there are callbacks from timecounter code into ntp code. This file still has a comment saying that everything must use splclock(): % * Note that all routines must run at priority splclock or higher. but it only has 1 splclock() call itself. It depends on all calls into it the be done at splclock() and for splclock() to actually work. It has a sprinkling of Giant locking. But when timecounter code calls into it, it has nothing like Giant locking. Timecounter code has some time- domain locking and perhaps still sched_lock for the usual case where it is called from the clock interrupt (settime() bypasses this). Timecounter code calls ntp code mainly for seconds update and pps. These functions provide no additional locking, but access ntptime's most critical state. Giant locking would be a gross LOR of course. The Giant locking for the syscalls isn't a LOR, but just cannot work. Another interesting comment: % #ifdef PPS_SYNC % /* % * hardpps() - discipline CPU clock oscillator to external PPS signal % * ... % * Note that, on some Unix systems this routine runs at an interrupt % * priority level higher than the timer interrupt routine hardclock(). % * Therefore, the variables used are distinct from the hardclock() % * variables, except for the actual time and frequency variables, which % * are determined by this routine and updated atomically. % */ I don't see any atomic updates. Certainly no mutexes, atomic ops or memory barriers. I think there is at best stores that are atomic on most UP systems combined with time-domain locking. The time-domain locking might even work, as in timecounter code generally, but is difficult to understand. Any variable that is only updated every second is unlikely to have a race in the update itself. Then it can be arranged that readers work if they see stale values out of order. But ntptime is mostly old UP code that depends on splclock(), so it is unlikely to arrange this. Only core timecounter code tries to arrange this. It may be too sloppy with memory barriers for the generation count, but this is unclear. Userland doesn't trust this, and sprinkles memory barriers. All userland can do to get more problems is to run stupid code that reads the timecounter in a loop. Time-domain locking obviously works better if the times between the reads are larger, and the stupid loop can hit problems sooner by making more calls. Bruce