From owner-freebsd-arch@FreeBSD.ORG  Wed Aug 20 20:23:06 2014
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id DE8A5475;
 Wed, 20 Aug 2014 20:23:05 +0000 (UTC)
Received: from mail107.syd.optusnet.com.au (mail107.syd.optusnet.com.au
 [211.29.132.53])
 by mx1.freebsd.org (Postfix) with ESMTP id 893D432A9;
 Wed, 20 Aug 2014 20:23:05 +0000 (UTC)
Received: from c122-106-147-133.carlnfd1.nsw.optusnet.com.au
 (c122-106-147-133.carlnfd1.nsw.optusnet.com.au [122.106.147.133])
 by mail107.syd.optusnet.com.au (Postfix) with ESMTPS id 7BD19D448FA;
 Thu, 21 Aug 2014 06:22:56 +1000 (EST)
Date: Thu, 21 Aug 2014 06:22:55 +1000 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: John Baldwin <jhb@freebsd.org>
Subject: Re: [PATCH 0/2] plug capability races
In-Reply-To: <201408201111.47601.jhb@freebsd.org>
Message-ID: <20140821044234.H11472@besplex.bde.org>
References: <1408064112-573-1-git-send-email-mjguzik@gmail.com>
 <201408151031.45967.jhb@freebsd.org> <20140816102840.V1007@besplex.bde.org>
 <201408201111.47601.jhb@freebsd.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
X-Optus-CM-Score: 0
X-Optus-CM-Analysis: v=2.1 cv=BdjhjNd2 c=1 sm=1 tr=0
 a=7NqvjVvQucbO2RlWB8PEog==:117 a=PO7r1zJSAAAA:8 a=tTSYktBZc9AA:10
 a=KN91Z2BipYgA:10 a=kj9zAlcOel0A:10 a=JzwRw_2MAAAA:8
 a=YwIKbEHEEUB-9GaOcawA:9 a=kzYn1Pzwvs4spdd-:21 a=Ip1ZeEM7m2elqLRx:21
 a=CjuIK1q_8ugA:10
Cc: Mateusz Guzik <mjguzik@gmail.com>, Robert Watson <rwatson@freebsd.org>,
 Johan Schuijt <johan@transip.nl>, freebsd-arch@freebsd.org,
 Konstantin Belousov <kib@freebsd.org>
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.18-1
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch/>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 20 Aug 2014 20:23:06 -0000

On Wed, 20 Aug 2014, John Baldwin wrote:

> On Friday, August 15, 2014 9:34:59 pm Bruce Evans wrote:
>> On Fri, 15 Aug 2014, John Baldwin wrote:
>>
>>> One thing I would like to see is for the timecounter code to be adapted to use
>>> the seq API instead of doing it by hand (the timecounter code is also missing
>>> barriers due to doing it by hand).
>>
>> Locking in the timecounter code is poor (1), but I fear a general mechanism
>> would be slower.  Also, the timecounter code now extends into userland,
>> so purely kernel locking cannot work for it.  The userland part is
>> more careful about locking than the kernel.  It has memory barriers and
>> other pessimizations which were intentionally left out of the kernel
>> locking for timecounters.  If these barriers are actually necessary, then
>> they give the silly situation that there are less races for userland
>> timecounting than kernel timecounting provided userland mostly does
>> direct accesses instead of syscalls and kernel uses of timecounters are
>> are infrequent enough to not race often with the userland accesses.
>
> Yes, the userland code is more correct here.  The barriers are indeed missing in
> the kernel part, and adding them should give something equivalant to a correctly
> working seq API as it is doing the same thing.

Userland is technically correct, but this defeats the point of the
intended algorithm.

I now remember a bit more about the algorithm.  There are several
generations of timehands.  Each generation remains stable for several
clock ticks.  That should be several clock ticks at 100 Hz.  Normally
there is no problem with just using the old pointer read from timehands
(except there is no serialization for updating timehands itself (*)).
However, the thread might be preempted for several clock ticks.  This
is enough time for the old generation to change.  The generation count
is used to detect such changes.  Again it doesn't matter if the
generation count is out of date, unless it is out of date by a few
generations.  So the algorithm works unless the CPU de-serializes
things by more than a few clock ticks.  I think no real CPUs do that.
Virtual CPUs can do that, but I think they aren't a problem in practice.
Single stepping in ddb gives a sort of virtual CPU and breaks the
algorthm since time runs much faster outside of the stepped process
and may do several generations per step.  The generation count protects
against using a changed timehands but may cause binuptime() to never
terminate instead.  It takes much weirder virtualization than that to
break the generation count itself.  Any normal preemption or abnormal
stopping of CPUs uses locks galore which synchronize everything on at
least x86.

Variable-tick kernels give another problem.  They sometimes issue virtual
clock interrupts to catch up.  I think they take some care with tc_windup()
but perhaps not enough.  tc_windup() calls must be separated so that the
timehands don't cycle too fast or too slow in either real time or time
related to other system operation (there are hard real time requirements
mainly for reading real hardware timecounters before they overflow).

(*):

% binuptime(struct bintime *bt)
% {
% 	struct timehands *th;
% 	u_int gen;
% 
% 	do {
% 		th = timehands;

Since tc_windup() also doesn't dream of memory ordering, timehands here
may be in the future of what it points to.  That is much worse than it
being in the past.  Barriers would be cheap in tc_windup() but useless
if they require barriers in binuptime() to work.

tc_windup() is normally called from the clock interrupt handler.  There
are several mutexes (or at least atomic ops that give synchronization on
at least x86 SMP) before and after it.  These gives serialization very
soon after the changes.

The fix (without adding any barrier instructions) is easy.  Simply
run the timehands update 1 or 2 generations behind the update of what
it points to.  This gives even more than time-domain locking, since
the accidental synchronization from the interrupt handler gives ordering
between the update of the pointed-to data and the timehands pointer.

% 		gen = th->th_generation;

It doesn't matter if the generation count is in the future, but it needs
to be the same as what was written in the past or future.

% 		*bt = th->th_offset;
% 		bintime_addx(bt, th->th_scale * tc_delta(th));
% 	} while (gen == 0 || gen != th->th_generation);
% }

Now the timehands update code:

% 	/*
% 	 * Now that the struct timehands is again consistent, set the new
% 	 * generation number, making sure to not make it zero.
% 	 */

It is only sure to be consistent on in-order CPUs.

% 	if (++ogen == 0)
% 		ogen = 1;
% 	th->th_generation = ogen;
% 
% 	/* Go live with the new struct timehands. */
% #ifdef FFCLOCK
% 	switch (sysclock_active) {
% 	case SYSCLOCK_FBCK:
% #endif

I don't like the FFCLOCK complications.  They interact with the locking
bugs a little here.

% 		time_second = th->th_microtime.tv_sec;
% 		time_uptime = th->th_offset.sec;

Old versions had only these 2 statements before setting timehands and
returning.  These are racy enough.  Using these variables is racier.
They have type time_t, so they might be 64 bits on 32-bit arches so
reading them might be non-atomic.  In practice, very strong time-domain
locking applies -- the races won't occur until the top bits start being
actually used a mere 24 years from now.  Then there will be a race window
of a few microseconds.  The generation count should be used to make accesses
to these variables techically correct and slow.

% #ifdef FFCLOCK
% 		break;
% 	case SYSCLOCK_FFWD:
% 		time_second = fftimehands->tick_time_lerp.sec;
% 		time_uptime = fftimehands->tick_time_lerp.sec - ffclock_boottime.sec;
% 		break;

Perhaps more races from more complicated expressions.  Also a style bug
(long line).

% 	}
% #endif
% 
% 	timehands = th;
% 	timekeep_push_vdso();
% }

timekeep_push_vdso()  has a couple of atomic stores in it.  Perhaps
these give perfect serialization for the user variables.  On some
arches, they accidentally sync the kernel variables a little earlier
than the accidental sync from the interrupt handler.  Still out of order
with the kernel variable updates.  Again, this shouldn't be needed --
use a delayed pointer update for the user variables too.

Bruce