From owner-freebsd-arch@FreeBSD.ORG  Thu Jun  7 03:00:44 2012
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 90C21106566C;
	Thu,  7 Jun 2012 03:00:44 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from mail06.syd.optusnet.com.au (mail06.syd.optusnet.com.au
	[211.29.132.187])
	by mx1.freebsd.org (Postfix) with ESMTP id 24DD78FC14;
	Thu,  7 Jun 2012 03:00:43 +0000 (UTC)
Received: from c122-106-171-232.carlnfd1.nsw.optusnet.com.au
	(c122-106-171-232.carlnfd1.nsw.optusnet.com.au [122.106.171.232])
	by mail06.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id
	q5730YqF018640
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
	Thu, 7 Jun 2012 13:00:36 +1000
Date: Thu, 7 Jun 2012 13:00:34 +1000 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: Konstantin Belousov <kostikbel@gmail.com>
In-Reply-To: <20120606205938.GS85127@deviant.kiev.zoral.com.ua>
Message-ID: <20120607130029.K1962@besplex.bde.org>
References: <20120606165115.GQ85127@deviant.kiev.zoral.com.ua>
	<201206061423.53179.jhb@freebsd.org>
	<20120606205938.GS85127@deviant.kiev.zoral.com.ua>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: freebsd-arch@freebsd.org
Subject: Re: Fast gettimeofday(2) and clock_gettime(2)
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 07 Jun 2012 03:00:44 -0000

On Wed, 6 Jun 2012, Konstantin Belousov wrote:

> On Wed, Jun 06, 2012 at 02:23:53PM -0400, John Baldwin wrote:
>> In general this looks good but I see a few nits / races:
>>
>> 1) You don't follow the model of clearing tk_current to 0 while you
>>    are updating the structure that the in-kernel timecounter code
>>    uses.  This also means you have to avoid using a tk_current of 0
>>    and that userland has to keep spinning as long as tk_current is 0.
>>    Without this I believe userland can read a partially updated
>>    structure.
> I changed the code to be much more similar to the kern_tc.c. I (re)added
> the generation field, which is set to 0 upon kernel touching timehands.

Seems necessary.

> I think this can only happen if tc_windups occurs quite close in
> succession, or usermode thread is suspended for long enough. BTW,
> even generation could loop back to the previous value if thread is
> stopped.

tc_windup()'s close in succession are bugs, since they cycle the timehands
faster than they were designed to be.  We already have too many of these
bugs (where tc_setclock() calls tc_windup().  I didn't notice this
particular problem with it before).  Now I will point out that version
2 of your patch adds more of these calls, apparently to get changes to
happen sooner.  But in sysctl_kern_timecounter_hardware(), such a call
was intentionaly left out since it is not needed.  Note that tc_tick
prevents calls to tc_windup() more often than about once per msec if
hz > 1000.

The generation count makes tc_windup()s close in succession harmless,
except they increase race possibilities by reducing the time-domain
locking.  The generation count is 32 bits, so it can only loop back to
a previous value after 2**32 tc_windup_calls.  This "can't happen".
What can happen is for the timehands to cycle after something is
preempted for 10-100 msec.  Then the generation count allows detection
of the cycling.  It only has an effect in this case.  Otherwise, the
a thread can be preempted for 10-100 seconds and start up using a
timehands pointer that it read into a register that long ago, and
safely use the old pointer unless its generation has changed.  Even
switching the timecounter works in that case.  This depends on the
hardware part of the timecounter not going away and the software
keeping most state per-timehands.

> There was apparently another issue with version 2. The bcopy() is not
> atomic, so potentially libc could read wrong tk_current. I redid
> the interface to write to the shared page to allow use of real atomics.

Timecounter code is supposed to be lock-free except for some time-domain
locking.  I only see 1 problem with this: where tc_windup() writes the
generation count and other things without asking for these writes to
be ordered.  In most cases, the time-domain locking prevents problems.
E.g., when the timehands pointer is read, it remains valid for 9+
generations of cycling timehands (9+ to 90+ msec).  It is only when
it sleeps for this long while holding and planning to use the old
pointer that it needs the generation count to actually work.  Another
case is if writes are out of order (can't happen on x86), so:

  	/*
  	 * The write to th_generation fails to protect users of th
  	 * via 10-100 msec old pointers if it becomes visible unordered
  	 * after any of the writes done by the bcopy().  Very rare to
  	 * lose here, but th_generation's point is to not lose here.
  	 */
  	th->th_generation = 0;
  	bcopy(tho, th, offsetof(struct timehands, th_generation));

  	// finish writing th except for th_generation
  	th->th_generation = ogen;
  	/*
  	 * The previous write to th_generation fails to protect users
  	 * of th via old pointers if becomes visible unordered before
  	 * all of the other writes (users see the generation change
  	 * via the old pointer, and now since it has become nonzero
  	 * they use the incompletely written data.  Again, only a problem
  	 * after 10-100 msec.
  	 */

  	timehands = th;
  	/*
  	 * Now users can grab th via timehands.  If timehands became visible
  	 * unordered before all of the other writes except th_generation,
  	 * then users use the incompletely written data.  Now the time
  	 * domain locking doesn't help.
  	 */

>> 2) You read tk->tk_boottime without the tk_current protection in your
>>    non-uptime routines.  This is racey as the kernel alters the
>>    boottime when it skews time for large adjustments from ntp, etc.
>>    To be really safe you need to read the boottime inside the loop
>>    into a local variable and perhaps use a boolean parameter to decide
>>    if you should add it to the computed uptime.
> I moved the bootime to timehands from timekeep, thank you for the
> clarification.

This isn't bug for bug compatible with the kernel.  The kernel has a
global boottimebin which affects uses of old timehands the instance
that it is changed (even before tc_windup() is called).

> Updated patch is at
> http://people.freebsd.org/~kib/misc/moronix.3.patch

I had better not be awed by looking at it :-).

Bruce