From owner-freebsd-ppc@freebsd.org  Fri Apr  5 11:39:23 2019
Return-Path: <owner-freebsd-ppc@freebsd.org>
Delivered-To: freebsd-ppc@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 60FC01573DEB;
 Fri,  5 Apr 2019 11:39:23 +0000 (UTC)
 (envelope-from kostikbel@gmail.com)
Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id AC98A8D528;
 Fri,  5 Apr 2019 11:39:22 +0000 (UTC)
 (envelope-from kostikbel@gmail.com)
Received: from tom.home (kib@localhost [127.0.0.1])
 by kib.kiev.ua (8.15.2/8.15.2) with ESMTPS id x35BdCH1003695
 (version=TLSv1.2 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO);
 Fri, 5 Apr 2019 14:39:15 +0300 (EEST)
 (envelope-from kostikbel@gmail.com)
DKIM-Filter: OpenDKIM Filter v2.10.3 kib.kiev.ua x35BdCH1003695
Received: (from kostik@localhost)
 by tom.home (8.15.2/8.15.2/Submit) id x35BdCH7003692;
 Fri, 5 Apr 2019 14:39:12 +0300 (EEST)
 (envelope-from kostikbel@gmail.com)
X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com
 using -f
Date: Fri, 5 Apr 2019 14:39:12 +0300
From: Konstantin Belousov <kostikbel@gmail.com>
To: Bruce Evans <brde@optusnet.com.au>
Cc: Michael Tuexen <tuexen@fh-muenster.de>,
 freebsd-hackers Hackers <freebsd-hackers@freebsd.org>,
 FreeBSD PowerPC ML <freebsd-ppc@freebsd.org>
Subject: Re: powerpc64 head -r344018 stuck sleeping problems: th->th_scale *
 tc_delta(th) overflows unsigned 64 bits sometimes [patched failed]
Message-ID: <20190405113912.GB1923@kib.kiev.ua>
References: <20190304114150.GM68879@kib.kiev.ua>
 <20190305031010.I4610@besplex.bde.org>
 <20190306172003.GD2492@kib.kiev.ua>
 <20190308001005.M2756@besplex.bde.org>
 <20190307222220.GK2492@kib.kiev.ua>
 <20190309144844.K1166@besplex.bde.org>
 <20190324110138.GR1923@kib.kiev.ua>
 <E0785613-2B6E-4BB3-95CD-03DD96902CD8@fh-muenster.de>
 <20190403070045.GW1923@kib.kiev.ua>
 <20190404011802.E2390@besplex.bde.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20190404011802.E2390@besplex.bde.org>
User-Agent: Mutt/1.11.4 (2019-03-13)
X-Spam-Status: No, score=-1.0 required=5.0 tests=ALL_TRUSTED,BAYES_00,
 DKIM_ADSP_CUSTOM_MED,FORGED_GMAIL_RCVD,FREEMAIL_FROM,
 NML_ADSP_CUSTOM_MED autolearn=no autolearn_force=no version=3.4.2
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on tom.home
X-BeenThere: freebsd-ppc@freebsd.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Porting FreeBSD to the PowerPC <freebsd-ppc.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-ppc>,
 <mailto:freebsd-ppc-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-ppc/>
List-Post: <mailto:freebsd-ppc@freebsd.org>
List-Help: <mailto:freebsd-ppc-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-ppc>,
 <mailto:freebsd-ppc-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 05 Apr 2019 11:39:23 -0000

On Thu, Apr 04, 2019 at 02:47:34AM +1100, Bruce Evans wrote:
> I noticed (or better realized) a general problem with multiple
> timehands.  ntpd can slew the clock at up to 500 ppm, and at least an
> old version of it uses a rate of 50 ppm to fix up fairly small drifts
> in the milliseconds range.  500 ppm is enormous in CPU cycles -- it is
> 500 thousand nsec or 2 million cycles at 4GHz.  Winding up the timecounter
> every 1 msec reduces this to only 2000 cycles.
> 
> More details of ordering and timing for 1 thread:
> - thread N calls binuptime() and it loads timehands
> - another or even the same thread runs tc_windup().  This modifies timehands.
> - thread N is preempted for a long time, but less than the time for
>    <number of timehands> updates
> - thread N checks the generation count.  Since this is for the timehands
>    contents and not for the timehands pointer, it hasn't changed, so the
>    old timehands is used
> - and instant later, the same thread calls binuptime again() and uses the
>    new timehands 
> - now suppose only 2 timehands (as in -current) the worst (?) case of a
>    slew of 500 ppm for the old timehands and -500 ppm for the new timehands
>    and almost the worst case of 10 msec for the oldness of the old timehands
>    relative to the new timehands, with the new timehands about to be updated
>    after 10 msec (assume perfectly periodiodic updates every 10 msec).  The
>    calculated times are:
> 
>    old bintime = old_base + (20 msec) * (1 + 500e-6)
>    new base = old_base + 10 msec * (1 + 500e-6)    # calc by tc_windup()
>    new bintime = new_base + (10 msec) * (1 - 500e-6) + epsilon
> 
>    error = epsilon - (20 msec) * 500e-6 = epsilon - 10000 nsec
> 
> Errors in the negative direction are most harmful.  ntpd normally doesn't
> change the skew as much as that in one step, but it is easy for adjtime(2)
> to change the skew like that and there are no reasonable microadjustments
> that would accidentally work around this kernel bug (it seems unreasonable
> to limit the skew to 1 ppm and that still gives an error of epsilon + 20
> nsec.
> 
> phk didn't want to slow down timecounters using extra code to make
> them them monotonic and coherent (getbinuptime() is incoherent with
> binuptime() since it former lags the latter by the update interval),
> but this wouldn't be very slow within a thread.
> 
> Monotonicity across threads is a larger problem and not helped by using
> a faked forced monotonic time within threads.
> 
> So it seems best to fix the above problem by moving the generation count
> from the timehands contents to the timehands pointer, and maybe also
> reduce the number of timehands to 1.  With 2 timehands, this gives a
> shorter race:
> 
> - thread N loads timehands
> - tc_windup()
> - thread N preempted
> - thread N uses old timehands
> - case tc_windup() completes first: no problem -- thread N checks the
>    generation count on the pointer and loops
> - case binuptime() completes first: lost a race -- binuptime() is off
>    by approx <tc_windup() execution time> * <difference in skews>.
> 
> The main point of having multiple timehands (after introducing the per-
> timehands generation count) is to avoid blocking thread N during the
> update, but this doesn't actually work, even for only 2 timehands and 
> a global generation count.

You are describing the generic race between reader and writer. The same
race would exist even with one timehand (and/or one global generation
counter), where ntp adjustment might come earlier or later of some
consumer accessing the timehands. If timehand instance was read before
tc_windup() run but code consumed the result after the windup, it might
appear as if time went backward, and this cannot be fixed without either
re-reading the time after time-depended calculations were done and
restarting, or some globabl lock ensuring serialization.