From owner-freebsd-hackers@freebsd.org  Fri Apr  5 19:35:09 2019
Return-Path: <owner-freebsd-hackers@freebsd.org>
Delivered-To: freebsd-hackers@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 27099155BB9B
 for <freebsd-hackers@mailman.ysv.freebsd.org>;
 Fri,  5 Apr 2019 19:35:09 +0000 (UTC)
 (envelope-from marklmi@yahoo.com)
Received: from sonic314-20.consmr.mail.gq1.yahoo.com
 (sonic314-20.consmr.mail.gq1.yahoo.com [98.137.69.83])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id F3CB8712BD
 for <freebsd-hackers@freebsd.org>; Fri,  5 Apr 2019 19:35:07 +0000 (UTC)
 (envelope-from marklmi@yahoo.com)
X-YMail-OSG: uUzurf0VM1lxFYMoT_VBwfIKKg_s0eLih6LSYHMSNeXWXxiBkoP2rvVrWL0OxJD
 d8iJGhL1jake0_oyOkaNVOZmCy27aLpZYcM7gofxA3CuOfsqVndeXhVlcWxqhNkdNM1Yj2Cz6dPw
 wybRMPE2BjOlPvh3l10jl9ZdUuSJBCf4_MZhvroD6pKBdxavkTzMzWSVijZM8xfJ44v73EzoclQ0
 h4PJVrGvn9H4aH1WR8DCKi6L1AQcTXiTTSxvSpqrwWx_aypqsI7SOs6VZzly.exRnX8B4v0BLUuy
 _WLEECk3rs4xPadTY12V32_ljrmWrGBMGcyDBhcFEowc7rY5kGlDZEZCEA1gNqmaBPq3VsSuJsvV
 qlqUJGPCMiGMvCgI.FUltTTlK4hq_FXG9N.eL6hFmHRz0ILtfUCdUDM3.DXbD8izpDIoYPwfziS2
 WYWbyO8S4oTErJbZwXnHwZkDIghitYAWGhOfXXMVQ0r9SGn6iH9KWI14eucdXQxwtZ.had2oHv.L
 b83htDwmBZXc_XFul3qrF3lIYWBc2URPcFVdCFx1l0WBo.Zpj8IX3vdcpSS5JrNvsPobpp7y9pU.
 vS5RQplaBuKiRi5VaSmdvTPWieY3pILVoMlud70OKbbXvwpOFApWaPP0r36g2Ar3tDE3.t2SXwfP
 bvkBTSxPTZ4H1ZKC8sNr8p2aC9Rm6SuvdCvzvoGxVsQRlnu2tQRgOVGBh8xKCFuOLKpOkh.iEviJ
 uTQbaN92zku1Ca578H5Zqt4R8mNzedj6utmcE7wjoFi9o7XIUx45DNaZ4MDauMWr7UPPD2MdZacC
 8qouyPItMtGT0Ad3n4vxBgPp7q0UdLgx2v64bd2pyttqJqMHiyVqOJNzCd3zznyfLDdIIQac9oXf
 Pc5pDZujnGMyPEjKCezUBD0HjRdHLHCnzXXW1GaCIbUQxTvkmomk1Of_KStVIzRZTzFzlC1RWhLm
 MG_d33ruDYKLaox2h9wQc_f5RFssNJSSNYBiJ_AoMEdNwxwOPNpIxWcKBePqJTFFEbSzzsHVjlcl
 Ql30S_V860plIFcyCYqsFMP78XVSMIBQLsKDOBZlFEewacGzYDYlGvALX_YrWwICidLO88SpLZQA
 8zA--
Received: from sonic.gate.mail.ne1.yahoo.com by
 sonic314.consmr.mail.gq1.yahoo.com with HTTP; Fri, 5 Apr 2019 19:35:01 +0000
Received: from c-67-170-167-181.hsd1.or.comcast.net (EHLO [192.168.1.113])
 ([67.170.167.181])
 by smtp431.mail.gq1.yahoo.com (Oath Hermes SMTP Server) with ESMTPA ID
 86af237d517259e6241beadd5b873703; 
 Fri, 05 Apr 2019 19:35:00 +0000 (UTC)
Content-Type: text/plain;
	charset=us-ascii
Mime-Version: 1.0 (Mac OS X Mail 12.4 \(3445.104.8\))
Subject: Re: powerpc64 head -r344018 stuck sleeping problems: th->th_scale *
 tc_delta(th) overflows unsigned 64 bits sometimes [patched failed]
From: Mark Millard <marklmi@yahoo.com>
In-Reply-To: <20190406014724.X4174@besplex.bde.org>
Date: Fri, 5 Apr 2019 12:34:59 -0700
Cc: Konstantin Belousov <kostikbel@gmail.com>,
 freebsd-hackers Hackers <freebsd-hackers@freebsd.org>,
 Michael Tuexen <tuexen@fh-muenster.de>,
 FreeBSD PowerPC ML <freebsd-ppc@freebsd.org>
Content-Transfer-Encoding: quoted-printable
Message-Id: <48DB5A87-1681-47DE-969F-FA569EBF6DF5@yahoo.com>
References: <20190307222220.GK2492@kib.kiev.ua>
 <20190309144844.K1166@besplex.bde.org> <20190324110138.GR1923@kib.kiev.ua>
 <E0785613-2B6E-4BB3-95CD-03DD96902CD8@fh-muenster.de>
 <20190403070045.GW1923@kib.kiev.ua> <20190404011802.E2390@besplex.bde.org>
 <20190405113912.GB1923@kib.kiev.ua> <20190405230717.D3383@besplex.bde.org>
 <20190405132128.GD1923@kib.kiev.ua> <20190406003907.C3872@besplex.bde.org>
 <20190405143619.GF1923@kib.kiev.ua> <20190406014724.X4174@besplex.bde.org>
To: Bruce Evans <brde@optusnet.com.au>
X-Mailer: Apple Mail (2.3445.104.8)
X-Rspamd-Queue-Id: F3CB8712BD
X-Spamd-Bar: +++
X-Spamd-Result: default: False [3.14 / 15.00]; RCVD_VIA_SMTP_AUTH(0.00)[];
 R_SPF_ALLOW(-0.20)[+ptr:yahoo.com]; MV_CASE(0.50)[];
 FREEMAIL_FROM(0.00)[yahoo.com]; RCPT_COUNT_FIVE(0.00)[5];
 RCVD_COUNT_THREE(0.00)[3]; TO_DN_ALL(0.00)[];
 DKIM_TRACE(0.00)[yahoo.com:+];
 MX_GOOD(-0.01)[cached: mta6.am0.yahoodns.net];
 DMARC_POLICY_ALLOW(-0.50)[yahoo.com,reject];
 FREEMAIL_TO(0.00)[optusnet.com.au]; FROM_EQ_ENVFROM(0.00)[];
 RCVD_TLS_LAST(0.00)[]; MIME_TRACE(0.00)[0:+];
 FREEMAIL_ENVFROM(0.00)[yahoo.com];
 ASN(0.00)[asn:36647, ipnet:98.137.64.0/21, country:US];
 MID_RHS_MATCH_FROM(0.00)[];
 DWL_DNSWL_NONE(0.00)[yahoo.com.dwl.dnswl.org : 127.0.5.0];
 ARC_NA(0.00)[]; R_DKIM_ALLOW(-0.20)[yahoo.com:s=s2048];
 FROM_HAS_DN(0.00)[]; NEURAL_SPAM_SHORT(0.27)[0.270,0];
 MIME_GOOD(-0.10)[text/plain];
 IP_SCORE(1.90)[ip: (7.87), ipnet: 98.137.64.0/21(0.95), asn: 36647(0.76),
 country: US(-0.06)]; NEURAL_SPAM_MEDIUM(0.95)[0.948,0];
 TO_MATCH_ENVRCPT_SOME(0.00)[]; NEURAL_SPAM_LONG(0.53)[0.533,0];
 RCVD_IN_DNSWL_NONE(0.00)[83.69.137.98.list.dnswl.org : 127.0.5.0];
 RWL_MAILSPIKE_POSSIBLE(0.00)[83.69.137.98.rep.mailspike.net : 127.0.0.17];
 FREEMAIL_CC(0.00)[gmail.com]
X-BeenThere: freebsd-hackers@freebsd.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Technical Discussions relating to FreeBSD
 <freebsd-hackers.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-hackers>, 
 <mailto:freebsd-hackers-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hackers/>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Help: <mailto:freebsd-hackers-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-hackers>, 
 <mailto:freebsd-hackers-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 05 Apr 2019 19:35:09 -0000


On 2019-Apr-5, at 08:26, Bruce Evans <brde optusnet.com.au> wrote:

> On Fri, 5 Apr 2019, Konstantin Belousov wrote:
>=20
>> On Sat, Apr 06, 2019 at 01:01:19AM +1100, Bruce Evans wrote:
>>> On Fri, 5 Apr 2019, Konstantin Belousov wrote:
>>>=20
>>>> On Fri, Apr 05, 2019 at 11:52:27PM +1100, Bruce Evans wrote:
>>>>> On Fri, 5 Apr 2019, Konstantin Belousov wrote:
>>>>>=20
>>>>>> On Thu, Apr 04, 2019 at 02:47:34AM +1100, Bruce Evans wrote:
>>>>>>> I noticed (or better realized) a general problem with multiple
>>>>>>> timehands.  ntpd can slew the clock at up to 500 ppm, and at =
least an
>>>>>>> old version of it uses a rate of 50 ppm to fix up fairly small =
drifts
>>>>>>> in the milliseconds range.  500 ppm is enormous in CPU cycles -- =
it is
>>>>>>> 500 thousand nsec or 2 million cycles at 4GHz.  Winding up the =
timecounter
>>>>>>> every 1 msec reduces this to only 2000 cycles.
>>>>>>> ...
>>>>>>> The main point of having multiple timehands (after introducing =
the per-
>>>>>>> timehands generation count) is to avoid blocking thread N during =
the
>>>>>>> update, but this doesn't actually work, even for only 2 =
timehands and
>>>>>>> a global generation count.
>>>>>>=20
>>>>>> You are describing the generic race between reader and writer. =
The same
>>>>>> race would exist even with one timehand (and/or one global =
generation
>>>>>> counter), where ntp adjustment might come earlier or later of =
some
>>>>>> consumer accessing the timehands. If timehand instance was read =
before
>>>>>> tc_windup() run but code consumed the result after the windup, it =
might
>>>>>> appear as if time went backward, and this cannot be fixed without =
either
>>>>>> re-reading the time after time-depended calculations were done =
and
>>>>>> restarting, or some globabl lock ensuring serialization.
>>>>>=20
>>>>> With 1 timehand, its generation count would be global.  I think =
its ordering
>>>>> is strong enough to ensure serialization.
>>>> Yes, single timehands result in global generation.  But it would =
not solve
>>>> the same race appearing in slightly different manner, as I =
described above.
>>>> If reader finished while generation number in th was not yet reset, =
but
>>>> caller uses the result after tc_windup(), the effect is same as if =
we
>>>> have two th's and reader used the outdated one.
>>>=20
>>> You described it too concisely for me to understand :-).
>>>=20
>>> I now see that a single generation count doesn't give serialization. =
 I
>>> thought that setting the generation to 0 at the start of tc_windup()
>>> serialized the reader and writer.  But all it does is prevent use of =
the
>>> results of the windup while only some of them are visible.  If the
>>> setting the generation count to 0 doesn't become before tc_windup() =
reads
>>> the hardware timecounter, then this read may be before other reads =
using
>>> the old timehands, but it needs to be after.
>> If we have either single th or global gen counter, current code would
>> become serialized, but this is not what I am about.  Lets assume, for
>=20
> No, generation counts used like they are now (or in any way that I can
> see) can't give serialization.
>=20
>> the sake of the discussion only, that all readers take the same =
spinlock
>> as tc_windup (i.e. tc_setclock_mtx).
>=20
> Spinlocks are far too heavyweight.  Most of the complications in =
timecounter
> locking are to avoid using them.  But OK for the discussion.
>=20
>> It is still possible that reader unlocked the mutex, tc_windup() =
kicked
>> in, locked the mutex and moved timehands (as you noted, this might
>> even happen on the same CPU), and only then the reader continues. For
>> consumer of bintime() or any other function' result, it looks exactly
>> the same as if we did not serialized with writer but used outdated
>> timehands.
>=20
> Not with full serialization.  The writer tc_windup() is also a reader, =
and
> serializing its read gives the necessary monotonicity (for a single =
thread):
> - normal reader locks the mutex, reads the timecounter and unlocks.  =
The
>  mutex makes visible all previous writes, so the reader doesn't use a
>  stale timehands.  Consumers of bintime(), etc., use this time in the =
past.
> - tc_windup() locks the mutex, reads the timecounter hardware and =
writes the
>  timecounter soft state.  The new offset is after all previous times =
read,
>  since this is serialized.
> - normal reader as above sees the new state, so it reads times after =
the
>  time of the windup, so also after the time of previous normal reads.
>=20
>> Let me formulate this diffeently: as far as consumer of the bintime()
>> result does not serialize itself with tc_windup(), serializing =
bintime()
>> itself against tc_windup() does not close the race, but it is not
>> obvious that the race matters.
>=20
> Readers can easily see times far in the past, but the times increase =
in
> program order.
>=20
>> Either we should just accept the race as
>> we currently do, or readers must take the spinlock where the exact =
value
>> of the current time is important,
>=20
> Disabling interrupts should be enough.  In my version of 5.2, =
spinlocks
> don't disable hardware interrupts and may be preempted by fast =
interrupt
> handlers which may be not so fast and take hundreds of usec.  =
Actually,
> even disabling interrupts might not be enough.  A single isa bus read
> can take at least 138 usec (when it is behind a DMA queue or =
something).
> There are also NMI's and SMI's.
>=20
>> or readers must re-read the time after
>> doing something important, and redo if the new measuremedtime is =
outside
>> the acceptable range.
>=20
> This method seems to be essential for robustness.
>=20
> But I don't see any race (for a single thread and no timecounter skew
> across CPUs).  Sloppy readers just see times an unknown but usually =
small
> time in the past.  Non-sloppy readers can also defend against =
timecounter
> skew by binding to 1 CPU.
>=20
> Mutex locking of the timecounter doesn't give monotonic times across =
threads.
> It gives some order, but you don't know which.  Another mutex or =
rendezvous
> is needed to control the order.
>=20

Just for context for the original problem, in case it helps:

The sleepq_timeout went into the case:

       if (td->td_sleeptimo > sbinuptime() || td->td_sleeptimo =3D=3D 0) =
{
               /*
                * The thread does not want a timeout (yet).
                */

and after that the specific sleep did not try again (deleted?),
thus the hangup for the sleeping thread.

This was with a call backtrace looking like the below
at the time:

0xe00000009af7c730: at sleepq_timeout+0x148
0xe00000009af7c7d0: at softclock_call_cc+0x234
0xe00000009af7c910: at callout_process+0x2e0
0xe00000009af7c9f0: at handleevents+0x22c
0xe00000009af7caa0: at timercb+0x340
0xe00000009af7cba0: at decr_intr+0x140
0xe00000009af7cbd0: at powerpc_interrupt+0x268

(I added a call to cause the backtrace to be reported.)

For this call chain:

timercb gets a "now" value that is passsed along
and into callout_process but not to softclock_call_cc or
sleepq_timeout .

The callout_process is doing CALLOUT_DIRECT handling when
it directly calls softclock_call_cc:

. . .
        /* Iterate callwheel from firstb to nowb and then up to lastb. =
*/
        do {
                sc =3D &cc->cc_callwheel[firstb & callwheelmask];
                tmp =3D LIST_FIRST(sc);
                while (tmp !=3D NULL) {
                        /* Run the callout if present time within =
allowed. */
                        if (tmp->c_time <=3D now) {
                                /*
                                 * Consumer told us the callout may be =
run
                                 * directly from hardware interrupt =
context.
                                 */
                                if (tmp->c_iflags & CALLOUT_DIRECT) {
#ifdef CALLOUT_PROFILING
                                        ++depth_dir;
#endif
                                        cc_exec_next(cc) =3D
                                            LIST_NEXT(tmp, c_links.le);
                                        cc->cc_bucket =3D firstb & =
callwheelmask;
                                        LIST_REMOVE(tmp, c_links.le);
                                        softclock_call_cc(tmp, cc,
#ifdef CALLOUT_PROFILING
                                            &mpcalls_dir, =
&lockcalls_dir, NULL,
#endif
                                            1);
                                        tmp =3D cc_exec_next(cc);
                                        cc_exec_next(cc) =3D NULL;
                                } else {
. . .


=3D=3D=3D
Mark Millard
marklmi at yahoo.com
( dsl-only.net went
away in early 2018-Mar)