From owner-freebsd-hackers@freebsd.org Fri Apr 5 19:35:09 2019 Return-Path: Delivered-To: freebsd-hackers@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 27099155BB9B for ; Fri, 5 Apr 2019 19:35:09 +0000 (UTC) (envelope-from marklmi@yahoo.com) Received: from sonic314-20.consmr.mail.gq1.yahoo.com (sonic314-20.consmr.mail.gq1.yahoo.com [98.137.69.83]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id F3CB8712BD for ; Fri, 5 Apr 2019 19:35:07 +0000 (UTC) (envelope-from marklmi@yahoo.com) X-YMail-OSG: uUzurf0VM1lxFYMoT_VBwfIKKg_s0eLih6LSYHMSNeXWXxiBkoP2rvVrWL0OxJD d8iJGhL1jake0_oyOkaNVOZmCy27aLpZYcM7gofxA3CuOfsqVndeXhVlcWxqhNkdNM1Yj2Cz6dPw wybRMPE2BjOlPvh3l10jl9ZdUuSJBCf4_MZhvroD6pKBdxavkTzMzWSVijZM8xfJ44v73EzoclQ0 h4PJVrGvn9H4aH1WR8DCKi6L1AQcTXiTTSxvSpqrwWx_aypqsI7SOs6VZzly.exRnX8B4v0BLUuy _WLEECk3rs4xPadTY12V32_ljrmWrGBMGcyDBhcFEowc7rY5kGlDZEZCEA1gNqmaBPq3VsSuJsvV qlqUJGPCMiGMvCgI.FUltTTlK4hq_FXG9N.eL6hFmHRz0ILtfUCdUDM3.DXbD8izpDIoYPwfziS2 WYWbyO8S4oTErJbZwXnHwZkDIghitYAWGhOfXXMVQ0r9SGn6iH9KWI14eucdXQxwtZ.had2oHv.L b83htDwmBZXc_XFul3qrF3lIYWBc2URPcFVdCFx1l0WBo.Zpj8IX3vdcpSS5JrNvsPobpp7y9pU. vS5RQplaBuKiRi5VaSmdvTPWieY3pILVoMlud70OKbbXvwpOFApWaPP0r36g2Ar3tDE3.t2SXwfP bvkBTSxPTZ4H1ZKC8sNr8p2aC9Rm6SuvdCvzvoGxVsQRlnu2tQRgOVGBh8xKCFuOLKpOkh.iEviJ uTQbaN92zku1Ca578H5Zqt4R8mNzedj6utmcE7wjoFi9o7XIUx45DNaZ4MDauMWr7UPPD2MdZacC 8qouyPItMtGT0Ad3n4vxBgPp7q0UdLgx2v64bd2pyttqJqMHiyVqOJNzCd3zznyfLDdIIQac9oXf Pc5pDZujnGMyPEjKCezUBD0HjRdHLHCnzXXW1GaCIbUQxTvkmomk1Of_KStVIzRZTzFzlC1RWhLm MG_d33ruDYKLaox2h9wQc_f5RFssNJSSNYBiJ_AoMEdNwxwOPNpIxWcKBePqJTFFEbSzzsHVjlcl Ql30S_V860plIFcyCYqsFMP78XVSMIBQLsKDOBZlFEewacGzYDYlGvALX_YrWwICidLO88SpLZQA 8zA-- Received: from sonic.gate.mail.ne1.yahoo.com by sonic314.consmr.mail.gq1.yahoo.com with HTTP; Fri, 5 Apr 2019 19:35:01 +0000 Received: from c-67-170-167-181.hsd1.or.comcast.net (EHLO [192.168.1.113]) ([67.170.167.181]) by smtp431.mail.gq1.yahoo.com (Oath Hermes SMTP Server) with ESMTPA ID 86af237d517259e6241beadd5b873703; Fri, 05 Apr 2019 19:35:00 +0000 (UTC) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 12.4 \(3445.104.8\)) Subject: Re: powerpc64 head -r344018 stuck sleeping problems: th->th_scale * tc_delta(th) overflows unsigned 64 bits sometimes [patched failed] From: Mark Millard In-Reply-To: <20190406014724.X4174@besplex.bde.org> Date: Fri, 5 Apr 2019 12:34:59 -0700 Cc: Konstantin Belousov , freebsd-hackers Hackers , Michael Tuexen , FreeBSD PowerPC ML Content-Transfer-Encoding: quoted-printable Message-Id: <48DB5A87-1681-47DE-969F-FA569EBF6DF5@yahoo.com> References: <20190307222220.GK2492@kib.kiev.ua> <20190309144844.K1166@besplex.bde.org> <20190324110138.GR1923@kib.kiev.ua> <20190403070045.GW1923@kib.kiev.ua> <20190404011802.E2390@besplex.bde.org> <20190405113912.GB1923@kib.kiev.ua> <20190405230717.D3383@besplex.bde.org> <20190405132128.GD1923@kib.kiev.ua> <20190406003907.C3872@besplex.bde.org> <20190405143619.GF1923@kib.kiev.ua> <20190406014724.X4174@besplex.bde.org> To: Bruce Evans X-Mailer: Apple Mail (2.3445.104.8) X-Rspamd-Queue-Id: F3CB8712BD X-Spamd-Bar: +++ X-Spamd-Result: default: False [3.14 / 15.00]; RCVD_VIA_SMTP_AUTH(0.00)[]; R_SPF_ALLOW(-0.20)[+ptr:yahoo.com]; MV_CASE(0.50)[]; FREEMAIL_FROM(0.00)[yahoo.com]; RCPT_COUNT_FIVE(0.00)[5]; RCVD_COUNT_THREE(0.00)[3]; TO_DN_ALL(0.00)[]; DKIM_TRACE(0.00)[yahoo.com:+]; MX_GOOD(-0.01)[cached: mta6.am0.yahoodns.net]; DMARC_POLICY_ALLOW(-0.50)[yahoo.com,reject]; FREEMAIL_TO(0.00)[optusnet.com.au]; FROM_EQ_ENVFROM(0.00)[]; RCVD_TLS_LAST(0.00)[]; MIME_TRACE(0.00)[0:+]; FREEMAIL_ENVFROM(0.00)[yahoo.com]; ASN(0.00)[asn:36647, ipnet:98.137.64.0/21, country:US]; MID_RHS_MATCH_FROM(0.00)[]; DWL_DNSWL_NONE(0.00)[yahoo.com.dwl.dnswl.org : 127.0.5.0]; ARC_NA(0.00)[]; R_DKIM_ALLOW(-0.20)[yahoo.com:s=s2048]; FROM_HAS_DN(0.00)[]; NEURAL_SPAM_SHORT(0.27)[0.270,0]; MIME_GOOD(-0.10)[text/plain]; IP_SCORE(1.90)[ip: (7.87), ipnet: 98.137.64.0/21(0.95), asn: 36647(0.76), country: US(-0.06)]; NEURAL_SPAM_MEDIUM(0.95)[0.948,0]; TO_MATCH_ENVRCPT_SOME(0.00)[]; NEURAL_SPAM_LONG(0.53)[0.533,0]; RCVD_IN_DNSWL_NONE(0.00)[83.69.137.98.list.dnswl.org : 127.0.5.0]; RWL_MAILSPIKE_POSSIBLE(0.00)[83.69.137.98.rep.mailspike.net : 127.0.0.17]; FREEMAIL_CC(0.00)[gmail.com] X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 05 Apr 2019 19:35:09 -0000 On 2019-Apr-5, at 08:26, Bruce Evans wrote: > On Fri, 5 Apr 2019, Konstantin Belousov wrote: >=20 >> On Sat, Apr 06, 2019 at 01:01:19AM +1100, Bruce Evans wrote: >>> On Fri, 5 Apr 2019, Konstantin Belousov wrote: >>>=20 >>>> On Fri, Apr 05, 2019 at 11:52:27PM +1100, Bruce Evans wrote: >>>>> On Fri, 5 Apr 2019, Konstantin Belousov wrote: >>>>>=20 >>>>>> On Thu, Apr 04, 2019 at 02:47:34AM +1100, Bruce Evans wrote: >>>>>>> I noticed (or better realized) a general problem with multiple >>>>>>> timehands. ntpd can slew the clock at up to 500 ppm, and at = least an >>>>>>> old version of it uses a rate of 50 ppm to fix up fairly small = drifts >>>>>>> in the milliseconds range. 500 ppm is enormous in CPU cycles -- = it is >>>>>>> 500 thousand nsec or 2 million cycles at 4GHz. Winding up the = timecounter >>>>>>> every 1 msec reduces this to only 2000 cycles. >>>>>>> ... >>>>>>> The main point of having multiple timehands (after introducing = the per- >>>>>>> timehands generation count) is to avoid blocking thread N during = the >>>>>>> update, but this doesn't actually work, even for only 2 = timehands and >>>>>>> a global generation count. >>>>>>=20 >>>>>> You are describing the generic race between reader and writer. = The same >>>>>> race would exist even with one timehand (and/or one global = generation >>>>>> counter), where ntp adjustment might come earlier or later of = some >>>>>> consumer accessing the timehands. If timehand instance was read = before >>>>>> tc_windup() run but code consumed the result after the windup, it = might >>>>>> appear as if time went backward, and this cannot be fixed without = either >>>>>> re-reading the time after time-depended calculations were done = and >>>>>> restarting, or some globabl lock ensuring serialization. >>>>>=20 >>>>> With 1 timehand, its generation count would be global. I think = its ordering >>>>> is strong enough to ensure serialization. >>>> Yes, single timehands result in global generation. But it would = not solve >>>> the same race appearing in slightly different manner, as I = described above. >>>> If reader finished while generation number in th was not yet reset, = but >>>> caller uses the result after tc_windup(), the effect is same as if = we >>>> have two th's and reader used the outdated one. >>>=20 >>> You described it too concisely for me to understand :-). >>>=20 >>> I now see that a single generation count doesn't give serialization. = I >>> thought that setting the generation to 0 at the start of tc_windup() >>> serialized the reader and writer. But all it does is prevent use of = the >>> results of the windup while only some of them are visible. If the >>> setting the generation count to 0 doesn't become before tc_windup() = reads >>> the hardware timecounter, then this read may be before other reads = using >>> the old timehands, but it needs to be after. >> If we have either single th or global gen counter, current code would >> become serialized, but this is not what I am about. Lets assume, for >=20 > No, generation counts used like they are now (or in any way that I can > see) can't give serialization. >=20 >> the sake of the discussion only, that all readers take the same = spinlock >> as tc_windup (i.e. tc_setclock_mtx). >=20 > Spinlocks are far too heavyweight. Most of the complications in = timecounter > locking are to avoid using them. But OK for the discussion. >=20 >> It is still possible that reader unlocked the mutex, tc_windup() = kicked >> in, locked the mutex and moved timehands (as you noted, this might >> even happen on the same CPU), and only then the reader continues. For >> consumer of bintime() or any other function' result, it looks exactly >> the same as if we did not serialized with writer but used outdated >> timehands. >=20 > Not with full serialization. The writer tc_windup() is also a reader, = and > serializing its read gives the necessary monotonicity (for a single = thread): > - normal reader locks the mutex, reads the timecounter and unlocks. = The > mutex makes visible all previous writes, so the reader doesn't use a > stale timehands. Consumers of bintime(), etc., use this time in the = past. > - tc_windup() locks the mutex, reads the timecounter hardware and = writes the > timecounter soft state. The new offset is after all previous times = read, > since this is serialized. > - normal reader as above sees the new state, so it reads times after = the > time of the windup, so also after the time of previous normal reads. >=20 >> Let me formulate this diffeently: as far as consumer of the bintime() >> result does not serialize itself with tc_windup(), serializing = bintime() >> itself against tc_windup() does not close the race, but it is not >> obvious that the race matters. >=20 > Readers can easily see times far in the past, but the times increase = in > program order. >=20 >> Either we should just accept the race as >> we currently do, or readers must take the spinlock where the exact = value >> of the current time is important, >=20 > Disabling interrupts should be enough. In my version of 5.2, = spinlocks > don't disable hardware interrupts and may be preempted by fast = interrupt > handlers which may be not so fast and take hundreds of usec. = Actually, > even disabling interrupts might not be enough. A single isa bus read > can take at least 138 usec (when it is behind a DMA queue or = something). > There are also NMI's and SMI's. >=20 >> or readers must re-read the time after >> doing something important, and redo if the new measuremedtime is = outside >> the acceptable range. >=20 > This method seems to be essential for robustness. >=20 > But I don't see any race (for a single thread and no timecounter skew > across CPUs). Sloppy readers just see times an unknown but usually = small > time in the past. Non-sloppy readers can also defend against = timecounter > skew by binding to 1 CPU. >=20 > Mutex locking of the timecounter doesn't give monotonic times across = threads. > It gives some order, but you don't know which. Another mutex or = rendezvous > is needed to control the order. >=20 Just for context for the original problem, in case it helps: The sleepq_timeout went into the case: if (td->td_sleeptimo > sbinuptime() || td->td_sleeptimo =3D=3D 0) = { /* * The thread does not want a timeout (yet). */ and after that the specific sleep did not try again (deleted?), thus the hangup for the sleeping thread. This was with a call backtrace looking like the below at the time: 0xe00000009af7c730: at sleepq_timeout+0x148 0xe00000009af7c7d0: at softclock_call_cc+0x234 0xe00000009af7c910: at callout_process+0x2e0 0xe00000009af7c9f0: at handleevents+0x22c 0xe00000009af7caa0: at timercb+0x340 0xe00000009af7cba0: at decr_intr+0x140 0xe00000009af7cbd0: at powerpc_interrupt+0x268 (I added a call to cause the backtrace to be reported.) For this call chain: timercb gets a "now" value that is passsed along and into callout_process but not to softclock_call_cc or sleepq_timeout . The callout_process is doing CALLOUT_DIRECT handling when it directly calls softclock_call_cc: . . . /* Iterate callwheel from firstb to nowb and then up to lastb. = */ do { sc =3D &cc->cc_callwheel[firstb & callwheelmask]; tmp =3D LIST_FIRST(sc); while (tmp !=3D NULL) { /* Run the callout if present time within = allowed. */ if (tmp->c_time <=3D now) { /* * Consumer told us the callout may be = run * directly from hardware interrupt = context. */ if (tmp->c_iflags & CALLOUT_DIRECT) { #ifdef CALLOUT_PROFILING ++depth_dir; #endif cc_exec_next(cc) =3D LIST_NEXT(tmp, c_links.le); cc->cc_bucket =3D firstb & = callwheelmask; LIST_REMOVE(tmp, c_links.le); softclock_call_cc(tmp, cc, #ifdef CALLOUT_PROFILING &mpcalls_dir, = &lockcalls_dir, NULL, #endif 1); tmp =3D cc_exec_next(cc); cc_exec_next(cc) =3D NULL; } else { . . . =3D=3D=3D Mark Millard marklmi at yahoo.com ( dsl-only.net went away in early 2018-Mar)