Date: Sat, 6 Apr 2019 15:16:50 +0300 From: Konstantin Belousov <kostikbel@gmail.com> To: Bruce Evans <brde@optusnet.com.au> Cc: Michael Tuexen <tuexen@fh-muenster.de>, freebsd-hackers Hackers <freebsd-hackers@freebsd.org>, FreeBSD PowerPC ML <freebsd-ppc@freebsd.org> Subject: Re: powerpc64 head -r344018 stuck sleeping problems: th->th_scale * tc_delta(th) overflows unsigned 64 bits sometimes [patched failed] Message-ID: <20190406121650.GJ1923@kib.kiev.ua> In-Reply-To: <20190406014724.X4174@besplex.bde.org> References: <20190324110138.GR1923@kib.kiev.ua> <E0785613-2B6E-4BB3-95CD-03DD96902CD8@fh-muenster.de> <20190403070045.GW1923@kib.kiev.ua> <20190404011802.E2390@besplex.bde.org> <20190405113912.GB1923@kib.kiev.ua> <20190405230717.D3383@besplex.bde.org> <20190405132128.GD1923@kib.kiev.ua> <20190406003907.C3872@besplex.bde.org> <20190405143619.GF1923@kib.kiev.ua> <20190406014724.X4174@besplex.bde.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Sat, Apr 06, 2019 at 02:26:11AM +1100, Bruce Evans wrote: > On Fri, 5 Apr 2019, Konstantin Belousov wrote: > > > On Sat, Apr 06, 2019 at 01:01:19AM +1100, Bruce Evans wrote: > >> On Fri, 5 Apr 2019, Konstantin Belousov wrote: > >> > >>> On Fri, Apr 05, 2019 at 11:52:27PM +1100, Bruce Evans wrote: > >>>> On Fri, 5 Apr 2019, Konstantin Belousov wrote: > >>>> > >>>>> On Thu, Apr 04, 2019 at 02:47:34AM +1100, Bruce Evans wrote: > >>>>>> I noticed (or better realized) a general problem with multiple > >>>>>> timehands. ntpd can slew the clock at up to 500 ppm, and at least an > >>>>>> old version of it uses a rate of 50 ppm to fix up fairly small drifts > >>>>>> in the milliseconds range. 500 ppm is enormous in CPU cycles -- it is > >>>>>> 500 thousand nsec or 2 million cycles at 4GHz. Winding up the timecounter > >>>>>> every 1 msec reduces this to only 2000 cycles. > >>>>>> ... > >>>>>> The main point of having multiple timehands (after introducing the per- > >>>>>> timehands generation count) is to avoid blocking thread N during the > >>>>>> update, but this doesn't actually work, even for only 2 timehands and > >>>>>> a global generation count. > >>>>> > >>>>> You are describing the generic race between reader and writer. The same > >>>>> race would exist even with one timehand (and/or one global generation > >>>>> counter), where ntp adjustment might come earlier or later of some > >>>>> consumer accessing the timehands. If timehand instance was read before > >>>>> tc_windup() run but code consumed the result after the windup, it might > >>>>> appear as if time went backward, and this cannot be fixed without either > >>>>> re-reading the time after time-depended calculations were done and > >>>>> restarting, or some globabl lock ensuring serialization. > >>>> > >>>> With 1 timehand, its generation count would be global. I think its ordering > >>>> is strong enough to ensure serialization. > >>> Yes, single timehands result in global generation. But it would not solve > >>> the same race appearing in slightly different manner, as I described above. > >>> If reader finished while generation number in th was not yet reset, but > >>> caller uses the result after tc_windup(), the effect is same as if we > >>> have two th's and reader used the outdated one. > >> > >> You described it too concisely for me to understand :-). > >> > >> I now see that a single generation count doesn't give serialization. I > >> thought that setting the generation to 0 at the start of tc_windup() > >> serialized the reader and writer. But all it does is prevent use of the > >> results of the windup while only some of them are visible. If the > >> setting the generation count to 0 doesn't become before tc_windup() reads > >> the hardware timecounter, then this read may be before other reads using > >> the old timehands, but it needs to be after. > > If we have either single th or global gen counter, current code would > > become serialized, but this is not what I am about. Lets assume, for > > No, generation counts used like they are now (or in any way that I can > see) can't give serialization. > > > the sake of the discussion only, that all readers take the same spinlock > > as tc_windup (i.e. tc_setclock_mtx). > > Spinlocks are far too heavyweight. Most of the complications in timecounter > locking are to avoid using them. But OK for the discussion. > > > It is still possible that reader unlocked the mutex, tc_windup() kicked > > in, locked the mutex and moved timehands (as you noted, this might > > even happen on the same CPU), and only then the reader continues. For > > consumer of bintime() or any other function' result, it looks exactly > > the same as if we did not serialized with writer but used outdated > > timehands. > > Not with full serialization. The writer tc_windup() is also a reader, and > serializing its read gives the necessary monotonicity (for a single thread): > - normal reader locks the mutex, reads the timecounter and unlocks. The > mutex makes visible all previous writes, so the reader doesn't use a > stale timehands. Consumers of bintime(), etc., use this time in the past. > - tc_windup() locks the mutex, reads the timecounter hardware and writes the > timecounter soft state. The new offset is after all previous times read, > since this is serialized. > - normal reader as above sees the new state, so it reads times after the > time of the windup, so also after the time of previous normal reads. > > > Let me formulate this diffeently: as far as consumer of the bintime() > > result does not serialize itself with tc_windup(), serializing bintime() > > itself against tc_windup() does not close the race, but it is not > > obvious that the race matters. > > Readers can easily see times far in the past, but the times increase in > program order. This is true for the current implementation (single-thread monotonic). > > > Either we should just accept the race as > > we currently do, or readers must take the spinlock where the exact value > > of the current time is important, > > Disabling interrupts should be enough. In my version of 5.2, spinlocks > don't disable hardware interrupts and may be preempted by fast interrupt > handlers which may be not so fast and take hundreds of usec. Actually, > even disabling interrupts might not be enough. A single isa bus read > can take at least 138 usec (when it is behind a DMA queue or something). > There are also NMI's and SMI's. > > > or readers must re-read the time after > > doing something important, and redo if the new measuremedtime is outside > > the acceptable range. > > This method seems to be essential for robustness. > > But I don't see any race (for a single thread and no timecounter skew > across CPUs). Sloppy readers just see times an unknown but usually small > time in the past. Non-sloppy readers can also defend against timecounter > skew by binding to 1 CPU. Yes, this is true. > > Mutex locking of the timecounter doesn't give monotonic times across threads. > It gives some order, but you don't know which. Another mutex or rendezvous > is needed to control the order. Define monotonic between threads ?
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20190406121650.GJ1923>