Date: Tue, 5 Jun 2012 06:51:00 +1000 (EST) From: Bruce Evans <brde@optusnet.com.au> To: John Baldwin <jhb@FreeBSD.org> Cc: Gianni <gianni@FreeBSD.org>, Alan Cox <alc@rice.edu>, Alexander Kabaev <kan@FreeBSD.org>, Attilio Rao <attilio@FreeBSD.org>, Konstantin Belousov <kib@FreeBSD.org>, freebsd-arch@FreeBSD.org, Konstantin Belousov <kostikbel@gmail.com> Subject: Re: Fwd: [RFC] Kernel shared variables Message-ID: <20120605054930.H3236@besplex.bde.org> In-Reply-To: <201206041101.57486.jhb@freebsd.org> References: <CACfq090r1tWhuDkxdSZ24fwafbVKU0yduu1yV2%2BoYo%2BwwT4ipA@mail.gmail.com> <20120603051904.GG2358@deviant.kiev.zoral.com.ua> <20120603184315.T856@besplex.bde.org> <201206041101.57486.jhb@freebsd.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, 4 Jun 2012, John Baldwin wrote: > On Sunday, June 03, 2012 6:49:27 am Bruce Evans wrote: >> On Sun, 3 Jun 2012, Konstantin Belousov wrote: >>> What is timehands offsets ? Do you mean things like leap seconds ? >> >> Yes. binuptime() is: >> >> % void >> % binuptime(struct bintime *bt) >> % { >> % struct timehands *th; >> % u_int gen; >> % >> % do { >> % th = timehands; >> % gen = th->th_generation; >> % *bt = th->th_offset; >> % bintime_addx(bt, th->th_scale * tc_delta(th)); >> % } while (gen == 0 || gen != th->th_generation); >> % } >> >> Without the kernel providing th->th_offset, you have to do lots of ntp >> handling for yourself (compatibly with the kernel) just to get an >> accuracy of 1 second. Leap seconds don't affect CLOCK_MONOTONIC, but >> they do affect CLOCK_REALTIME which is the clock id used by >> gettimeofday(). For the former, you only have to advance the offset >> for yourself occasionally (compatibly with the kernel) and manage >> (compatibly with the kernel, especially in the long term) ntp slewing >> and other syscall/sysctl kernel activity that micro-adjusts th->th_scale. > > I think duplicating this logic in userland would just be wasteful. I have Sure. I modestly proposed it. > a private fast gettimeofday() at my current job and it works by exporting > the current timehands structure (well, the equivalent) to userland. The > userland bits then fetch a copy of the details and do the same as bintime(). How do you keep this up to date, especially for leap seconds? > (I move the math (bintime_addx() and the multiply)) out of the loop however. My version has a comment saying to do that, but I just noticed that it wouldn't work so well -- the timehands fields would have to be copied to local variables while under protection of the generation count, so it would give messier code to optimize a case that occurs _very_ rarely. >> timehands in a shared pages is close to working. th_generation protects >> things in the same way as in the kernel, modulo assumptions that writes >> are ordered. > > It would work fine. And in fact, having multiple timehands is actually a > bug, not a feature. It lets you compute bogus timestamps if you get preempted > at the wrong time and end up with time jumping around. At Yahoo! we reduced > the number of timehands structures down to 2 or some such, and I'm now of > the opinion we should just have one and dispense with the entire array. No, it is a feature. The time should never jump around (backwards), but it can easily jump forwards. It makes little difference if preemption occurs after the timehands have been read, or while reading them but in such a way that the timehands become stale during preemption but not stale enough for their generation to change so that you notice that they are stale -- you get a stale timestamp either way (with staleness approximately the preemption time). Times read by different threads can easily have different staleness according to which timehands they ended up using and this may be quite different from which timehands they started using and from which timehands is active after they return. Perhaps this is what you mean. But again, this happens anyway when the preemption occurs after the timehands have been read. The main point of timehands was originally to give a copy of the time that was stable for a time hopefully long enough for the timehands to be read without them being clobbered by an update. binuptime() was: 1.59 (phk 26-Mar-98): void 1.113 (phk 07-Feb-02): binuptime(struct bintime *bt) 1.113 (phk 07-Feb-02): { 1.113 (phk 07-Feb-02): struct timecounter *tc; 1.113 (phk 07-Feb-02): 1.113 (phk 07-Feb-02): tc = timecounter; 1.113 (phk 07-Feb-02): *bt = tc->tc_offset; 1.113 (phk 07-Feb-02): bintime_addx(bt, tc->tc_scale * tco_delta(tc)); 1.113 (phk 07-Feb-02): } This has an obvious race if the thread running this is preempted for a long time, so that the copy of the time is actually not stable for long enough. This was fixed (except I think in some cases using ddb) by using the generation count. With the generation count, multiple timehands are probably unnecessary, but they reduce locking bugs (no memory ordering for the generation count) and give the optimization that binuptime() etc. doesn't have to spin waiting for updates. Now it is the thread doing the updates that gets the most advantanges from multiple timehands. It doesn't have to worry much about locking, or being preempted, or blocking for a long time, since it knows that binuptime() etc. will keep using a previous generation safely and not busy-wait for it, provided only that it doesn't block for so long that the oldest previous generation doesn't become too old to work. 2 timehands are probably enough for this, but 1 isn't. > For my userland case I only export a single timehands copy. So readers block for a long time if the writer is updating and the writer blocks? Works best for UP :-). Actually, there are problems in the kernel even for UP. Consider the writer doing an update and being preempted by ddb, and ddb using binuptime(), though it shouldn't. This is deadlock if there is only 1 timehands. My version runs the update as a normal interrupt handler so that it can be interrupted by fast interrupt handlers. This gives similar problems -- fast interrupt handlers shouldn't call binuptime() either (this can deadlock in the timecounter hardware function for at least the i8254 timecounter), but they do and this is useful for things like timestamps from serial hardware. Multiple timehands at least limit this problem. Applications have similar problems (more like my kernel version since applications can't get as exclusive as access as a fast interrupt handler can). >>>> rdtsc is also very unportable, even on CPUs that have it. But all other >>>> x86 timecounter hardware is too slow if you want gettimeofday() to be fast >>>> and as accurate as it is now. > > For all the hardware where people run mysql and similar software that calls > getimeofday() a lot, rdtsc() works just fine. That wasn't the case until recently (except 10-15 years ago for UP with no SMM). Someone just fixed rdtsc()-based time function in dtrace. It tries to add a per-cpu rdtsc() offset, but the offset was backwards. It takes P-state invariance and maybe more for the offset to be 0 and not drift. >>> !rdtsc hardware is probably cannot be used at all due to need to provide >>> usermode access to device registers. The mere presence of rdtsc does not >>> means that usermode indeed can use it, it should be decided by kernel >>> based on the current in-kernel time source. If rdtsc is not usable, the >>> corresponding data should not be exported, or implementation should go >>> directly into syscall or whatever. > > Yes, the patches I have only work if the kernel uses the TSC as its main > timecounter as well. The detail I miss most is the TSC being available for use in userland even if it is not the primary timecounter. Maybe it its quality is enough for the application, or the application can fix it up using per-cpu offsets. Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20120605054930.H3236>