Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 4 Jun 2012 17:30:05 -0400
From:      John Baldwin <jhb@freebsd.org>
To:        Bruce Evans <brde@optusnet.com.au>
Cc:        Gianni <gianni@freebsd.org>, Alan Cox <alc@rice.edu>, Alexander Kabaev <kan@freebsd.org>, Attilio Rao <attilio@freebsd.org>, Konstantin Belousov <kib@freebsd.org>, freebsd-arch@freebsd.org, Konstantin Belousov <kostikbel@gmail.com>
Subject:   Re: Fwd: [RFC] Kernel shared variables
Message-ID:  <201206041730.05478.jhb@freebsd.org>
In-Reply-To: <20120605054930.H3236@besplex.bde.org>
References:  <CACfq090r1tWhuDkxdSZ24fwafbVKU0yduu1yV2%2BoYo%2BwwT4ipA@mail.gmail.com> <201206041101.57486.jhb@freebsd.org> <20120605054930.H3236@besplex.bde.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Monday, June 04, 2012 4:51:00 pm Bruce Evans wrote:
> On Mon, 4 Jun 2012, John Baldwin wrote:
> > On Sunday, June 03, 2012 6:49:27 am Bruce Evans wrote:
> >> On Sun, 3 Jun 2012, Konstantin Belousov wrote:
> >>> What is timehands offsets ? Do you mean things like leap seconds ?
> >>
> >> Yes.  binuptime() is:
> >>
> >> % void
> >> % binuptime(struct bintime *bt)
> >> % {
> >> % 	struct timehands *th;
> >> % 	u_int gen;
> >> %
> >> % 	do {
> >> % 		th = timehands;
> >> % 		gen = th->th_generation;
> >> % 		*bt = th->th_offset;
> >> % 		bintime_addx(bt, th->th_scale * tc_delta(th));
> >> % 	} while (gen == 0 || gen != th->th_generation);
> >> % }
> >>
> >> Without the kernel providing th->th_offset, you have to do lots of ntp
> >> handling for yourself (compatibly with the kernel) just to get an
> >> accuracy of 1 second.  Leap seconds don't affect CLOCK_MONOTONIC, but
> >> they do affect CLOCK_REALTIME which is the clock id used by
> >> gettimeofday().  For the former, you only have to advance the offset
> >> for yourself occasionally (compatibly with the kernel) and manage
> >> (compatibly with the kernel, especially in the long term) ntp slewing
> >> and other syscall/sysctl kernel activity that micro-adjusts th->th_scale.
> >
> > I think duplicating this logic in userland would just be wasteful.  I have
> 
> Sure.  I modestly proposed it.
> 
> > a private fast gettimeofday() at my current job and it works by exporting
> > the current timehands structure (well, the equivalent) to userland.  The
> > userland bits then fetch a copy of the details and do the same as bintime().
> 
> How do you keep this up to date, especially for leap seconds?

I added a hack to tc_windup() where it updates the shared copy of the variables
with the results of the tc_windup() call each time it is invoked.

> My version has a comment saying to do that, but I just noticed that
> it wouldn't work so well -- the timehands fields would have to be
> copied to local variables while under protection of the generation
> count, so it would give messier code to optimize a case that occurs
> _very_ rarely.

It's not that messy in my experience.

> >> timehands in a shared pages is close to working.  th_generation protects
> >> things in the same way as in the kernel, modulo assumptions that writes
> >> are ordered.
> >
> > It would work fine.  And in fact, having multiple timehands is actually a
> > bug, not a feature.  It lets you compute bogus timestamps if you get preempted
> > at the wrong time and end up with time jumping around.  At Yahoo! we reduced
> > the number of timehands structures down to 2 or some such, and I'm now of
> > the opinion we should just have one and dispense with the entire array.
> 
> No, it is a feature.  The time should never jump around (backwards), but
> it can easily jump forwards.  It makes little difference if preemption
> occurs after the timehands have been read, or while reading them but in
> such a way that the timehands become stale during preemption but not stale
> enough for their generation to change so that you notice that they are
> stale -- you get a stale timestamp either way (with staleness approximately
> the preemption time).  Times read by different threads can easily have
> different staleness according to which timehands they ended up using and
> this may be quite different from which timehands they started using and
> from which timehands is active after they return.  Perhaps this is what
> you mean.  But again, this happens anyway when the preemption occurs after
> the timehands have been read.

Time definitely jumped backwards at Yahoo!.  The problem case was when NTP
was adjusting the time, so if you used a timehands structure that was a
few generations old (stale), you could have a fairly large component that
was (delta * scale).  If the scale had slowed down in subsequent updates,
then the computed time would jump out into the future.  On the next time
update with a newer timehands, the effective base was less than the previous
calculation thought it should have been, and the scale was smaller, so the
end result if the TSC had not advanced very far was for the new time to be
less than the previous time, and thus time jumped backwards.

> The main point of timehands was originally to give a copy of the time
> that was stable for a time hopefully long enough for the timehands to be
> read without them being clobbered by an update.  binuptime() was:
> 
> 1.59         (phk      26-Mar-98): void
> 1.113        (phk      07-Feb-02): binuptime(struct bintime *bt)
> 1.113        (phk      07-Feb-02): {
> 1.113        (phk      07-Feb-02): 	struct timecounter *tc;
> 1.113        (phk      07-Feb-02): 
> 1.113        (phk      07-Feb-02): 	tc = timecounter;
> 1.113        (phk      07-Feb-02): 	*bt = tc->tc_offset;
> 1.113        (phk      07-Feb-02): 	bintime_addx(bt, tc->tc_scale * tco_delta(tc));
> 1.113        (phk      07-Feb-02): }
> 
> This has an obvious race if the thread running this is preempted for a long
> time, so that the copy of the time is actually not stable for long enough.
> This was fixed (except I think in some cases using ddb) by using the
> generation count.

The problem with having too many timehands structures is you can get a stable
timehands structure that is too stale.

> > For my userland case I only export a single timehands copy.
> 
> So readers block for a long time if the writer is updating and the
> writer blocks?  Works best for UP :-).

The update to the shared timehands structure does not take a long time,
specifically for userland it does not require all of tc_windup()'s
execution time, merely the time to update the values.

> > For all the hardware where people run mysql and similar software that calls
> > getimeofday() a lot, rdtsc() works just fine.
> 
> That wasn't the case until recently (except 10-15 years ago for UP with
> no SMM).  Someone just fixed rdtsc()-based time function in dtrace.  It
> tries to add a per-cpu rdtsc() offset, but the offset was backwards.  It
> takes P-state invariance and maybe more for the offset to be 0 and
> not drift.

I do have the luxury of using fairly modern Intel CPUs at work, and all of them
have invariant TSCs.

-- 
John Baldwin



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?201206041730.05478.jhb>