FreeBSD Mail Archives

Date:      Thu, 7 Jun 2012 11:35:49 +1000 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        John Baldwin <jhb@FreeBSD.org>
Cc:        Konstantin Belousov <kostikbel@gmail.com>, freebsd-arch@FreeBSD.org
Subject:   Re: Fast gettimeofday(2) and clock_gettime(2)
Message-ID:  <20120607084229.C1474@besplex.bde.org>
In-Reply-To: <201206061423.53179.jhb@freebsd.org>
References:  <20120606165115.GQ85127@deviant.kiev.zoral.com.ua> <201206061423.53179.jhb@freebsd.org>

On Wed, 6 Jun 2012, John Baldwin wrote:

> On Wednesday, June 06, 2012 12:51:15 pm Konstantin Belousov wrote:
>> A positive result from the recent flame-bait on arch@ is the working
>> implementation of the fast gettimeofday(2) and clock_gettime(2). The
>> speedup I see is around 6-7x on the 2600K. I think the speedup could
>> be even bigger on the previous generation of CPUs, where lock
>> operations and syscall entry are costlier. A sample test runs of
>> tools/tools/syscall_timing are presented at the end of message.
>
> In general this looks good but I see a few nits / races:

It is awefully (sic) complete and large.  The patch is almost twice as
large as the entire kern_tc.c in FreeBSD-4, and that was quite bloated.

> 1) You don't follow the model of clearing tk_current to 0 while you
>   are updating the structure that the in-kernel timecounter code
>   uses.  This also means you have to avoid using a tk_current of 0
>   and that userland has to keep spinning as long as tk_current is 0.
>   Without this I believe userland can read a partially updated
>   structure.

I thought that too at first, but after looking at the patch decided
that it may be correct, but is too hard for me to understand.
Urk, we both missed that tk_current is an index into the timehands
array, so it cannot act as a generation count and it seems to be harder
to lock.

> 2) You read tk->tk_boottime without the tk_current protection in your
>   non-uptime routines.  This is racey as the kernel alters the
>   boottime when it skews time for large adjustments from ntp, etc.
>   To be really safe you need to read the boottime inside the loop
>   into a local variable and perhaps use a boolean parameter to decide
>   if you should add it to the computed uptime.

The critical problems seem to be mostly here:

+static void
+timehands_update(void *arg)
+{
+	struct sysentvec *sv;
+	struct vdso_timehands th;
+	uint32_t enabled, idx;
+
+	sv = arg;
+	sx_xlock(&shared_page_alloc_sx);
+	enabled = tc_fill_vdso_timehands(&th);

I think tc_windup() should just write to the shared page using the same
delicate order that it uses for its variables now, but there are callbacks
and fill functions like this.

This fill function seems to be OK, since it copies to a local variable
and checks th_generation to get a consistent snapshot.  Now we have to
copy it to the shared page atomically.

+	idx = sv->sv_timekeep_curr;
+	if (++idx >= VDSO_TH_NUM)
+		idx = 0;
+	sv->sv_timekeep_curr = idx;
+	if (enabled) {
+		shared_page_write(sv->sv_timekeep_off +
+		    sizeof(struct vdso_timekeep) + idx *
+		    sizeof(struct vdso_timehands), sizeof(th), &th);
+	}

Now I seem to understand this.  It has race (1) as you said.  Problems are
limited by it copying to (previously) old timehands which is unlikely
to be in use.  The user must have grabbed the pointer to them 10-100 msec
ago and been preempted and still be using it.  But this is precisely the
corner case that the generation count is supposed to fix.
shared_page_write() is essentially bcopy(), so it writes non-atomically
in any order.

+	shared_page_write(sv->sv_timekeep_off + offsetof(struct vdso_timekeep,
+	    tk_boottime), sizeof(struct bintime), &boottimebin);
+	shared_page_write(sv->sv_timekeep_off + offsetof(struct vdso_timekeep,
+	    tk_enabled), sizeof(uint32_t), &enabled);

Then more large variables are written non-atomically in any order.  The
kernel has bugs in this area too (tc_setclock() hacks on bootimebin and
then does an invalid (possibly concurrent) call to tc_windup().

+	wmb();

Then things become written if we get this far.

+	shared_page_write(sv->sv_timekeep_off +
+	    offsetof(struct vdso_timekeep, tk_current), sizeof(uint32_t),
+	    &idx);

I don't understand this.  Why isn't it it before wmb(), or at least done
atomically.  Ah, it is tk_current.  Writing this as atomically 0 at the
start and then atomically here should be enough (no wmb()), except for
the problems with boottimebin().  Except tk_current is actually the
timehands index and there is no timehands generation in userland.  I
don't understand this.

+	sx_xunlock(&shared_page_alloc_sx);
+}

The enabled flag should be cleared when the timecounter is switched
away from a TSC.  I can't see where that happens.  Also, things should
change if a TSC is switched to another one (TSC-low <-> TSC).  That is
a bit more delicate and not convered by the enabled flag.

% +static int
% +binuptime(struct bintime *bt, struct vdso_timekeep *tk)
% +{
% +	struct vdso_timehands *th;
% +	uint32_t curr;
% +
% +	do {
% +		if (!tk->tk_enabled)
% +			return (ENOSYS);

This should not be acted on before the generation count stablizes.

% +
% +		/*
% +		 * XXXKIB. The load of tk->tk_current should use
% +		 * atomic_load_acq_32 to provide load barrier. But
% +		 * since tk points to r/o mapped page, x86
% +		 * implementation of atomic_load_acq faults.
% +		 */
% +		curr = tk->tk_current;
% +		rmb();

Memory barriers are intentionally left out in the kernel version.  Isn't
the generation count enough, provided it is stored using atomic_rel?

% +		th = &tk->tk_th[curr];
% +		if (th->th_algo != VDSO_TH_ALGO_1)
% +			return (ENOSYS);

I don't like having 2 conditional tests.  1 more than in the kernel
seems to be needed because all timehands may become unusable here (when
the kernel timecounter hardware stops being a TSC, and this happens
after any previous userland check of the flags).

% +		*bt = th->th_offset;
% +		bintime_addx(bt, th->th_scale * tc_delta(th));
% +	} while (curr != tk->tk_current);

With generation counts, it is only this second access to what was the
generation count that needs to be atomic.  If the other one is stale,
then it is different from this one.

% +	return (0);
% +}

It is a large regression to use the current index instead of the old
timehands.  The old timehands is stable for 10-100 msec after you load
it -- nothing in it, including its generation count changes in that
time.  So the kernel version of the above loop almost never iterates
more than once -- it only iterates if it is preempted for 10-100 msec.
But using the current index, you see this change as soon as the kernel
updates it, and then iterate, and aren't protected by the 10-100 msec
of time based locking.  Even accessing the current index requires more
locking.

Bruce

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20120607084229.C1474>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation