Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 22 Jun 2012 13:46:26 +0300
From:      Konstantin Belousov <kostikbel@gmail.com>
To:        Alexander Motin <mav@freebsd.org>
Cc:        svn-src-head@freebsd.org, svn-src-all@freebsd.org, src-committers@freebsd.org
Subject:   Re: svn commit: r237433 - in head/sys: amd64/include arm/include conf i386/include ia64/include kern mips/include pc98/include powerpc/include sparc64/include sys x86/include x86/x86
Message-ID:  <20120622104626.GE2337@deviant.kiev.zoral.com.ua>
In-Reply-To: <20120622102342.GD2337@deviant.kiev.zoral.com.ua>
References:  <201206220706.q5M76fbO062751@svn.freebsd.org> <4FE42812.3050807@FreeBSD.org> <20120622082502.GB2337@deviant.kiev.zoral.com.ua> <4FE432C4.7000608@FreeBSD.org> <20120622102342.GD2337@deviant.kiev.zoral.com.ua>

next in thread | previous in thread | raw e-mail | index | archive | help

--r5UKZFo5ar5Icv+k
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Fri, Jun 22, 2012 at 01:23:42PM +0300, Konstantin Belousov wrote:
> On Fri, Jun 22, 2012 at 11:54:28AM +0300, Alexander Motin wrote:
> > On 22.06.2012 11:25, Konstantin Belousov wrote:
> > >On Fri, Jun 22, 2012 at 11:08:50AM +0300, Alexander Motin wrote:
> > >>On 06/22/12 10:06, Konstantin Belousov wrote:
> > >>>Author: kib
> > >>>Date: Fri Jun 22 07:06:40 2012
> > >>>New Revision: 237433
> > >>>URL: http://svn.freebsd.org/changeset/base/237433
> > >>>
> > >>>Log:
> > >>>   Implement mechanism to export some kernel timekeeping data to
> > >>>   usermode, using shared page.  The structures and functions have v=
dso
> > >>>   prefix, to indicate the intended location of the code in some fut=
ure.
> > >>>
> > >>>   The versioned per-algorithm data is exported in the format of str=
uct
> > >>>   vdso_timehands, which mostly repeats the content of in-kernel str=
uct
> > >>>   timehands. Usermode reading of the structure can be lockless.
> > >>>   Compatibility export for 32bit processes on 64bit host is also
> > >>>   provided. Kernel also provides usermode with indication about
> > >>>   currently used timecounter, so that libc can fall back to syscall=
 if
> > >>>   configured timecounter is unknown to usermode code.
> > >>>
> > >>>   The shared data updates are initiated both from the tc_windup(), =
where
> > >>>   a fast task is queued to do the update, and from sysctl handlers =
which
> > >>>   change timecounter. A manual override switch
> > >>>   kern.timecounter.fast_gettime allows to turn off the mechanism.
> > >>>
> > >>>   Only x86 architectures export the real algorithm data, and there,=
 only
> > >>>   for tsc timecounter. HPET counters page could be exported as well=
, but
> > >>>   I prefer to not further glue the kernel and libc ABI there until
> > >>>   proper vdso-based solution is developed.
> > >>>
> > >>>   Minimal stubs neccessary for non-x86 architectures to still compi=
le
> > >>>   are provided.
> > >>>
> > >>>   Discussed with:	bde
> > >>>   Reviewed by:	jhb
> > >>>   Tested by:	flo
> > >>>   MFC after:	1 month
> > >>
> > >>
> > >>>@@ -1360,6 +1367,7 @@ tc_windup(void)
> > >>>  #endif
> > >>>
> > >>>  	timehands =3D th;
> > >>>+	taskqueue_enqueue_fast(taskqueue_fast,&tc_windup_push_vdso_task);
> > >>>  }
> > >>>
> > >>>  /* Report or change the active timecounter hardware. */
> > >>
> > >>This taskqueue_enqueue_fast() will schedule extra thread to run each
> > >>time hardclock() fires. One thread may be not a big problem, but
> > >>together with callout swi and possible other threads woken up there it
> > >>will wake up several other CPU cores from sleep just to put them back=
 in
> > >>few microseconds. Now davide@ and me are trying to fix that by avoidi=
ng
> > >>callout SWI use for simple tasks. Please, let's not create another
> > >>problem same time.
> > >
> > >The patch was public for quite a time. If you have some comments about
> > >it, it would be much more productive to let me know about them before
> > >the commit, not after.
> >=20
> > I'm sorry, I haven't seen it. My mad.
> >=20
> > >Anyway, what is your proposal for 'let's not create another problem
> > >same time' part of the message ? It was discussed, as a possibility,
> > >to have permanent mapping for the shared page in the KVA and to perform
> > >lock-less update of the struct vdso_timehands directly from hardclock
> > >handler. My opinion was that amount of work added by tc_windup
> > >eventhandler is not trivial, so it is better to be postponed to
> > >less critical context. It also slightly more safe to not perform
> > >lockless update for vdso_timehands, since otherwise module load which
> > >register exec handler could cause transient gettimeofday() failure
> > >in usermode.
> > >
> > >This might boil down to the fact that tc_windup function is called
> > >too often, in fact. Also, packing execution of tc_windup eventhandler
> > >together with the clock swi is fine from my POV.
> >=20
> > I have nothing against using shared pages. On UP system I would probabl=
y=20
> > have not so much against several threads. But on SMP system it will=20
> > cause at least one, but in many cases two extra CPUs to be woken up.=20
> > There are two or more threads to run on hardclock(): this taskqueue,=20
> > callout swi and some thread(s) woken from callouts. Scheduler has no=20
> > idea how heavy they are. So it will try to move each of them to separat=
e=20
> > idle CPU. Does the amount of work done in event handlers worth extra=20
> > Watts consumed by rapidly waking CPUs? As quite rare person running=20
> > FreeBSD on laptop, I am not sure. I am not sure even that on=20
> > desktop/server this won't kill all benefit of fast clocks by limiting=
=20
> > TurboBoost.
>=20
> Patch below would probably work, but I cannot test it right now on real
> hardware due to ACPI issue. It worked for me in qemu.
>=20
> commit 4f2ffd93b36d20eae61495776fc6b0855745fd7f
> Author: Konstantin Belousov <kib@freebsd.org>
> Date:   Fri Jun 22 13:19:22 2012 +0300
>=20
>     Use persistent kernel mapping of the shared page, and update the
>     vdso_timehands from hardclock, instead of scheduling task.

Slightly improved version. Since tc_fill_vdso_timehands is now
called from hardclock context, thee is no need to spin waiting for
valid current generation of timehands.


diff --git a/sys/kern/kern_exec.c b/sys/kern/kern_exec.c
index 80502e3..9365223 100644
--- a/sys/kern/kern_exec.c
+++ b/sys/kern/kern_exec.c
@@ -1517,42 +1517,13 @@ exec_unregister(execsw_arg)
 static struct sx shared_page_alloc_sx;
 static vm_object_t shared_page_obj;
 static int shared_page_free;
-
-struct sf_buf *
-shared_page_write_start(int base)
-{
-	vm_page_t m;
-	struct sf_buf *s;
-
-	VM_OBJECT_LOCK(shared_page_obj);
-	m =3D vm_page_grab(shared_page_obj, OFF_TO_IDX(base), VM_ALLOC_RETRY);
-	VM_OBJECT_UNLOCK(shared_page_obj);
-	s =3D sf_buf_alloc(m, SFB_DEFAULT);
-	return (s);
-}
-
-void
-shared_page_write_end(struct sf_buf *sf)
-{
-	vm_page_t m;
-
-	m =3D sf_buf_page(sf);
-	sf_buf_free(sf);
-	VM_OBJECT_LOCK(shared_page_obj);
-	vm_page_wakeup(m);
-	VM_OBJECT_UNLOCK(shared_page_obj);
-}
+char *shared_page_mapping;
=20
 void
 shared_page_write(int base, int size, const void *data)
 {
-	struct sf_buf *sf;
-	vm_offset_t sk;
=20
-	sf =3D shared_page_write_start(base);
-	sk =3D sf_buf_kva(sf);
-	bcopy(data, (void *)(sk + (base & PAGE_MASK)), size);
-	shared_page_write_end(sf);
+	bcopy(data, shared_page_mapping + base, size);
 }
=20
 static int
@@ -1596,6 +1567,7 @@ static void
 shared_page_init(void *dummy __unused)
 {
 	vm_page_t m;
+	vm_offset_t addr;
=20
 	sx_init(&shared_page_alloc_sx, "shpsx");
 	shared_page_obj =3D vm_pager_allocate(OBJT_PHYS, 0, PAGE_SIZE,
@@ -1605,25 +1577,24 @@ shared_page_init(void *dummy __unused)
 	    VM_ALLOC_ZERO);
 	m->valid =3D VM_PAGE_BITS_ALL;
 	VM_OBJECT_UNLOCK(shared_page_obj);
+	addr =3D kmem_alloc_nofault(kernel_map, PAGE_SIZE);
+	pmap_qenter(addr, &m, 1);
+	shared_page_mapping =3D (char *)addr;
 }
=20
 SYSINIT(shp, SI_SUB_EXEC, SI_ORDER_FIRST, (sysinit_cfunc_t)shared_page_ini=
t,
     NULL);
=20
 static void
-timehands_update(void *arg)
+timehands_update(struct sysentvec *sv)
 {
-	struct sysentvec *sv;
-	struct sf_buf *sf;
 	struct vdso_timehands th;
 	struct vdso_timekeep *tk;
 	uint32_t enabled, idx;
=20
-	sv =3D arg;
-	sx_xlock(&shared_page_alloc_sx);
 	enabled =3D tc_fill_vdso_timehands(&th);
-	sf =3D shared_page_write_start(sv->sv_timekeep_off);
-	tk =3D (void *)(sf_buf_kva(sf) + (sv->sv_timekeep_off & PAGE_MASK));
+	tk =3D (struct vdso_timekeep *)(shared_page_mapping +
+	    sv->sv_timekeep_off);
 	idx =3D sv->sv_timekeep_curr;
 	atomic_store_rel_32(&tk->tk_th[idx].th_gen, 0);
 	if (++idx >=3D VDSO_TH_NUM)
@@ -1637,25 +1608,19 @@ timehands_update(void *arg)
 	tk->tk_enabled =3D enabled;
 	atomic_store_rel_32(&tk->tk_th[idx].th_gen, sv->sv_timekeep_gen);
 	tk->tk_current =3D idx;
-	shared_page_write_end(sf);
-	sx_xunlock(&shared_page_alloc_sx);
 }
=20
 #ifdef COMPAT_FREEBSD32
 static void
-timehands_update32(void *arg)
+timehands_update32(struct sysentvec *sv)
 {
-	struct sysentvec *sv;
-	struct sf_buf *sf;
 	struct vdso_timekeep32 *tk;
 	struct vdso_timehands32 th;
 	uint32_t enabled, idx;
=20
-	sv =3D arg;
-	sx_xlock(&shared_page_alloc_sx);
 	enabled =3D tc_fill_vdso_timehands32(&th);
-	sf =3D shared_page_write_start(sv->sv_timekeep_off);
-	tk =3D (void *)(sf_buf_kva(sf) + (sv->sv_timekeep_off & PAGE_MASK));
+	tk =3D (struct vdso_timekeep32 *)(shared_page_mapping +
+	    sv->sv_timekeep_off);
 	idx =3D sv->sv_timekeep_curr;
 	atomic_store_rel_32(&tk->tk_th[idx].th_gen, 0);
 	if (++idx >=3D VDSO_TH_NUM)
@@ -1669,11 +1634,32 @@ timehands_update32(void *arg)
 	tk->tk_enabled =3D enabled;
 	atomic_store_rel_32(&tk->tk_th[idx].th_gen, sv->sv_timekeep_gen);
 	tk->tk_current =3D idx;
-	shared_page_write_end(sf);
-	sx_xunlock(&shared_page_alloc_sx);
 }
 #endif
=20
+/*
+ * This is hackish, but easiest way to avoid creating list structures
+ * that needs to be iterated over from the hardclock interrupt
+ * context.
+ */
+static struct sysentvec *host_sysentvec;
+#ifdef COMPAT_FREEBSD32
+static struct sysentvec *compat32_sysentvec;
+#endif
+
+void
+timekeep_push_vdso(void)
+{
+
+	if (host_sysentvec !=3D NULL && host_sysentvec->sv_timekeep_base !=3D 0)
+		timehands_update(host_sysentvec);
+#ifdef COMPAT_FREEBSD32
+	if (compat32_sysentvec !=3D NULL &&
+	    compat32_sysentvec->sv_timekeep_base !=3D 0)
+		timehands_update32(compat32_sysentvec);
+#endif
+}
+
 void
 exec_sysvec_init(void *param)
 {
@@ -1688,29 +1674,32 @@ exec_sysvec_init(void *param)
 	sv->sv_shared_page_obj =3D shared_page_obj;
 	sv->sv_sigcode_base =3D sv->sv_shared_page_base +
 	    shared_page_fill(*(sv->sv_szsigcode), 16, sv->sv_sigcode);
+	if ((sv->sv_flags & SV_ABI_MASK) !=3D SV_ABI_FREEBSD)
+		return;
 	tk_ver =3D VDSO_TK_VER_CURR;
 #ifdef COMPAT_FREEBSD32
 	if ((sv->sv_flags & SV_ILP32) !=3D 0) {
 		tk_base =3D shared_page_alloc(sizeof(struct vdso_timekeep32) +
 		    sizeof(struct vdso_timehands32) * VDSO_TH_NUM, 16);
 		KASSERT(tk_base !=3D -1, ("tk_base -1 for 32bit"));
-		EVENTHANDLER_REGISTER(tc_windup, timehands_update32, sv,
-		    EVENTHANDLER_PRI_ANY);
 		shared_page_write(tk_base + offsetof(struct vdso_timekeep32,
 		    tk_ver), sizeof(uint32_t), &tk_ver);
+		KASSERT(compat32_sysentvec =3D=3D 0,
+		    ("Native compat32 already registered"));
+		compat32_sysentvec =3D sv;
 	} else {
 #endif
 		tk_base =3D shared_page_alloc(sizeof(struct vdso_timekeep) +
 		    sizeof(struct vdso_timehands) * VDSO_TH_NUM, 16);
 		KASSERT(tk_base !=3D -1, ("tk_base -1 for native"));
-		EVENTHANDLER_REGISTER(tc_windup, timehands_update, sv,
-		    EVENTHANDLER_PRI_ANY);
 		shared_page_write(tk_base + offsetof(struct vdso_timekeep,
 		    tk_ver), sizeof(uint32_t), &tk_ver);
+		KASSERT(host_sysentvec =3D=3D 0, ("Native already registered"));
+		host_sysentvec =3D sv;
 #ifdef COMPAT_FREEBSD32
 	}
 #endif
 	sv->sv_timekeep_base =3D sv->sv_shared_page_base + tk_base;
 	sv->sv_timekeep_off =3D tk_base;
-	EVENTHANDLER_INVOKE(tc_windup);
+	timekeep_push_vdso();
 }
diff --git a/sys/kern/kern_tc.c b/sys/kern/kern_tc.c
index 0b8fefe..4a75af5 100644
--- a/sys/kern/kern_tc.c
+++ b/sys/kern/kern_tc.c
@@ -31,7 +31,6 @@ __FBSDID("$FreeBSD$");
 #include <sys/systm.h>
 #include <sys/timeffc.h>
 #include <sys/timepps.h>
-#include <sys/taskqueue.h>
 #include <sys/timetc.h>
 #include <sys/timex.h>
 #include <sys/vdso.h>
@@ -121,12 +120,8 @@ SYSCTL_INT(_kern_timecounter, OID_AUTO, stepwarnings, =
CTLFLAG_RW,
     &timestepwarnings, 0, "Log time steps");
=20
 static void tc_windup(void);
-static void tc_windup_push_vdso(void *ctx, int pending);
 static void cpu_tick_calibrate(int);
=20
-static struct task tc_windup_push_vdso_task =3D TASK_INITIALIZER(0,
-    tc_windup_push_vdso,  0);
-
 static int
 sysctl_kern_boottime(SYSCTL_HANDLER_ARGS)
 {
@@ -1367,7 +1362,7 @@ tc_windup(void)
 #endif
=20
 	timehands =3D th;
-	taskqueue_enqueue_fast(taskqueue_fast, &tc_windup_push_vdso_task);
+	timekeep_push_vdso();
 }
=20
 /* Report or change the active timecounter hardware. */
@@ -1394,7 +1389,7 @@ sysctl_kern_timecounter_hardware(SYSCTL_HANDLER_ARGS)
 		(void)newtc->tc_get_timecount(newtc);
=20
 		timecounter =3D newtc;
-		EVENTHANDLER_INVOKE(tc_windup);
+		timekeep_push_vdso();
 		return (0);
 	}
 	return (EINVAL);
@@ -1865,7 +1860,7 @@ sysctl_fast_gettime(SYSCTL_HANDLER_ARGS)
 	if (error !=3D 0)
 		return (error);
 	vdso_th_enable =3D old_vdso_th_enable;
-	EVENTHANDLER_INVOKE(tc_windup);
+	timekeep_push_vdso();
 	return (0);
 }
 SYSCTL_PROC(_kern_timecounter, OID_AUTO, fast_gettime,
@@ -1877,19 +1872,15 @@ tc_fill_vdso_timehands(struct vdso_timehands *vdso_=
th)
 {
 	struct timehands *th;
 	uint32_t enabled;
-	int gen;
=20
-	do {
-		th =3D timehands;
-		gen =3D th->th_generation;
-		vdso_th->th_algo =3D VDSO_TH_ALGO_1;
-		vdso_th->th_scale =3D th->th_scale;
-		vdso_th->th_offset_count =3D th->th_offset_count;
-		vdso_th->th_counter_mask =3D th->th_counter->tc_counter_mask;
-		vdso_th->th_offset =3D th->th_offset;
-		vdso_th->th_boottime =3D boottimebin;
-		enabled =3D cpu_fill_vdso_timehands(vdso_th);
-	} while (gen =3D=3D 0 || timehands->th_generation !=3D gen);
+	th =3D timehands;
+	vdso_th->th_algo =3D VDSO_TH_ALGO_1;
+	vdso_th->th_scale =3D th->th_scale;
+	vdso_th->th_offset_count =3D th->th_offset_count;
+	vdso_th->th_counter_mask =3D th->th_counter->tc_counter_mask;
+	vdso_th->th_offset =3D th->th_offset;
+	vdso_th->th_boottime =3D boottimebin;
+	enabled =3D cpu_fill_vdso_timehands(vdso_th);
 	if (!vdso_th_enable)
 		enabled =3D 0;
 	return (enabled);
@@ -1901,30 +1892,19 @@ tc_fill_vdso_timehands32(struct vdso_timehands32 *v=
dso_th32)
 {
 	struct timehands *th;
 	uint32_t enabled;
-	int gen;
=20
-	do {
-		th =3D timehands;
-		gen =3D th->th_generation;
-		vdso_th32->th_algo =3D VDSO_TH_ALGO_1;
-		*(uint64_t *)&vdso_th32->th_scale[0] =3D th->th_scale;
-		vdso_th32->th_offset_count =3D th->th_offset_count;
-		vdso_th32->th_counter_mask =3D th->th_counter->tc_counter_mask;
-		vdso_th32->th_offset.sec =3D th->th_offset.sec;
-		*(uint64_t *)&vdso_th32->th_offset.frac[0] =3D th->th_offset.frac;
-		vdso_th32->th_boottime.sec =3D boottimebin.sec;
-		*(uint64_t *)&vdso_th32->th_boottime.frac[0] =3D boottimebin.frac;
-		enabled =3D cpu_fill_vdso_timehands32(vdso_th32);
-	} while (gen =3D=3D 0 || timehands->th_generation !=3D gen);
+	th =3D timehands;
+	vdso_th32->th_algo =3D VDSO_TH_ALGO_1;
+	*(uint64_t *)&vdso_th32->th_scale[0] =3D th->th_scale;
+	vdso_th32->th_offset_count =3D th->th_offset_count;
+	vdso_th32->th_counter_mask =3D th->th_counter->tc_counter_mask;
+	vdso_th32->th_offset.sec =3D th->th_offset.sec;
+	*(uint64_t *)&vdso_th32->th_offset.frac[0] =3D th->th_offset.frac;
+	vdso_th32->th_boottime.sec =3D boottimebin.sec;
+	*(uint64_t *)&vdso_th32->th_boottime.frac[0] =3D boottimebin.frac;
+	enabled =3D cpu_fill_vdso_timehands32(vdso_th32);
 	if (!vdso_th_enable)
 		enabled =3D 0;
 	return (enabled);
 }
 #endif
-
-static void
-tc_windup_push_vdso(void *ctx, int pending)
-{
-
-	EVENTHANDLER_INVOKE(tc_windup);
-}
diff --git a/sys/sys/sysent.h b/sys/sys/sysent.h
index 22769c2..6de72d9 100644
--- a/sys/sys/sysent.h
+++ b/sys/sys/sysent.h
@@ -265,8 +265,6 @@ int shared_page_alloc(int size, int align);
 int shared_page_fill(int size, int align, const void *data);
 void shared_page_write(int base, int size, const void *data);
 void exec_sysvec_init(void *param);
-struct sf_buf *shared_page_write_start(int base);
-void shared_page_write_end(struct sf_buf *sf);
=20
 #define INIT_SYSENTVEC(name, sv)					\
     SYSINIT(name, SI_SUB_EXEC, SI_ORDER_ANY,				\
diff --git a/sys/sys/vdso.h b/sys/sys/vdso.h
index 9f3f3af..653a606 100644
--- a/sys/sys/vdso.h
+++ b/sys/sys/vdso.h
@@ -29,7 +29,6 @@
 #define	_SYS_VDSO_H
=20
 #include <sys/types.h>
-#include <sys/eventhandler.h>
 #include <machine/vdso.h>
=20
 struct vdso_timehands {
@@ -74,6 +73,8 @@ u_int __vdso_gettc(const struct vdso_timehands *vdso_th);
=20
 #ifdef _KERNEL
=20
+void timekeep_push_vdso(void);
+
 uint32_t tc_fill_vdso_timehands(struct vdso_timehands *vdso_th);
=20
 /*
@@ -86,9 +87,6 @@ uint32_t tc_fill_vdso_timehands(struct vdso_timehands *vd=
so_th);
  */
 uint32_t cpu_fill_vdso_timehands(struct vdso_timehands *vdso_th);
=20
-typedef void (*tc_windup_fn)(void *);
-EVENTHANDLER_DECLARE(tc_windup, tc_windup_fn);
-
 #define	VDSO_TH_NUM	4
=20
 #ifdef COMPAT_FREEBSD32

--r5UKZFo5ar5Icv+k
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (FreeBSD)

iEYEARECAAYFAk/kTQEACgkQC3+MBN1Mb4j1JwCbBlIe+aX2TtvQoMbqZCMiLbeq
bZIAnRAqwAxtQl8uMWjTfO/+Xi+ysCfM
=nV8y
-----END PGP SIGNATURE-----

--r5UKZFo5ar5Icv+k--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20120622104626.GE2337>