Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 20 Jan 2014 15:22:23 -0500
From:      John Baldwin <jhb@freebsd.org>
To:        freebsd-virtualization@freebsd.org
Cc:        Julian Stecklina <jsteckli@os.inf.tu-dresden.de>
Subject:   Re: KVM Clock
Message-ID:  <3298215.G6vx8QjdMB@pippin.baldwin.cx>
In-Reply-To: <1389805150.16498.18.camel@janus.xn--pl-wia.net>
References:  <lb3jnb$qo8$1@ger.gmane.org> <52D6B18D.4080204@citrix.com> <1389805150.16498.18.camel@janus.xn--pl-wia.net>

next in thread | previous in thread | raw e-mail | index | archive | help
On Wednesday 15 January 2014 17:59:10 Julian Stecklina wrote:
> On Mi, 2014-01-15 at 17:04 +0100, Roger Pau Monn=E9 wrote:
> > On 15/01/14 14:45, Julian Stecklina wrote:
> > > On Mi, 2014-01-15 at 14:08 +0100, Roger Pau Monn=E9 wrote:
> > >> On 15/01/14 13:05, Julian Stecklina wrote:
> > >>> On 01/14/2014 05:13 PM, Peter Grehan wrote:
> > >>>> Hi Julian,
> > >>>>=20
> > >>>>> is anyone working on KVM clock support for FreeBSD? If not, I=

> > >>>>> might take a shot at it.
> > >>>>=20
> > >>>> None I know of: go for it :)
> > >>>=20
> > >>> Works for me so far:
> > >>> https://github.com/blitz/freebsd/commit/cdc5f872b3e48cc0dda031f=
c7d6bde
> > >>> dc65c3148f> >>=20
> > >> Looking
> > >>=20
> > >> at the code it seems some common routines could be shared
> > >> between the KVM PV clock and the Xen PV clock
> > >> (sys/dev/xen/timer/timer.c). The data passed from the hypervisor=
 to
> > >> the guest has exactly the same format (see struct vcpu_time_info=
 in
> > >> Xen public headers).
> > >=20
> > > Yes, I know. Didn't get around to making it pretty yesterday even=
ing. ;)
> > > I'll post an updated patch, when I have some time. Any suggestion=
s where
> > > to put the two common functions?
> > >=20
> > >> At a first sight the KVM clock can benefit from using scale_delt=
a
> > >> (which is going to be faster than the C version implemented in
> > >> kvmclock_get_timecount),
> > >=20
> > > At least somewhat on amd64. 32-bit looks pretty identical.
> > >=20
> > >>  and xen_fetch_vcpu_tinfo is exactly the same
> > >>=20
> > >> as kvmclock_fetch.
> > >=20
> > > I think xen_fetch_vcpu_tinfo has a subtle bug:
> > >   217         do {
> > >   218                 dst->version =3D src->version;
> > >   219                 rmb();
> > >   220                 dst->tsc_timestamp =3D src->tsc_timestamp;
> > >   221                 dst->system_time =3D src->system_time;
> > >   222                 dst->tsc_to_system_mul =3D src->tsc_to_syst=
em_mul;
> > >   223                 dst->tsc_shift =3D src->tsc_shift;
> > >   224                 rmb();
> > >   225         } while ((src->version & 1) | (dst->version ^
> > >=20
> > > src->version));
> > >=20
> > > In line 225 src->version is potentially read twice. If you consid=
er the
> > > following schedule:
> > >=20
> > > host starts updating data, sets src->version to 1
> > > guest reads src->version (1) and stores it into dst->version.
> > > guest copies inconsistent data
> > > guest reads src->version (1) and computes xor with dst->version.
> > > host finishes updating data and sets src->version to 2
> > > guest reads src->version (2) and checks whether lower bit is not =
set.
> > > while loop exits with inconsistent data!
> > >=20
> > > I think the C standard (at least C11) permits this as it does not=

> > > specify in which order the two reads in line 225 need to happen.
> >=20
> > According to the operator precedence and associativity in C
> > (http://en.wikipedia.org/wiki/Operators_in_C_and_C%2B%2B#Operator_p=
receden
> > ce), if I'm reading it right, the condition in the while line will =
be
> > evaluated as follows (because of the left-to-right associativity of=
 the
> >=20
> > | operator):
> > 1. (src->version & 1)
> > 2. (dst->version ^ src->version)
> > 3. result of 1 | result of 2
> >=20
> > So AFAICT the flow that you describe could never happen, because
> > (src->version & 1) is always done before (dst->version ^ src->versi=
on).
>=20
> Operator precedence doesn't matter. Consider it written like this:
>=20
> or(and(src->version, 1), xor(dst->version, src->version))
>=20
> C gives you no guarantees whether the and or the xor will be executed=

> first. There is no sequence point in between. And even with a sequenc=
e
> point, the C11 memory model gives you no guarantees about how the rea=
ds
> are ordered.
>=20
> This discussion is somewhat theoretical, because
>  a) the hypervisor will probably update the data structure in the
> current vCPU context (making memory consistency issues go away).
>  b) the compiler will likely not be an asshole. ;)
>=20
> Pseudocode for a better fetch could be:
>=20
> dst->version =3D src->version;
> rmb();
> ... read data ...
> rmb();
> version_post =3D src->version;
> rmb(); <- keeps the compiler from reading src->version multiple times=

>=20
> and then doing the test with version_post and dst->version.
> Unfortunately, rmb() expands into lfence, even if there is no need fo=
r
> that here.

There is the __compiler_membar() macro in <sys/cdefs.h> that you could =
use if=20
this code is x86-specific (and thus knows it only needs a compiler barr=
ier).

--=20
John Baldwin



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3298215.G6vx8QjdMB>