From owner-freebsd-arch@FreeBSD.ORG  Sat Jun  2 20:05:21 2012
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id 60D2C1065672;
	Sat,  2 Jun 2012 20:05:21 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from mail35.syd.optusnet.com.au (mail35.syd.optusnet.com.au
	[211.29.133.51])
	by mx1.freebsd.org (Postfix) with ESMTP id E47598FC12;
	Sat,  2 Jun 2012 20:05:20 +0000 (UTC)
Received: from c122-106-171-232.carlnfd1.nsw.optusnet.com.au
	(c122-106-171-232.carlnfd1.nsw.optusnet.com.au [122.106.171.232])
	by mail35.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id
	q52K5Atd015942
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
	Sun, 3 Jun 2012 06:05:11 +1000
Date: Sun, 3 Jun 2012 06:05:10 +1000 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: Konstantin Belousov <kostikbel@gmail.com>
In-Reply-To: <20120602164847.GB2358@deviant.kiev.zoral.com.ua>
Message-ID: <20120603053445.Y3302@besplex.bde.org>
References: <CACfq090r1tWhuDkxdSZ24fwafbVKU0yduu1yV2+oYo+wwT4ipA@mail.gmail.com>
	<20120601193522.GA2358@deviant.kiev.zoral.com.ua>
	<CAJ-FndC71=3Jo+BxQi==gCoLipBxj8X8XMBydjvrcKeGw+WOnA@mail.gmail.com>
	<20120602164847.GB2358@deviant.kiev.zoral.com.ua>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: Attilio Rao <attilio@FreeBSD.org>, alc@FreeBSD.org,
	Giovanni Trematerra <giovanni.trematerra@gmail.com>,
	Alexander Kabaev <kan@FreeBSD.org>, freebsd-arch@FreeBSD.org
Subject: Re: [RFC] Kernel shared variables
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 02 Jun 2012 20:05:21 -0000

On Sat, 2 Jun 2012, Konstantin Belousov wrote:

> On Sat, Jun 02, 2012 at 02:01:35PM +0100, Attilio Rao wrote:
>> ...
>> I have some reservations on both the implementation and the approach
>> for retrieving datas from the page.
>> In particular, I don't like that a new vm_object is allocated for this
>> page. What I really would like would be:
>> 1) very minimal implementation -- you just use
>> pmap_enter()/pmap_remove() specifically when needed, separately, in
>> fork(), execve(), etc. cases
> Oh, this simply cannot work.
>
>> 2) more complete approach -- you make a very quick layer which let you
>> map pages from the static image of the kernel and the shared page
>> becomes just a specific consumer of this. This way the object has much
>> more sense because it becomes an object associated to all the static
>> image of the kernel
> So you want to circumvent the vm layer.
>>
>> About the layering, I don't like that you require both a kernel and
>> userland header to locate the objects within the page. This is very
>> likely ABI breakage prone. It is needed a mechanism for retrieving at
>> run time what Giovanni calls "indexes", or making it indexes-agnostic.
>
> And this is what VDSO is for. VDSO with the standard ELF symbol
> interposition rules allow to have libc that is completely unaware of the
> shared page and 'indexes', i.e. which works both for older kernel that
> do not export required index, and for new kernels that export the same
> information in some more advanced format. By having VDSO that exports

I have no strong ideas about the ABI issues.  Even shared libraries are
too large and complicated for me :-).

> e.g. gettimeofday() we would get override for libc gettimeofday, while
> having fully functional libc for other, future and past, kernels, even
> if the format of the data exported for super-fast gettimeofday changes.

Please no getttimeofday() for the example :-).

> As an additional note, already put by Bruce, the implementation of
> usermode gettimeofday is exactly opposite of any reasonable implementation.
> It looses the precision to the frequency of the event timer. Obvious
> approach is to not have any periodically updating data for gettimeofday
> purpose, and use some formula with rdtsc and kernel-provided coefficients
> on the machines where rdtsc is usable.

Actually, you can probably do gettimeofday() by exporting mounds of
excecute-only and read-only kernel code and data in the in the shared
page(s).  The kernel code becomes just another way of implementing a
shared library that is especially good for syscalls.  It needs to run
with only user privilege.  x86 rdtsc normally has user privilege.  User
privilege for timecounter hardware in bus space would be problematic.
Actually^2, you only need a small amount of kernel code for this --
just microtime() and what it calls, with only the timecounter hardware
call being a problem.  The kernel maintains lots of not-quite-constant
timecounter state (primarily timehands offsets) that can be locked in
the time domain in the same way that it is in the kernel.

> Interesting question is how much shared the shared page needs be.
> Obvious needs are shared between all same-ABI processes, but I can also
> easily see a need for the per-process private information be present in
> the 'private-shared' page. For silly but typical example, useful for
> moronix-style benchmarks, see getpid().

Slightly better benchmarks use getppid() since the parent pid is not
quite constant so it can't easily be cached in userland.  But with
a kernel read-only pages, it it doesn't even need time domain locking,
since getppid() is inherently racy (the parent may go away) before it
returns.

Lots of read-only syscalls that don't require privilege or much locking
could be implemented similarly.  All syscalls can be put in the shared
executable page(s), with most reducing to the same library code as now
to actually enter the kernel.  This is too large and complicated for me.

Bruce