From owner-freebsd-hackers  Thu Aug 29 15:24:21 2002
Delivered-To: freebsd-hackers@freebsd.org
Received: from mx1.FreeBSD.org (mx1.FreeBSD.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 921B737B401
	for <freebsd-hackers@freebsd.org>; Thu, 29 Aug 2002 15:24:09 -0700 (PDT)
Received: from swan.mail.pas.earthlink.net (swan.mail.pas.earthlink.net [207.217.120.123])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 31BE543EA9
	for <freebsd-hackers@freebsd.org>; Thu, 29 Aug 2002 15:24:07 -0700 (PDT)
	(envelope-from tlambert2@mindspring.com)
Received: from pool0433.cvx21-bradley.dialup.earthlink.net ([209.179.193.178] helo=mindspring.com)
	by swan.mail.pas.earthlink.net with esmtp (Exim 3.33 #1)
	id 17kXhq-0001Jw-00; Thu, 29 Aug 2002 15:24:02 -0700
Message-ID: <3D6E9E94.41942024@mindspring.com>
Date: Thu, 29 Aug 2002 15:22:12 -0700
From: Terry Lambert <tlambert2@mindspring.com>
X-Mailer: Mozilla 4.79 [en] (Win98; U)
X-Accept-Language: en
MIME-Version: 1.0
To: Aaro J Koskinen <akoskine@cc.helsinki.fi>
Cc: freebsd-hackers@FreeBSD.ORG
Subject: Re: More dynamic KVA_SPACE
References: <Pine.OSF.4.30.0208292249080.14797-100000@sirppi.helsinki.fi>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-hackers@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-hackers.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-hackers>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-hackers>
X-Loop: FreeBSD.ORG

Aaro J Koskinen wrote:
> > > Am I completely off the track? What are the main reasons behind the
> > > current KVM layout?
> >
> > Kernel code is not position independent.
> 
> Yes, I understand this. It seems I failed to explain what I meant. :-(
> 
> The reason for moving the kernel text, data and bss to the end of the
> virtual memory was that this is the "not position independent" part of
> the kernel that would be at fixed known virtual address and always be
> linked and loaded correctly. What I want is to resize the KVA space
> without breaking the linked code. Therefore I try to organize the KVM
> so that the position dependant part does not move if the KVA size
> changes.

When the bootstrap loads the kernel, it loads it to a physical
address.  This physical address ends up being the same as the
kernel's virtual address, once the kernel switches from real to
protected mode using a stack trampoline to return from non-pic
real mode code to non-PIC protected mode code.  This code is in
btext(), e.g.:

        The address of the "begin" symbol is pushed onto the stack,
        and the btext() function returns ...not to its caller, but to
        the begin() funcion, in the relocated address space.

Basically, you aren't going to be able to seperate these things
so you can load the kernel at the top of the address space, unless
you spend a huge amount of effort, and have the code copy itself
up to high physical memory, or make the bootstrap process much more
complex than it currently is (it's much more complex than it needs
to be, at least conceptually, already).

For this to work, the relocation address will have to not equal
the physical load address OR'ed with some value equal to the
start of the KVA space.

This has numerous implications.  For example, anything that uses
libkvm opens /dev/kmem and references memory locations zero
relative to the KVA space base address, using symbolic names
from the ELF symbol table attached to the image of the kernel
itself.  As a result, the base address is assumed to be the
base address relative to the start of a virtual 4G address space.

Kernel modules have the same issue with symbol references.


> The "dynamic" part of the KVA space would be below the kernel code, and
> expand towards the address 0, and this would contain all the stuff that
> are dimensioned according to anticipated system usage, and could be
> configurable by a boot parameter. E.g. if I want a lot more mbufs or
> kmalloc memory I could boot with a huge KVA space.

I understand the memory map you want to use.  8-).  Understand
that the assumption of the kernel physical memory load address
being relative to the virtual address mapping is really, really
hard-coded, and it's not always easy to understand where there
are dependencies that matter.


> My understanding is that there's lot of relevant data areas in KVM that
> could be dimensioned and relocated run-time in the kernel
> initialization. But since they are currently located above the kernel
> code, they will hit the ceiling at some point, unless the kernel is
> moved lower (KVA space expanded), which needs recompilation.

Yes, that's true.  But relocating the kernel dymanically means
not relocating the kernel relative to the base address of the
KVA.

Actually, it would be very tempting to split the kernel into two
pieces, and have it self-relocate after entering protected mode
with paging enabled by creating page table entries for the kernel
proper totally seperately.  This avoids requiring all of the kernel
to be PIC.

Unfortunately, this ups the complexity of that code by an order of
magnitude, and other than people whove worked in that area, on one
really has any documentation.  And there certainly is no *published*
documentation.  What documentation people like Peter, Alfred, Bosko,
and myself have is not likely to be useful, since in as far as it
refers to -current, it's not up to date, and in as far as it's up to
date, it doesn't refer to -current (e.g. I have some seriously
extensive documentation through FreeBSD 4.4, and much less so after
that, though I have the information in my head).

IMO, it's probably not worth the jump in complexity (but don't let
that stop you ;^)).


> > The way protected mode OSs work, in the simplest terms, is by
> > crafting a KVA space that looks exactly like the physical space
> > after the bootstrap load, so that none of the code needs to be
> > relocated, and it all "just works".
> 
> But doesn't the kernel relocate some of its data in the initialization
> (in locore.s, pmap_bootstrap() or whatever),

There is code in locore.s in the function btext() that is the real
mode kernel entry point from the bootstrap (i386; for Alpha, the
function is named locorestart()).  It's carefully crafted so that it
can run before paging is enabled at the non-relocated address, and
after paging is enabled at the non-relocated address, and then uses
a stack-hack to "return" to a relocated address *after* paging is
enabled (the "begin()" function).

The answer is that you would have to change all this code, and you
would have to write additional code that sat between the begin()
and the rest of reality to do the relocation of all the code that
gets executed after the btext() code, while in protected mode.

This is possible, but itself raises a couple of issues -- unless
the kernel is compiled PIC.  The primary issue is that before you
compile the kernel, you don't know its size, and after you compile
the kernel, it has all of its relocation hints wedged into it in
all sorts of nooks and crannies.  Short of extracting every one of
them, you would need to build the kernel twice.

This also means that adding a single driver, or modifying a driver,
which ends up pushing you over the space between the end of the
kernel and the end of the 4G virtual address space, means that you
must recompile *everything*.

The cheap answer is "PIC".  The expensive answer is an anal probe
of the code in order to factor out every instance of the knowledge
of the relocation base address.  It's largely possible, but there
are a couple of really, really tricky parts that I know of, and I
don't claim exhaustive knowledge (i.e. even if I could dump for
you everything I knew about what would have to change, there's no
guarantee that I wouldn't have missed something).


> > It's theoretically possible to do what you want, but it's not very
> > easy, and there are other reasons it's not desirable.  The number
> > one reason that it's not desirable is that you would not have any
> > additional memory available for kernel modules or kernel data
> > structures.  For example, if you wanted to have a very large number
> > of network connections on a 4G machine, the practical limit that
> > you will run into is the KVA space available for representation of
> > connection data.
> 
> The point was not to make KVA space smaller but to be able to state in
> the boot how many GBs the KVA should be, so that the kernel could
> dimension e.g. the area from which the connection data is allocated
> according to the KVA space limits. Surely the connection data is not
> position dependant or statically allocated.

Yes and now.  For example, the tcpcb and inpcb's are allocated
via a zalloci(), which does a reserve of a physical address
space, very early on in boot.  The point of these is to preallocate
a large chunk of KVA space, so that the backing pages can be allocated
via a fault trap (this allows allocations to be done at interrupt time,
when normally they are aonly permitted to occur at non-interrupt time).

There are a number of similar allocations which occur for sockets (for
example) based on the value of maxfiles at boot time.  Therefore, even
though the sysctl permits you to increase "maxfiles", you run out of
socket, tcpcb, inpcb, or other space before you hit the "maxfiles"
limit.

Similarly, there are a number of regions allocated in machdep.c using
direct allocation of linear physical address space.  These regions are
allocated using "valloc()" ("virtual alloction"), which uses two passes:
one to get the expectation size, and one to do the allocation.  These
are allocated following the kernel data space, and would have to be
allocated "somewhere before the kernel, but after the base of the KVA
space".  Possible to do, but pretty tricky.  One issue with this is
that some of the allocations are for page tables which *must* be there
for the system to function, and are relative to the size of memory,
rather than the size of the KVA.

The implication there is that you will need to perform the necessary
calculations for identifying the beginning of the KVA at runtime,
rather than at boot time, or live with the fact that the user can
specify defaults which will render the box unbootable.


> I'm not interested in whether it's desirable or not to have huge KVA or
> UVA space; I'm interested just in the problem of moving this decision
> from compile-time to run-time.

As I said, I understand the problem space -- the solution space is
just really, really complicated.  8-).

-- Terry

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message