From owner-freebsd-hackers Thu Aug 29 15:24:21 2002 Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.FreeBSD.org (mx1.FreeBSD.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 921B737B401 for ; Thu, 29 Aug 2002 15:24:09 -0700 (PDT) Received: from swan.mail.pas.earthlink.net (swan.mail.pas.earthlink.net [207.217.120.123]) by mx1.FreeBSD.org (Postfix) with ESMTP id 31BE543EA9 for ; Thu, 29 Aug 2002 15:24:07 -0700 (PDT) (envelope-from tlambert2@mindspring.com) Received: from pool0433.cvx21-bradley.dialup.earthlink.net ([209.179.193.178] helo=mindspring.com) by swan.mail.pas.earthlink.net with esmtp (Exim 3.33 #1) id 17kXhq-0001Jw-00; Thu, 29 Aug 2002 15:24:02 -0700 Message-ID: <3D6E9E94.41942024@mindspring.com> Date: Thu, 29 Aug 2002 15:22:12 -0700 From: Terry Lambert X-Mailer: Mozilla 4.79 [en] (Win98; U) X-Accept-Language: en MIME-Version: 1.0 To: Aaro J Koskinen Cc: freebsd-hackers@FreeBSD.ORG Subject: Re: More dynamic KVA_SPACE References: Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-freebsd-hackers@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG Aaro J Koskinen wrote: > > > Am I completely off the track? What are the main reasons behind the > > > current KVM layout? > > > > Kernel code is not position independent. > > Yes, I understand this. It seems I failed to explain what I meant. :-( > > The reason for moving the kernel text, data and bss to the end of the > virtual memory was that this is the "not position independent" part of > the kernel that would be at fixed known virtual address and always be > linked and loaded correctly. What I want is to resize the KVA space > without breaking the linked code. Therefore I try to organize the KVM > so that the position dependant part does not move if the KVA size > changes. When the bootstrap loads the kernel, it loads it to a physical address. This physical address ends up being the same as the kernel's virtual address, once the kernel switches from real to protected mode using a stack trampoline to return from non-pic real mode code to non-PIC protected mode code. This code is in btext(), e.g.: The address of the "begin" symbol is pushed onto the stack, and the btext() function returns ...not to its caller, but to the begin() funcion, in the relocated address space. Basically, you aren't going to be able to seperate these things so you can load the kernel at the top of the address space, unless you spend a huge amount of effort, and have the code copy itself up to high physical memory, or make the bootstrap process much more complex than it currently is (it's much more complex than it needs to be, at least conceptually, already). For this to work, the relocation address will have to not equal the physical load address OR'ed with some value equal to the start of the KVA space. This has numerous implications. For example, anything that uses libkvm opens /dev/kmem and references memory locations zero relative to the KVA space base address, using symbolic names from the ELF symbol table attached to the image of the kernel itself. As a result, the base address is assumed to be the base address relative to the start of a virtual 4G address space. Kernel modules have the same issue with symbol references. > The "dynamic" part of the KVA space would be below the kernel code, and > expand towards the address 0, and this would contain all the stuff that > are dimensioned according to anticipated system usage, and could be > configurable by a boot parameter. E.g. if I want a lot more mbufs or > kmalloc memory I could boot with a huge KVA space. I understand the memory map you want to use. 8-). Understand that the assumption of the kernel physical memory load address being relative to the virtual address mapping is really, really hard-coded, and it's not always easy to understand where there are dependencies that matter. > My understanding is that there's lot of relevant data areas in KVM that > could be dimensioned and relocated run-time in the kernel > initialization. But since they are currently located above the kernel > code, they will hit the ceiling at some point, unless the kernel is > moved lower (KVA space expanded), which needs recompilation. Yes, that's true. But relocating the kernel dymanically means not relocating the kernel relative to the base address of the KVA. Actually, it would be very tempting to split the kernel into two pieces, and have it self-relocate after entering protected mode with paging enabled by creating page table entries for the kernel proper totally seperately. This avoids requiring all of the kernel to be PIC. Unfortunately, this ups the complexity of that code by an order of magnitude, and other than people whove worked in that area, on one really has any documentation. And there certainly is no *published* documentation. What documentation people like Peter, Alfred, Bosko, and myself have is not likely to be useful, since in as far as it refers to -current, it's not up to date, and in as far as it's up to date, it doesn't refer to -current (e.g. I have some seriously extensive documentation through FreeBSD 4.4, and much less so after that, though I have the information in my head). IMO, it's probably not worth the jump in complexity (but don't let that stop you ;^)). > > The way protected mode OSs work, in the simplest terms, is by > > crafting a KVA space that looks exactly like the physical space > > after the bootstrap load, so that none of the code needs to be > > relocated, and it all "just works". > > But doesn't the kernel relocate some of its data in the initialization > (in locore.s, pmap_bootstrap() or whatever), There is code in locore.s in the function btext() that is the real mode kernel entry point from the bootstrap (i386; for Alpha, the function is named locorestart()). It's carefully crafted so that it can run before paging is enabled at the non-relocated address, and after paging is enabled at the non-relocated address, and then uses a stack-hack to "return" to a relocated address *after* paging is enabled (the "begin()" function). The answer is that you would have to change all this code, and you would have to write additional code that sat between the begin() and the rest of reality to do the relocation of all the code that gets executed after the btext() code, while in protected mode. This is possible, but itself raises a couple of issues -- unless the kernel is compiled PIC. The primary issue is that before you compile the kernel, you don't know its size, and after you compile the kernel, it has all of its relocation hints wedged into it in all sorts of nooks and crannies. Short of extracting every one of them, you would need to build the kernel twice. This also means that adding a single driver, or modifying a driver, which ends up pushing you over the space between the end of the kernel and the end of the 4G virtual address space, means that you must recompile *everything*. The cheap answer is "PIC". The expensive answer is an anal probe of the code in order to factor out every instance of the knowledge of the relocation base address. It's largely possible, but there are a couple of really, really tricky parts that I know of, and I don't claim exhaustive knowledge (i.e. even if I could dump for you everything I knew about what would have to change, there's no guarantee that I wouldn't have missed something). > > It's theoretically possible to do what you want, but it's not very > > easy, and there are other reasons it's not desirable. The number > > one reason that it's not desirable is that you would not have any > > additional memory available for kernel modules or kernel data > > structures. For example, if you wanted to have a very large number > > of network connections on a 4G machine, the practical limit that > > you will run into is the KVA space available for representation of > > connection data. > > The point was not to make KVA space smaller but to be able to state in > the boot how many GBs the KVA should be, so that the kernel could > dimension e.g. the area from which the connection data is allocated > according to the KVA space limits. Surely the connection data is not > position dependant or statically allocated. Yes and now. For example, the tcpcb and inpcb's are allocated via a zalloci(), which does a reserve of a physical address space, very early on in boot. The point of these is to preallocate a large chunk of KVA space, so that the backing pages can be allocated via a fault trap (this allows allocations to be done at interrupt time, when normally they are aonly permitted to occur at non-interrupt time). There are a number of similar allocations which occur for sockets (for example) based on the value of maxfiles at boot time. Therefore, even though the sysctl permits you to increase "maxfiles", you run out of socket, tcpcb, inpcb, or other space before you hit the "maxfiles" limit. Similarly, there are a number of regions allocated in machdep.c using direct allocation of linear physical address space. These regions are allocated using "valloc()" ("virtual alloction"), which uses two passes: one to get the expectation size, and one to do the allocation. These are allocated following the kernel data space, and would have to be allocated "somewhere before the kernel, but after the base of the KVA space". Possible to do, but pretty tricky. One issue with this is that some of the allocations are for page tables which *must* be there for the system to function, and are relative to the size of memory, rather than the size of the KVA. The implication there is that you will need to perform the necessary calculations for identifying the beginning of the KVA at runtime, rather than at boot time, or live with the fact that the user can specify defaults which will render the box unbootable. > I'm not interested in whether it's desirable or not to have huge KVA or > UVA space; I'm interested just in the problem of moving this decision > from compile-time to run-time. As I said, I understand the problem space -- the solution space is just really, really complicated. 8-). -- Terry To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hackers" in the body of the message