From owner-freebsd-hackers@FreeBSD.ORG Mon Mar 17 00:43:14 2008 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 66B33106564A for ; Mon, 17 Mar 2008 00:43:14 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by mx1.freebsd.org (Postfix) with ESMTP id 2DCA38FC17 for ; Mon, 17 Mar 2008 00:43:14 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (localhost [127.0.0.1]) by apollo.backplane.com (8.14.1/8.13.7) with ESMTP id m2H0h2P5010176; Sun, 16 Mar 2008 17:43:03 -0700 (PDT) Received: (from dillon@localhost) by apollo.backplane.com (8.14.1/8.13.4/Submit) id m2H0h2qO010175; Sun, 16 Mar 2008 17:43:02 -0700 (PDT) Date: Sun, 16 Mar 2008 17:43:02 -0700 (PDT) From: Matthew Dillon Message-Id: <200803170043.m2H0h2qO010175@apollo.backplane.com> To: Kris Kennaway References: <47DBC800.8030601@dir.bg> <47DD1FFF.6070004@FreeBSD.org> Cc: Jordan Gordeev , freebsd-hackers@freebsd.org Subject: Re: vkernel & GSoC, some questions X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 17 Mar 2008 00:43:14 -0000 :Finally, the way vkernels were implemented in dragonfly was *very* :disruptive to the kernel source (lots of function renaming etc), so it :is likely that this would also have to be completely reimplemented in a :FreeBSD port. :... :Kris Well, I don't think I would agree with your assessment but, particularly, the way vkernels are implemented in DragonFly is NOT in the least disruptive to kernel source. It has about 1/10 the code pollution of FreeBSD's current jail implementation. The implementation is just about as clean as it is possible to make it from the point of view of code pollution. You could try reimplementing the concepts and APIs in a FreeBSD port, but good luck with that. The 'pollution' involved, aka the kernel shims needed, are fairly minor: * VM fault code detects fault in special mmap entry type and passes control to the virtual kernel. * Trap code tests that the fault occured in a managed VM context and passes control to the virtual kernel. * (real process) signal code checks that the signal occured while running a managed VM context and switches the context back to the virtual kernel before taking the signal (duh! gotta do that!). Note that there was some other work related to the vkernel work, such as signal mailboxes, but those aren't actually needed to port the vkernel, though you do need some way to properly deal with scheduling races without having to make signal blocking and unblocking system calls (which make system calls made by a virtualized process even more expensive). No matter how you twist it, you can't avoid any of that. The added APIs are: * mmap supporting emulated user-accessible page tables. This is unavoidable. There is no way a user process can control virtualized processes without page-level control of their pages or without page-level sharing of pages, with separate access domains, between the virtual kernel process and the virtualized user process running under it. Not only does a virtual kernel need to be able to manipulate pages within the virtualized VM context (representing a virtualized process), but it must also be able to manipulate pages within its OWN context to properly share pages between the virtual kernel and virtualized processes, or it can't do things like, oh, implement mmap()ing of files which have pages in both places, let alone implement the buffer cache. I did have an issue with mmap() in that 32 bit ranges are not supported by the current mmap code. i.e. I can't tell it in a single mmap() to map a 3G chunk of memory. I did hack that... the vkernel code just does three adjacent mmap()'s to map the emulated address space in the VM context. Hokey but it works. That's not really a kernel pollution issue anyway since it is in the vkernel platform code. * syscalls to switch into and out of a managed VM context. Kinda need to be able to control the virtualized contexts. * syscalls to manipulate managed VM contexts. Kinda need to be able to manipulate page-by-page mappings within managed VM contexts. * signal mailboxes (the only thing that could be done away with, really), used to avoid the vkernel having to block and unblock signals. The most complex part of the whole mess is the emulated page table support added to mmap. I don't think there is any way to avoid it, particularly if you intend to support SMP virtualization (which we do, completely, even though it may lack performance). The MMU interactions are tricky at best when one is trying to implement a virtual SMP kernel running inside a real SMP kernel, because the real kernel MUST implement real page tables inaccessible to the virtual kernel. Synchronizing page table modifications between the emulated and real page tables on SMP is *NOT* trivial but, hey, I wrote it so you guys have a working template for all that crap now. It took something like two months to make it work properly in a SMP environment. Now one thing you can do, which I considered but ultimately discarded, is to associate the managed VM context with a real kernel process separate from the virtual kernel process. This does simplify the signal processing somewhat and I believe it may also reduce context switch overhead slightly. The reason I discarded it was two fold: First, for a SMP build there are now two real processes per cpu instead of one, making scheduling more complex. Second, the emulated page table is not confined to the VM contexts under the virtual kernel's control, the virtual kernel itself uses the same feature, so additional MP related synchronization would have to occur to properly emulate the MMU and I got a headache trying to think about how to do it. What I strongly recommend you NOT do is try to associate each virtualized process running under the virtual kernel with a real-kernel process. The reason is that it is extremely wasteful of real-kernel resources and exposes the real kernel to resource starvation originating in the virtual kernel. My solution was to separate struct vmspace out from everything else and give it its own API. This isn't pollution... really it is a major clean-up and we already had partial separation due to our 'resident' code support. It was easy and cleaned up a chunk of the kernel source at the same time. In anycase, unless you do a 1:1 process model for the emulated processes you need the code to swap VM spaces for a process. -Matt Matthew Dillon