Date: Sun, 16 Mar 2008 16:13:37 -0700 (PDT) From: Matthew Dillon <dillon@apollo.backplane.com> To: Igor Shmukler <shmukler@mail.ru>, Robert Watson <rwatson@freebsd.org>, jgordeev@dir.bg, "Andrey V. Elsukov" <bu7cher@yandex.ru>, freebsd-hackers@freebsd.org Subject: Re: Re[2]: vkernel & GSoC, some questions Message-ID: <200803162313.m2GNDbvl009550@apollo.backplane.com> References: <20080316122108.S44049@fledge.watson.org> <E1JatyK-000FfY-00.shmukler-mail-ru@f8.mail.ru>
next in thread | previous in thread | raw e-mail | index | archive | help
Basically DragonFly has a syscall API that allows a userland process to create and completely control any number of VM spaces, including the ability to pass execution control to a VM space and get it back, and control memory mappings within that VM space (and in the virtual kernel process itself) on a page-by-page basis, so only 'invalid' PTEs are passed through to the virtual kernel by the real kernel and the real kernel caches page mappings with real hardware pmaps. Any exception that occurs within a running VM space is routed back to the virtual kernel process by the real kernel. Any real signal (e.g. the vkernel's 'clock' interrupt) or exception that occurs also forces control to return to the vkernel process. A DragonFly virtual kernel is just a user process which uses this feature to manipulate VM contexts (i.e. for processes running under the vkernel itself), providing a complete emulation environment that is opaque to userland. The vkernel itself is not running in an emulated environment, it is a 'real' (and singular) user process running on the machine. These VM contexts are managed by the real kernel as pure VM contexts, NOT as threads or processes or anything else. Since the VM context in the real kernel basically has one VM entry (representing the software emulated mmap of the entire address space), and since pmap's use throw-away PTEs, the real-kernel overhead is minimal and there is no real limit to the number of virtualized processes the virtual kernel can control, nor any other resource limitations within the real kernel. One can even run a virtual kernel inside a virtual kernel... not sure why anyone would want to do it, but it works! I can even thrash the virtual kernel without it having any effect whatsoever on the real kernel or system. The ENTIRE operational overhead rests solely in operations which must perform a context switch. Cpu-bound programs will run at full speed and I/O bound programs aren't too bad either. Context-switch-heavy programs suffer as they do in a hardware virtualized environment. Make no mistake about that, running any sort of kernel in a hardware virtualized environment that wasn't designed to run in and you are going to have horrible performance, as many people trying to simply 'move' their existing machines to virtualized environments have found out the hard way. I could probably shave off a microsecond from our virtual kernel syscall path, but it isn't a priority for me... I'm using a code efficient but performance inefficient implementation to pass contextual information between the emulated VM context and the virtual kernel, and it's a fairly expensive copy op that would benefit greatly if it were converted to shared memory or if I simply cached the userland page in the real kernel to avoid the copyout/lookup/pmap op. I could probably also parallelize the real I/O backend for the 'disk' better, but it isn't a priority for me either. SMP is supported the same as it is supported in a real kernel, the virtual kernel simply creates a LWP for each 'cpu' (for all intents and purposes you can think of it as forking once for each cpu). All the LWPs have access to the same pool of VM contexts and thus the virtual kernel can schedule its processes to any of the LWPs on a whim. It just uses the same process scheduler that the real kernel does... nearly all the code in the virtual kernel is the same, in fact, the vkernel 'platform' is only 700K of source code. There are some minor (and admittedly not very well developed) shims to reduce the load on the real machine when you do things like run a vkernel simulating many cpu's on a machine which only has a few physical cpu's. Spinning in a thread vs on a hard cpu is not the best thing in the world to do, after all. In anycase, this means that generally speaking SMP performance in a virtual kernel will scale as DragonFly's own SMP performance is improved. Right now the vkernels can be built SMP but it isn't recommended... those kinds of builds are best used to test SMP work and not for real applications. -- Insofar as virtual kernels verses machine emulation and performance goes, people need to realize that *NO* machine emulation technology is going to perform well for any task requiring a lot context switching or a lot of non-MMU-resolvable page faults. No matter WHAT technology you use, at some point any real I/O operation will have to pass through the real kernel, period. For example, a syscall-heavy process running under a virtual kernel will perform just about as badly as a syscall-heavy process running under something like VMWare. Hardware virtualized MMU support isn't quite advanced enough to solve the performance bottleneck for any virtualization technology that I am aware of. The only reason VMWare is perceived to have better performance in certain cases is simply because they have invested a ridiculous number of man-hours on instruction rewriting, plus targetted optimizations which do not stand the test of time (work with particular software and do not generally survive the evoluation of that software without retargetting the optimization). It's like the assembly-vs-C arguments we had in the mid-80's. It isn't a good precedent. Hardware virtualization is still the only real avenue for true cross- platform emulation, but it isn't ultimately going to be the best solution for same-platform emulation. Frankly a virtualized kernel such as DragonFly's kernel and user mode linux (which uses a similar but slightly different context switch handling model) is a better development path then machine emulation for SAME-OS kernels, because the virtualized kernel is explicitly designed to operate in that environment, allowing all the context- transitional interfaces to be customized far better then what you can do with any hardware virtualization technology, not to mention that a virtual kernel is actually better positioned to use hardware virtualization technologies then a hardware emulated kernel is. Sounds nuts, but it's true. Hardware virtualization technologies currently have far more eyeballs writing insanely complex instruction rewriting code which is why they are perceived as having a performance benefit at the moment, but the development path is extremely inelegant and there is far more room for optimization in a virtualized kernel environment then there is in a hardware emulated environment. The virtualized kernel environment can take advantage of the same hardware features as the hardware emulated environment, after all, but a hardware emulated environment cannot take advantage of all the direct syscall features available to a virtual kernel. -- Types of optimizations we can do to improve virtual kernel technologies, which also apply to hardware emulated kernels: * Prefetch more pages to avoid excessive invalid page exceptions. Right now mot prefetching is turned off, resulting in fairly horrible performance for malloc-intensive programs. * Improve the system call context switching path. Right now it uses excessive copyin/copyout ops. What it really needs to do is use an up-call mechanic that allows the register and FP context to be thrown away in the vkernel process (cutting out half the copyin/copyout's). * Do a better job bundle I/O (buffer cache interactions in the vkernel require very different optmizations vs buffer cache interactions in a real kernel). * Asynchronize real I/O better. Right now, I admit, I'm basically just using a write() to the disk file. True asynchronization requires creating some 'I/O' LWPs outside of the SMP model, and I haven't done that yet. Right now I have a few LWPs inside the SMP model to parallelize I/O but it doesn't work very well. Not really a big list, and nothing earthshattering. -Matt
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200803162313.m2GNDbvl009550>