Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 16 Mar 2008 17:43:02 -0700 (PDT)
From:      Matthew Dillon <dillon@apollo.backplane.com>
To:        Kris Kennaway <kris@freebsd.org>
Cc:        Jordan Gordeev <jgordeev@dir.bg>, freebsd-hackers@freebsd.org
Subject:   Re: vkernel & GSoC, some questions
Message-ID:  <200803170043.m2H0h2qO010175@apollo.backplane.com>
References:  <47DBC800.8030601@dir.bg> <47DD1FFF.6070004@FreeBSD.org>

next in thread | previous in thread | raw e-mail | index | archive | help
:Finally, the way vkernels were implemented in dragonfly was *very* 
:disruptive to the kernel source (lots of function renaming etc), so it 
:is likely that this would also have to be completely reimplemented in a 
:FreeBSD port.
:...
:Kris

    Well, I don't think I would agree with your assessment but,
    particularly, the way vkernels are implemented in DragonFly is NOT
    in the least disruptive to kernel source.  It has about 1/10 the
    code pollution of FreeBSD's current jail implementation.  The
    implementation is just about as clean as it is possible to make it
    from the point of view of code pollution.

    You could try reimplementing the concepts and APIs in a FreeBSD port,
    but good luck with that.  The 'pollution' involved, aka the kernel
    shims needed, are fairly minor:

    * VM fault code detects fault in special mmap entry type and passes
      control to the virtual kernel.

    * Trap code tests that the fault occured in a managed VM context and
      passes control to the virtual kernel.

    * (real process) signal code checks that the signal occured while running
      a managed VM context and switches the context back to the virtual
      kernel before taking the signal (duh! gotta do that!).

    Note that there was some other work related to the vkernel work, such
    as signal mailboxes, but those aren't actually needed to port the vkernel,
    though you do need some way to properly deal with scheduling races
    without having to make signal blocking and unblocking system calls
    (which make system calls made by a virtualized process even more
    expensive).

    No matter how you twist it, you can't avoid any of that.  The added APIs
    are:

    * mmap supporting emulated user-accessible page tables.

      This is unavoidable.  There is no way a user process can control
      virtualized processes without page-level control of their pages or
      without page-level sharing of pages, with separate access domains,
      between the virtual kernel process and the virtualized user process
      running under it.

      Not only does a virtual kernel need to be able to manipulate pages
      within the virtualized VM context (representing a virtualized process),
      but it must also be able to manipulate pages within its OWN context
      to properly share pages between the virtual kernel and virtualized
      processes, or it can't do things like, oh, implement mmap()ing of files
      which have pages in both places, let alone implement the buffer cache.

      I did have an issue with mmap() in that 32 bit ranges are not supported
      by the current mmap code.  i.e. I can't tell it in a single mmap()
      to map a 3G chunk of memory.  I did hack that... the vkernel code just
      does three adjacent mmap()'s to map the emulated address space in
      the VM context.  Hokey but it works.  That's not really a kernel
      pollution issue anyway since it is in the vkernel platform code.

    * syscalls to switch into and out of a managed VM context.

      Kinda need to be able to control the virtualized contexts.

    * syscalls to manipulate managed VM contexts.

      Kinda need to be able to manipulate page-by-page mappings within
      managed VM contexts.

    * signal mailboxes (the only thing that could be done away with, really),
      used to avoid the vkernel having to block and unblock signals.

    The most complex part of the whole mess is the emulated page table
    support added to mmap.  I don't think there is any way to avoid it,
    particularly if you intend to support SMP virtualization (which we do,
    completely, even though it may lack performance).  The MMU interactions
    are tricky at best when one is trying to implement a virtual SMP kernel
    running inside a real SMP kernel, because the real kernel MUST implement
    real page tables inaccessible to the virtual kernel.  Synchronizing page
    table modifications between the emulated and real page tables on SMP
    is *NOT* trivial but, hey, I wrote it so you guys have a working
    template for all that crap now.  It took something like two months
    to make it work properly in a SMP environment.

    Now one thing you can do, which I considered but ultimately discarded,
    is to associate the managed VM context with a real kernel process 
    separate from the virtual kernel process.  This does simplify the 
    signal processing somewhat and I believe it may also reduce context
    switch overhead slightly.  The reason I discarded it was two fold:  First,
    for a SMP build there are now two real processes per cpu instead of
    one, making scheduling more complex.  Second, the emulated page table
    is not confined to the VM contexts under the virtual kernel's control,
    the virtual kernel itself uses the same feature, so additional MP related
    synchronization would have to occur to properly emulate the MMU and I
    got a headache trying to think about how to do it.

    What I strongly recommend you NOT do is try to associate each virtualized
    process running under the virtual kernel with a real-kernel process.  The
    reason is that it is extremely wasteful of real-kernel resources and
    exposes the real kernel to resource starvation originating in the virtual
    kernel.  My solution was to separate struct vmspace out from everything
    else and give it its own API.  This isn't pollution... really it is a major
    clean-up and we already had partial separation due to our 'resident'
    code support.  It was easy and cleaned up a chunk of the kernel source
    at the same time.  In anycase, unless you do a 1:1 process model for the
    emulated processes you need the code to swap VM spaces for a process.

					-Matt
					Matthew Dillon 
					<dillon@backplane.com>



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200803170043.m2H0h2qO010175>