From owner-freebsd-hackers  Mon Nov 23 20:34:03 1998
Return-Path: <owner-freebsd-hackers@FreeBSD.ORG>
Received: (from majordom@localhost)
          by hub.freebsd.org (8.8.8/8.8.8) id UAA03061
          for freebsd-hackers-outgoing; Mon, 23 Nov 1998 20:34:03 -0800 (PST)
          (envelope-from owner-freebsd-hackers@FreeBSD.ORG)
Received: from apollo.backplane.com (apollo.backplane.com [209.157.86.2])
          by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id UAA03051
          for <freebsd-hackers@FreeBSD.ORG>; Mon, 23 Nov 1998 20:34:02 -0800 (PST)
          (envelope-from dillon@apollo.backplane.com)
Received: (from dillon@localhost)
	by apollo.backplane.com (8.9.1/8.9.1) id UAA12209;
	Mon, 23 Nov 1998 20:33:57 -0800 (PST)
	(envelope-from dillon)
Date: Mon, 23 Nov 1998 20:33:57 -0800 (PST)
From: Matthew Dillon <dillon@apollo.backplane.com>
Message-Id: <199811240433.UAA12209@apollo.backplane.com>
To: Luoqi Chen <luoqi@watermarkgroup.com>
Cc: cnielsen@pobox.com, freebsd-hackers@FreeBSD.ORG
Subject: Re: Kernel threads
References:  <199811240328.WAA26688@lor.watermarkgroup.com>
Sender: owner-freebsd-hackers@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

:> needs to be done to fix it? I'm curious because I was looking into writing
:> an encrypting VFS layer, but I decided it was biting off more than I could
:> chew when I discovered the VFS stacking problems.
:> 
:We need a cache manager (as described in Heidemann's paper) to ensure
:coherence between vm objects hanging off vnodes from upper and lower layer.
:For a transparent layer (no data translation, such as null/union/umap),
:maybe we could do without the cache manager by implementing vm object sharing
:between upper and lower vnodes. But for a data translation layer, a cache
:manager is definitely required.
:
:-lq

    Ok, I'm going to put forth a general wide-view infomercial for general
    discussion.  I am definitely *not* planning on implementing this here,
    it would be too much work for one person (or even two), but it's basically
    something I have been working on for one of my side OS projects.

    What we really need to do is integrate the VFS layer abstraction into 
    the buffer cache and get rid of all the baggage in the buffer pointer 
    (bp) layer.  That is, abstract the backing store out of the core
    vm_page_t structure and put it into a vm_cache_t type of structure.
    We then preallocate the vm_page_t's as we currently do (one per physical
    page of memory), and preallocate the smaller vm_cache_t backing store 
    reference structures as well.  For example, preallocate 4x  the number
    of vm_cache_t's as we have vm_page_t's.  We get rid of the KVM maps 
    entirely and instead do a simple direct map of all physical memory 
    into KVM (John doesn't like this part because it limits the amount of
    physical memory to the KVM address space, typically 2GB on a 32 bit
    machine).  Devices drivers would no longer use consolidated KVM buffers
    (e.g. the filesystem code would make distinctions on page boundries when 
    indexing into filesystem buffers rather then filesystem block boundries.
    Raw disk units would consolidate disk I/O into page-sized blocks, which
    they pretty much do anyway).

    Each vm_cache_t would contain a reference to an underlying vm_page_t,
    the backing store abstraction (vnode, block#), a lock, and a chain to
    other vm_cache_t's sharing the same page.  vm_cache_t's are mostly
    throw-away structures, but with strict unwinding rules to maintain
    cache consistancy.  For example, the vm_cache_t representing the
    ultimate physical backing store of an object can not be deleted until
    vm_cache_t's higher in the chain (referencing the same page) are
    deleted, but vm_cache_t's in the middle of a stacking chain might
    very well be deletable if they can be resynthesized later.  It also
    becomes trivial to break and unify chains depending.  For example, a
    'hole' or 'fragment' in a file that grows into a page can be unified
    with the disk backing store for the file when the full block is allocated
    by the filesystem, but can remain detached until then.


    Here are the disadvantages:

	* no buffer concatenation in KVM, i.e. if the filesystem block size
	  is 8K, the coding of the filesystem device must still lookup data
	  in the cache on page boundries.

	* All of physical memory must be mapped into KVM for easy reference.
	  (typically gives us a 2G limit on a 32 bit cpu).  Or most of it,
	  anyway.

	* KVM maps go away entirely or almost entirely (We might have to keep
	  a few around to optimize memory management of some of the larger
	  kernel structures).

	* vm_cache_t eats wired memory.  If it is a 64 byte structure and we
	  have four for each page of physical memory, we eat 6.25% of physical
	  memory for our preallocate vm_cache_t array.  With a 32 byte
	  structure size we eat 3.125% of physical memory. ( Even so I would
	  still recommend preallocation in order to eek out extreme
	  performance).

    Here are the advantages:

	* supervisor code does not have to touch UVM at all, which means it
	  doesn't have to touch page tables for kernel operations (except 
	  to map/unmap things for user processes).  i.e. there is no hardware
	  MMU interaction required for kernel-mode operations.

	* no KVM maps need be maintained for kernel memory pools, buffer
	  pointers, or other related items.   We would still need them for
	  the per-process kernel stack and a few of the larger kernel 
	  structures.

	* breaking and consolidating VFS stacks is trivialized.

	* physical devices are well and truely abstracted into a VNode with
	  no loss in performance.

	* most of the kernel level page and VM locking issues as related
	  to manipulation of kernel data goes away.

	* remaining locking issues become more clear.

	* vm_cache_t structures can be used to temporarily cluster backing
	  store operations, including swap operations, delaying consolidation
	  of pages into swap structures.

	* 'dummy' vnodes can be used to abstract memory maps almost 
	  trivially.

	* intermediate short-lasting vm_cache_t's can be used to handle simple
	  mappings, such as a disk stripe.  More complex mappings such as
	  RAID5 and MIRRORing would require the vm_cache_t to remain in the
	  chain in order to handle fail-over, parity consolidation, and
	  other RAID features.

	* If you thought page tables were throw away before, they are
	  *truely* throwaway and abstracted now.  In fact, it is even
	  theoretically possible to share pagetables across like-maps
	  (e.g. 100 processes mmap()ing shared+rw or ro the same file and 
	  start poking at it need not cost more then one common pagetable).
	  
    The biggest operational feature is that when you lookup a page in the
    buffer cache, you are now looking up an arbitrary (vnode,block#) pair
    and getting a vm_cache_t in return.  And when I mean 'arbitrary vnode',
    I really mean 'arbitrary vnode'.  This can be inclusive of a pseudo vnode 
    representing a memory map for a process, for example.   In fact, all the
    side effects both synchronous and asynchronous wind up being encapsulated
    in the buffer cache code with very little additional complexity.
    The cache routines devolve down into (taking from some of the
    unrelated-to-FreeBSD code that I am working on in my side project):

	vm_cache_t *bcread(VNode *vn, block_t blkNo);

	    Lookup (vn,block#), issue asynchronous I/O as required, return
	    vm_cache_t with underlying page either suitable for reading or
	    undergoing I/O that we can block on if we want.

	vm_cache_t *bcwrite(VNode *vn, block_t blkNo);

	    Lookup (vn,block#), issue asynchronous I/O as required, return
	    vm_cache_t with underlying page either suitable for writing or
	    undergoing I/O that we can block on if we want.  This may also
	    result in a copy-on-write, chain detachment, or other side effects
	    necessary to return a writable page.

	    (e.g. the vnode might represent a file in a MAP_SHARED+RW
	    situation, or might represent a dummy vnode that abstracts
	    a MAP_PRIVATE+RW situation.  A copy on write would allocate a
	    new page and associate it directly with the dummy vnode.  A fork
	    would fork the dummy vnode into two new vnodes A and B that both
	    abstract the original dummy vnode V that itself abstracted the
	    file).

	vm_cache_t *bcfree(VNode *vn, block_t blkNo);

	    Indicate that a page can be thrown away entirely.

    There are all sorts of cool optimizations that can be implemented with
    this sort of scheme as well, but I'm only going to mention the most common
    one:  caching block maps in order to avoid having to dive into a device
    in the 'cache' case.  For example, you can abstract a file block map
    like this (note the negative block number supplied to bcread())
    (also note that this whole schmear could be encapsulated and hidden
    in a procedure):

	block_t blkMapNo = fileOffset >> MMUPageShift;
	vc = bcread(vnFile, -blkMapNo);
	while ((kvmptr = bckvmaddress(vc)) == NULL)
	    block
	chainBlk = ((block_t *)kvmptr)[blkMapNo & (MMUBlocksInPage - 1)];
	if (chainBlk == (block_t)-1)
	    fvc = ... attach zero-fill page ...
	else
	    fvc = bcread(vnFile->vn_UnderlyingDevice, chainBlk);

    Etc.  I could go on forever. 

						-Matt

    Matthew Dillon  Engineering, HiWay Technologies, Inc. & BEST Internet 
                    Communications & God knows what else.
    <dillon@backplane.com> (Please include original email in any response)    


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message