Date: Mon, 23 Nov 1998 20:33:57 -0800 (PST) From: Matthew Dillon <dillon@apollo.backplane.com> To: Luoqi Chen <luoqi@watermarkgroup.com> Cc: cnielsen@pobox.com, freebsd-hackers@FreeBSD.ORG Subject: Re: Kernel threads Message-ID: <199811240433.UAA12209@apollo.backplane.com> References: <199811240328.WAA26688@lor.watermarkgroup.com>
next in thread | previous in thread | raw e-mail | index | archive | help
:> needs to be done to fix it? I'm curious because I was looking into writing
:> an encrypting VFS layer, but I decided it was biting off more than I could
:> chew when I discovered the VFS stacking problems.
:>
:We need a cache manager (as described in Heidemann's paper) to ensure
:coherence between vm objects hanging off vnodes from upper and lower layer.
:For a transparent layer (no data translation, such as null/union/umap),
:maybe we could do without the cache manager by implementing vm object sharing
:between upper and lower vnodes. But for a data translation layer, a cache
:manager is definitely required.
:
:-lq
Ok, I'm going to put forth a general wide-view infomercial for general
discussion. I am definitely *not* planning on implementing this here,
it would be too much work for one person (or even two), but it's basically
something I have been working on for one of my side OS projects.
What we really need to do is integrate the VFS layer abstraction into
the buffer cache and get rid of all the baggage in the buffer pointer
(bp) layer. That is, abstract the backing store out of the core
vm_page_t structure and put it into a vm_cache_t type of structure.
We then preallocate the vm_page_t's as we currently do (one per physical
page of memory), and preallocate the smaller vm_cache_t backing store
reference structures as well. For example, preallocate 4x the number
of vm_cache_t's as we have vm_page_t's. We get rid of the KVM maps
entirely and instead do a simple direct map of all physical memory
into KVM (John doesn't like this part because it limits the amount of
physical memory to the KVM address space, typically 2GB on a 32 bit
machine). Devices drivers would no longer use consolidated KVM buffers
(e.g. the filesystem code would make distinctions on page boundries when
indexing into filesystem buffers rather then filesystem block boundries.
Raw disk units would consolidate disk I/O into page-sized blocks, which
they pretty much do anyway).
Each vm_cache_t would contain a reference to an underlying vm_page_t,
the backing store abstraction (vnode, block#), a lock, and a chain to
other vm_cache_t's sharing the same page. vm_cache_t's are mostly
throw-away structures, but with strict unwinding rules to maintain
cache consistancy. For example, the vm_cache_t representing the
ultimate physical backing store of an object can not be deleted until
vm_cache_t's higher in the chain (referencing the same page) are
deleted, but vm_cache_t's in the middle of a stacking chain might
very well be deletable if they can be resynthesized later. It also
becomes trivial to break and unify chains depending. For example, a
'hole' or 'fragment' in a file that grows into a page can be unified
with the disk backing store for the file when the full block is allocated
by the filesystem, but can remain detached until then.
Here are the disadvantages:
* no buffer concatenation in KVM, i.e. if the filesystem block size
is 8K, the coding of the filesystem device must still lookup data
in the cache on page boundries.
* All of physical memory must be mapped into KVM for easy reference.
(typically gives us a 2G limit on a 32 bit cpu). Or most of it,
anyway.
* KVM maps go away entirely or almost entirely (We might have to keep
a few around to optimize memory management of some of the larger
kernel structures).
* vm_cache_t eats wired memory. If it is a 64 byte structure and we
have four for each page of physical memory, we eat 6.25% of physical
memory for our preallocate vm_cache_t array. With a 32 byte
structure size we eat 3.125% of physical memory. ( Even so I would
still recommend preallocation in order to eek out extreme
performance).
Here are the advantages:
* supervisor code does not have to touch UVM at all, which means it
doesn't have to touch page tables for kernel operations (except
to map/unmap things for user processes). i.e. there is no hardware
MMU interaction required for kernel-mode operations.
* no KVM maps need be maintained for kernel memory pools, buffer
pointers, or other related items. We would still need them for
the per-process kernel stack and a few of the larger kernel
structures.
* breaking and consolidating VFS stacks is trivialized.
* physical devices are well and truely abstracted into a VNode with
no loss in performance.
* most of the kernel level page and VM locking issues as related
to manipulation of kernel data goes away.
* remaining locking issues become more clear.
* vm_cache_t structures can be used to temporarily cluster backing
store operations, including swap operations, delaying consolidation
of pages into swap structures.
* 'dummy' vnodes can be used to abstract memory maps almost
trivially.
* intermediate short-lasting vm_cache_t's can be used to handle simple
mappings, such as a disk stripe. More complex mappings such as
RAID5 and MIRRORing would require the vm_cache_t to remain in the
chain in order to handle fail-over, parity consolidation, and
other RAID features.
* If you thought page tables were throw away before, they are
*truely* throwaway and abstracted now. In fact, it is even
theoretically possible to share pagetables across like-maps
(e.g. 100 processes mmap()ing shared+rw or ro the same file and
start poking at it need not cost more then one common pagetable).
The biggest operational feature is that when you lookup a page in the
buffer cache, you are now looking up an arbitrary (vnode,block#) pair
and getting a vm_cache_t in return. And when I mean 'arbitrary vnode',
I really mean 'arbitrary vnode'. This can be inclusive of a pseudo vnode
representing a memory map for a process, for example. In fact, all the
side effects both synchronous and asynchronous wind up being encapsulated
in the buffer cache code with very little additional complexity.
The cache routines devolve down into (taking from some of the
unrelated-to-FreeBSD code that I am working on in my side project):
vm_cache_t *bcread(VNode *vn, block_t blkNo);
Lookup (vn,block#), issue asynchronous I/O as required, return
vm_cache_t with underlying page either suitable for reading or
undergoing I/O that we can block on if we want.
vm_cache_t *bcwrite(VNode *vn, block_t blkNo);
Lookup (vn,block#), issue asynchronous I/O as required, return
vm_cache_t with underlying page either suitable for writing or
undergoing I/O that we can block on if we want. This may also
result in a copy-on-write, chain detachment, or other side effects
necessary to return a writable page.
(e.g. the vnode might represent a file in a MAP_SHARED+RW
situation, or might represent a dummy vnode that abstracts
a MAP_PRIVATE+RW situation. A copy on write would allocate a
new page and associate it directly with the dummy vnode. A fork
would fork the dummy vnode into two new vnodes A and B that both
abstract the original dummy vnode V that itself abstracted the
file).
vm_cache_t *bcfree(VNode *vn, block_t blkNo);
Indicate that a page can be thrown away entirely.
There are all sorts of cool optimizations that can be implemented with
this sort of scheme as well, but I'm only going to mention the most common
one: caching block maps in order to avoid having to dive into a device
in the 'cache' case. For example, you can abstract a file block map
like this (note the negative block number supplied to bcread())
(also note that this whole schmear could be encapsulated and hidden
in a procedure):
block_t blkMapNo = fileOffset >> MMUPageShift;
vc = bcread(vnFile, -blkMapNo);
while ((kvmptr = bckvmaddress(vc)) == NULL)
block
chainBlk = ((block_t *)kvmptr)[blkMapNo & (MMUBlocksInPage - 1)];
if (chainBlk == (block_t)-1)
fvc = ... attach zero-fill page ...
else
fvc = bcread(vnFile->vn_UnderlyingDevice, chainBlk);
Etc. I could go on forever.
-Matt
Matthew Dillon Engineering, HiWay Technologies, Inc. & BEST Internet
Communications & God knows what else.
<dillon@backplane.com> (Please include original email in any response)
To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199811240433.UAA12209>
