From owner-freebsd-hackers Mon Nov 23 20:34:03 1998 Return-Path: Received: (from majordom@localhost) by hub.freebsd.org (8.8.8/8.8.8) id UAA03061 for freebsd-hackers-outgoing; Mon, 23 Nov 1998 20:34:03 -0800 (PST) (envelope-from owner-freebsd-hackers@FreeBSD.ORG) Received: from apollo.backplane.com (apollo.backplane.com [209.157.86.2]) by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id UAA03051 for ; Mon, 23 Nov 1998 20:34:02 -0800 (PST) (envelope-from dillon@apollo.backplane.com) Received: (from dillon@localhost) by apollo.backplane.com (8.9.1/8.9.1) id UAA12209; Mon, 23 Nov 1998 20:33:57 -0800 (PST) (envelope-from dillon) Date: Mon, 23 Nov 1998 20:33:57 -0800 (PST) From: Matthew Dillon Message-Id: <199811240433.UAA12209@apollo.backplane.com> To: Luoqi Chen Cc: cnielsen@pobox.com, freebsd-hackers@FreeBSD.ORG Subject: Re: Kernel threads References: <199811240328.WAA26688@lor.watermarkgroup.com> Sender: owner-freebsd-hackers@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG :> needs to be done to fix it? I'm curious because I was looking into writing :> an encrypting VFS layer, but I decided it was biting off more than I could :> chew when I discovered the VFS stacking problems. :> :We need a cache manager (as described in Heidemann's paper) to ensure :coherence between vm objects hanging off vnodes from upper and lower layer. :For a transparent layer (no data translation, such as null/union/umap), :maybe we could do without the cache manager by implementing vm object sharing :between upper and lower vnodes. But for a data translation layer, a cache :manager is definitely required. : :-lq Ok, I'm going to put forth a general wide-view infomercial for general discussion. I am definitely *not* planning on implementing this here, it would be too much work for one person (or even two), but it's basically something I have been working on for one of my side OS projects. What we really need to do is integrate the VFS layer abstraction into the buffer cache and get rid of all the baggage in the buffer pointer (bp) layer. That is, abstract the backing store out of the core vm_page_t structure and put it into a vm_cache_t type of structure. We then preallocate the vm_page_t's as we currently do (one per physical page of memory), and preallocate the smaller vm_cache_t backing store reference structures as well. For example, preallocate 4x the number of vm_cache_t's as we have vm_page_t's. We get rid of the KVM maps entirely and instead do a simple direct map of all physical memory into KVM (John doesn't like this part because it limits the amount of physical memory to the KVM address space, typically 2GB on a 32 bit machine). Devices drivers would no longer use consolidated KVM buffers (e.g. the filesystem code would make distinctions on page boundries when indexing into filesystem buffers rather then filesystem block boundries. Raw disk units would consolidate disk I/O into page-sized blocks, which they pretty much do anyway). Each vm_cache_t would contain a reference to an underlying vm_page_t, the backing store abstraction (vnode, block#), a lock, and a chain to other vm_cache_t's sharing the same page. vm_cache_t's are mostly throw-away structures, but with strict unwinding rules to maintain cache consistancy. For example, the vm_cache_t representing the ultimate physical backing store of an object can not be deleted until vm_cache_t's higher in the chain (referencing the same page) are deleted, but vm_cache_t's in the middle of a stacking chain might very well be deletable if they can be resynthesized later. It also becomes trivial to break and unify chains depending. For example, a 'hole' or 'fragment' in a file that grows into a page can be unified with the disk backing store for the file when the full block is allocated by the filesystem, but can remain detached until then. Here are the disadvantages: * no buffer concatenation in KVM, i.e. if the filesystem block size is 8K, the coding of the filesystem device must still lookup data in the cache on page boundries. * All of physical memory must be mapped into KVM for easy reference. (typically gives us a 2G limit on a 32 bit cpu). Or most of it, anyway. * KVM maps go away entirely or almost entirely (We might have to keep a few around to optimize memory management of some of the larger kernel structures). * vm_cache_t eats wired memory. If it is a 64 byte structure and we have four for each page of physical memory, we eat 6.25% of physical memory for our preallocate vm_cache_t array. With a 32 byte structure size we eat 3.125% of physical memory. ( Even so I would still recommend preallocation in order to eek out extreme performance). Here are the advantages: * supervisor code does not have to touch UVM at all, which means it doesn't have to touch page tables for kernel operations (except to map/unmap things for user processes). i.e. there is no hardware MMU interaction required for kernel-mode operations. * no KVM maps need be maintained for kernel memory pools, buffer pointers, or other related items. We would still need them for the per-process kernel stack and a few of the larger kernel structures. * breaking and consolidating VFS stacks is trivialized. * physical devices are well and truely abstracted into a VNode with no loss in performance. * most of the kernel level page and VM locking issues as related to manipulation of kernel data goes away. * remaining locking issues become more clear. * vm_cache_t structures can be used to temporarily cluster backing store operations, including swap operations, delaying consolidation of pages into swap structures. * 'dummy' vnodes can be used to abstract memory maps almost trivially. * intermediate short-lasting vm_cache_t's can be used to handle simple mappings, such as a disk stripe. More complex mappings such as RAID5 and MIRRORing would require the vm_cache_t to remain in the chain in order to handle fail-over, parity consolidation, and other RAID features. * If you thought page tables were throw away before, they are *truely* throwaway and abstracted now. In fact, it is even theoretically possible to share pagetables across like-maps (e.g. 100 processes mmap()ing shared+rw or ro the same file and start poking at it need not cost more then one common pagetable). The biggest operational feature is that when you lookup a page in the buffer cache, you are now looking up an arbitrary (vnode,block#) pair and getting a vm_cache_t in return. And when I mean 'arbitrary vnode', I really mean 'arbitrary vnode'. This can be inclusive of a pseudo vnode representing a memory map for a process, for example. In fact, all the side effects both synchronous and asynchronous wind up being encapsulated in the buffer cache code with very little additional complexity. The cache routines devolve down into (taking from some of the unrelated-to-FreeBSD code that I am working on in my side project): vm_cache_t *bcread(VNode *vn, block_t blkNo); Lookup (vn,block#), issue asynchronous I/O as required, return vm_cache_t with underlying page either suitable for reading or undergoing I/O that we can block on if we want. vm_cache_t *bcwrite(VNode *vn, block_t blkNo); Lookup (vn,block#), issue asynchronous I/O as required, return vm_cache_t with underlying page either suitable for writing or undergoing I/O that we can block on if we want. This may also result in a copy-on-write, chain detachment, or other side effects necessary to return a writable page. (e.g. the vnode might represent a file in a MAP_SHARED+RW situation, or might represent a dummy vnode that abstracts a MAP_PRIVATE+RW situation. A copy on write would allocate a new page and associate it directly with the dummy vnode. A fork would fork the dummy vnode into two new vnodes A and B that both abstract the original dummy vnode V that itself abstracted the file). vm_cache_t *bcfree(VNode *vn, block_t blkNo); Indicate that a page can be thrown away entirely. There are all sorts of cool optimizations that can be implemented with this sort of scheme as well, but I'm only going to mention the most common one: caching block maps in order to avoid having to dive into a device in the 'cache' case. For example, you can abstract a file block map like this (note the negative block number supplied to bcread()) (also note that this whole schmear could be encapsulated and hidden in a procedure): block_t blkMapNo = fileOffset >> MMUPageShift; vc = bcread(vnFile, -blkMapNo); while ((kvmptr = bckvmaddress(vc)) == NULL) block chainBlk = ((block_t *)kvmptr)[blkMapNo & (MMUBlocksInPage - 1)]; if (chainBlk == (block_t)-1) fvc = ... attach zero-fill page ... else fvc = bcread(vnFile->vn_UnderlyingDevice, chainBlk); Etc. I could go on forever. -Matt Matthew Dillon Engineering, HiWay Technologies, Inc. & BEST Internet Communications & God knows what else. (Please include original email in any response) To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hackers" in the body of the message