From owner-freebsd-current Mon Mar 20 9:36:36 2000 Delivered-To: freebsd-current@freebsd.org Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by hub.freebsd.org (Postfix) with ESMTP id CB81037B8B9 for ; Mon, 20 Mar 2000 09:36:26 -0800 (PST) (envelope-from dillon@apollo.backplane.com) Received: (from dillon@localhost) by apollo.backplane.com (8.9.3/8.9.1) id JAA70124; Mon, 20 Mar 2000 09:36:22 -0800 (PST) (envelope-from dillon) Date: Mon, 20 Mar 2000 09:36:22 -0800 (PST) From: Matthew Dillon Message-Id: <200003201736.JAA70124@apollo.backplane.com> To: Poul-Henning Kamp Cc: current@FreeBSD.ORG Subject: Re: patches for test / review References: <18039.953549289@critter.freebsd.dk> Sender: owner-freebsd-current@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG : : :> Kirk and I have already mapped out a plan to drastically update :> the buffer cache API which will encapsulate much of the state within :> the buffer cache module. : :Sounds good. Combined with my stackable BIO plans that sounds like :a really great win for FreeBSD. : :-- :Poul-Henning Kamp FreeBSD coreteam member :phk@FreeBSD.ORG "Real hackers run -current on their laptop." I think so. I can give -current a quick synopsis of the plan but I've probably forgotten some of the bits (note: the points below are not in any particular order): Probably the most important thing to keep in mind when reading over this list is to note that nearly all the changes being contemplated can be implemented without breaking current interfaces, and the current interfaces can then be shifted over to the new interfaces one subsystem at a time (shift, test, shift, test, shift test) until none of the original use remains. At the point the support for the original API can be removed. * make VOP locking calls recursive. That is, to obtain exclusive recursive locks by default rather then non-recursive locks. * cleanup all VOP_*() interfaces in regards to the special handling of the case where a locked vnode is passed, a locked vnode is returned, and the returned vnode happens to wind up being the same as the locked vnode (Allow a double-locked vnode on return and get rid of all the stupid code that juggles locks around to get around the non-recursive nature of current exclusive locks). VOP_LOOKUP is the most confused interface that needs cleaning up. With only a small amount of additional work, mainly KASERT's to catch potential problems, we should be able to turn on exclusive recursion. The VOP_*() interfaces will have to be fixed one at a time with VOP_LOOKUP topping the list. * Make exclusive buffer cache locks recursive. Kirk has completed all the preliminary work on this and we should be able to just turn it on. We just haven't gotten around to it (and the release got in the way). This is necessary to support up and coming softupdates mechanisms (e.g. background fsck, snapshot dumps) as well as better-support device recursion. * Cleanup the buffer cache API (bread(), BUF_STRATEGY(), and so forth). Specifically, split out the call functionality such that the buffer cache can determine whether a buffer being obtained is going to be used for reading or writing. At the moment we don't know if the system is going to dirty a buffer until after the fact and this has caused a lot of pain in regards to dealing with low-memory situations. getblk() -> getblk_sh() and getblk_ex() Obtain bp without issuing I/O, getting either a shared or exclusive lock on the bp. With a shared lock you are allowed to issue READ I/O but you are not allowed to modify the contents of the buffer. With an exclusive lock you are allowed to issue both READ and WRITE I/O and you can modify the contents of the buffer. bread() -> bread_sh() and bread_ex() Obtain and validate (issue read I/O as appropriate) a bp. bread_sh() allows a buffer to be accessed but not modified or rewritten. bread_ex() allows a buffer to be modified and written. * Many uses of the buffer cache in the critical path do not actually require the buffer data to be mapped into KVM. For example, a number of I/O devices need only the b_pages[] array and do not need a b_data mapping. It would not take a huge amount of work to adjust the uiomove*() interfaces appropriately. The general plan is to try remove whole portions of the current buffer cache funcitonality and shift them into the new vm_pager_*() API. That is, to operate on VM Object's directly whenever possible. The idea for the buffer cache is to shift its functionality to one that is solely used to issue device I/O and to keep track of dirty areas for proper sequencing of I/O (e.g. softupdate's use of the buffer cache to placemark I/O will not change). The core buffer cache code would no longer map things to KVM with b_data, that functionality would be shifted to the VM Object vm_pager_*() API. The buffer cache would continue to use the b_pages[] array mechanism to collect pages for I/O, for clustering, and so forth. It should be noted that the buffer cache's perceived slowness is almost entirely due to all the KVM manipulation it does for b_data, and that such manipulate is not necessary for the vast majority of the critical path: Reading and writing file data (can run through the VM Object API), and issuing I/O (can avoid b_data KVM mappings entirely). Meta data, such as mapping inodes and bitmap blocks, will almost certainly still require b_data mappings. It would be much too much work to change those at this time. But meta-data is not in the critical path so this is not a big deal. * VOP_PUTPAGES() and VOP_GETPAGES(). During discussions with Tor on how to implement O_DIRECT I/O (direct I/O that bypasses the buffer cache), We appear to have hit upon a solution that dovetails well into the other plans. The API change is to simply say that a VOP_PUTPAGES() and VOP_GETPAGES() calls are to become a lower level direct I/O interface that bypasses the buffer cache. This would mean reimplmenting it in UFS (not difficult) and a few other filesystems (because they currently fall back to a degenerate case using the IO_VMIO flag and then running through the normal I/O read-write VOP_*() calls, which obviously goes through the buffer cache). Only minor changes are required to the callers of VOP_PUTPAGES() / VOP_GETPAGES() - for the most part the callers *ALREADY* assume a direct-I/O-ish interface. Additional advantages are glaringly obvious -- it instantly gives us an optimal path through the VFS layers for things like VN, CCD, and even vinum of Greg ever wants to support file-based vinum partitions. Not to mention an O_DIRECT file I/O flag (ala Solaris or IRIX but without the silly restrictions on mixed buffer-cache/non-buffer-cache I/O). -Matt Matthew Dillon To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-current" in the body of the message