FreeBSD Mail Archives

Date:      Mon, 20 Mar 2000 09:36:22 -0800 (PST)
From:      Matthew Dillon <dillon@apollo.backplane.com>
To:        Poul-Henning Kamp <phk@critter.freebsd.dk>
Cc:        current@FreeBSD.ORG
Subject:   Re: patches for test / review 
Message-ID:  <200003201736.JAA70124@apollo.backplane.com>
References:   <18039.953549289@critter.freebsd.dk>


:
:
:>    Kirk and I have already mapped out a plan to drastically update
:>    the buffer cache API which will encapsulate much of the state within
:>    the buffer cache module.
:
:Sounds good.  Combined with my stackable BIO plans that sounds like
:a really great win for FreeBSD.
:
:--
:Poul-Henning Kamp             FreeBSD coreteam member
:phk@FreeBSD.ORG               "Real hackers run -current on their laptop."

    I think so.  I can give -current a quick synopsis of the plan but I've
    probably forgotten some of the bits (note: the points below are not
    in any particular order):

    Probably the most important thing to keep in mind when reading over
    this list is to note that nearly all the changes being contemplated 
    can be implemented without breaking current interfaces, and the current
    interfaces can then be shifted over to the new interfaces one subsystem
    at a time (shift, test, shift, test, shift test) until none of the 
    original use remains.  At the point the support for the original API
    can be removed.

    * make VOP locking calls recursive.  That is, to obtain exclusive
      recursive locks by default rather then non-recursive locks.

    * cleanup all VOP_*() interfaces in regards to the special handling
      of the case where a locked vnode is passed, a locked vnode is
      returned, and the returned vnode happens to wind up being the same
      as the locked vnode (Allow a double-locked vnode on return and get
      rid of all the stupid code that juggles locks around to get around
      the non-recursive nature of current exclusive locks).

      VOP_LOOKUP is the most confused interface that needs cleaning up.

      With only a small amount of additional work, mainly KASERT's to
      catch potential problems, we should be able to turn on exclusive 
      recursion.  The VOP_*() interfaces will have to be fixed one at
      a time with VOP_LOOKUP topping the list.

    * Make exclusive buffer cache locks recursive.  Kirk has completed all
      the preliminary work on this and we should be able to just turn it
      on.  We just haven't gotten around to it (and the release got in the
      way).  This is necessary to support up and coming softupdates mechanisms
      (e.g. background fsck, snapshot dumps) as well as better-support device
      recursion.

    * Cleanup the buffer cache API (bread(), BUF_STRATEGY(), and so forth).
      Specifically, split out the call functionality such that the buffer
      cache can determine whether a buffer being obtained is going to be
      used for reading or writing.  At the moment we don't know if the system
      is going to dirty a buffer until after the fact and this has caused a
      lot of pain in regards to dealing with low-memory situations.

      getblk() -> getblk_sh() and getblk_ex()

	Obtain bp without issuing I/O, getting either a shared or exclusive
	lock on the bp.  With a shared lock you are allowed to issue READ
	I/O but you are not allowed to modify the contents of the buffer.
	With an exclusive lock you are allowed to issue both READ and WRITE
	I/O and you can modify the contents of the buffer.

      bread()  -> bread_sh() and bread_ex()

	Obtain and validate (issue read I/O as appropriate) a bp.  bread_sh()
	allows a buffer to be accessed but not modified or rewritten.
	bread_ex() allows a buffer to be modified and written.

    * Many uses of the buffer cache in the critical path do not actually 
      require the buffer data to be mapped into KVM.  For example, a number 
      of I/O devices need only the b_pages[] array and do not need a b_data
      mapping.  It would not take a huge amount of work to adjust the 
      uiomove*() interfaces appropriately.

      The general plan is to try remove whole portions of the current buffer
      cache funcitonality and shift them into the new vm_pager_*() API.  That
      is, to operate on VM Object's directly whenever possible.

      The idea for the buffer cache is to shift its functionality to one that
      is solely used to issue device I/O and to keep track of dirty areas for
      proper sequencing of I/O (e.g. softupdate's use of the buffer cache 
      to placemark I/O will not change).  The core buffer cache code would
      no longer map things to KVM with b_data, that functionality would be
      shifted to the VM Object vm_pager_*() API.  The buffer cache would
      continue to use the b_pages[] array mechanism to collect pages for I/O,
      for clustering, and so forth.


      It should be noted that the buffer cache's perceived slowness is almost
      entirely due to all the KVM manipulation it does for b_data, and that
      such manipulate is not necessary for the vast majority of the critical
      path:  Reading and writing file data (can run through the VM Object
      API), and issuing I/O (can avoid b_data KVM mappings entirely).  

      Meta data, such as mapping inodes and bitmap blocks, will almost
      certainly still require b_data mappings.  It would be much too much 
      work to change those at this time.  But meta-data is not in the critical
      path so this is not a big deal.

    * VOP_PUTPAGES() and VOP_GETPAGES().   During discussions with Tor on how 
      to implement O_DIRECT I/O (direct I/O that bypasses the buffer cache),
      We appear to have hit upon a solution that dovetails well into the
      other plans.

      The API change is to simply say that a VOP_PUTPAGES() and VOP_GETPAGES()
      calls are to become a lower level direct I/O interface that bypasses
      the buffer cache.  This would mean reimplmenting it in UFS (not 
      difficult) and a few other filesystems (because they currently fall 
      back to a degenerate case using the IO_VMIO flag and then running 
      through the normal I/O read-write VOP_*() calls, which obviously goes
      through the buffer cache).

      Only minor changes are required to the callers of VOP_PUTPAGES() /
      VOP_GETPAGES() - for the most part the callers *ALREADY* assume 
      a direct-I/O-ish interface.

      Additional advantages are glaringly obvious -- it instantly gives us
      an optimal path through the VFS layers for things like VN, CCD, and
      even vinum of Greg ever wants to support file-based vinum partitions.
      Not to mention an O_DIRECT file I/O flag (ala Solaris or IRIX but
      without the silly restrictions on mixed buffer-cache/non-buffer-cache
      I/O).

					-Matt
					Matthew Dillon 
					<dillon@backplane.com>



To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-current" in the body of the message

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200003201736.JAA70124>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation