Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 20 Mar 2000 13:55:05 -0800 (PST)
From:      Matthew Dillon <dillon@apollo.backplane.com>
To:        Alfred Perlstein <bright@wintelcom.net>
Cc:        Poul-Henning Kamp <phk@critter.freebsd.dk>, current@FreeBSD.ORG
Subject:   Re: patches for test / review
Message-ID:  <200003202155.NAA72035@apollo.backplane.com>
References:  <18039.953549289@critter.freebsd.dk> <200003201736.JAA70124@apollo.backplane.com> <20000320111544.A14789@fw.wintelcom.net>

next in thread | previous in thread | raw e-mail | index | archive | help

:> 	lock on the bp.  With a shared lock you are allowed to issue READ
:> 	I/O but you are not allowed to modify the contents of the buffer.
:> 	With an exclusive lock you are allowed to issue both READ and WRITE
:> 	I/O and you can modify the contents of the buffer.
:> 
:>       bread()  -> bread_sh() and bread_ex()
:> 
:> 	Obtain and validate (issue read I/O as appropriate) a bp.  bread_sh()
:> 	allows a buffer to be accessed but not modified or rewritten.
:> 	bread_ex() allows a buffer to be modified and written.
:
:This seems to allow for expressing intent to write to buffers,
:which would be an excellent place to cow the pages 'in software'
:rather than obsd's way of using cow'd pages to accomplish the same
:thing.

    Yes, absolutely.  DG (if I remember right) is rabid about not taking
    VM faults while sitting in the kernel and I tend to agree with him that
    it's a cop-out to use VM faults in the kernel to get around those
    sorts of problems.

:I'm not sure if you remeber what I brought up at BAFUG, but I'd
:like to see something along the lines of BX_BKGRDWRITE that Kirk
:is using for the bitmaps blocks in softupdates to be enabled on a
:system wide basis.  That way rewriting data that has been sent to
:the driver isn't blocked and at the same time we don't need to page
:protect during every strategy call.
:
:I may have misunderstood your intent, but using page protections
:on each IO would seem to introduce a lot of performance issues that
:the rest of these points are all trying to get rid of.

    At the low-level device there is no concept of page protections.
    If you pass an array of vm_page_t's then that is where the data
    will be taken from or written to.

    A background-write capability is actually much more easily implemented
    at the VM Object level then the buffer cache level.  If you think about
    it, all you need to do is add another VM Object layer *below* the 
    one representing the device.  Whenever a device write is initiated the
    pages are moved to the underlying layer.  If a process (or the kernel)
    needs to modify the pages while the write is in progress, a copy-on-write
    occurs through normal mechanisms.  On completion of the I/O the pages
    are moved back to the main VM Object device layer except for those that
    would conflict with any copy-on-write that occured (the original device
    pages in the conflict case simply get thrown away).  

    Problem solved.  Plus this deals with low-memory situations properly...
    we do not introduce any new deadlocks.

:>       The idea for the buffer cache is to shift its functionality to one that
:>       is solely used to issue device I/O and to keep track of dirty areas for
:>       proper sequencing of I/O (e.g. softupdate's use of the buffer cache 
:>       to placemark I/O will not change).  The core buffer cache code would
:...
:
:Keeping the currect cluster code is a bad idea, if the drivers were
:taught how to traverse the linked list in the buf struct rather
:than just notice "a big buffer" we could avoid a lot of page
:twiddling and also allow for massive IO clustering ( > 64k ) because
:we won't be limited by the size of the b_pages[] array for our
:upper bound on the amount of buffers we can issue effectively a
:scatter/gather on (since the drivers must VTOPHYS them anyway).

    This devolves down into how simple (or complex) an interface we
    are willing to use to talk to the low-level device.

    The reason I would hesitate to move to a 'linked list of buffers'
    methodology is that *ALL* of the current VM API's pass a single
    array of vm_page_t's... not just the current struct buf code, but also
    the VOP_PUTPAGES and VOP_GETPAGES API.

    I would much prefer to keep this simplicity intact in order to avoid
    introducing even more bugs into the source then we will when we try
    to do this stuff, which means changing the clustering code from:

	* copies vm_page_t's into the cluster pbuf's b_pages[] array
	* maps the pages into b_data

    to:


	* copies vm_page_t's into the cluster pbuf's b_pages[] array

    In otherwords, keeping the clustering changes as simple as possible.
    I think once the new I/O path is operational we can then start thinking
    about how to optimize it -- for example, by having a default (embedded)
    static array but also allowing the b_pages array to be dynamically
    allocated.

:To realize my "nfs super commit" stuff all we'd need to do is make
:the max cluster size something like 0-1 and instantly get an almost
:unbounded IO burst.
:
:-- 
:-Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
:

					-Matt
					Matthew Dillon 
					<dillon@backplane.com>


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-current" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200003202155.NAA72035>