Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 20 Mar 2000 15:16:28 -0800
From:      Alfred Perlstein <bright@wintelcom.net>
To:        Matthew Dillon <dillon@apollo.backplane.com>
Cc:        Poul-Henning Kamp <phk@critter.freebsd.dk>, current@FreeBSD.ORG
Subject:   Re: patches for test / review
Message-ID:  <20000320151628.F14789@fw.wintelcom.net>
In-Reply-To: <200003202155.NAA72035@apollo.backplane.com>; from dillon@apollo.backplane.com on Mon, Mar 20, 2000 at 01:55:05PM -0800
References:  <18039.953549289@critter.freebsd.dk> <200003201736.JAA70124@apollo.backplane.com> <20000320111544.A14789@fw.wintelcom.net> <200003202155.NAA72035@apollo.backplane.com>

next in thread | previous in thread | raw e-mail | index | archive | help
* Matthew Dillon <dillon@apollo.backplane.com> [000320 14:18] wrote:
> 
> :> 	lock on the bp.  With a shared lock you are allowed to issue READ
> :> 	I/O but you are not allowed to modify the contents of the buffer.
> :> 	With an exclusive lock you are allowed to issue both READ and WRITE
> :> 	I/O and you can modify the contents of the buffer.
> :> 
> :>       bread()  -> bread_sh() and bread_ex()
> :> 
> :> 	Obtain and validate (issue read I/O as appropriate) a bp.  bread_sh()
> :> 	allows a buffer to be accessed but not modified or rewritten.
> :> 	bread_ex() allows a buffer to be modified and written.
> :
> :This seems to allow for expressing intent to write to buffers,
> :which would be an excellent place to cow the pages 'in software'
> :rather than obsd's way of using cow'd pages to accomplish the same
> :thing.
> 
>     Yes, absolutely.  DG (if I remember right) is rabid about not taking
>     VM faults while sitting in the kernel and I tend to agree with him that
>     it's a cop-out to use VM faults in the kernel to get around those
>     sorts of problems.

ok, so we're on the same page then. :)

> 
> :I'm not sure if you remeber what I brought up at BAFUG, but I'd
> :like to see something along the lines of BX_BKGRDWRITE that Kirk
> :is using for the bitmaps blocks in softupdates to be enabled on a
> :system wide basis.  That way rewriting data that has been sent to
> :the driver isn't blocked and at the same time we don't need to page
> :protect during every strategy call.
> :
> :I may have misunderstood your intent, but using page protections
> :on each IO would seem to introduce a lot of performance issues that
> :the rest of these points are all trying to get rid of.
> 
>     At the low-level device there is no concept of page protections.
>     If you pass an array of vm_page_t's then that is where the data
>     will be taken from or written to.
> 
>     A background-write capability is actually much more easily implemented
>     at the VM Object level then the buffer cache level.  If you think about
>     it, all you need to do is add another VM Object layer *below* the 
>     one representing the device.  Whenever a device write is initiated the
>     pages are moved to the underlying layer.  If a process (or the kernel)
>     needs to modify the pages while the write is in progress, a copy-on-write
>     occurs through normal mechanisms.  On completion of the I/O the pages
>     are moved back to the main VM Object device layer except for those that
>     would conflict with any copy-on-write that occured (the original device
>     pages in the conflict case simply get thrown away).  
> 
>     Problem solved.  Plus this deals with low-memory situations properly...
>     we do not introduce any new deadlocks.

That does sound a lot better, using the buffer system for anything more
than describing an IO is a hack and I'd like to see an implementation such
as this be possible.

> 
> :>       The idea for the buffer cache is to shift its functionality to one that
> :>       is solely used to issue device I/O and to keep track of dirty areas for
> :>       proper sequencing of I/O (e.g. softupdate's use of the buffer cache 
> :>       to placemark I/O will not change).  The core buffer cache code would
> :...
> :
> :Keeping the currect cluster code is a bad idea, if the drivers were
> :taught how to traverse the linked list in the buf struct rather
> :than just notice "a big buffer" we could avoid a lot of page
> :twiddling and also allow for massive IO clustering ( > 64k ) because
> :we won't be limited by the size of the b_pages[] array for our
> :upper bound on the amount of buffers we can issue effectively a
> :scatter/gather on (since the drivers must VTOPHYS them anyway).
> 
>     This devolves down into how simple (or complex) an interface we
>     are willing to use to talk to the low-level device.
> 
>     The reason I would hesitate to move to a 'linked list of buffers'
>     methodology is that *ALL* of the current VM API's pass a single
>     array of vm_page_t's... not just the current struct buf code, but also
>     the VOP_PUTPAGES and VOP_GETPAGES API.
> 
>     I would much prefer to keep this simplicity intact in order to avoid
>     introducing even more bugs into the source then we will when we try
>     to do this stuff, which means changing the clustering code from:
> 
> 	* copies vm_page_t's into the cluster pbuf's b_pages[] array
> 	* maps the pages into b_data
> 
>     to:
> 
> 
> 	* copies vm_page_t's into the cluster pbuf's b_pages[] array
> 
>     In otherwords, keeping the clustering changes as simple as possible.
>     I think once the new I/O path is operational we can then start thinking
>     about how to optimize it -- for example, by having a default (embedded)
>     static array but also allowing the b_pages array to be dynamically
>     allocated.

Why?  Why allocate a special buffer pbuf just for all of this, problems
can develop where you may have implemented this, and now clustering can grow
without (nearly) any bounds, however now you totall have this really inane
complexity for the pbuf which locks down _all_ the buffers associated with
it until the entire pbuf is marked done, I really don't think that is what
you want.

The abstraction is sort of nice, but let's break down the annoyances of
pbufs:

  an extra allocation that can fail when the system is issueing a lot of IO
    (exactly when we need more of them)
  an array fill to track the buffers
  a possible additional allocation for the b_pages if that gets implemented
  buffer lockdown (which although addressed above, is still something to avoid)
  other stuff to make sure the rest of the kernel only sees the pbuf.
  
If anything perhaps a zone for b_pages to allow for easy use of GET/PUT_PAGES
then copying it into the list of buffers.

It's really up to you as you have the time and clue to do this, but I
wanted to bring up the issues I had with the current implementation
as well as an alternate suggestion.

Since you're actually doing the work and I do agree with the direction
it's going I wish you and Kirk the best of luck.

I have to get back to my html/php/db stuff (kill me now). :)

Thanks for taking the time to go over it.

-- 
-Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-current" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20000320151628.F14789>