Date: Mon, 20 Mar 2000 13:55:05 -0800 (PST) From: Matthew Dillon <dillon@apollo.backplane.com> To: Alfred Perlstein <bright@wintelcom.net> Cc: Poul-Henning Kamp <phk@critter.freebsd.dk>, current@FreeBSD.ORG Subject: Re: patches for test / review Message-ID: <200003202155.NAA72035@apollo.backplane.com> References: <18039.953549289@critter.freebsd.dk> <200003201736.JAA70124@apollo.backplane.com> <20000320111544.A14789@fw.wintelcom.net>
next in thread | previous in thread | raw e-mail | index | archive | help
:> lock on the bp. With a shared lock you are allowed to issue READ :> I/O but you are not allowed to modify the contents of the buffer. :> With an exclusive lock you are allowed to issue both READ and WRITE :> I/O and you can modify the contents of the buffer. :> :> bread() -> bread_sh() and bread_ex() :> :> Obtain and validate (issue read I/O as appropriate) a bp. bread_sh() :> allows a buffer to be accessed but not modified or rewritten. :> bread_ex() allows a buffer to be modified and written. : :This seems to allow for expressing intent to write to buffers, :which would be an excellent place to cow the pages 'in software' :rather than obsd's way of using cow'd pages to accomplish the same :thing. Yes, absolutely. DG (if I remember right) is rabid about not taking VM faults while sitting in the kernel and I tend to agree with him that it's a cop-out to use VM faults in the kernel to get around those sorts of problems. :I'm not sure if you remeber what I brought up at BAFUG, but I'd :like to see something along the lines of BX_BKGRDWRITE that Kirk :is using for the bitmaps blocks in softupdates to be enabled on a :system wide basis. That way rewriting data that has been sent to :the driver isn't blocked and at the same time we don't need to page :protect during every strategy call. : :I may have misunderstood your intent, but using page protections :on each IO would seem to introduce a lot of performance issues that :the rest of these points are all trying to get rid of. At the low-level device there is no concept of page protections. If you pass an array of vm_page_t's then that is where the data will be taken from or written to. A background-write capability is actually much more easily implemented at the VM Object level then the buffer cache level. If you think about it, all you need to do is add another VM Object layer *below* the one representing the device. Whenever a device write is initiated the pages are moved to the underlying layer. If a process (or the kernel) needs to modify the pages while the write is in progress, a copy-on-write occurs through normal mechanisms. On completion of the I/O the pages are moved back to the main VM Object device layer except for those that would conflict with any copy-on-write that occured (the original device pages in the conflict case simply get thrown away). Problem solved. Plus this deals with low-memory situations properly... we do not introduce any new deadlocks. :> The idea for the buffer cache is to shift its functionality to one that :> is solely used to issue device I/O and to keep track of dirty areas for :> proper sequencing of I/O (e.g. softupdate's use of the buffer cache :> to placemark I/O will not change). The core buffer cache code would :... : :Keeping the currect cluster code is a bad idea, if the drivers were :taught how to traverse the linked list in the buf struct rather :than just notice "a big buffer" we could avoid a lot of page :twiddling and also allow for massive IO clustering ( > 64k ) because :we won't be limited by the size of the b_pages[] array for our :upper bound on the amount of buffers we can issue effectively a :scatter/gather on (since the drivers must VTOPHYS them anyway). This devolves down into how simple (or complex) an interface we are willing to use to talk to the low-level device. The reason I would hesitate to move to a 'linked list of buffers' methodology is that *ALL* of the current VM API's pass a single array of vm_page_t's... not just the current struct buf code, but also the VOP_PUTPAGES and VOP_GETPAGES API. I would much prefer to keep this simplicity intact in order to avoid introducing even more bugs into the source then we will when we try to do this stuff, which means changing the clustering code from: * copies vm_page_t's into the cluster pbuf's b_pages[] array * maps the pages into b_data to: * copies vm_page_t's into the cluster pbuf's b_pages[] array In otherwords, keeping the clustering changes as simple as possible. I think once the new I/O path is operational we can then start thinking about how to optimize it -- for example, by having a default (embedded) static array but also allowing the b_pages array to be dynamically allocated. :To realize my "nfs super commit" stuff all we'd need to do is make :the max cluster size something like 0-1 and instantly get an almost :unbounded IO burst. : :-- :-Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org] : -Matt Matthew Dillon <dillon@backplane.com> To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-current" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200003202155.NAA72035>