Date: Fri, 14 Aug 2020 05:54:23 +0000 From: Rick Macklem <rmacklem@uoguelph.ca> To: Konstantin Belousov <kostikbel@gmail.com> Cc: Kirk McKusick <mckusick@mckusick.com>, "freebsd-current@FreeBSD.org" <freebsd-current@FreeBSD.org> Subject: Re: can buffer cache pages be used in ext_pgs mbufs? Message-ID: <YTBPR01MB337592575CAF5CEF13F10120DD400@YTBPR01MB3375.CANPRD01.PROD.OUTLOOK.COM> In-Reply-To: <20200811175422.GP2551@kib.kiev.ua> References: <202008080443.0784hEfh084650@chez.mckusick.com> <20200808144040.GD2551@kib.kiev.ua> <QB1PR01MB33646963EB1E38F2968A859BDD470@QB1PR01MB3364.CANPRD01.PROD.OUTLOOK.COM> <20200810170956.GL2551@kib.kiev.ua> <QB1PR01MB33643DFE1D132C76E33A9276DD450@QB1PR01MB3364.CANPRD01.PROD.OUTLOOK.COM>, <20200811175422.GP2551@kib.kiev.ua>
next in thread | previous in thread | raw e-mail | index | archive | help
Konstantin Belousov wrote:=0A= >On Tue, Aug 11, 2020 at 03:10:39AM +0000, Rick Macklem wrote:=0A= >> Konstantin Belousov wrote:=0A= >> >On Mon, Aug 10, 2020 at 12:46:00AM +0000, Rick Macklem wrote:=0A= >> >> Konstantin Belousov wrote:=0A= >> >> >On Fri, Aug 07, 2020 at 09:43:14PM -0700, Kirk McKusick wrote:=0A= >> >> >> I do not have the answer to your question, but I am copying Kostik= =0A= >> >> >> as if anyone knows the answer, it is probably him.=0A= >> >> >>=0A= >> >> >> ~Kirk=0A= >> >> >>=0A= >> >> >> =3D-=3D-=3D=0A= >> >> >I do not know the exact answer, this is why I did not followed up on= the=0A= >> >> >original question on current@. In particular, I have no idea about = the=0A= >> >> >ext_pgs mechanism.=0A= >> >> >=0A= >> >> >Still I can point one semi-obvious aspect of your proposal.=0A= >> >> >=0A= >> >> >When the buffer is written (with bwrite()), its pages are sbusied an= d=0A= >> >> >the write mappings of them are invalidated. The end effect is that n= o=0A= >> >> >modifications to the pages are possible until they are unbusied. Thi= s,=0A= >> >> >together with the lock of the buffer that holds the pages, effective= ly=0A= >> >> >stops all writes either through write(2) or by mmaped regions.=0A= >> >> >=0A= >> >> >In other words, any access for write to the range of file designated= by=0A= >> >> >the buffer, causes the thread to block until the pages are unbusied = and=0A= >> >> >the buffer is unlocked. Which in described case would mean, until N= FS=0A= >> >> >server responds.=0A= >> >> >=0A= >> >> >If this is fine, then ok.=0A= >> >> For what I am thinking of, I would say that is fine, since the ktls c= ode reads=0A= >> >> the pages to encrypt/send them, but can use other allocated pages for= =0A= >> >> the encrypted data.=0A= >> >>=0A= >> >> >Rick, do you know anything about the vm page lifecycle as mb_ext_pgs= ?=0A= >> >> Well, the anonymous pages (the only ones I've been using sofar) are= =0A= >> >> allocated with:=0A= >> >> vm_page_alloc(NULL, 0, VM_ALLOC_NORMAL | VM_ALLOC_NOOBJ |=0A= >> >> VM_ALLOC_NODUMP | VM_ALLOC_WIRED);=0A= >> >>=0A= >> >> and then the m_ext_ext_free function (mb_free_mext_pgs()) does:=0A= >> >> vm_page_unwire_noq(pg);=0A= >> >> vm_page_free(pg);=0A= >> >> on each of them.=0A= >> >>=0A= >> >> m->m_ext_ext_free() is called in tls_encrypt() when it no longer want= s the=0A= >> >> pages, but is normally called via m_free(m), which calls mb_free_extp= g(m),=0A= >> >> although there are a few other places.=0A= >> >>=0A= >> >> Since m_ext_ext_free is whatever function you want to make it, I supp= ose the=0A= >> >> answer is "until your m_ext.ext_free" function is called.=0A= >> >>=0A= >> >> At this time, for ktls, if you are using software encryption, the cal= l to ktls_encrypt(),=0A= >> >> which is done before passing the mbufs down to TCP is when it is done= with the=0A= >> >> unencrypted data pages. (I suppose there is no absolute guarantee tha= t this=0A= >> >> happens before the kernel RPC layer times out waiting for an RPC repl= y, but it=0A= >> >> is almost inconceivable, since this happens before the RPC request is= passed=0A= >> >> down to TCP.)=0A= >> >>=0A= >> >> The case I now think is more problematic is the "hardware assist" cas= e. Although=0A= >> >> no hardware/driver yet does this afaik, I suspect that the unencrypte= d data page=0A= >> >> mbufs could end up stuck in TCP for a long time, in case a retransmit= is needed.=0A= >> >>=0A= >> >> So, I now think I might need to delay the bufdone() call until the m_= ext_ext_free()=0A= >> >> call has been done for the pages, if they are buffer cache pages?=0A= >> >> --> Usually I would expect the m_ext_ext_free() call for the mbuf(s) = that=0A= >> >> hold the data to be written to the server to be done long befo= re=0A= >> >> bufdone() would be called for the buffer that is being written= ,=0A= >> >> but there is no guarantee.=0A= >> >>=0A= >> >> Am I correct in assuming that the pages for the buffer will remain va= lid and=0A= >> >> readable through the direct map until bufdone() is called?=0A= >> >> If I am correct w.r.t. this, it should work so long as the m_ext_ext_= free() calls=0A= >> >> for the pages happen before the bufdone() call on the bp, I think?=0A= >> >=0A= >> >I think there is further complication with non-anonymous pages.=0A= >> >You want (or perhaps need) the page content to be immutable and not=0A= >> >changed while you pass pages around and give the for ktls sw or hw=0A= >> >processing. Otherwise it could not pass the TLS authentification if=0A= >> >page was changed in process.=0A= >> >=0A= >> >Similar issue exists when normal buffer writes are scheduled through=0A= >> >the strategy(), and you can see that bufwrite() does vfs_busy_pages()= =0A= >> >with clear_modify=3D1, which does two things:=0A= >> >- sbusy the pages (sbusy pages can get new read-only mappings, but cann= ot=0A= >> > be mapped rw)=0A= >> >- pmap_remove_write() on the pages to invalidate all current writeable= =0A= >> > mappings.=0A= >> >=0A= >> >This state should be kept until ktls is completely done with the pages.= =0A= >> I am now thinking that this is done exactly as you describe above and=0A= >> doesn't require any changes.=0A= >>=0A= >> The change I am planning is below the strategy routine in the function= =0A= >> that does the write RPC.=0A= >> It currently copies the data from the buffer into mbuf clusters.=0A= >> After this change, it would put the physical page #s for the buffer in t= he=0A= >> mbuf(s) and then wait for them all to be m_ext_ext_free()d before callin= g=0A= >> bufdone().=0A= >> --> The only difference is the wait before the bufdone() call in the RPC= layer=0A= >> below the strategy routine. (bufdone() is the only call the NFS c= lient=0A= >> seems to do below the strategy routine, so I assume it ends the s= tate=0A= >> you describe above?)=0A= >>=0A= >As far as pages are put into mbuf clusters only after bwrite() that=0A= >did vfs_busy_pages(), and bufdone() is called not earlier than network=0A= >finished with the mbufs, it should be ok.=0A= I've coded it up and, at least for a little testing sofar, it seems to work= ok.=0A= =0A= Thanks for your comments, rick=0A=
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?YTBPR01MB337592575CAF5CEF13F10120DD400>