Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 11 Aug 2020 03:10:39 +0000
From:      Rick Macklem <rmacklem@uoguelph.ca>
To:        Konstantin Belousov <kostikbel@gmail.com>
Cc:        Kirk McKusick <mckusick@mckusick.com>, "freebsd-current@FreeBSD.org" <freebsd-current@FreeBSD.org>
Subject:   Re: can buffer cache pages be used in ext_pgs mbufs?
Message-ID:  <QB1PR01MB33643DFE1D132C76E33A9276DD450@QB1PR01MB3364.CANPRD01.PROD.OUTLOOK.COM>
In-Reply-To: <20200810170956.GL2551@kib.kiev.ua>
References:  <202008080443.0784hEfh084650@chez.mckusick.com> <20200808144040.GD2551@kib.kiev.ua> <QB1PR01MB33646963EB1E38F2968A859BDD470@QB1PR01MB3364.CANPRD01.PROD.OUTLOOK.COM>, <20200810170956.GL2551@kib.kiev.ua>

next in thread | previous in thread | raw e-mail | index | archive | help
Konstantin Belousov wrote:=0A=
>On Mon, Aug 10, 2020 at 12:46:00AM +0000, Rick Macklem wrote:=0A=
>> Konstantin Belousov wrote:=0A=
>> >On Fri, Aug 07, 2020 at 09:43:14PM -0700, Kirk McKusick wrote:=0A=
>> >> I do not have the answer to your question, but I am copying Kostik=0A=
>> >> as if anyone knows the answer, it is probably him.=0A=
>> >>=0A=
>> >>       ~Kirk=0A=
>> >>=0A=
>> >> =3D-=3D-=3D=0A=
>> >I do not know the exact answer, this is why I did not followed up on th=
e=0A=
>> >original question on current@.  In particular, I have no idea about the=
=0A=
>> >ext_pgs mechanism.=0A=
>> >=0A=
>> >Still I can point one semi-obvious aspect of your proposal.=0A=
>> >=0A=
>> >When the buffer is written (with bwrite()), its pages are sbusied and=
=0A=
>> >the write mappings of them are invalidated. The end effect is that no=
=0A=
>> >modifications to the pages are possible until they are unbusied. This,=
=0A=
>> >together with the lock of the buffer that holds the pages, effectively=
=0A=
>> >stops all writes either through write(2) or by mmaped regions.=0A=
>> >=0A=
>> >In other words, any access for write to the range of file designated by=
=0A=
>> >the buffer, causes the thread to block until the pages are unbusied and=
=0A=
>> >the buffer is unlocked.  Which in described case would mean, until NFS=
=0A=
>> >server responds.=0A=
>> >=0A=
>> >If this is fine, then ok.=0A=
>> For what I am thinking of, I would say that is fine, since the ktls code=
 reads=0A=
>> the pages to encrypt/send them, but can use other allocated pages for=0A=
>> the encrypted data.=0A=
>>=0A=
>> >Rick, do you know anything about the vm page lifecycle as mb_ext_pgs ?=
=0A=
>> Well, the anonymous pages (the only ones I've been using sofar) are=0A=
>> allocated with:=0A=
>>         vm_page_alloc(NULL, 0, VM_ALLOC_NORMAL | VM_ALLOC_NOOBJ |=0A=
>>                VM_ALLOC_NODUMP | VM_ALLOC_WIRED);=0A=
>>=0A=
>> and then the m_ext_ext_free function (mb_free_mext_pgs()) does:=0A=
>>         vm_page_unwire_noq(pg);=0A=
>>         vm_page_free(pg);=0A=
>> on each of them.=0A=
>>=0A=
>> m->m_ext_ext_free() is called in tls_encrypt() when it no longer wants t=
he=0A=
>> pages, but is normally called via m_free(m), which calls mb_free_extpg(m=
),=0A=
>> although there are a few other places.=0A=
>>=0A=
>> Since m_ext_ext_free is whatever function you want to make it, I suppose=
 the=0A=
>> answer is "until your m_ext.ext_free" function is called.=0A=
>>=0A=
>> At this time, for ktls, if you are using software encryption, the call t=
o ktls_encrypt(),=0A=
>> which is done before passing the mbufs down to TCP is when it is done wi=
th the=0A=
>> unencrypted data pages. (I suppose there is no absolute guarantee that t=
his=0A=
>> happens before the kernel RPC layer times out waiting for an RPC reply, =
but it=0A=
>> is almost inconceivable, since this happens before the RPC request is pa=
ssed=0A=
>> down to TCP.)=0A=
>>=0A=
>> The case I now think is more problematic is the "hardware assist" case. =
Although=0A=
>> no hardware/driver yet does this afaik, I suspect that the unencrypted d=
ata page=0A=
>> mbufs could end up stuck in TCP for a long time, in case a retransmit is=
 needed.=0A=
>>=0A=
>> So, I now think I might need to delay the bufdone() call until the m_ext=
_ext_free()=0A=
>> call has been done for the pages, if they are buffer cache pages?=0A=
>> --> Usually I would expect the m_ext_ext_free() call for the mbuf(s) tha=
t=0A=
>>        hold the data to be written to the server to be done long before=
=0A=
>>        bufdone() would be called for the buffer that is being written,=
=0A=
>>        but there is no guarantee.=0A=
>>=0A=
>> Am I correct in assuming that the pages for the buffer will remain valid=
 and=0A=
>> readable through the direct map until bufdone() is called?=0A=
>> If I am correct w.r.t. this, it should work so long as the m_ext_ext_fre=
e() calls=0A=
>> for the pages happen before the bufdone() call on the bp, I think?=0A=
>=0A=
>I think there is further complication with non-anonymous pages.=0A=
>You want (or perhaps need) the page content to be immutable and not=0A=
>changed while you pass pages around and give the for ktls sw or hw=0A=
>processing.  Otherwise it could not pass the TLS authentification if=0A=
>page was changed in process.=0A=
>=0A=
>Similar issue exists when normal buffer writes are scheduled through=0A=
>the strategy(), and you can see that bufwrite() does vfs_busy_pages()=0A=
>with clear_modify=3D1, which does two things:=0A=
>- sbusy the pages (sbusy pages can get new read-only mappings, but cannot=
=0A=
>  be mapped rw)=0A=
>- pmap_remove_write() on the pages to invalidate all current writeable=0A=
>  mappings.=0A=
>=0A=
>This state should be kept until ktls is completely done with the pages.=0A=
I am now thinking that this is done exactly as you describe above and=0A=
doesn't require any changes.=0A=
=0A=
The change I am planning is below the strategy routine in the function=0A=
that does the write RPC.=0A=
It currently copies the data from the buffer into mbuf clusters.=0A=
After this change, it would put the physical page #s for the buffer in the=
=0A=
mbuf(s) and then wait for them all to be m_ext_ext_free()d before calling=
=0A=
bufdone().=0A=
--> The only difference is the wait before the bufdone() call in the RPC la=
yer=0A=
       below the strategy routine. (bufdone() is the only call the NFS clie=
nt=0A=
       seems to do below the strategy routine, so I assume it ends the stat=
e=0A=
       you describe above?)=0A=
=0A=
rick=0A=
=0A=
=0A=
=0A=



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?QB1PR01MB33643DFE1D132C76E33A9276DD450>