From nobody Thu Dec 22 15:38:07 2022 X-Original-To: freebsd-jail@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4NdDzS0MBWz1G1jV for ; Thu, 22 Dec 2022 15:40:36 +0000 (UTC) (envelope-from zlei@FreeBSD.org) Received: from smtp.freebsd.org (smtp.freebsd.org [96.47.72.83]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (4096 bits) client-digest SHA256) (Client CN "smtp.freebsd.org", Issuer "R3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4NdDzS05ntz3QD0; Thu, 22 Dec 2022 15:40:36 +0000 (UTC) (envelope-from zlei@FreeBSD.org) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org; s=dkim; t=1671723636; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=j7P/IrGhElMTd4EWRaFuhm2MrMJHHcIYT8NCIc3T7ms=; b=D2SUmM9sf6R46Stgfi8CeoYdUVS+9+HV1Jf43aYZJRmfy8M6+MnD7GZDtdyXsctLnRZbqi 25tYSoDAGhYdybIhMUNRJsHItmnumF60WIdfQLu9CCsqC0yx/0lJBC/ZQiwHW4qAFfLNvS WK0eUVKtYJa/rlkTQcRt6V3CnofUeUFtLFWxZer4ZqJjT19hPjnbegDGy63Qb2v0VjFoSn C52umt5H5tUjgOor++eVMkD4vYhRLpblqOBBVR/XHddYU1WuYeeZcgp4D3aTXMsbwaTvAf 7sEZUCpR4YeisQZ4t/mmBEQqTDUfWt/UpsmaQHl6h4jcZJ3XMtvzv3uwhajWPA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org; s=dkim; t=1671723636; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=j7P/IrGhElMTd4EWRaFuhm2MrMJHHcIYT8NCIc3T7ms=; b=SZAnRNHwRG+TxL5y8R6m9g8VQhUGjjwd4Hm5oM9UHPU4NgrOxVUes1cc8wGjUELUgrcjO7 FtKnxaQLFFF97nGWE055JnqFlK3pIGUavPQXFbxqhkDIikhZZM6xgleExqA5x9YR6IqedP 3i/jvJKycOiDsoat/RX68I3Pu0GW3dG9+o7zmy2QPuSExw6sDrYNkjcXfgYWmgNCMpV0nn Jbw9DiCV1Qc4enqWVsq2P7lq8mPk/LCww+zz+Xhx/BuWytt1DtzSRnmIAbBo0aLF9+4t++ +rsKrkiLZcBfva3hgLfxJOVMYYm3Zu6b4Mlw4/8hvPArRDvmr/SKOFHczfLvRg== ARC-Authentication-Results: i=1; mx1.freebsd.org; none ARC-Seal: i=1; s=dkim; d=freebsd.org; t=1671723636; a=rsa-sha256; cv=none; b=l4vTmsatMCkhkJTUScGR1guOxYSWTE9ykblRVsdGQMiSskz2DHYXukFXExbp1Fj98p2d1e RnJ0J5A1RoQrvLWT8UOUedBsd7w2VOqjp0DqcBkxkUguZh8jcvcC2RgqXNTyossNzDucZH FA68HVAVJUmwJf4kqhDsqYmKvRPisSuLC60HSbe51UrRptyG+bNfKNFb1FQc0EEVLtZnIa jsAj6aQZxwmnm7+a2OQVAY3ZV6nJq86ucZu75adDygIDiXrZeEK+ENIjkv7lwofJywpt60 h/TeYUyuoIHEofsphzyYEaCiHTRfqyZJHeDdJ1misvIRILcC0/TzboMoBMCGeA== Received: from [IPv6:2409:8a5e:aa71:3294:6920:cc8d:9f5b:e8a4] (unknown [IPv6:2409:8a5e:aa71:3294:6920:cc8d:9f5b:e8a4]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) (Authenticated sender: zlei/mail) by smtp.freebsd.org (Postfix) with ESMTPSA id 4NdDzN0ZX5z145c; Thu, 22 Dec 2022 15:40:31 +0000 (UTC) (envelope-from zlei@FreeBSD.org) From: Zhenlei Huang Message-Id: <069FE014-1D62-4479-B830-9FBDF36F2109@FreeBSD.org> Content-Type: multipart/alternative; boundary="Apple-Mail=_05999826-0A4B-4FD5-A20A-0ED8E067BC1C" List-Id: Discussion about FreeBSD jail(8) List-Archive: https://lists.freebsd.org/archives/freebsd-jail List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-jail@freebsd.org Mime-Version: 1.0 (Mac OS X Mail 13.4 \(3608.120.23.2.7\)) Subject: Re: What's going on with vnets and epairs w/ addresses? Date: Thu, 22 Dec 2022 23:38:07 +0800 In-Reply-To: Cc: Kyle Evans , Gleb Smirnoff , "Bjoern A. Zeeb" , "freebsd-jail@freebsd.org" To: Mark Johnston References: <5r22os7n-ro15-27q-r356-rps331o06so5@mnoonqbm.arg> <150A60D6-6757-46DD-988F-05A9FFA36821@FreeBSD.org> X-Mailer: Apple Mail (2.3608.120.23.2.7) X-ThisMailContainsUnwantedMimeParts: N --Apple-Mail=_05999826-0A4B-4FD5-A20A-0ED8E067BC1C Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=us-ascii >=20 > On Dec 21, 2022, at 12:12 AM, Mark Johnston > wrote: >=20 > On Sun, Dec 18, 2022 at 10:52:58AM -0600, Kyle Evans wrote: >> On Sat, Dec 17, 2022 at 11:22 AM Gleb Smirnoff > wrote: >>>=20 >>> Zhenlei, >>>=20 >>> On Fri, Dec 16, 2022 at 06:30:57PM +0800, Zhenlei Huang wrote: >>> Z> I managed to repeat this issue on CURRENT/14 with this small = snip: >>> Z> >>> Z> ------------------------------------------- >>> Z> #!/bin/sh >>> Z> >>> Z> # test jail name >>> Z> n=3D"test_ref_leak" >>> Z> >>> Z> jail -c name=3D$n path=3D/ vnet persist >>> Z> # The following line trigger jail pr_ref leak >>> Z> jexec $n ifconfig lo0 inet 127.0.0.1/8 >>> Z> >>> Z> jail -R $n >>> Z> >>> Z> # wait a moment >>> Z> sleep 1 >>> Z> >>> Z> jls -j $n >>> Z> >>> Z> After DDB debugging and tracing , it seems that is triggered by a = combine of [1] and [2] >>> Z> >>> Z> [1] = https://reviews.freebsd.org/rGfec8a8c7cbe4384c7e61d376f3aa5be5ac895915 = = > >>> Z> [2] = https://reviews.freebsd.org/rGeb93b99d698674e3b1cc7139fda98e2b175b8c5b = = > >>> Z> >>> Z> >>> Z> In [1] the per-VNET uma zone is shared with the global one. >>> Z> `pcbinfo->ipi_zone =3D pcbstor->ips_zone;` >>> Z> >>> Z> In [2] unref `inp->inp_cred` is deferred called in inpcb_dtor() = by uma_zfree_smr() . >>> Z> >>> Z> Unfortunately inps freed by uma_zfree_smr() are cached and = inpcb_dtor() is not called immediately , >>> Z> thus leaking `inp->inp_cred` ref and hence `prison->pr_ref`. >>> Z> >>> Z> And it is also not possible to free up the cache by per-VNET = SYSUNINIT tcp_destroy / udp_destroy / rip_destroy. >>>=20 >>> This is known issue and I'd prefer not to call it a problem. The = "leak" of a jail >>> happens only if machine is idle wrt the networking activity. >>>=20 >>> Getting back to the problem that started this thread - the epair(4)s = not immediately >>> popping back to prison0. IMHO, the problem again lies in the design = of if_vmove and >>> epair(4) in particular. The if_vmove shall not exist, instead we = should do a full >>> if_attach() and if_detach(). The state of an ifnet when it undergoes = if_vmove doesn't >>> carry any useful information. With Alexander melifaro@ we discussed = better options >>> for creating or attaching interfaces to jails that if_vmove. Until = they are ready >>> the most easy workaround to deal with annoying epair(4) come back = problem is to >>> remove it manually before destroying a jail, like I did in = 80fc25025ff. >>>=20 >>=20 >> It still behaved much better prior to eb93b99d6986, which you and = Mark >> were going to work on a solution for to allow the cred "leak" to = close >> up much more quickly. CC markj@, since I think it's been six months >> since the last time I inquired about it, making this a good time to = do >> it again... >=20 > I spent some time trying to see if we could fix this in UMA/SMR and > talked to Jeff about it a bit. At this point I don't think it's the > right approach, at least for now. Really we have a composability > problem where different layers are using different techniques to = signal > that they're done with a particular piece of memory, and they just > aren't compatible. I originally thought that `uma_free_smr()` is somewhat like = `epoch_call()` with an internal `epoch_callback_t`, but after digging into the source code = it is not true. `uma_free_smr()` put the item into cache and until next allocation from = the cache the destructor get a chance to run. Can SMR provide some mean just like `epoch_callback_t` , so that the = destructors eventually get been invoked ? >=20 > One thing I tried is to implement a UMA function which walks over all > SMR zones and synchronizes all cached items (so that their destructors > are called). This is really expensive, at minimum it has to bind to = all > CPUs in the system so that it can flush per-CPU buckets. If > jail_deref() calls that function, the bug goes away at least in my > limited testing, but its use is really a layering violation. I've proposed a `vnet_shutdown()` stage in another mail. Maybe we can = introduce a `vnet_cleanup()` and INPCB layer register to listen a `cleanup` event = and the function which synchronizing cached items get been invoked. Is that still a layering violation? >=20 > We could, say, periodically scan cached UMA/SMR items and invoke their > destructors, but for most SMR consumers this is unnecessary, and again > there's a layering problem: the inpcb layer shouldn't "know" that it = has > to do that for its zones, since it's the jail layer that actually = cares. >=20 > It also seems kind of strange that dying jails still occupy a slot in > the jail namespace. I don't really understand why the existence of a > dying jail prevents creation of a new jail with the same name, but > presumably there's a good reason for it? >=20 > Now my inclination is to try and fix this in the inpcb layer, by not > accessing the inp_cred at all in the lookup path until we hold the = inpcb > lock, and then releasing the cred ref before freeing a PCB to its = zone. > I think this is doable based on a few observations: > - When doing an SMR-protected lookup, we always lock the returned = inpcb > before handing it to the caller. So we could in principle perform > inp_cred checking after acquiring the lock but before returning. > - If there are no jailed PCBs in a hash chain = in_pcblookup_hash_locked() > always scans the whole chain. > - If we match only one PCB in a lookup, we can probably(?) return that > PCB without dereferencing the cred pointer at all. If not, then the > scan only has to keep track of a fixed number of PCBs before picking > which one to return. So it looks like we can perform a lockless scan > and keep track of matches on the stack, then lock the matched PCBs = and > perform prison checks if necessary, without making the common case > more expensive. >=20 > In fact there is a parallel thread on freebsd-jail which reports that > this inp_cred access is a source of frequent cache misses. I was > surprised to see that the scan calls prison_flag() before even = checking > the PCB's local address. So if the hash chain is large then we're > potentially performing a lot of unnecessary memory accesses (though > presumably it's common for most of the PCBs to be sharing a single > cred?). In particular we can perhaps solve two problems at once. >=20 > Any thoughts? Are there some fundamental reasons this can't work? --Apple-Mail=_05999826-0A4B-4FD5-A20A-0ED8E067BC1C Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=us-ascii

On Dec 21, 2022, at = 12:12 AM, Mark Johnston <markj@freebsd.org> wrote:

On Sun, Dec 18, 2022 at 10:52:58AM -0600, = Kyle Evans wrote:
On Sat, Dec 17, 2022 at 11:22 AM Gleb Smirnoff = <glebius@freebsd.org> wrote:

 Zhenlei,

On Fri, Dec 16, 2022 at 06:30:57PM +0800, Zhenlei Huang = wrote:
Z> I managed to repeat this issue on CURRENT/14 = with this small snip:
Z>
Z> = -------------------------------------------
Z> = #!/bin/sh
Z>
Z> # test jail name
Z> n=3D"test_ref_leak"
Z>
Z> jail -c name=3D$n path=3D/ vnet persist
Z> # The following line trigger jail pr_ref leak
Z> jexec $n ifconfig lo0 inet 127.0.0.1/8
Z>
Z> jail -R $n
Z>
Z> # wait a moment
Z> sleep 1
Z>
Z> jls -j $n
Z>
Z> After DDB debugging and tracing , it seems that is = triggered by a combine of [1] and [2]
Z>
Z> [1] https://reviews.freebsd.org/rGfec8a8c7cbe4384c7e61d376f3aa5be5a= c895915 <https://reviews.freebsd.org/rGfec8a8c7cbe4384c7e61d376f3aa5be5a= c895915>
Z> [2] https://reviews.freebsd.org/rGeb93b99d698674e3b1cc7139fda98e2b1= 75b8c5b <https://reviews.freebsd.org/rGeb93b99d698674e3b1cc7139fda98e2b1= 75b8c5b>
Z>
Z>
Z> In [1] the per-VNET uma zone is shared with the global = one.
Z> `pcbinfo->ipi_zone =3D = pcbstor->ips_zone;`
Z>
Z> In [2] = unref `inp->inp_cred` is deferred called in inpcb_dtor() by = uma_zfree_smr() .
Z>
Z> Unfortunately = inps freed by uma_zfree_smr() are cached and inpcb_dtor() is not called = immediately ,
Z> thus leaking `inp->inp_cred` ref = and hence `prison->pr_ref`.
Z>
Z> = And it is also not possible to free up the cache by per-VNET SYSUNINIT = tcp_destroy / udp_destroy / rip_destroy.

This= is known issue and I'd prefer not to call it a problem. The "leak" of a = jail
happens only if machine is idle wrt the networking = activity.

Getting back to the problem that = started this thread - the epair(4)s not immediately
popping = back to prison0. IMHO, the problem again lies in the design of if_vmove = and
epair(4) in particular. The if_vmove shall not exist, = instead we should do a full
if_attach() and if_detach(). = The state of an ifnet when it undergoes if_vmove doesn't
carry any useful information. With Alexander melifaro@ we = discussed better options
for creating or attaching = interfaces to jails that if_vmove. Until they are ready
the = most easy workaround to deal with annoying epair(4) come back problem is = to
remove it manually before destroying a jail, like I did = in 80fc25025ff.


It still behaved much better prior to eb93b99d6986, which you = and Mark
were going to work on a solution for to allow the = cred "leak" to close
up much more quickly. CC markj@, = since I think it's been six months
since the last time I = inquired about it, making this a good time to do
it = again...

I spent = some time trying to see if we could fix this in UMA/SMR and
talked to Jeff about it = a bit.  At this point I don't think it's the
right approach, at least for now. =  Really we have a composability
problem where different layers are using = different techniques to signal
that = they're done with a particular piece of memory, and they just
aren't = compatible.

I originally thought that = `uma_free_smr()` is somewhat like `epoch_call()` with
an internal `epoch_callback_t`, but after digging into the source = code it is not true.
`uma_free_smr()` put the item into cache and until next allocation = from the
cache the destructor get a chance = to run.

Can SMR provide some mean just like `epoch_callback_t` , so that the destructors
eventually get been invoked ?


One = thing I tried is to implement a UMA function which walks over = all
SMR = zones and synchronizes all cached items (so that their = destructors
are = called).  This is really expensive, at minimum it has to bind to = all
CPUs in = the system so that it can flush per-CPU buckets.  If
jail_deref() calls that = function, the bug goes away at least in my
limited testing, but its use is really a = layering violation.

I've proposed a `vnet_shutdown()` = stage in another mail. Maybe we can introduce
a `vnet_cleanup()`  and INPCB layer register to listen a = `cleanup` event and
the function which synchronizing = cached items get been invoked.
Is that still a layering = violation?


We could, say, periodically scan cached = UMA/SMR items and invoke their
destructors, but for most SMR consumers this is = unnecessary, and again
there's = a layering problem: the inpcb layer shouldn't "know" that it = has
to do = that for its zones, since it's the jail layer that actually = cares.

It also = seems kind of strange that dying jails still occupy a slot in
the jail namespace. =  I don't really understand why the existence of a
dying jail prevents = creation of a new jail with the same name, but
presumably there's a good reason for = it?

Now my = inclination is to try and fix this in the inpcb layer, by not
accessing the inp_cred = at all in the lookup path until we hold the inpcb
lock, and then releasing the cred ref = before freeing a PCB to its zone.
I think this is doable based on a few = observations:
- When = doing an SMR-protected lookup, we always lock the returned = inpcb
 before handing it to the caller.  So we could in = principle perform
 inp_cred checking after acquiring the lock but before = returning.
- If = there are no jailed PCBs in a hash chain = in_pcblookup_hash_locked()
 always scans the whole chain.
- If we match only one PCB in a lookup, we = can probably(?) return that
 PCB without dereferencing the cred pointer at all. =  If not, then the
 scan only has to keep track of a fixed number of PCBs = before picking
 which one to return.  So it looks like we can = perform a lockless scan
 and keep track of matches on the stack, then lock the = matched PCBs and
 perform prison checks if necessary, without making = the common case
 more expensive.

In fact there is a parallel thread on = freebsd-jail which reports that
this = inp_cred access is a source of frequent cache misses.  I = was
surprised to see that the scan calls prison_flag() before = even checking
the = PCB's local address.  So if the hash chain is large then = we're
potentially performing a lot of unnecessary memory accesses = (though
presumably it's common for most of the PCBs to be sharing a = single
cred?). =  In particular we can perhaps solve two problems at once.

Any thoughts?  Are = there some fundamental reasons this can't = work?
= --Apple-Mail=_05999826-0A4B-4FD5-A20A-0ED8E067BC1C--