Date: Tue, 20 Dec 2022 11:12:24 -0500 From: Mark Johnston <markj@freebsd.org> To: Kyle Evans <kevans@freebsd.org> Cc: Gleb Smirnoff <glebius@freebsd.org>, Zhenlei Huang <zlei.huang@gmail.com>, "Bjoern A. Zeeb" <bz@freebsd.org>, "freebsd-jail@freebsd.org" <freebsd-jail@freebsd.org> Subject: Re: What's going on with vnets and epairs w/ addresses? Message-ID: <Y6He6OD6PA0ntoK9@nuc> In-Reply-To: <CACNAnaE6UQB=zNBjVNrF%2Bpd%2Bmh=6H0%2BROYf1%2BD=HKBTp8aX27g@mail.gmail.com> References: <5r22os7n-ro15-27q-r356-rps331o06so5@mnoonqbm.arg> <B6C70A88-11F8-40D7-85E4-63BBA0F7931D@FreeBSD.org> <150A60D6-6757-46DD-988F-05A9FFA36821@FreeBSD.org> <Y534qgEG1nX5i1iB@FreeBSD.org> <CACNAnaE6UQB=zNBjVNrF%2Bpd%2Bmh=6H0%2BROYf1%2BD=HKBTp8aX27g@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Sun, Dec 18, 2022 at 10:52:58AM -0600, Kyle Evans wrote: > On Sat, Dec 17, 2022 at 11:22 AM Gleb Smirnoff <glebius@freebsd.org> wrote: > > > > Zhenlei, > > > > On Fri, Dec 16, 2022 at 06:30:57PM +0800, Zhenlei Huang wrote: > > Z> I managed to repeat this issue on CURRENT/14 with this small snip: > > Z> > > Z> ------------------------------------------- > > Z> #!/bin/sh > > Z> > > Z> # test jail name > > Z> n="test_ref_leak" > > Z> > > Z> jail -c name=$n path=/ vnet persist > > Z> # The following line trigger jail pr_ref leak > > Z> jexec $n ifconfig lo0 inet 127.0.0.1/8 > > Z> > > Z> jail -R $n > > Z> > > Z> # wait a moment > > Z> sleep 1 > > Z> > > Z> jls -j $n > > Z> > > Z> After DDB debugging and tracing , it seems that is triggered by a combine of [1] and [2] > > Z> > > Z> [1] https://reviews.freebsd.org/rGfec8a8c7cbe4384c7e61d376f3aa5be5ac895915 <https://reviews.freebsd.org/rGfec8a8c7cbe4384c7e61d376f3aa5be5ac895915> > > Z> [2] https://reviews.freebsd.org/rGeb93b99d698674e3b1cc7139fda98e2b175b8c5b <https://reviews.freebsd.org/rGeb93b99d698674e3b1cc7139fda98e2b175b8c5b> > > Z> > > Z> > > Z> In [1] the per-VNET uma zone is shared with the global one. > > Z> `pcbinfo->ipi_zone = pcbstor->ips_zone;` > > Z> > > Z> In [2] unref `inp->inp_cred` is deferred called in inpcb_dtor() by uma_zfree_smr() . > > Z> > > Z> Unfortunately inps freed by uma_zfree_smr() are cached and inpcb_dtor() is not called immediately , > > Z> thus leaking `inp->inp_cred` ref and hence `prison->pr_ref`. > > Z> > > Z> And it is also not possible to free up the cache by per-VNET SYSUNINIT tcp_destroy / udp_destroy / rip_destroy. > > > > This is known issue and I'd prefer not to call it a problem. The "leak" of a jail > > happens only if machine is idle wrt the networking activity. > > > > Getting back to the problem that started this thread - the epair(4)s not immediately > > popping back to prison0. IMHO, the problem again lies in the design of if_vmove and > > epair(4) in particular. The if_vmove shall not exist, instead we should do a full > > if_attach() and if_detach(). The state of an ifnet when it undergoes if_vmove doesn't > > carry any useful information. With Alexander melifaro@ we discussed better options > > for creating or attaching interfaces to jails that if_vmove. Until they are ready > > the most easy workaround to deal with annoying epair(4) come back problem is to > > remove it manually before destroying a jail, like I did in 80fc25025ff. > > > > It still behaved much better prior to eb93b99d6986, which you and Mark > were going to work on a solution for to allow the cred "leak" to close > up much more quickly. CC markj@, since I think it's been six months > since the last time I inquired about it, making this a good time to do > it again... I spent some time trying to see if we could fix this in UMA/SMR and talked to Jeff about it a bit. At this point I don't think it's the right approach, at least for now. Really we have a composability problem where different layers are using different techniques to signal that they're done with a particular piece of memory, and they just aren't compatible. One thing I tried is to implement a UMA function which walks over all SMR zones and synchronizes all cached items (so that their destructors are called). This is really expensive, at minimum it has to bind to all CPUs in the system so that it can flush per-CPU buckets. If jail_deref() calls that function, the bug goes away at least in my limited testing, but its use is really a layering violation. We could, say, periodically scan cached UMA/SMR items and invoke their destructors, but for most SMR consumers this is unnecessary, and again there's a layering problem: the inpcb layer shouldn't "know" that it has to do that for its zones, since it's the jail layer that actually cares. It also seems kind of strange that dying jails still occupy a slot in the jail namespace. I don't really understand why the existence of a dying jail prevents creation of a new jail with the same name, but presumably there's a good reason for it? Now my inclination is to try and fix this in the inpcb layer, by not accessing the inp_cred at all in the lookup path until we hold the inpcb lock, and then releasing the cred ref before freeing a PCB to its zone. I think this is doable based on a few observations: - When doing an SMR-protected lookup, we always lock the returned inpcb before handing it to the caller. So we could in principle perform inp_cred checking after acquiring the lock but before returning. - If there are no jailed PCBs in a hash chain in_pcblookup_hash_locked() always scans the whole chain. - If we match only one PCB in a lookup, we can probably(?) return that PCB without dereferencing the cred pointer at all. If not, then the scan only has to keep track of a fixed number of PCBs before picking which one to return. So it looks like we can perform a lockless scan and keep track of matches on the stack, then lock the matched PCBs and perform prison checks if necessary, without making the common case more expensive. In fact there is a parallel thread on freebsd-jail which reports that this inp_cred access is a source of frequent cache misses. I was surprised to see that the scan calls prison_flag() before even checking the PCB's local address. So if the hash chain is large then we're potentially performing a lot of unnecessary memory accesses (though presumably it's common for most of the PCBs to be sharing a single cred?). In particular we can perhaps solve two problems at once. Any thoughts? Are there some fundamental reasons this can't work?
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?Y6He6OD6PA0ntoK9>