From nobody Tue Dec 20 16:12:24 2022 X-Original-To: freebsd-jail@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4Nc1n92ZLPz1G2DY for ; Tue, 20 Dec 2022 16:12:29 +0000 (UTC) (envelope-from markjdb@gmail.com) Received: from mail-qt1-x832.google.com (mail-qt1-x832.google.com [IPv6:2607:f8b0:4864:20::832]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "GTS CA 1D4" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4Nc1n85wXYz3Jc4; Tue, 20 Dec 2022 16:12:28 +0000 (UTC) (envelope-from markjdb@gmail.com) Authentication-Results: mx1.freebsd.org; dkim=pass header.d=gmail.com header.s=20210112 header.b=puvB0DZU; spf=pass (mx1.freebsd.org: domain of markjdb@gmail.com designates 2607:f8b0:4864:20::832 as permitted sender) smtp.mailfrom=markjdb@gmail.com; dmarc=none Received: by mail-qt1-x832.google.com with SMTP id fu10so11346369qtb.0; Tue, 20 Dec 2022 08:12:28 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:sender:from:to:cc:subject:date:message-id :reply-to; bh=cE/1SQjIO35aNrWMhZ6Kvh2hFmPYyUVj0GpkvFjNCOE=; b=puvB0DZUAxaDLxSEYuw++9XATVrzQ4IAVZYEQwYXRYBnkj5lspiYhaQ2+chZyqtj6q QgwUWMpsSIOC72SCzEDIzqmzGeWtD//8bZqhfmGHlNctPCGzIq0OfuiGqmkAuB4wx1pR llF94tTM1p+V1QTvmeCGJonpat8MbK0ilXzOAuY1rTGZFCqIHkedT0UDWzkva+GXKg+q McrlMUax02vmnLseIW8aRvpJscyMkux74xpJ7hctTxOdvbDUfCqLvUxK4jX1WsXDkmNw 0SmF6QcWv80W+w3lX0GT/ZpwTmNNrTlevNb2f5ER1zkEh4+92WWujTs3lv/j8DLKhwT6 GYTw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:sender:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=cE/1SQjIO35aNrWMhZ6Kvh2hFmPYyUVj0GpkvFjNCOE=; b=zH49tPBbGVdC7flD2ILfRV3YGENz06kwuUf0peCs9H/0+TrxgVOmZ66nFewDLpX3Yb cr4DIDnkLJ0JBfPYOM2poNI4J5yCGhf0OtAYiEQUxqVV291w4DWXoNN87xPrOyZNGRir uOriVitnrY4BHFVzIJqC/WvQPSVz4CBOgpPdC+awbHuT3I39N2AKY7PvQZc6lc9b94WU OjmJkgwTz+C7s1G2FT+s0Z+oEg8zMm5DPBI0MUVh4/lMizitAWzgybcP81Srg3yCs2NW bWWrHXn9aim0IgBZ9DKX2EKmz1WPP8nKeJa4a0PHIXmFFMmXw+BsTLdPtI+Py5Vummcu GBGw== X-Gm-Message-State: ANoB5pnnmVQwOeJ6BwXmGt4MpifKObnjv37yBMfxB02tHImyAk8w534l 5RtReIW65dozc3QaIlWrdnCE2DlW77w= X-Google-Smtp-Source: AA0mqf7BKxJeU2+fQt8tvKw234Wuw0zARcAPOigNAPH8+2Td64C6/2b02EhD1EQDdXdomivskHYuKg== X-Received: by 2002:ac8:647:0:b0:3a5:f9cb:8852 with SMTP id e7-20020ac80647000000b003a5f9cb8852mr52714387qth.28.1671552747803; Tue, 20 Dec 2022 08:12:27 -0800 (PST) Received: from nuc (192-0-220-237.cpe.teksavvy.com. [192.0.220.237]) by smtp.gmail.com with ESMTPSA id x10-20020ac8730a000000b00398a7c860c2sm7843150qto.4.2022.12.20.08.12.26 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 20 Dec 2022 08:12:26 -0800 (PST) Date: Tue, 20 Dec 2022 11:12:24 -0500 From: Mark Johnston To: Kyle Evans Cc: Gleb Smirnoff , Zhenlei Huang , "Bjoern A. Zeeb" , "freebsd-jail@freebsd.org" Subject: Re: What's going on with vnets and epairs w/ addresses? Message-ID: References: <5r22os7n-ro15-27q-r356-rps331o06so5@mnoonqbm.arg> <150A60D6-6757-46DD-988F-05A9FFA36821@FreeBSD.org> List-Id: Discussion about FreeBSD jail(8) List-Archive: https://lists.freebsd.org/archives/freebsd-jail List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-jail@freebsd.org MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Spamd-Result: default: False [-1.42 / 15.00]; SUBJECT_ENDS_QUESTION(1.00)[]; NEURAL_HAM_LONG(-1.00)[-1.000]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; NEURAL_HAM_SHORT(-0.72)[-0.725]; MID_RHS_NOT_FQDN(0.50)[]; FORGED_SENDER(0.30)[markj@freebsd.org,markjdb@gmail.com]; R_DKIM_ALLOW(-0.20)[gmail.com:s=20210112]; R_SPF_ALLOW(-0.20)[+ip6:2607:f8b0:4000::/36]; MIME_GOOD(-0.10)[text/plain]; FROM_NEQ_ENVFROM(0.00)[markj@freebsd.org,markjdb@gmail.com]; MLMMJ_DEST(0.00)[freebsd-jail@freebsd.org]; RCVD_IN_DNSWL_NONE(0.00)[2607:f8b0:4864:20::832:from]; FROM_HAS_DN(0.00)[]; TAGGED_RCPT(0.00)[]; DMARC_NA(0.00)[freebsd.org]; TO_MATCH_ENVRCPT_SOME(0.00)[]; TO_DN_EQ_ADDR_SOME(0.00)[]; DWL_DNSWL_NONE(0.00)[gmail.com:dkim]; FREEMAIL_ENVFROM(0.00)[gmail.com]; RCPT_COUNT_FIVE(0.00)[5]; TO_DN_SOME(0.00)[]; RCVD_VIA_SMTP_AUTH(0.00)[]; ARC_NA(0.00)[]; RCVD_COUNT_THREE(0.00)[3]; RCVD_TLS_LAST(0.00)[]; DKIM_TRACE(0.00)[gmail.com:+]; ASN(0.00)[asn:15169, ipnet:2607:f8b0::/32, country:US]; MIME_TRACE(0.00)[0:+]; FREEMAIL_CC(0.00)[freebsd.org,gmail.com] X-Rspamd-Queue-Id: 4Nc1n85wXYz3Jc4 X-Spamd-Bar: - X-ThisMailContainsUnwantedMimeParts: N On Sun, Dec 18, 2022 at 10:52:58AM -0600, Kyle Evans wrote: > On Sat, Dec 17, 2022 at 11:22 AM Gleb Smirnoff wrote: > > > > Zhenlei, > > > > On Fri, Dec 16, 2022 at 06:30:57PM +0800, Zhenlei Huang wrote: > > Z> I managed to repeat this issue on CURRENT/14 with this small snip: > > Z> > > Z> ------------------------------------------- > > Z> #!/bin/sh > > Z> > > Z> # test jail name > > Z> n="test_ref_leak" > > Z> > > Z> jail -c name=$n path=/ vnet persist > > Z> # The following line trigger jail pr_ref leak > > Z> jexec $n ifconfig lo0 inet 127.0.0.1/8 > > Z> > > Z> jail -R $n > > Z> > > Z> # wait a moment > > Z> sleep 1 > > Z> > > Z> jls -j $n > > Z> > > Z> After DDB debugging and tracing , it seems that is triggered by a combine of [1] and [2] > > Z> > > Z> [1] https://reviews.freebsd.org/rGfec8a8c7cbe4384c7e61d376f3aa5be5ac895915 > > Z> [2] https://reviews.freebsd.org/rGeb93b99d698674e3b1cc7139fda98e2b175b8c5b > > Z> > > Z> > > Z> In [1] the per-VNET uma zone is shared with the global one. > > Z> `pcbinfo->ipi_zone = pcbstor->ips_zone;` > > Z> > > Z> In [2] unref `inp->inp_cred` is deferred called in inpcb_dtor() by uma_zfree_smr() . > > Z> > > Z> Unfortunately inps freed by uma_zfree_smr() are cached and inpcb_dtor() is not called immediately , > > Z> thus leaking `inp->inp_cred` ref and hence `prison->pr_ref`. > > Z> > > Z> And it is also not possible to free up the cache by per-VNET SYSUNINIT tcp_destroy / udp_destroy / rip_destroy. > > > > This is known issue and I'd prefer not to call it a problem. The "leak" of a jail > > happens only if machine is idle wrt the networking activity. > > > > Getting back to the problem that started this thread - the epair(4)s not immediately > > popping back to prison0. IMHO, the problem again lies in the design of if_vmove and > > epair(4) in particular. The if_vmove shall not exist, instead we should do a full > > if_attach() and if_detach(). The state of an ifnet when it undergoes if_vmove doesn't > > carry any useful information. With Alexander melifaro@ we discussed better options > > for creating or attaching interfaces to jails that if_vmove. Until they are ready > > the most easy workaround to deal with annoying epair(4) come back problem is to > > remove it manually before destroying a jail, like I did in 80fc25025ff. > > > > It still behaved much better prior to eb93b99d6986, which you and Mark > were going to work on a solution for to allow the cred "leak" to close > up much more quickly. CC markj@, since I think it's been six months > since the last time I inquired about it, making this a good time to do > it again... I spent some time trying to see if we could fix this in UMA/SMR and talked to Jeff about it a bit. At this point I don't think it's the right approach, at least for now. Really we have a composability problem where different layers are using different techniques to signal that they're done with a particular piece of memory, and they just aren't compatible. One thing I tried is to implement a UMA function which walks over all SMR zones and synchronizes all cached items (so that their destructors are called). This is really expensive, at minimum it has to bind to all CPUs in the system so that it can flush per-CPU buckets. If jail_deref() calls that function, the bug goes away at least in my limited testing, but its use is really a layering violation. We could, say, periodically scan cached UMA/SMR items and invoke their destructors, but for most SMR consumers this is unnecessary, and again there's a layering problem: the inpcb layer shouldn't "know" that it has to do that for its zones, since it's the jail layer that actually cares. It also seems kind of strange that dying jails still occupy a slot in the jail namespace. I don't really understand why the existence of a dying jail prevents creation of a new jail with the same name, but presumably there's a good reason for it? Now my inclination is to try and fix this in the inpcb layer, by not accessing the inp_cred at all in the lookup path until we hold the inpcb lock, and then releasing the cred ref before freeing a PCB to its zone. I think this is doable based on a few observations: - When doing an SMR-protected lookup, we always lock the returned inpcb before handing it to the caller. So we could in principle perform inp_cred checking after acquiring the lock but before returning. - If there are no jailed PCBs in a hash chain in_pcblookup_hash_locked() always scans the whole chain. - If we match only one PCB in a lookup, we can probably(?) return that PCB without dereferencing the cred pointer at all. If not, then the scan only has to keep track of a fixed number of PCBs before picking which one to return. So it looks like we can perform a lockless scan and keep track of matches on the stack, then lock the matched PCBs and perform prison checks if necessary, without making the common case more expensive. In fact there is a parallel thread on freebsd-jail which reports that this inp_cred access is a source of frequent cache misses. I was surprised to see that the scan calls prison_flag() before even checking the PCB's local address. So if the hash chain is large then we're potentially performing a lot of unnecessary memory accesses (though presumably it's common for most of the PCBs to be sharing a single cred?). In particular we can perhaps solve two problems at once. Any thoughts? Are there some fundamental reasons this can't work?