Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 13 Dec 2022 19:03:54 +0000 (UTC)
From:      "Bjoern A. Zeeb" <bz@freebsd.org>
To:        James Gritton <jamie@freebsd.org>
Cc:        jail@freebsd.org, "glebius@FreeBSD.org" <glebius@freebsd.org>,  Andrew Gallatin <gallatin@gmail.com>
Subject:   Re: prison_flag() check in hot path of in_pcblookup()
Message-ID:  <89pn26q0-pps9-q8n7-1334-q15o5896p6p@serrofq.bet>
In-Reply-To: <e5ef5a4dfae8f7723c10dfb8db9b7d9a@freebsd.org>
References:  <CADwhF6VuoPCNEqyBmt%2BdZgDwHdaGty2%2BsYU4eYg0_62CMHq-BA@mail.gmail.com> <e5ef5a4dfae8f7723c10dfb8db9b7d9a@freebsd.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, 13 Dec 2022, James Gritton wrote:

Hi,

Argh, sorry Drew, I looked at the wrong of the two checks in the
function earlier.  Sorry, that's what happens if trying to be helpful
when firefighting elsewhere.

> On 2022-12-13 09:18, Andrew Gallatin wrote:
>
>> I was trying to improve the performance of in_pcblookup(), as it is a very 
>> hot path for us (Netflix). One thing I noticed was the prison_flag() check 
>> in in_pcblookup_hash_locked() can cause a cache miss just by deref'ing the 
>> cred pointer, and it can also cause multiple misses in tables with 
>> collisions by causing us to walk the entire chain even after finding a 
>> perfect match.
>> 
>> I'm curious why this check is needed.  Can you explain it to me?  It 
>> originated in this commit:
>> 
>> commit 413628a7e3d23a897cd959638d325395e4c9691b
>> Author: Bjoern A. Zeeb <bz@FreeBSD.org>
>> Date:   Sat Nov 29 14:32:14 2008 +0000
>> 
>> MFp4:
>> Bring in updated jail support from bz_jail branch.
>> 
>> This enhances the current jail implementation to permit multiple
>> addresses per jail. In addtion to IPv4, IPv6 is supported as well.
>> 
>> My thinking is that a jail will either use the host IP, and share its port 
>> space, or it will have its own IP entirely (but I know nothing about 
>> jails).

Well having its own IP address entirely doesn't work with classic jails
as there is only one network stack where they can be configured on
an interface, and that is the base system.

So de-facto all jail address/port space is always shared with the host
(ignoring vnet jails with their own entire virtual network stack).
Whether the host would bind/listen is a different story.  I know 15
years ago people would bind the sshd of the base to a single-IP address
which was not assigned to jails and then the sshd inside a jail would
bind to a different (single) IP address.   If one doesn't do that and the
sshd inside the jail isn't runnig one would end up trying to connect to
the sshd in the base system (sshd being a popular example).  Not sure if
people still do but that's still the case.

There are special cases with classic multi-IP jails in that they then
cannot have overlapping IP-ranges as otherwise it would not be
deterministic which jail would get an inbound connection on
inaddr_any:port_n if two jails were to listen on the same port_n.

Hope that helps for basic understanding.


>> In either case, a perfect 4-tuple match should be enough to 
>> uniquely identify the connection.
>> 
>> Even if this somehow is not the case and we have multiple connections 
>> somehow sharing the same 4-tuple, how does checking the prison flag help 
>> us?

That logic predates me and came from [1].  The jail_jailed_sockets_first
sysctl got removed in the review process with rwatson.  I am still trying
to see where the SO_REUSEPORT comment (back then) came from.  I know I
only had the first lines initially, so must have been sometime during
review with rwatson as well.   Sadly p4 emails where truncated to 1000
lines so I cannot simply grep for the change (if it is in my mail
archives) or had a useful commit message (but at least would give a
date to check further private email).

My current guess is that if we have the 4-tuple in both the base
and a jail (hence the SO_REUSEPORT comment) we want the jail not getting
a socket of the base system returned as that would mean one could "break
out of prison".  But if the inp belongs to a jail we know we can simply
return.  So if you find the one of the base system first you'll have to
go and look through the others.

XXX-jamie: is that all still true in hierarchical jails?

Now whether fabricating the case of having both inps on the hash is
still theorectically possible I cannot say.  I haven't followed all the
changes to that code lately close enough.


>> It would prefer the jailed connection over the non jailed, but that 
>> would shadow a host connection.  And if we had 2 jails sharing the same 
>> 4-tuple, the first jail would win.
>> 
>> I can't see how this check is doing anything useful, so I'd very much like 
>> to remove this check if possible.   Untested patch attached.
>
> For a complete 4-tuple, it should indeed be the case that a match would only 
> ever identify a single prison.  The later part of the function that examines 
> wildcards definitely needs the check.  I don't get the XXX comment about both 
> being bound with SO_REUSEPORT, because I would only expect that to apply to 
> listening, not to full connections. But I also expect Bjoern to know more 
> than I do here...

/bz

[1] https://people.freebsd.org/~pjd/patches/jail_2004120901.patch

-- 
Bjoern A. Zeeb                                                     r15:7



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?89pn26q0-pps9-q8n7-1334-q15o5896p6p>