Date: Thu, 19 Nov 2020 22:56:45 -0600 From: Kyle Evans <kevans@freebsd.org> To: James Gritton <jamie@freebsd.org> Cc: freebsd-jail <freebsd-jail@freebsd.org> Subject: Re: cpuset/jail creation Message-ID: <CACNAnaF1U_R5wNrLyDc2KkdzsgmyWVeewrrOs4w7J-8vVUoC4g@mail.gmail.com> In-Reply-To: <08c97ed86c3d64fea1cedbc111841b7a@freebsd.org> References: <CACNAnaFC4fhYTC7T3zWzEsHO=M-7Ny9KNxh47-Jdi_4yha%2BzZg@mail.gmail.com> <08c97ed86c3d64fea1cedbc111841b7a@freebsd.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, Nov 18, 2020 at 12:39 PM James Gritton <jamie@freebsd.org> wrote: > > On 2020-11-17 10:40, Kyle Evans wrote: > > Hello! > > > > I've done some work to try and make qemu-user-static honor cpuset and > > advertise a fake hw.ncpu to emulated processes based on the number of > > cpus actually available to it. > > > > In doing so, I discovered that created jail's inherit the parent > > jail's cpuset mask, rather than the creating process. This is OK for > > persistent jails, as one can create the jail, cpuset(1), then jexec > > whatever tasks they want to do; but jails that would otherwise not be > > persistent have to either deal with racy cpuset(1) after creation or > > settle for instead creating a persistent jail because they need a > > specific cpuset. > > > > I've got this patch that I'd like to propose[0], but it's unclear if > > it's really OK to do or if anyone else cares about this. I can't see > > any theoretical problem with it off-hand, as the creating thread > > should be guaranteed to have a cpuset that's valid as a child of the > > parent prison's cpuset. > > > > The patch creates a new poorly-named cpuset_create_root_td KPI to > > inherit the cpuset from the creating thread, and leaves the previous > > KPI intact in case something else is using it or to leave the door > > open to adding an option to go either way with this (inherit from > > parent jail vs. inherit from thread). > > > > From a MAC perspective, I think it makes a lot of sense to inherit > > from the thread by default. e.g. a non-root user could be granted > > PRIV_JAIL_SET, then they're freely able to create jails using the > > parent jail's root cpuset even if they've been limited themselves via > > login.conf(5) restriction. > > > > For most existing use-cases, it should effectively be a nop unless > > they were cpuset(1)ing a process not expecting the created jail to > > inherit that. > > > > Thoughts? Thanks, > > > > Kyle Evans > > > > [0] https://people.freebsd.org/~kevans/jail-cpuset.diff > > There are two separate issues here: the cpuset of the jail, and the > cpuset of the processes in the jail. > > For the first, I think the current code does the right thing, > inheriting the cpuset from the parent jail. > > For the second, the current code appears to be *almost* correct. When > you attach to a jail, including a non-persistent one, you keep the > thread's cpuset, but it is masked to the jail's cpuset. Thus a jailed > process is always: > * at least as restricted as the jail's cpuset. > * at least as restricted as the process' original cpuset. > > So a process that has a restricted cpuset may be able to create a jail > that allows CPUs that the process lacks access to, but it still can't > run on those CPUs. This applies to non-persistent jails: the cpuset > of the jail ends up being somewhat illusory, since the only thing > running in the jail is still restricted to its original cpuset (or > less, if the parent jail is also restricted). True, if some other > process later attaches to that jail it could use CPUs that the jail > creator couldn't. But IMHO that's as it should be: this is only if > the later-attaching process had access to those CPUs anyway. > Sure, ok, that makes sense. > Now I mention that the current behavior is almost correct. In trying > this out, I found that the attached process keeps its cpuset, with a > mask applied from the jail cpuset. I think it would be better the > other way around: use the jail's cpuset, and mask it to the current > thread mask. Interestingly, if attaching to a jail from a normal > (unrestricted) system thread, that's exactly what happens. This is > down to an implementation detail of cpuset_setproc_setthread that > looks simple to change but that where I'm unsure of side effects > without further exploration. The only case where this matters is the > "racy cpuset(1) after creation" you mentioned, where setting the > jail's cpu list wouldn't change anything running in the non-persistent > jail, because those are still in the process' cpuset (even if masked). > > So I don't think cpuset_create_root_td() is necessary, but I do think > cpuset_setproc_sethread and cpuset_setproc_setthread_mask may need > some tweaking > I can also agree with that, and that the changes are a little more involved. cpuset_setproc w/ non-NULL set is also used for cpuset(2) and cpuset_setid(2), so the current contract needs to be maintained for those interfaces because that is pretty much the essence of what they do. I'm having a hard time reconciling this, though: * 1) Set is non-null. This reparents all anonymous sets to the provided * set and replaces all non-anonymous td_cpusets with the provided set The verbiage would seem to indicate to me that the new parent provided by cpuset_setproc_setthread() should be set, but it's not: return cpuset_shadow(tdset, nsetp, &mask, &domain, freelist, domainlist); cpuset_shadow makes tdset the parent, unless it's anon -- then it uses the base of tdset, which appears to be wrong. I think this really wants to be passing 'set' as the first parameter to meet the promise of cpuset_settproc. I've mulled over the rest of it a little bit, and I think this is what I came up with based on your assessment: https://people.freebsd.org/~kevans/cpuset-reroot.diff The basic premise of this patch is that we shouldn't use the set passed to cpuset_setproc directly if it has a different root than whatever root the process has [*]. More importantly, the process should not lose the identity of its base in the move. Currently, in the scenario that I'm going through, the process and its single thread goes from having its own cpuset 5 that was just created with a different affinity to an anonymous set hanging off of the jail's cpuset. We can make the obvious change and make that an anonymous set with the correct mask, but I think that's actually not an ideal behavior; if you inspect the process cpuset id from before and after the attach, it goes from having a cpuset id that you can modify to one that you cannot modify because the one thread has an anonymous cpuset with parent set to the prison's cpuset. On the other hand, let's go the other way with it: say I have a process based on cpuset 1 and just a different affinity, so the one thread has an anonymous cpuset based on the mother of all prison0 processes. With the above patch applied, it suddenly gets its own numbered cpuset [**]. [*] This isn't quite right, though, because unjailed processes can cpuset_setid(2) any cpuset on the system, jailed or not. We likely need to have the caller indicate if this is a jail update for the process cpuset. [**] I think this particular issue can be easily addressed by walking through all of the threads in the process one more time, just before we allocate a new base cpuset for it if we're going to. If every thread has an anonymous set, then we can simply avoid creating a new base for it and just keep the existing behavior, which will do the right thing -- create a new anonymous set hanging off of the jail's root. The jail's root set can be modified by real root, so from that perspective nothing has changed. This should be safe, even, because we hold the proc lock and all of the paths that could modify one of its thread cpusets to invalidate the walk must also grab the proc lock.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CACNAnaF1U_R5wNrLyDc2KkdzsgmyWVeewrrOs4w7J-8vVUoC4g>