Date: Fri, 22 Feb 2008 17:12:46 -0600 From: Brooks Davis <brooks@freebsd.org> To: Jeff Roberson <jroberson@chesapeake.net> Cc: Daniel Eischen <deischen@freebsd.org>, arch@freebsd.org, Robert Watson <rwatson@freebsd.org>, David Xu <davidxu@freebsd.org>, Andrew Gallatin <gallatin@cs.duke.edu> Subject: Re: getaffinity/setaffinity and cpu sets. Message-ID: <20080222231245.GA28788@lor.one-eyed-alien.net> In-Reply-To: <20080222121253.N920@desktop> References: <20080112194521.I957@desktop> <20080219234101.D920@desktop> <20080220101348.D44565@fledge.watson.org> <20080220005030.Y920@desktop> <20080220105333.G44565@fledge.watson.org> <47BCEFDB.5040207@freebsd.org> <20080220175532.Q920@desktop> <20080220213253.A920@desktop> <20080221092011.J52922@fledge.watson.org> <20080222121253.N920@desktop>
next in thread | previous in thread | raw e-mail | index | archive | help
--a8Wt8u1KmwUX3Y2C
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable
On Fri, Feb 22, 2008 at 12:34:13PM -1000, Jeff Roberson wrote:
>=20
> On Thu, 21 Feb 2008, Robert Watson wrote:
>=20
>> On Wed, 20 Feb 2008, Jeff Roberson wrote:
>>=20
>>> I also have a 'cpuset' command which can run a new program with a given=
=20
>>> cpu set, view and modify sets of arbitrary pids. This is all working a=
nd=20
>>> I can supply patches if anyone is interested. I have to implement 4BSD=
=20
>>> support before I can commit.
>>> I have a proposal for solaris style processor sets which I think is=20
>>> simple and sufficient for most cases. It involves the following new=20
>>> syscalls:
>>> int cpuset(void); int setcpuset(pid_t pid, int setid); int=20
>>> getcpuset(pid_t pid);
>>> The notion would be that you can create a new numbered cpuset with=20
>>> cpuset(). You can modify or inspect its affinity with get/setaffinity=
=20
>>> above and the CPU_WHICH_SET argument. The cpuset exists as long as the=
re=20
>>> are members of the set. Sort of like a process group or session. The=
=20
>>> {get,set}cpuset calls can inspect or modify the state.
>>> This set would not be modifiable by user processes or by processes in a=
=20
>>> jail. It would create the restriction that differs between 'avail' and=
=20
>>> 'sys' above. Processors would be able to directly bind to any processor=
=20
>>> within the set. Changing the set would apply to all processes in the se=
t.=20
>>> The cpuset would be per-process while the mask is per-thread. Sets=20
>>> involvement is inherited on fork().
>>> In solaris sets can be named and have a more complete management api. =
=20
>>> I'm not really interested in implementing all of that but I believe wha=
t=20
>>> I have outlined here would be subset of this and no code/syscalls would=
=20
>>> be wasted.
>>> Comments? Objections? I'm fairly pleased with this arrangement now.
>>=20
>> Just to put a few notes from our conversation on IRC in e-mail:
>>=20
>> - I think I'd prefer int cpuset(cpuset_t *set), int getcpuset(pid_t,=20
>> cpuset_t
>> *) so that we don't mix up ID's and return values. More recent=20
>> interfaces
>> tend to do this, I believe, and it means that the prototype, even if no=
t=20
>> the
>> ABI, remains the same if the set identifier changes in the future.
>=20
> Ok, this is a good suggestion and I did this. This is actually my=20
> preferred method as well but most syscalls don't follow this pattern and =
I=20
> was trying to make it look syscallish.
>
>> - You don't mention what happens if a process's cpu set changes to=20
>> preclude a
>> CPU the process has a thread with affinity for. Online, you suggested
>> SIGKILL, and I thought maybe a new SIGCPUGONE with a default SIGKILL=20
>> action
>> might be a friendlier model. We should see what Solaris and others do=
=20
>> here
>> though. I like the idea that the affinity is a guarantee in userspace
>> because it means that you can rely on it; I'm OK with the idea that your
>> thread always runs on the CPUs you have affinity for unless in the
>> SIGCPUGONE handler :-).
>=20
> I could also reject changes to the cpuset if they leave a thread with=20
> nothing to run on. It might be confusing for the administrator and hard =
to=20
> tell them which thread caused the problem. However, it might be nicer th=
an=20
> killing a thread as well.
>=20
> Another option would be to expel the offending thread from the set that i=
s=20
> in violation and reparent it to the real system root along with a syslog=
=20
> message or similar. If the administrator addressed the problem with the=
=20
> set he could then reassign the grouping.
>=20
> This is what I would most like comments about. Should we have a force=20
> mode? Which of these behaviors sound best to you?
It seems to me that refusing by default and reparenting when forced sound r=
igh
to me. There migth also be some value in adding the ability to signal all
processes/threads bound to a cpu set so you can kill them if that's what you
want to do.
>> - It would be nice to be able to use CPU sets in jail as well, suggestin=
g=20
>> a
>> hierarchal model with some sort of tagging so you know what CPU sets we=
re
>> created in a jail such that you know whether they can be changed in a=
=20
>> jail.
>> While I recognize this makes things a lot more tricky, I think we should
>> basically be planning more carefully with respect to virtualization whe=
n=20
>> we
>> add new interfaces, since it's a widely used feature, and the current s=
et=20
>> of
>> "stragglers" unsupported in Jail is growing rather than shrinking.
>=20
> I have implemented a hierarchical model. Each thread has a pointer to th=
e=20
> cpuset that it's in. If it makes a local modification via setaffinity() =
it=20
> gets an anonymous cpuset that is a child of the set assigned to the=20
> process. This anonymous set will also be inherited across fork/thread=20
> creation.
>=20
> In this model presently there are nodes marked as root. To query the=20
> 'system' cpus available we walk up from the current node until we find a=
=20
> root. These are the 'system' set. A thread may not break out of its=20
> system set. A process may join the root set but it may not modify a root=
=20
> that is a parent. Jails would create a new root. A process outside of t=
he=20
> jail can modify the set of processors in the jail but a process within th=
e=20
> jail/root may not.
>=20
> The next level down from the root is the assigned set. The root may be a=
n=20
> assigned set or this may be a subset of the root. Processes may create=
=20
> sets which are parented back to their root and may include any processors=
=20
> within their root. The mask of the assigned set is returned as 'availabl=
e'=20
> processors.
>=20
> This gives a 1 to 3 level hierarchy. The root, an assigned set, and an=20
> anonymous set. Any of these but the root may be omitted. There is no=20
> current way for userland to create subsets of assigned sets to permit=20
> further nesting. I'm not sure I see value in it right now and it gives t=
he=20
> possibility of unbound tree depth.
>=20
> Anonymous sets are immutable as they are shared and changes only apply to=
=20
> the thread/pid in the WHICH argument and not others which have inherited=
=20
> from it. Anonymous sets have no id and may not be specifically manipulat=
ed=20
> via a setid. You must refer to the process/thread. From the=20
> administration point of view they don't exist.
>=20
> When a set is modified we walk down the children recursively and apply th=
e=20
> new mask. This is done with a global set lock under which all=20
> modifications and tree operations are performed. The td_cpuset pointer i=
s=20
> protected under the thread_lock() and may read the set without a lock. Th=
is=20
> gives the possibility for certain kinds of races but I believe they are a=
ll=20
> safe.
>=20
> Hopefully I explained that well enough for people to follow. I realize=
=20
> it's a lot of text but it's fairly simple book keeping code. This is all=
=20
> implemented and I'm debugging now.
One place I'd like to implement CPU affinity is in the Sun Grid Engine
execution daemon. I think anonymous set would not be sufficent there
because the model allows new tasks to be started on a particular node at
any time during a parallel job. I'd have to do some more digging in the
code to be entierly certain. I think the less limits we place on the
hierarchy, the better off we'll be unless there are compeling complexity
reasons to avoid them.
>> - There's still no way to specify an affinity policy rather than explicit
>> affinity, but if our CPU set model is sufficiently general, that might =
be=20
>> a
>> vehicle to do that. I.e., cpuset_setpolicy() rather than setting a mas=
k.
>=20
> Yes, I think this is orthogonal and can be addressed seperately. I'm not=
=20
> sure how many userland programs are smart enough or even capable of makin=
g=20
> determinations about their cache behavior however. We should open anothe=
r=20
> discussion once this one is done.
>=20
>>=20
>> - In the interests of boring API changes, recent APIs tend to prefix the
>> method on the object name. Have you thought about cpuset_create(),
>> cpuset_foo(), etc? That reduces the chances of interfering with=20
>> application
>> namespaces. I think, anyway. :-).
>=20
> Yes, I prefer that as well, as I mentioned syscalls tended to favor=20
> brevity. I'm fine with changing that trend.
>=20
>>=20
>> I need to ponder the proposal a little more, ideally over a hot beverage=
=20
>> this morning, and will follow up if I have further thoughts. Thanks for=
=20
>> working on this, BTW -- affinity is well-overdue for FreeBSD.
>=20
> A little more to ponder now! Your feedback is much appreciated.
>=20
> I believe the present hierarchical model satisfies the jail requirements =
of=20
> restricting cpus in the jail while still allowing the jail to create sets.
>=20
> The unanswered questions are:
>=20
> 1) What to do about sets that strand threads, options described above.
> 2) Are people ok with the transient nature of sets?
> 3) Does anyone want to help with man pages, administrative tools, etc? =
I=20
> have a prototype tool called 'cpuset' that fully exercises the api but is=
=20
> probably ugly. Will post details soon.
I could help with some of this as it furthers a funded project at work.
-- Brooks
--a8Wt8u1KmwUX3Y2C
Content-Type: application/pgp-signature
Content-Disposition: inline
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (FreeBSD)
iD8DBQFHv1btXY6L6fI4GtQRAnnGAJ9z3R/j+8/TrqOni6YsWrPyPFWA9gCgxfNK
7Dm2dW5L4wJDeLucFO3x2ME=
=MJzF
-----END PGP SIGNATURE-----
--a8Wt8u1KmwUX3Y2C--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20080222231245.GA28788>
