Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 23 Feb 2008 15:35:07 -0600
From:      Brooks Davis <brooks@freebsd.org>
To:        Jeff Roberson <jroberson@chesapeake.net>
Cc:        Daniel Eischen <deischen@freebsd.org>, arch@freebsd.org, Robert Watson <rwatson@freebsd.org>, David Xu <davidxu@freebsd.org>, Andrew Gallatin <gallatin@cs.duke.edu>
Subject:   Re: getaffinity/setaffinity and cpu sets.
Message-ID:  <20080223213507.GD39699@lor.one-eyed-alien.net>
In-Reply-To: <20080223111659.K920@desktop>
References:  <20080220105333.G44565@fledge.watson.org> <47BCEFDB.5040207@freebsd.org> <20080220175532.Q920@desktop> <20080220213253.A920@desktop> <20080221092011.J52922@fledge.watson.org> <20080222121253.N920@desktop> <20080222231245.GA28788@lor.one-eyed-alien.net> <20080222134923.M920@desktop> <20080223194047.GB38485@lor.one-eyed-alien.net> <20080223111659.K920@desktop>

next in thread | previous in thread | raw e-mail | index | archive | help

--1sNVjLsmu1MXqwQ/
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Sat, Feb 23, 2008 at 11:21:33AM -1000, Jeff Roberson wrote:
>=20
> On Sat, 23 Feb 2008, Brooks Davis wrote:
>=20
>> On Fri, Feb 22, 2008 at 01:52:54PM -1000, Jeff Roberson wrote:
>>> On Fri, 22 Feb 2008, Brooks Davis wrote:
>>>=20
>>>> On Fri, Feb 22, 2008 at 12:34:13PM -1000, Jeff Roberson wrote:
>>>>>=20
>>>>> On Thu, 21 Feb 2008, Robert Watson wrote:
>>>>>=20
>>>>>> On Wed, 20 Feb 2008, Jeff Roberson wrote:
>>=20
>>>>>> - It would be nice to be able to use CPU sets in jail as well,
>>>>>> suggesting
>>>>>> a
>>>>>>  hierarchal model with some sort of tagging so you know what CPU sets
>>>>>> were
>>>>>>  created in a jail such that you know whether they can be changed in=
 a
>>>>>> jail.
>>>>>>  While I recognize this makes things a lot more tricky, I think we
>>>>>> should
>>>>>>  basically be planning more carefully with respect to virtualization
>>>>>> when
>>>>>> we
>>>>>>  add new interfaces, since it's a widely used feature, and the curre=
nt
>>>>>> set
>>>>>> of
>>>>>>  "stragglers" unsupported in Jail is growing rather than shrinking.
>>>>>=20
>>>>> I have implemented a hierarchical model.  Each thread has a pointer to
>>>>> the
>>>>> cpuset that it's in.  If it makes a local modification via=20
>>>>> setaffinity()
>>>>> it
>>>>> gets an anonymous cpuset that is a child of the set assigned to the
>>>>> process.  This anonymous set will also be inherited across fork/thread
>>>>> creation.
>>>>>=20
>>>>> In this model presently there are nodes marked as root.  To query the
>>>>> 'system' cpus available we walk up from the current node until we fin=
d=20
>>>>> a
>>>>> root.  These are the 'system' set.  A thread may not break out of its
>>>>> system set.  A process may join the root set but it may not modify a=
=20
>>>>> root
>>>>> that is a parent.  Jails would create a new root.  A process outside =
of
>>>>> the
>>>>> jail can modify the set of processors in the jail but a process within
>>>>> the
>>>>> jail/root may not.
>>>>>=20
>>>>> The next level down from the root is the assigned set.  The root may =
be
>>>>> an
>>>>> assigned set or this may be a subset of the root.  Processes may crea=
te
>>>>> sets which are parented back to their root and may include any=20
>>>>> processors
>>>>> within their root.  The mask of the assigned set is returned as
>>>>> 'available'
>>>>> processors.
>>>>>=20
>>>>> This gives a 1 to 3 level hierarchy. The root, an assigned set, and an
>>>>> anonymous set.  Any of these but the root may be omitted.  There is no
>>>>> current way for userland to create subsets of assigned sets to permit
>>>>> further nesting.  I'm not sure I see value in it right now and it giv=
es
>>>>> the
>>>>> possibility of unbound tree depth.
>>>>>=20
>>>>> Anonymous sets are immutable as they are shared and changes only appl=
y=20
>>>>> to
>>>>> the thread/pid in the WHICH argument and not others which have=20
>>>>> inherited
>>>>> from it.  Anonymous sets have no id and may not be specifically
>>>>> manipulated
>>>>> via a setid.  You must refer to the process/thread.  From the
>>>>> administration point of view they don't exist.
>>>>>=20
>>>>> When a set is modified we walk down the children recursively and apply
>>>>> the
>>>>> new mask.  This is done with a global set lock under which all
>>>>> modifications and tree operations are performed.  The td_cpuset point=
er
>>>>> is
>>>>> protected under the thread_lock() and may read the set without a lock.
>>>>> This
>>>>> gives the possibility for certain kinds of races but I believe they a=
re
>>>>> all
>>>>> safe.
>>>>>=20
>>>>> Hopefully I explained that well enough for people to follow.  I reali=
ze
>>>>> it's a lot of text but it's fairly simple book keeping code.  This is=
=20
>>>>> all
>>>>> implemented and I'm debugging now.
>>>>=20
>>>> One place I'd like to implement CPU affinity is in the Sun Grid Engine
>>>> execution daemon.  I think anonymous set would not be sufficent there
>>>> because the model allows new tasks to be started on a particular node =
at
>>>> any time during a parallel job.  I'd have to do some more digging in t=
he
>>>> code to be entierly certain.  I think the less limits we place on the
>>>> hierarchy, the better off we'll be unless there are compeling complexi=
ty
>>>> reasons to avoid them.
>>>=20
>>> With the anonymous set you can bind any thread to any cpu that is visib=
le
>>> to it.  How would this not work?
>>=20
>> I'm still trying to wrap my head around the anonymous sets.  Is the idea
>> that once you are in an anonymous set, you can't expand it, or can you
>> expand out as far as the assigned set?  I'd like for parallel jobs to
>> be allocated a set of cpus that they can't change, but still be able
>> to make their own decisions about thread affinity if they desire (for
>> example OpenMPI has some support for this so processes stay put and in
>> theory benefit from positive cache effects).  If that's feasible in
>> this model, I'm happy ok it.  I think we should keep in mind that these
>> SGE execution daemons might be sitting inside jails. ;-)
>=20
> Ah, when I said the anonymous sets were immutable, that only means that=
=20
> they are copy-on-write.  Because you can't know who shares a copy via for=
k=20
> or thread creation you must make a new set each time you write.
>=20
> I made the anonymous sets so that the parent would have a list of all=20
> derivative children sets so that modifications to the parent would be=20
> reflected in the child.  This also means that the scheduler only has to=
=20
> look at one bitmap to determine the available cpus for a thread.

I think the anonymous sets seem like a good idea.  On solution to my
problem might be to make changing your current set to be something that
is not a subset of your parent (or maybe your current set?) is privileged.

-- Brooks

--1sNVjLsmu1MXqwQ/
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (FreeBSD)

iD8DBQFHwJGKXY6L6fI4GtQRAl3iAKDXYMD6U6rx87OVqGsDfQgQk/GVfACfXlra
EDNQLEYWfYoI6H5v7YsDBWM=
=YC+R
-----END PGP SIGNATURE-----

--1sNVjLsmu1MXqwQ/--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20080223213507.GD39699>