Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 23 Feb 2008 15:35:07 -0600
From:      Brooks Davis <brooks@freebsd.org>
To:        Jeff Roberson <jroberson@chesapeake.net>
Cc:        Daniel Eischen <deischen@freebsd.org>, arch@freebsd.org, Robert Watson <rwatson@freebsd.org>, David Xu <davidxu@freebsd.org>, Andrew Gallatin <gallatin@cs.duke.edu>
Subject:   Re: getaffinity/setaffinity and cpu sets.
Message-ID:  <20080223213507.GD39699@lor.one-eyed-alien.net>
In-Reply-To: <20080223111659.K920@desktop>
References:  <20080220105333.G44565@fledge.watson.org> <47BCEFDB.5040207@freebsd.org> <20080220175532.Q920@desktop> <20080220213253.A920@desktop> <20080221092011.J52922@fledge.watson.org> <20080222121253.N920@desktop> <20080222231245.GA28788@lor.one-eyed-alien.net> <20080222134923.M920@desktop> <20080223194047.GB38485@lor.one-eyed-alien.net> <20080223111659.K920@desktop>

next in thread | previous in thread | raw e-mail | index | archive | help

[-- Attachment #1 --]
On Sat, Feb 23, 2008 at 11:21:33AM -1000, Jeff Roberson wrote:
> 
> On Sat, 23 Feb 2008, Brooks Davis wrote:
> 
>> On Fri, Feb 22, 2008 at 01:52:54PM -1000, Jeff Roberson wrote:
>>> On Fri, 22 Feb 2008, Brooks Davis wrote:
>>> 
>>>> On Fri, Feb 22, 2008 at 12:34:13PM -1000, Jeff Roberson wrote:
>>>>> 
>>>>> On Thu, 21 Feb 2008, Robert Watson wrote:
>>>>> 
>>>>>> On Wed, 20 Feb 2008, Jeff Roberson wrote:
>> 
>>>>>> - It would be nice to be able to use CPU sets in jail as well,
>>>>>> suggesting
>>>>>> a
>>>>>>  hierarchal model with some sort of tagging so you know what CPU sets
>>>>>> were
>>>>>>  created in a jail such that you know whether they can be changed in a
>>>>>> jail.
>>>>>>  While I recognize this makes things a lot more tricky, I think we
>>>>>> should
>>>>>>  basically be planning more carefully with respect to virtualization
>>>>>> when
>>>>>> we
>>>>>>  add new interfaces, since it's a widely used feature, and the current
>>>>>> set
>>>>>> of
>>>>>>  "stragglers" unsupported in Jail is growing rather than shrinking.
>>>>> 
>>>>> I have implemented a hierarchical model.  Each thread has a pointer to
>>>>> the
>>>>> cpuset that it's in.  If it makes a local modification via 
>>>>> setaffinity()
>>>>> it
>>>>> gets an anonymous cpuset that is a child of the set assigned to the
>>>>> process.  This anonymous set will also be inherited across fork/thread
>>>>> creation.
>>>>> 
>>>>> In this model presently there are nodes marked as root.  To query the
>>>>> 'system' cpus available we walk up from the current node until we find 
>>>>> a
>>>>> root.  These are the 'system' set.  A thread may not break out of its
>>>>> system set.  A process may join the root set but it may not modify a 
>>>>> root
>>>>> that is a parent.  Jails would create a new root.  A process outside of
>>>>> the
>>>>> jail can modify the set of processors in the jail but a process within
>>>>> the
>>>>> jail/root may not.
>>>>> 
>>>>> The next level down from the root is the assigned set.  The root may be
>>>>> an
>>>>> assigned set or this may be a subset of the root.  Processes may create
>>>>> sets which are parented back to their root and may include any 
>>>>> processors
>>>>> within their root.  The mask of the assigned set is returned as
>>>>> 'available'
>>>>> processors.
>>>>> 
>>>>> This gives a 1 to 3 level hierarchy. The root, an assigned set, and an
>>>>> anonymous set.  Any of these but the root may be omitted.  There is no
>>>>> current way for userland to create subsets of assigned sets to permit
>>>>> further nesting.  I'm not sure I see value in it right now and it gives
>>>>> the
>>>>> possibility of unbound tree depth.
>>>>> 
>>>>> Anonymous sets are immutable as they are shared and changes only apply 
>>>>> to
>>>>> the thread/pid in the WHICH argument and not others which have 
>>>>> inherited
>>>>> from it.  Anonymous sets have no id and may not be specifically
>>>>> manipulated
>>>>> via a setid.  You must refer to the process/thread.  From the
>>>>> administration point of view they don't exist.
>>>>> 
>>>>> When a set is modified we walk down the children recursively and apply
>>>>> the
>>>>> new mask.  This is done with a global set lock under which all
>>>>> modifications and tree operations are performed.  The td_cpuset pointer
>>>>> is
>>>>> protected under the thread_lock() and may read the set without a lock.
>>>>> This
>>>>> gives the possibility for certain kinds of races but I believe they are
>>>>> all
>>>>> safe.
>>>>> 
>>>>> Hopefully I explained that well enough for people to follow.  I realize
>>>>> it's a lot of text but it's fairly simple book keeping code.  This is 
>>>>> all
>>>>> implemented and I'm debugging now.
>>>> 
>>>> One place I'd like to implement CPU affinity is in the Sun Grid Engine
>>>> execution daemon.  I think anonymous set would not be sufficent there
>>>> because the model allows new tasks to be started on a particular node at
>>>> any time during a parallel job.  I'd have to do some more digging in the
>>>> code to be entierly certain.  I think the less limits we place on the
>>>> hierarchy, the better off we'll be unless there are compeling complexity
>>>> reasons to avoid them.
>>> 
>>> With the anonymous set you can bind any thread to any cpu that is visible
>>> to it.  How would this not work?
>> 
>> I'm still trying to wrap my head around the anonymous sets.  Is the idea
>> that once you are in an anonymous set, you can't expand it, or can you
>> expand out as far as the assigned set?  I'd like for parallel jobs to
>> be allocated a set of cpus that they can't change, but still be able
>> to make their own decisions about thread affinity if they desire (for
>> example OpenMPI has some support for this so processes stay put and in
>> theory benefit from positive cache effects).  If that's feasible in
>> this model, I'm happy ok it.  I think we should keep in mind that these
>> SGE execution daemons might be sitting inside jails. ;-)
> 
> Ah, when I said the anonymous sets were immutable, that only means that 
> they are copy-on-write.  Because you can't know who shares a copy via fork 
> or thread creation you must make a new set each time you write.
> 
> I made the anonymous sets so that the parent would have a list of all 
> derivative children sets so that modifications to the parent would be 
> reflected in the child.  This also means that the scheduler only has to 
> look at one bitmap to determine the available cpus for a thread.

I think the anonymous sets seem like a good idea.  On solution to my
problem might be to make changing your current set to be something that
is not a subset of your parent (or maybe your current set?) is privileged.

-- Brooks

[-- Attachment #2 --]
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (FreeBSD)

iD8DBQFHwJGKXY6L6fI4GtQRAl3iAKDXYMD6U6rx87OVqGsDfQgQk/GVfACfXlra
EDNQLEYWfYoI6H5v7YsDBWM=
=YC+R
-----END PGP SIGNATURE-----

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20080223213507.GD39699>