From owner-freebsd-arch@FreeBSD.ORG  Fri Mar 14 18:32:42 2008
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 2A79A106566B
	for <freebsd-arch@freebsd.org>; Fri, 14 Mar 2008 18:32:42 +0000 (UTC)
	(envelope-from jhb@freebsd.org)
Received: from speedfactory.net (mail.speedfactory.net [66.23.216.219])
	by mx1.freebsd.org (Postfix) with ESMTP id C63F38FC19
	for <freebsd-arch@freebsd.org>; Fri, 14 Mar 2008 18:32:41 +0000 (UTC)
	(envelope-from jhb@freebsd.org)
Received: from server.baldwin.cx (unverified [66.23.211.162]) 
	by speedfactory.net (SurgeMail 3.8s) with ESMTP id 235507532-1834499 
	for multiple; Fri, 14 Mar 2008 14:30:33 -0400
Received: from localhost.corp.yahoo.com (john@localhost [127.0.0.1])
	(authenticated bits=0)
	by server.baldwin.cx (8.14.2/8.14.2) with ESMTP id m2EIW7e9043494;
	Fri, 14 Mar 2008 14:32:16 -0400 (EDT) (envelope-from jhb@freebsd.org)
From: John Baldwin <jhb@freebsd.org>
To: "Joseph Koshy" <joseph.koshy@gmail.com>
Date: Fri, 14 Mar 2008 14:31:53 -0400
User-Agent: KMail/1.9.7
References: <20080313180805.GA83406@dragon.NUXI.org>
	<200803131516.12284.jhb@freebsd.org>
	<84dead720803132232k15c3aad7pe59875f0c84e0c27@mail.gmail.com>
In-Reply-To: <84dead720803132232k15c3aad7pe59875f0c84e0c27@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain;
  charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
Message-Id: <200803141431.53846.jhb@freebsd.org>
X-Greylist: Sender succeeded SMTP AUTH authentication, not delayed by
	milter-greylist-2.0.2 (server.baldwin.cx [127.0.0.1]);
	Fri, 14 Mar 2008 14:32:16 -0400 (EDT)
X-Virus-Scanned: ClamAV 0.91.2/6232/Fri Mar 14 12:43:44 2008 on
	server.baldwin.cx
X-Virus-Status: Clean
X-Spam-Status: No, score=-4.4 required=4.2 tests=ALL_TRUSTED,AWL,BAYES_00 
	autolearn=ham version=3.1.3
X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on server.baldwin.cx
Cc: freebsd-arch@freebsd.org
Subject: Re: [PATCH] hwpmc(4) changes to use 'mp_maxid' instead of
	'mp_ncpus'.
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 14 Mar 2008 18:32:42 -0000

On Friday 14 March 2008 01:32:43 am Joseph Koshy wrote:
> On Fri, Mar 14, 2008 at 12:46 AM, John Baldwin <jhb@freebsd.org> wrote:
> > On Thursday 13 March 2008 02:08:05 pm David O'Brien wrote:
> >  > Hi folks,
> >  > Some folks at Juniper have submitted these changes to hwpmc(4).
> >  > I am sending them here for public review.
> >  >
> >  > Their thoughts are:
> >  >     The mp_ncpus refers to the count of the active CPU's.  Where as
> >  >     mp_maxid refers to the count of all the cpus on the SMP.  Using
> >  >     mp_ncpus in the cpu_id range-check of hwpmc module would lead to 
the
> >  >     assumption that all the active CPU's in the SMP are not 
interleaved.
> >  >     But for running on some platforms, the active and inactive cpus 
could
> >  >     be interleaved making hwpmc not work for the cpus whose cpu_id is
> >  >     greater than the active-cpu count.
> 
> jhb>  This is correct, but you need to handle CPUs that are absent.  It 
might be
> jhb>  sufficient to update pmc_cpu_is_disabled() in kern_pmc.c to check
> jhb>  CPU_ABSENT(cpu) and claim the CPU is disabled if it is absent, but I'm 
not
> jhb>  sure that will catch everything as that seems aimed at handling having 
a
> jhb>  non-absent CPU halted (such as disabling HTT on i386).
> 
> That is inline with the feedback (and sample patch to kern_pmc.c) that I
> had sent in to O'Brien.
> 
> But there are other problems with the patch at various levels,
> probably not obvious to someone who is just looking at the kernel
> code.
> 
> First, the relevance.  My understanding is that these changes are for
> a proprietary SMP platform that uses non-mainstream (Tier3 or
> Tier4) CPUs.  It so happens that Juniper decided to numbers CPUs
> 'sparsely' in their kernel variant and that is the motivation for this
> patch.
> 
> IMO, as a policy, code changes for exotic hardware need to be
> maintained by vendors of said exotic hardware and not dumped on
> volunteers.

I would respond with two things:

1) I commited an overhaul of the x86 new-bus code to make it easier 
for "exotic" embedded x86 hardware platforms in use at companies such as 
NetApp to hook into new-bus more cleanly.  By making it easier for companies 
to use FreeBSD we a) make it possible for them to even consider using 
FreeBSD, and b) for companies that use FreeBSD and devote resources 
(employees) to working on FreeBSD, those resources (e.g. grehan@ at NetApp) 
can spend more of their time working on stuff that might be able to given 
back to FreeBSD than coming up with hacks to work around deficiencies in 
FreeBSD.

2) All the sparse CPU stuff actually dates back to 5.0 and was there to 
support Alpha which originally numbered the CPUs using the HWPRB CPU IDs 
which were not sparse at all.  (I think my DS20 has CPUs 6 and 7 or some 
such).  So this was actually done to support a Tier-1 plaform (at the time).

Also, note that the comments in sys/smp.h for CPU_ABSENT() and cpu_setmaxid() 
specifically refer to mp_maxid's purpose and the fact that sparse CPU ID sets 
are expected and should be handled by code in the kernel.

> Second, when I designed the PMCTools API I didn't consider that CPU
> numbers could be 'sparse'.  [They aren't sparsely allocated
> on the i386/amd64---the code I looked at when I was designing
> PmcTools.]  So there are assumptions sprinkled throughout userland
> that that the integers 0..hw.ncpus  can select a valid CPU.  While
> all that can be tracked down and changed, and documentation updated,
> it is still work that I would prefer to defer until there is a chance
> that someone
> in the general public can use it.  I do need to prioritize how I spend my
> volunteer hours.

FreeBSD has been trying to not be quite as i386-centric as it used to be.  If 
you look at other code in the kernel that handles per-cpu data such as UMA 
you will see that it uses mp_maxid and CPU_ABSENT().  There are other places 
in the kernel that are broken though (such as ndis(4)).

> Third, IFF we as a project are going to support 'sparse CPU numbering,
> I would like to see the form that takes before making changes to
> HWPMC and tools.  For example:
> - How will userland and in-kernel modules find out which CPUs are
>   physically present?   Would there be a bitmask on the lines of today's
>   machdep.hlt_cpus that we could query?  Could we make the
>   'all_cpus' bitmask visible to userland?  What happens when we
>   start supporting systems with more than 32 processors?

Yes, we can certainly export more stuff to userland.  The all_cpus mask would 
be good as would a MI online_cpus mask, though at this point they would be 
cpusets to handle > 32 rather than cpumask_t.  Note that machdep.hlt_cpus is 
x86-only and would be superseded by a MI online_cpus mask.

> - Will sysctl hw.ncpus represent the count of present CPUs or will it
>   represent the maximum CPU id?

hw.ncpus is always mp_ncpus
kern.smp.cpus is also mp_ncpus
kern.smp.maxcpus is MAX_CPUS.

Userland can just iterate from 0 to kern.smp.maxcpus while handling absent 
CPUs.  (For example, the kern.cp_time[] sysctl just writes out all 0's for 
absent CPUs so that is how userland can determine an absent CPU in that 
case.)

> - How will userland  distinguish between absent CPUs those that
>   could be temporarily administratively disabled?

See above re: all_cpus and online_cpus cpu sets.

> - Are we going to support 'transient' CPUs [that come and go]?  Why
>   would we want sparse CPU numbering otherwise?

Yes.

> Nit: 'mp_maxid' appears to be an index, not a count as claimed above.

Correct, and documented as such in sys/smp.h.

> If support for sparse CPU numbering is something useful, I feel the
> correct sequence should be to discuss it here, add sparse CPU
> numbering to the base i386/amd64 kernels (say) first and then
> propagate the feature to auxiliary code like HWPMC and userland.
> 
> Changing HWPMC and its userland before the base kernel itself
> changes does not seem to be the right thing to do.

While the userland interface is somewhat lacking, all of the in-kernel 
infrastructure has been in place for at least the past 4 years, and there is 
no excuse for any in-kernel code not properly handling sparse CPU IDs.

-- 
John Baldwin