From owner-freebsd-arch@FreeBSD.ORG Fri Mar 14 18:32:42 2008 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 2A79A106566B for ; Fri, 14 Mar 2008 18:32:42 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from speedfactory.net (mail.speedfactory.net [66.23.216.219]) by mx1.freebsd.org (Postfix) with ESMTP id C63F38FC19 for ; Fri, 14 Mar 2008 18:32:41 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from server.baldwin.cx (unverified [66.23.211.162]) by speedfactory.net (SurgeMail 3.8s) with ESMTP id 235507532-1834499 for multiple; Fri, 14 Mar 2008 14:30:33 -0400 Received: from localhost.corp.yahoo.com (john@localhost [127.0.0.1]) (authenticated bits=0) by server.baldwin.cx (8.14.2/8.14.2) with ESMTP id m2EIW7e9043494; Fri, 14 Mar 2008 14:32:16 -0400 (EDT) (envelope-from jhb@freebsd.org) From: John Baldwin To: "Joseph Koshy" Date: Fri, 14 Mar 2008 14:31:53 -0400 User-Agent: KMail/1.9.7 References: <20080313180805.GA83406@dragon.NUXI.org> <200803131516.12284.jhb@freebsd.org> <84dead720803132232k15c3aad7pe59875f0c84e0c27@mail.gmail.com> In-Reply-To: <84dead720803132232k15c3aad7pe59875f0c84e0c27@mail.gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200803141431.53846.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH authentication, not delayed by milter-greylist-2.0.2 (server.baldwin.cx [127.0.0.1]); Fri, 14 Mar 2008 14:32:16 -0400 (EDT) X-Virus-Scanned: ClamAV 0.91.2/6232/Fri Mar 14 12:43:44 2008 on server.baldwin.cx X-Virus-Status: Clean X-Spam-Status: No, score=-4.4 required=4.2 tests=ALL_TRUSTED,AWL,BAYES_00 autolearn=ham version=3.1.3 X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on server.baldwin.cx Cc: freebsd-arch@freebsd.org Subject: Re: [PATCH] hwpmc(4) changes to use 'mp_maxid' instead of 'mp_ncpus'. X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 14 Mar 2008 18:32:42 -0000 On Friday 14 March 2008 01:32:43 am Joseph Koshy wrote: > On Fri, Mar 14, 2008 at 12:46 AM, John Baldwin wrote: > > On Thursday 13 March 2008 02:08:05 pm David O'Brien wrote: > > > Hi folks, > > > Some folks at Juniper have submitted these changes to hwpmc(4). > > > I am sending them here for public review. > > > > > > Their thoughts are: > > > The mp_ncpus refers to the count of the active CPU's. Where as > > > mp_maxid refers to the count of all the cpus on the SMP. Using > > > mp_ncpus in the cpu_id range-check of hwpmc module would lead to the > > > assumption that all the active CPU's in the SMP are not interleaved. > > > But for running on some platforms, the active and inactive cpus could > > > be interleaved making hwpmc not work for the cpus whose cpu_id is > > > greater than the active-cpu count. > > jhb> This is correct, but you need to handle CPUs that are absent. It might be > jhb> sufficient to update pmc_cpu_is_disabled() in kern_pmc.c to check > jhb> CPU_ABSENT(cpu) and claim the CPU is disabled if it is absent, but I'm not > jhb> sure that will catch everything as that seems aimed at handling having a > jhb> non-absent CPU halted (such as disabling HTT on i386). > > That is inline with the feedback (and sample patch to kern_pmc.c) that I > had sent in to O'Brien. > > But there are other problems with the patch at various levels, > probably not obvious to someone who is just looking at the kernel > code. > > First, the relevance. My understanding is that these changes are for > a proprietary SMP platform that uses non-mainstream (Tier3 or > Tier4) CPUs. It so happens that Juniper decided to numbers CPUs > 'sparsely' in their kernel variant and that is the motivation for this > patch. > > IMO, as a policy, code changes for exotic hardware need to be > maintained by vendors of said exotic hardware and not dumped on > volunteers. I would respond with two things: 1) I commited an overhaul of the x86 new-bus code to make it easier for "exotic" embedded x86 hardware platforms in use at companies such as NetApp to hook into new-bus more cleanly. By making it easier for companies to use FreeBSD we a) make it possible for them to even consider using FreeBSD, and b) for companies that use FreeBSD and devote resources (employees) to working on FreeBSD, those resources (e.g. grehan@ at NetApp) can spend more of their time working on stuff that might be able to given back to FreeBSD than coming up with hacks to work around deficiencies in FreeBSD. 2) All the sparse CPU stuff actually dates back to 5.0 and was there to support Alpha which originally numbered the CPUs using the HWPRB CPU IDs which were not sparse at all. (I think my DS20 has CPUs 6 and 7 or some such). So this was actually done to support a Tier-1 plaform (at the time). Also, note that the comments in sys/smp.h for CPU_ABSENT() and cpu_setmaxid() specifically refer to mp_maxid's purpose and the fact that sparse CPU ID sets are expected and should be handled by code in the kernel. > Second, when I designed the PMCTools API I didn't consider that CPU > numbers could be 'sparse'. [They aren't sparsely allocated > on the i386/amd64---the code I looked at when I was designing > PmcTools.] So there are assumptions sprinkled throughout userland > that that the integers 0..hw.ncpus can select a valid CPU. While > all that can be tracked down and changed, and documentation updated, > it is still work that I would prefer to defer until there is a chance > that someone > in the general public can use it. I do need to prioritize how I spend my > volunteer hours. FreeBSD has been trying to not be quite as i386-centric as it used to be. If you look at other code in the kernel that handles per-cpu data such as UMA you will see that it uses mp_maxid and CPU_ABSENT(). There are other places in the kernel that are broken though (such as ndis(4)). > Third, IFF we as a project are going to support 'sparse CPU numbering, > I would like to see the form that takes before making changes to > HWPMC and tools. For example: > - How will userland and in-kernel modules find out which CPUs are > physically present? Would there be a bitmask on the lines of today's > machdep.hlt_cpus that we could query? Could we make the > 'all_cpus' bitmask visible to userland? What happens when we > start supporting systems with more than 32 processors? Yes, we can certainly export more stuff to userland. The all_cpus mask would be good as would a MI online_cpus mask, though at this point they would be cpusets to handle > 32 rather than cpumask_t. Note that machdep.hlt_cpus is x86-only and would be superseded by a MI online_cpus mask. > - Will sysctl hw.ncpus represent the count of present CPUs or will it > represent the maximum CPU id? hw.ncpus is always mp_ncpus kern.smp.cpus is also mp_ncpus kern.smp.maxcpus is MAX_CPUS. Userland can just iterate from 0 to kern.smp.maxcpus while handling absent CPUs. (For example, the kern.cp_time[] sysctl just writes out all 0's for absent CPUs so that is how userland can determine an absent CPU in that case.) > - How will userland distinguish between absent CPUs those that > could be temporarily administratively disabled? See above re: all_cpus and online_cpus cpu sets. > - Are we going to support 'transient' CPUs [that come and go]? Why > would we want sparse CPU numbering otherwise? Yes. > Nit: 'mp_maxid' appears to be an index, not a count as claimed above. Correct, and documented as such in sys/smp.h. > If support for sparse CPU numbering is something useful, I feel the > correct sequence should be to discuss it here, add sparse CPU > numbering to the base i386/amd64 kernels (say) first and then > propagate the feature to auxiliary code like HWPMC and userland. > > Changing HWPMC and its userland before the base kernel itself > changes does not seem to be the right thing to do. While the userland interface is somewhat lacking, all of the in-kernel infrastructure has been in place for at least the past 4 years, and there is no excuse for any in-kernel code not properly handling sparse CPU IDs. -- John Baldwin