Date: Sun, 8 Oct 2006 18:23:23 -0700 From: Paul Allen <pallen@ugcs.caltech.edu> To: Kip Macy <kmacy@fsmware.com>, Attilio Rao <attilio@freebsd.org>, David Xu <davidxu@freebsd.org>, freebsd-current@freebsd.org, Ivan Voras <ivoras@fer.hr> Subject: Re: [PATCH] MAXCPU alterable in kernel config - needs testers Message-ID: <20061009012323.GD3467@riyal.ugcs.caltech.edu> In-Reply-To: <20061009002200.GM793@funkthat.com> References: <2fd864e0610080423q7ba6bdeal656a223e662a5d@mail.gmail.com> <20061008135031.G83537@demos.bsdclusters.com> <4529667D.8070108@fer.hr> <200610090634.31297.davidxu@freebsd.org> <20061008225150.GK793@funkthat.com> <3bbf2fe10610081555r67265368sf7f12edbf35bff0d@mail.gmail.com> <20061008155817.G29803@demos.bsdclusters.com> <20061009002200.GM793@funkthat.com>
next in thread | previous in thread | raw e-mail | index | archive | help
>From John-Mark Gurney <gurney_j@resnet.uoregon.edu>, Sun, Oct 08, 2006 at 05:22:00PM -0700: > Wouldn't having a single run queue lock still serialize the cpu's when > getting a thread to run? Don't we really need a per cpu run queue, and > then have a scheduler that puts threads on the cpu's run queues? Just so. A fine-grain mutex model is much like pipelining a processor. You break a task up into many small tasks each with its own lock and then run the tasks in parallel. There is a problem though. 1) How many pipeline stages you want depends on the number of underlying CPUs. Too few stages means idle processors. Too many stages relative to the available resources means wasted time acquiring many locks. 2) The work has been relatively matched. One slow stage creates bottlenecks that everyone else eventually feels. 3) Cache coherency: handing off a datum between CPUs wastes cache resources. Allowing one CPU to process a given datum through a life-time of several different lock stages (tasks) means sharing the data covered by those task's locks. This sort of design problem is hard to solve and harder for several different developers some of them working in their spare-time to coordinate properly. It's even harder to do when the end application is unknown (i.e., in a general purpose OS). BUT, per-cpu algorithms are not necessarily better. you can waste cache resources by data replication. you need to provision load-balancing/migration. I had been hoping that mdillon @ dragonfly would provide good experimental evidence on this point. unfortunately, in my estimation he s distracted by some heavy goals (namely his single-image clustering). The consequence of this is 1) it will take a long-time before enough of the dragonfly kernel is out from under giant and 2) relative performance comparisons will be dubious because it will be unclear what cost he is paying for some of his single-image clustering support. Its a shame because the community would greatly benefit from clear experimental evidence contrasting some of the fine-grained locking approaches with a more per-cpu oriented design. AFAIK, the linux kernel has generally favored the per-cpu approach. In that respect, relative underperformance of FreeBSD vs. Linux is an indicator that per-cpu approaches deserve more weight in the FreeBSD world. Paul
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20061009012323.GD3467>