FreeBSD Mail Archives

Date:      Sun, 8 Oct 2006 18:23:23 -0700
From:      Paul Allen <pallen@ugcs.caltech.edu>
To:        Kip Macy <kmacy@fsmware.com>, Attilio Rao <attilio@freebsd.org>, David Xu <davidxu@freebsd.org>, freebsd-current@freebsd.org, Ivan Voras <ivoras@fer.hr>
Subject:   Re: [PATCH] MAXCPU alterable in kernel config - needs testers
Message-ID:  <20061009012323.GD3467@riyal.ugcs.caltech.edu>
In-Reply-To: <20061009002200.GM793@funkthat.com>
References:  <2fd864e0610080423q7ba6bdeal656a223e662a5d@mail.gmail.com> <20061008135031.G83537@demos.bsdclusters.com> <4529667D.8070108@fer.hr> <200610090634.31297.davidxu@freebsd.org> <20061008225150.GK793@funkthat.com> <3bbf2fe10610081555r67265368sf7f12edbf35bff0d@mail.gmail.com> <20061008155817.G29803@demos.bsdclusters.com> <20061009002200.GM793@funkthat.com>

>From John-Mark Gurney <gurney_j@resnet.uoregon.edu>, Sun, Oct 08, 2006 at 05:22:00PM -0700:
> Wouldn't having a single run queue lock still serialize the cpu's when
> getting a thread to run? Don't we really need a per cpu run queue, and
> then have a scheduler that puts threads on the cpu's run queues?
Just so. A fine-grain mutex model is much like pipelining a processor.
You break a task up into many small tasks each with its own lock and then
run the tasks in parallel. There is a problem though.

1) How many pipeline stages you want depends on the number of underlying
CPUs. Too few stages means idle processors. Too many stages relative
to the available resources means wasted time acquiring many locks.
2) The work has been relatively matched. One slow stage creates bottlenecks
that everyone else eventually feels.
3) Cache coherency: handing off a datum between CPUs wastes cache resources.
Allowing one CPU to process a given datum through a life-time of several
different lock stages (tasks) means sharing the data covered by those
task's locks.

This sort of design problem is hard to solve and harder for several different
developers some of them working in their spare-time to coordinate properly.

It's even harder to do when the end application is unknown (i.e., in a general
purpose OS).

BUT,

per-cpu algorithms are not necessarily better. you can waste cache resources
by data replication. you need to provision load-balancing/migration.

I had been hoping that mdillon @ dragonfly would provide good experimental
evidence on this point. unfortunately, in my estimation he s distracted
by some heavy goals (namely his single-image clustering). The consequence
of this is 1) it will take a long-time before enough of the dragonfly kernel
is out from under giant and 2) relative performance comparisons will be
dubious because it will be unclear what cost he is paying for some of his
single-image clustering support.

Its a shame because the community would greatly benefit from clear experimental
evidence contrasting some of the fine-grained locking approaches with a more
per-cpu oriented design.

AFAIK, the linux kernel has generally favored the per-cpu approach. In
that respect, relative underperformance of FreeBSD vs. Linux is an indicator
that per-cpu approaches deserve more weight in the FreeBSD world.

Paul

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20061009012323.GD3467>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation