From owner-freebsd-hackers  Tue May 21 11:09:09 1996
Return-Path: owner-hackers
Received: (from root@localhost)
          by freefall.freebsd.org (8.7.3/8.7.3) id LAA14414
          for hackers-outgoing; Tue, 21 May 1996 11:09:09 -0700 (PDT)
Received: from caipfs.rutgers.edu (root@caipfs.rutgers.edu [128.6.91.100])
          by freefall.freebsd.org (8.7.3/8.7.3) with SMTP id LAA14409;
          Tue, 21 May 1996 11:08:59 -0700 (PDT)
Received: from huahaga.rutgers.edu (huahaga.rutgers.edu [128.6.155.53]) by caipfs.rutgers.edu (8.6.9+bestmx+oldruq+newsunq+grosshack/8.6.9) with ESMTP id OAA05548; Tue, 21 May 1996 14:08:42 -0400
Received: (davem@localhost) by huahaga.rutgers.edu (8.6.9+bestmx+oldruq+newsunq+grosshack/8.6.9) id OAA12006; Tue, 21 May 1996 14:08:41 -0400
Date: Tue, 21 May 1996 14:08:41 -0400
Message-Id: <199605211808.OAA12006@huahaga.rutgers.edu>
From: "David S. Miller" <davem@caip.rutgers.edu>
To: terry@lambert.org
CC: terry@lambert.org, jehamby@lightside.com, jkh@time.cdrom.com,
        current@freebsd.org, hackers@freebsd.org
In-reply-to: <199605211722.KAA01411@phaeton.artisoft.com> (message from Terry
	Lambert on Tue, 21 May 1996 10:22:17 -0700 (MST))
Subject: Re: Congrats on CURRENT 5/1 SNAP...
Sender: owner-hackers@freebsd.org
X-Loop: FreeBSD.org
Precedence: bulk

   From: Terry Lambert <terry@lambert.org>
   Date: Tue, 21 May 1996 10:22:17 -0700 (MST)

   >  Actually SunOS does do lwp scheduling where it checks for AST's
   > etc.  although I don't know how relevant that is to whats being
   > discussed.

   Yes.  It uses aioread/aiowrite/aiowait/aiocancel; these are closer
   to an event flag cluster than AST's.

Agreed.

   It's possible to get a modified NUMA for an SMP environment using
   per processor page allocation pools.

I agree.  I once talked with Larry McVoy and we were discussing how we
thought that a machine with N cpus and N high speed network interfaces
(read as: SuperHippi or faster) should be able to keep the pipes of
all interfaces full all the time with no contention whatsoever.  The
technique we came up with is similar to what you suggest.  Basically
you pre-allocate a pool of skiipy-bufs/mbufs per interface, no locking
necessary to allocate/free things in the pool to run things throught
the stack as packets come in and out.  If an interface starves, and
this is precisely the thing we want to keep from happening, it grabs
the global allocation lock, acquires what it needs and since it's
eating cycles anyways it looks at all the pools (reading state for the
purposes of this heuristic requires no locking) for the interfaces and
feed the queues which look like they could approach starvation.

   I *don't* think you'd want a non-symmetric implementation.

This is what I get for not explaining how the implementation I
mentioned functions.  It is symmetric.  Here is the basic idea:

   1) Truly scaling in all areas to >4 cpus is difficult if not a
      complex task.  However, medium graining to 4 cpus tops is
      "not too bad".  This 4 cpu grouping is the cluster, and all
      activities running on that cluster run in a truly symmetric
      fashion.

      A somewhat stripped down instance of the kernel runs on
      each of these cpu clusters.  The cluster has the capability
      to run jobs, and communicate in a fairly simplistic manner
      with other clusters.  Of note is the ability to offload
      jobs or groups of jobs to another cluster if necessary and/or
      desirable.

   2) Load state is maintained on a per-cluster basis.

   3) Operations like fork() scan the clusters to find the cluster
      with the lowest aggregate load.  This cluster is where to job
      is "communicated to".

   4) As jobs run and the per cluster loads go up and down the
      clusters "push out" or "pull in" jobs so that they operate more
      efficiently and contention is decreased.

You get as symmetric as is reasonable, but no further.

The design can be modified to either:

1) have one master "cluster maintainer" kernel thing which does
   all the work of periodically scanning the loads of all the
   clusters and moves things around, this method offloads that
   processing from the clusters and they just run what they have,
   the only question left is who does the arbitration during things
   like "fork()" etc.

2) Just put the cluster load processing in the per-cluster kernel.

In this particular implementation the hardware is a shared memory
machine with all N processors accesing that same memory.  It doesn't
stop here however.  But, it is the example that shows a problem that
can be solved by the methodology described.  I personally belive that
the above described scheme can scale to a shared memory machine with
more processors than you can count in an unsigned char, without any
problems whatsoever.

If your design is clean and done The Right Way(tm), you can take this
architecture to other scenerios and setups.  Your "clusters" could be
uniprocessor workstations on a high speed FDDI ring, the FDDI ring
acts as the cluster communications mechanism and the workstations are
what is clustered for the jobs, they act and react within themselves.
This "cluster" could also be a bunch of UNIX PC boxes on an everyday
ethernet.  The final example serves not only as a nice proof of
concept, it also serves as a method by which "others can play".  By
this I mean that someone with a bunch of UNIX PC's and cheap ethernet
to play with can actively work on such a project and are not left out
in the dark because they do not possess a "645 processor SMP machine"
;-)

Other directions are moving more than just the jobs, move socket
objects around, move vnodes around, move session groups around, expand
the pid space.  Anything can be done at this point.

See?  A clean solution to the SMP scaling problem is so clean that it
creates a direct pathway to a possible efficient distributed system
implementation.

   It should be noted that multithreading UFS in SVR4 (UnixWare)
   resulted in a 160% performance improvement -- even after the
   performance loss for using mutexes for locking was subtracted out.

   > Clustering is the answer and can scale to more CPU's than you can
   > count in an unsigned char. ;-)

   So can the scheme described above.

Here is what I am driving at.  Ok, break up the page pools so that
some locking is avoided.  Well, now the filesystem is the bottleneck,
so come up with a similar scheme of some sort so you don't grab so
many locks and now this is faster.  Well, not the FOO is the
bottleneck, lets hack that as well....

You see there will always be something else, you cannot eliminate all
the contention problems (ever think about what would be involved
in implementing a low contention SMP locking scheme for the entire
berkeley networking stack?).  You've dug yourself into a hole and are
trying now to dig yourself out, you may get partway out but you must
stay in the hole inevitably.

With clustering we are eliminating the hole altogether.

There are many people who have been thinking constantly about SMP for
many years, and what I present is basically what they have agreed is
the most beneficial direction to move in.  (careful readers will
notice that this is also a move away from kernel bloat as well...)

Why get performance with added complexity and maintanence hassle when
you can get more performance with less complexity ;-)

I'm sure many people are interested in this topic, and much work has
started and is continuing on such work in the free operating system
world.  I invite those interested to join in the discussions taking
place now in the MultiComputer project which has recently begun in the
Linux camp (no this is not a shameless Linux plug, I would be overly
thrilled to see a FreeBSD project of the same vein and a high level of
communication between both groups).  It is early still for the
project, and it is at this point where I think the most interesting
discussion occurs because we can talk about architecture instead of
how we fucked up and need to fix things.  The mailing list is
linux-mc@samba.anu.edu.au and it is a ListServ based mailing list.
Another interesting place for similar material can be found at the AP+
CAP project home page:

http://cap.anu.edu.au/cap/projects/linux/

(the CAP AP+ project is work on a Sparc based parallel
computer made by Fujitsu running SparcLinux, they have real parallel
programs running and have done experiments with things like Gang
Scheduling, handling asynchronous faults from within interrupt
context, etc.)

Later,
David S. Miller
davem@caip.rutgers.edu