From owner-freebsd-hackers Tue May 21 11:09:09 1996 Return-Path: owner-hackers Received: (from root@localhost) by freefall.freebsd.org (8.7.3/8.7.3) id LAA14414 for hackers-outgoing; Tue, 21 May 1996 11:09:09 -0700 (PDT) Received: from caipfs.rutgers.edu (root@caipfs.rutgers.edu [128.6.91.100]) by freefall.freebsd.org (8.7.3/8.7.3) with SMTP id LAA14409; Tue, 21 May 1996 11:08:59 -0700 (PDT) Received: from huahaga.rutgers.edu (huahaga.rutgers.edu [128.6.155.53]) by caipfs.rutgers.edu (8.6.9+bestmx+oldruq+newsunq+grosshack/8.6.9) with ESMTP id OAA05548; Tue, 21 May 1996 14:08:42 -0400 Received: (davem@localhost) by huahaga.rutgers.edu (8.6.9+bestmx+oldruq+newsunq+grosshack/8.6.9) id OAA12006; Tue, 21 May 1996 14:08:41 -0400 Date: Tue, 21 May 1996 14:08:41 -0400 Message-Id: <199605211808.OAA12006@huahaga.rutgers.edu> From: "David S. Miller" To: terry@lambert.org CC: terry@lambert.org, jehamby@lightside.com, jkh@time.cdrom.com, current@freebsd.org, hackers@freebsd.org In-reply-to: <199605211722.KAA01411@phaeton.artisoft.com> (message from Terry Lambert on Tue, 21 May 1996 10:22:17 -0700 (MST)) Subject: Re: Congrats on CURRENT 5/1 SNAP... Sender: owner-hackers@freebsd.org X-Loop: FreeBSD.org Precedence: bulk From: Terry Lambert Date: Tue, 21 May 1996 10:22:17 -0700 (MST) > Actually SunOS does do lwp scheduling where it checks for AST's > etc. although I don't know how relevant that is to whats being > discussed. Yes. It uses aioread/aiowrite/aiowait/aiocancel; these are closer to an event flag cluster than AST's. Agreed. It's possible to get a modified NUMA for an SMP environment using per processor page allocation pools. I agree. I once talked with Larry McVoy and we were discussing how we thought that a machine with N cpus and N high speed network interfaces (read as: SuperHippi or faster) should be able to keep the pipes of all interfaces full all the time with no contention whatsoever. The technique we came up with is similar to what you suggest. Basically you pre-allocate a pool of skiipy-bufs/mbufs per interface, no locking necessary to allocate/free things in the pool to run things throught the stack as packets come in and out. If an interface starves, and this is precisely the thing we want to keep from happening, it grabs the global allocation lock, acquires what it needs and since it's eating cycles anyways it looks at all the pools (reading state for the purposes of this heuristic requires no locking) for the interfaces and feed the queues which look like they could approach starvation. I *don't* think you'd want a non-symmetric implementation. This is what I get for not explaining how the implementation I mentioned functions. It is symmetric. Here is the basic idea: 1) Truly scaling in all areas to >4 cpus is difficult if not a complex task. However, medium graining to 4 cpus tops is "not too bad". This 4 cpu grouping is the cluster, and all activities running on that cluster run in a truly symmetric fashion. A somewhat stripped down instance of the kernel runs on each of these cpu clusters. The cluster has the capability to run jobs, and communicate in a fairly simplistic manner with other clusters. Of note is the ability to offload jobs or groups of jobs to another cluster if necessary and/or desirable. 2) Load state is maintained on a per-cluster basis. 3) Operations like fork() scan the clusters to find the cluster with the lowest aggregate load. This cluster is where to job is "communicated to". 4) As jobs run and the per cluster loads go up and down the clusters "push out" or "pull in" jobs so that they operate more efficiently and contention is decreased. You get as symmetric as is reasonable, but no further. The design can be modified to either: 1) have one master "cluster maintainer" kernel thing which does all the work of periodically scanning the loads of all the clusters and moves things around, this method offloads that processing from the clusters and they just run what they have, the only question left is who does the arbitration during things like "fork()" etc. 2) Just put the cluster load processing in the per-cluster kernel. In this particular implementation the hardware is a shared memory machine with all N processors accesing that same memory. It doesn't stop here however. But, it is the example that shows a problem that can be solved by the methodology described. I personally belive that the above described scheme can scale to a shared memory machine with more processors than you can count in an unsigned char, without any problems whatsoever. If your design is clean and done The Right Way(tm), you can take this architecture to other scenerios and setups. Your "clusters" could be uniprocessor workstations on a high speed FDDI ring, the FDDI ring acts as the cluster communications mechanism and the workstations are what is clustered for the jobs, they act and react within themselves. This "cluster" could also be a bunch of UNIX PC boxes on an everyday ethernet. The final example serves not only as a nice proof of concept, it also serves as a method by which "others can play". By this I mean that someone with a bunch of UNIX PC's and cheap ethernet to play with can actively work on such a project and are not left out in the dark because they do not possess a "645 processor SMP machine" ;-) Other directions are moving more than just the jobs, move socket objects around, move vnodes around, move session groups around, expand the pid space. Anything can be done at this point. See? A clean solution to the SMP scaling problem is so clean that it creates a direct pathway to a possible efficient distributed system implementation. It should be noted that multithreading UFS in SVR4 (UnixWare) resulted in a 160% performance improvement -- even after the performance loss for using mutexes for locking was subtracted out. > Clustering is the answer and can scale to more CPU's than you can > count in an unsigned char. ;-) So can the scheme described above. Here is what I am driving at. Ok, break up the page pools so that some locking is avoided. Well, now the filesystem is the bottleneck, so come up with a similar scheme of some sort so you don't grab so many locks and now this is faster. Well, not the FOO is the bottleneck, lets hack that as well.... You see there will always be something else, you cannot eliminate all the contention problems (ever think about what would be involved in implementing a low contention SMP locking scheme for the entire berkeley networking stack?). You've dug yourself into a hole and are trying now to dig yourself out, you may get partway out but you must stay in the hole inevitably. With clustering we are eliminating the hole altogether. There are many people who have been thinking constantly about SMP for many years, and what I present is basically what they have agreed is the most beneficial direction to move in. (careful readers will notice that this is also a move away from kernel bloat as well...) Why get performance with added complexity and maintanence hassle when you can get more performance with less complexity ;-) I'm sure many people are interested in this topic, and much work has started and is continuing on such work in the free operating system world. I invite those interested to join in the discussions taking place now in the MultiComputer project which has recently begun in the Linux camp (no this is not a shameless Linux plug, I would be overly thrilled to see a FreeBSD project of the same vein and a high level of communication between both groups). It is early still for the project, and it is at this point where I think the most interesting discussion occurs because we can talk about architecture instead of how we fucked up and need to fix things. The mailing list is linux-mc@samba.anu.edu.au and it is a ListServ based mailing list. Another interesting place for similar material can be found at the AP+ CAP project home page: http://cap.anu.edu.au/cap/projects/linux/ (the CAP AP+ project is work on a Sparc based parallel computer made by Fujitsu running SparcLinux, they have real parallel programs running and have done experiments with things like Gang Scheduling, handling asynchronous faults from within interrupt context, etc.) Later, David S. Miller davem@caip.rutgers.edu