From owner-freebsd-smp  Fri Jan 31 06:36:09 1997
Return-Path: <owner-smp>
Received: (from root@localhost)
          by freefall.freebsd.org (8.8.5/8.8.5) id GAA24787
          for smp-outgoing; Fri, 31 Jan 1997 06:36:09 -0800 (PST)
Received: from pdx1.world.net (pdx1.world.net [192.243.32.18])
          by freefall.freebsd.org (8.8.5/8.8.5) with ESMTP id GAA24759
          for <smp@freebsd.org>; Fri, 31 Jan 1997 06:36:02 -0800 (PST)
Received: from suburbia.net (suburbia.net [203.4.184.1]) by pdx1.world.net (8.7.5/8.7.3) with SMTP id GAA10728 for <smp@freebsd.org>; Fri, 31 Jan 1997 06:36:53 -0800 (PST)
Received: (qmail 2829 invoked by uid 110); 31 Jan 1997 14:35:10 -0000
MBOX-Line: From owner-netdev@roxanne.nuclecu.unam.mx Fri Jan 31 12:59:57 1997 remote from suburbia.net
Delivered-To: proff@suburbia.net
Received: (qmail 1389 invoked from network); 31 Jan 1997 12:59:12 -0000
Received: from roxanne.nuclecu.unam.mx (132.248.29.2)
  by suburbia.net with SMTP; 31 Jan 1997 12:59:12 -0000
Received: (from root@localhost) by roxanne.nuclecu.unam.mx (8.6.12/8.6.11) id GAA10602 for netdev-outgoing; Fri, 31 Jan 1997 06:34:25 -0600
Received: from caipfs.rutgers.edu (caipfs.rutgers.edu [128.6.155.100]) by roxanne.nuclecu.unam.mx (8.6.12/8.6.11) with ESMTP id GAA10595 for <netdev@roxanne.nuclecu.unam.mx>; Fri, 31 Jan 1997 06:34:20 -0600
Received: from jenolan.caipgeneral (jenolan.rutgers.edu [128.6.111.5]) by caipfs.rutgers.edu (8.7.6/8.7.3) with SMTP id HAA22496; Fri, 31 Jan 1997 07:31:29 -0500 (EST)
Received: by jenolan.caipgeneral (SMI-8.6/SMI-SVR4)
	id HAA15221; Fri, 31 Jan 1997 07:31:12 -0500
Date: Fri, 31 Jan 1997 07:31:12 -0500
Message-Id: <199701311231.HAA15221@jenolan.caipgeneral>
From: "David S. Miller" <davem@jenolan.rutgers.edu>
To: netdev@roxanne.nuclecu.unam.mx
CC: roque@di.fc.ul.pt
In-reply-to: <199701311117.LAA13030@oberon.di.fc.ul.pt> (message from Pedro
	Roque on Fri, 31 Jan 1997 11:17:04 GMT)
Subject: Re: SMP
Reply-To: netdev@roxanne.nuclecu.unam.mx,
        "David S. Miller" <davem@jenolan.rutgers.edu>
Sender: owner-smp@FreeBSD.ORG
X-Loop: FreeBSD.org
Precedence: bulk

   Date: Fri, 31 Jan 1997 11:17:04 GMT
   From: Pedro Roque <roque@di.fc.ul.pt>

   Ok dave how do you want us to do it :-) ?

Good question ;)

   Specific questions i have:

   - Is it ok to do locking on TCP by socket ? (i.e. is the granularity too
   low)
   - a lot of the current locking is done by using atomic_bh to avoid 
   net_bh and timer to run while the critical section is executing...
   This is a nice scheme in non SMP... how do you want it replaced ?

   - How will timers and bottoms handlers run ? do they have a lock already ?
   should they release the kernel lock and adquire a network one ?

[ NOTE: This mail should be forwarded by some random person within the
	next 12 hours or so to freebsd-hackers@freebsd.org and
	instantly everyone there will be "experts" on wait free
	algorithms and lockless data structures.  This seems to be
	a new fad or something, almost everything I post about some
	new technique that is going to enter Linux gets forwarded
	there by someone, I find it extremely amusing the treatment
	my forwarded postings gets. ;-)  ]

If you want to attack the SMP problem the way that will bring results
myself and Linus are really striving for, do this:

step 1:

	Play stupid.  "Locks, what are locks?"

	This is completely serious.  Pretend that you know nothing
	about "traditional" SMP synchronization techniques.  Now you
	are ready to analyze the problem at hand.

	As an example, self-locking or also termed "lockless" data
	structures which are self synchronizing are what you will
	most likely get the best results from.  As an example, Eric
	Schenk already has told us that self-locking algorithms for
	hash tables do exist, and is why he was stressing that we
	look at that solution for the socket demuxing instead of
	something involving a radix tree.

	There are (at least to my knowledge) self locking data
	structures and algorithms for a decent set of the basic
	objects one would want to manipulate.  Alan told me at
	Usenix that ones exist for trees, although he couldn't
	recall what kind of trees.  Also I'm pretty sure there
	some sort of linked list implementation which is self
	locking as well.

	The problem here, is that this is new territory.  I only have
	one reference or two to this area of work.  It would be nice
	if the other hackers (Mr. Schenk, hint hint ;-) could cough
	up sources of information for papers on existing self locking
	techniques.

	Another related area is that of "wait free" algorithms, most
	of the existing research work is done depending upon a DCAS()
	instruction on the CPU, luckily DCAS (Double Compare And Set)
	can be done in software with not much extra overhead than a
	real hardware implementation.  If you structure your code in
	the way their DCAS based wait free algorihms are, you get the
	same effect and the same contention avoidance characteristics.

step 2:

	Ok, no self locking technique exists for what you are
	tackling.  Ok, plan B.

	At this point, you are going to go after a traditional locking
	technique.  But make the most of it.  The traditional way
	people think about this stuff for performance is to try and
	make it such that less people want to lock the same stuff.

	I tend to disagree with that philosophy for certain
	situations, and instead I would incourage someone to take
	the following approach first.

	"Try to find a locking strategy that encourages code which
	 holds the lock for _extremely_ short periods of time."

	That is, strive for short holds instead of less contention.

As for interrupts and base handlers, perhaps not their "interface" but
their implementation is indeed going to go through a major overhaul.

At Usenix Linus and myself decided that we want the following from the
interrupt architecture the kernel will have:

	1) It must be implemented via software.

	2) It will extend to the multiprocessor model.  That
	   is to say that when a cli() occurrs, dammit that is
	   a cli() no matter what cpu you are on.  This essentially
	   requires (1) to be in any way portable.

	3) It must be low overhead, at least not as ineffeicient
	   as real uniprocessor hardware interrupt disabling is
	   on supported architectures is.

The interface as I mentioned will be the same, your good friends
save_flags, cli, sti, and restore_flags will still be there and do
what you expect them to.

What is the motivation behind this?  Think about it...

If the above can be made to work, and efficiently, _every_ interrupt
handler we currently have (this means all device driver interrupt
servicing code) can run with no locks, no modifications, nothing at
all.  And whats more in the SMP case, a port can scatter the
interrupts to various processors if they all come in at once.  For
example:

	do_IRQ()
	{
		if(intr_count[smp_processor_id()]) {
			give_irq_to_some_other_cpu(irq);
			return;
		}
		[ ... ]
	}

or

	do_IRQ()
	{
		make_next_irq_goto_some_cpu(irq);
		[ ... ]
	}

Again, everyone can still run lockless, completely.  Think about what
this does to interrupt response times.  And also consider what this
does for device driver writers, they need know nothing about ugly
traditional SMP mutual exclusion techniques.  They just block out
"interrupts" when they need to protect their top and bottom half code
from some data accesses.

Sounds nice right?  Anyways, these are the sorts of "clever" things
you want to look for.

There are some cases where I will say just do the traditional locking
for SMP.  For example, per-process signal state is a good example.
This is a case where there are enough "inherent races" that the only
thing you really need to protect from is multiple updates, reads are
essentially safe in all cases.

In fact I've coded the necessarily parallelization for the signal
stuff already, it was all straightforward and took around 45 minutes
to implement and test.  It was cool because it made 12 or 13
lock/unlock kernel sequences "disappear". ;-)

For the networking, I want to entertain an idea.  What would people
say if I told them that %75 or more of the locking in the networking
could be done via bh atomics?

Here is a case where my suggestion of keeping locking to a minimum
does not hold, and it shouldn't.  The bh's can be made to do the bulk
of the work, and are assured single threaded semantics at all times.

I could (actually, probably am) be entirely wrong here, but I wanted
to at least entertain that sort of idea to people.  And I suppose
trying to make everything protected via bh's could suck for
performance, essentially start_bh_atomic() means "single thread me".
And thats not so swift for scalability at all.

Could the folks with refernces to lockless data structures and similar
please share references you may have, this would be appreciated.

---------------------------------------------////
Yow! 11.26 MB/s remote host TCP bandwidth & ////
199 usec remote TCP latency over 100Mb/s   ////
ethernet.  Beat that!                     ////
-----------------------------------------////__________  o
David S. Miller, davem@caip.rutgers.edu /_____________/ / // /_/ ><