From owner-freebsd-smp Fri Jan 31 06:36:09 1997 Return-Path: Received: (from root@localhost) by freefall.freebsd.org (8.8.5/8.8.5) id GAA24787 for smp-outgoing; Fri, 31 Jan 1997 06:36:09 -0800 (PST) Received: from pdx1.world.net (pdx1.world.net [192.243.32.18]) by freefall.freebsd.org (8.8.5/8.8.5) with ESMTP id GAA24759 for ; Fri, 31 Jan 1997 06:36:02 -0800 (PST) Received: from suburbia.net (suburbia.net [203.4.184.1]) by pdx1.world.net (8.7.5/8.7.3) with SMTP id GAA10728 for ; Fri, 31 Jan 1997 06:36:53 -0800 (PST) Received: (qmail 2829 invoked by uid 110); 31 Jan 1997 14:35:10 -0000 MBOX-Line: From owner-netdev@roxanne.nuclecu.unam.mx Fri Jan 31 12:59:57 1997 remote from suburbia.net Delivered-To: proff@suburbia.net Received: (qmail 1389 invoked from network); 31 Jan 1997 12:59:12 -0000 Received: from roxanne.nuclecu.unam.mx (132.248.29.2) by suburbia.net with SMTP; 31 Jan 1997 12:59:12 -0000 Received: (from root@localhost) by roxanne.nuclecu.unam.mx (8.6.12/8.6.11) id GAA10602 for netdev-outgoing; Fri, 31 Jan 1997 06:34:25 -0600 Received: from caipfs.rutgers.edu (caipfs.rutgers.edu [128.6.155.100]) by roxanne.nuclecu.unam.mx (8.6.12/8.6.11) with ESMTP id GAA10595 for ; Fri, 31 Jan 1997 06:34:20 -0600 Received: from jenolan.caipgeneral (jenolan.rutgers.edu [128.6.111.5]) by caipfs.rutgers.edu (8.7.6/8.7.3) with SMTP id HAA22496; Fri, 31 Jan 1997 07:31:29 -0500 (EST) Received: by jenolan.caipgeneral (SMI-8.6/SMI-SVR4) id HAA15221; Fri, 31 Jan 1997 07:31:12 -0500 Date: Fri, 31 Jan 1997 07:31:12 -0500 Message-Id: <199701311231.HAA15221@jenolan.caipgeneral> From: "David S. Miller" To: netdev@roxanne.nuclecu.unam.mx CC: roque@di.fc.ul.pt In-reply-to: <199701311117.LAA13030@oberon.di.fc.ul.pt> (message from Pedro Roque on Fri, 31 Jan 1997 11:17:04 GMT) Subject: Re: SMP Reply-To: netdev@roxanne.nuclecu.unam.mx, "David S. Miller" Sender: owner-smp@FreeBSD.ORG X-Loop: FreeBSD.org Precedence: bulk Date: Fri, 31 Jan 1997 11:17:04 GMT From: Pedro Roque Ok dave how do you want us to do it :-) ? Good question ;) Specific questions i have: - Is it ok to do locking on TCP by socket ? (i.e. is the granularity too low) - a lot of the current locking is done by using atomic_bh to avoid net_bh and timer to run while the critical section is executing... This is a nice scheme in non SMP... how do you want it replaced ? - How will timers and bottoms handlers run ? do they have a lock already ? should they release the kernel lock and adquire a network one ? [ NOTE: This mail should be forwarded by some random person within the next 12 hours or so to freebsd-hackers@freebsd.org and instantly everyone there will be "experts" on wait free algorithms and lockless data structures. This seems to be a new fad or something, almost everything I post about some new technique that is going to enter Linux gets forwarded there by someone, I find it extremely amusing the treatment my forwarded postings gets. ;-) ] If you want to attack the SMP problem the way that will bring results myself and Linus are really striving for, do this: step 1: Play stupid. "Locks, what are locks?" This is completely serious. Pretend that you know nothing about "traditional" SMP synchronization techniques. Now you are ready to analyze the problem at hand. As an example, self-locking or also termed "lockless" data structures which are self synchronizing are what you will most likely get the best results from. As an example, Eric Schenk already has told us that self-locking algorithms for hash tables do exist, and is why he was stressing that we look at that solution for the socket demuxing instead of something involving a radix tree. There are (at least to my knowledge) self locking data structures and algorithms for a decent set of the basic objects one would want to manipulate. Alan told me at Usenix that ones exist for trees, although he couldn't recall what kind of trees. Also I'm pretty sure there some sort of linked list implementation which is self locking as well. The problem here, is that this is new territory. I only have one reference or two to this area of work. It would be nice if the other hackers (Mr. Schenk, hint hint ;-) could cough up sources of information for papers on existing self locking techniques. Another related area is that of "wait free" algorithms, most of the existing research work is done depending upon a DCAS() instruction on the CPU, luckily DCAS (Double Compare And Set) can be done in software with not much extra overhead than a real hardware implementation. If you structure your code in the way their DCAS based wait free algorihms are, you get the same effect and the same contention avoidance characteristics. step 2: Ok, no self locking technique exists for what you are tackling. Ok, plan B. At this point, you are going to go after a traditional locking technique. But make the most of it. The traditional way people think about this stuff for performance is to try and make it such that less people want to lock the same stuff. I tend to disagree with that philosophy for certain situations, and instead I would incourage someone to take the following approach first. "Try to find a locking strategy that encourages code which holds the lock for _extremely_ short periods of time." That is, strive for short holds instead of less contention. As for interrupts and base handlers, perhaps not their "interface" but their implementation is indeed going to go through a major overhaul. At Usenix Linus and myself decided that we want the following from the interrupt architecture the kernel will have: 1) It must be implemented via software. 2) It will extend to the multiprocessor model. That is to say that when a cli() occurrs, dammit that is a cli() no matter what cpu you are on. This essentially requires (1) to be in any way portable. 3) It must be low overhead, at least not as ineffeicient as real uniprocessor hardware interrupt disabling is on supported architectures is. The interface as I mentioned will be the same, your good friends save_flags, cli, sti, and restore_flags will still be there and do what you expect them to. What is the motivation behind this? Think about it... If the above can be made to work, and efficiently, _every_ interrupt handler we currently have (this means all device driver interrupt servicing code) can run with no locks, no modifications, nothing at all. And whats more in the SMP case, a port can scatter the interrupts to various processors if they all come in at once. For example: do_IRQ() { if(intr_count[smp_processor_id()]) { give_irq_to_some_other_cpu(irq); return; } [ ... ] } or do_IRQ() { make_next_irq_goto_some_cpu(irq); [ ... ] } Again, everyone can still run lockless, completely. Think about what this does to interrupt response times. And also consider what this does for device driver writers, they need know nothing about ugly traditional SMP mutual exclusion techniques. They just block out "interrupts" when they need to protect their top and bottom half code from some data accesses. Sounds nice right? Anyways, these are the sorts of "clever" things you want to look for. There are some cases where I will say just do the traditional locking for SMP. For example, per-process signal state is a good example. This is a case where there are enough "inherent races" that the only thing you really need to protect from is multiple updates, reads are essentially safe in all cases. In fact I've coded the necessarily parallelization for the signal stuff already, it was all straightforward and took around 45 minutes to implement and test. It was cool because it made 12 or 13 lock/unlock kernel sequences "disappear". ;-) For the networking, I want to entertain an idea. What would people say if I told them that %75 or more of the locking in the networking could be done via bh atomics? Here is a case where my suggestion of keeping locking to a minimum does not hold, and it shouldn't. The bh's can be made to do the bulk of the work, and are assured single threaded semantics at all times. I could (actually, probably am) be entirely wrong here, but I wanted to at least entertain that sort of idea to people. And I suppose trying to make everything protected via bh's could suck for performance, essentially start_bh_atomic() means "single thread me". And thats not so swift for scalability at all. Could the folks with refernces to lockless data structures and similar please share references you may have, this would be appreciated. ---------------------------------------------//// Yow! 11.26 MB/s remote host TCP bandwidth & //// 199 usec remote TCP latency over 100Mb/s //// ethernet. Beat that! //// -----------------------------------------////__________ o David S. Miller, davem@caip.rutgers.edu /_____________/ / // /_/ ><