From owner-freebsd-hackers@FreeBSD.ORG Sat Oct 25 15:13:14 2003 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 50CBA16A4BF for ; Sat, 25 Oct 2003 15:13:14 -0700 (PDT) Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by mx1.FreeBSD.org (Postfix) with ESMTP id 80BA243F85 for ; Sat, 25 Oct 2003 15:13:13 -0700 (PDT) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (localhost [127.0.0.1]) h9PMDCiF032547; Sat, 25 Oct 2003 15:13:12 -0700 (PDT) (envelope-from dillon@apollo.backplane.com) Received: (from dillon@localhost) by apollo.backplane.com (8.12.9p2/8.12.9/Submit) id h9PMDCHq032546; Sat, 25 Oct 2003 15:13:12 -0700 (PDT) (envelope-from dillon) Date: Sat, 25 Oct 2003 15:13:12 -0700 (PDT) From: Matthew Dillon Message-Id: <200310252213.h9PMDCHq032546@apollo.backplane.com> To: Kip Macy , Marcel Moolenaar , John-Mark Gurney , hackers@freebsd.org References: <200310230143.32244.wes@softweyr.com> <20031025175948.GF683@funkthat.com> <20031025194135.GA790@dhcp01.pn.xcllnt.net> <20031025135752.U84860@demos.bsdclusters.com> Subject: Re: FreeBSD mail list etiquette X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 25 Oct 2003 22:13:14 -0000 Sheesh, you think you guys (*ALL* you guys) have enough time on your hands? There are better places to direct all that brainpower. I don't really need to defend DragonFly... I believe it stands on its own very well not only with what we have already accomplished but with what we are about to accomplish. Jeffrey is very close to decoupling the NETIF and network protocol drivers from Giant and Hiten has been playing with the APICs in regards to distributing interrupts to particular CPUs (something which DragonFly is particularly good at due to the way the light weight kernel threading system works). As soon as I get this namecache mess rewritten (and assuming David Rhodus doesn't keep pulling obscure panics out of his hat :-), but to be fair our NFS is already gobs faster then 4.x)... I am going to start cleaning up loose ends in the networking code and we will have the critical path entirely decoupled and mostly (or completely) mutexless. We are taking a somewhat different approach to BGL removal then 5.x. Instead of halfhazardly locking up subsystems with mutexes we are instead locking up subsystems by moving them into their own threads, then scaling through the use of multiple threads, and leaving everything that hasn't been locked up under the BGL. That way we are able to skip the intermediate step of determining where all the contention is, because the only contention will be the BGL'd areas which haven't been converted yet and we will simply assume contention. This way we can focus on optimizing the critical path, which will get us 80% of the scaleability we need, and tackle the other things like, say, the route table, after we have the topology in place and can see clearly what needs to be done for it (e.g. like using RCU and passive IPI messaging instead of mutexes for updates). So, for example, take the TCP stack. It's already mostly in its own thread simply by virtue of being a software interrupt. Softints, like interrupts, are threads in DragonFly. After the first lockup phase external APIs such as mbuf allocation and freeing, and route table lookups, will still be under the BGL, but PCBs and packet manipulation will be serialized in the protocol thread(s) and require no mutexes or locks whatsoever. Then we will move most of the mbuf API out of the BGL simply by adding a per-cpu layer (and since there is no cpu-hopping preemption we can depend on the per-cpu globaldata area without wasting cycles getting and releasing mutexes that just waste cycles since the whole idea is for there to be no contention in the first place). But just like our current slab allocator, things that miss the per-cpu globaldata cache will either use the BGL to access the kernel_map or will queue the operation (if it does not need to be synchronous) for later execution. After all, who cares if free() can't release a chunk of memory to the kernel_map instantly for reuse? It's a lot easier lockup path then the direction 5.x is going, and a whole lot more maintainable IMHO because most of the coding doesn't have to worry about mutexes or LORs or anything like that. If I were to recommend anything to the folks working on FreeBSD-current, it would be: * get rid of priority borrowing, and stop depending on it to fix all your woes with interrupt threads accessing mutexes that non-interrupt threads might also be accessing in the critical path. Fix the interrupt code instead. * get rid of *NON*-interrupt thread preemption while in the kernel. * get rid of preemptive cpu migration, even across normal blocks inside the kernel unless you tell the API otherwise with a flag that it is ok. * formalize critical sections to use just the counter mechanism (similar to spls in 4.x), which it almost does now, and require that hardware interrupts conform to the mechanism on all architectures. * Port our IPI messaging code (which isn't optimized yet, but works and can theoretically be very nicely optimized). * separate the userland scheduler from the kernel thread scheduler using a designated P_CURPROC approach, which completely fixes the priority inversion issues I might add that ULE only 'fake fixes' right now. Make the kernel thread scheduler a fixed priority scheduler (e.g. highest priority being interrupts, then softints, then threads operating in the kernel, then user associated threads operating in the kernel, then user associated threads operating in userland). Fix the userland scheduler API to conform to the designated P_CURPROC approach, where the userland scheduler is responsible for maintaining a single user process's thread or threads on each cpu in the system at a time. If you did the above you would be a lot happier. Once the schedulers are separated I would also make the kernel thread scheduler per-cpu and remove *ALL* mutex dependancies from it, which in turn will allow you to trivially integrate BGL requirements with a per-thread lock counter and directly integrate it into the kernel thread scheduler, which I do in DragonFly if you look at kern/lwkt_thread.c. It actually optimizes the use of the BGL such that you can avoid doing BGL operations when switching between threads with the same BGL locked/not-locked state. -Matt