Date: Tue, 3 Dec 1996 11:43:00 -0700 (MST) From: Terry Lambert <terry@lambert.org> To: davem@jenolan.rutgers.edu (David S. Miller) Cc: avalon@coombs.anu.edu.au, dyson@freebsd.org, dennis@etinc.com, kpneal@pobox.com, hackers@freebsd.org, sparclinux@vger.rutgers.edu Subject: Re: TCP/IP bandwidth bragging Message-ID: <199612031843.LAA14375@phaeton.artisoft.com> In-Reply-To: <199612030410.XAA18471@jenolan.caipgeneral> from "David S. Miller" at Dec 2, 96 11:10:51 pm
next in thread | previous in thread | raw e-mail | index | archive | help
Of shoes, and ships, and sealing wax... Of SMP, and streams... Of the RT kernel threading message push... And the concurrency of Things... > Tell me, does Linux implement STREAMS in the kernel with a properly > stacked network implementation, using DLPI and TLPI with fine grain > mutexes and locks ? > > Oh yes, then we'll have real performance. Take a look at how many > fastpaths and shortcuts the Solaris folks have to do to overcome the > performance problems assosciated with streams. The Solaris > performance people are constantly breaking their necks to find new > ways to overcome these issues. If Ritchie couldn't get it right, > perhaps this is good enough cause that it isn't such a hot idea, and > that the implementation needs to be done differently (see below) or > the entire idea trashed. The problem with streams is that it runs in user context in most (bad) implementations. Moving it to user space won't fix this problem, it will codify it for all time. A streams in a RT-aware kernel, or a kernel running it as a kernel thread and supporting kernel preemption and prioritization would tell a significantly different story. The speed of streams is relative to the number of stack/boundry traversals, and the overall packet assembly overhead. UnixWare lost 15% of its network performance when 2.x moved from "monolithic" drivers to ODI-based streams driver, mostly because they added two boundry crossings. Each boundry crossing required that the "runstreams" entry point be called to propagate messages over the boundry. This is equivalent to requiring two additional context switch overheads, plus any overhead before the context switch was invoked by a blocking call or quantum expiration. Having trivially "fixed" the streams in UnixWare by running the interrupt push to completion (at the expense of serializing the interrupts during the execution), I can tell you that the problems with streams are *purely* related to the execution architecture, and not to the architecture of streams itself. We can examine a similar architecture, the FICUS VFS interface, which was integrated into BSD 4.4-Lite, to see better performance for a file system is not dependent on a monolithic design. If, in fact, you were truly worried about boundry crossing overhead, you would build a multiheaded monolithic module, or what some research papers have called a "collapsed stack". This would be a monolithic module that nevertheless exported seperate stream heads for IP, UDP, and TCP, even though internally, there were no msgbuf boundy pushes taking place. I suggest you look at the streams implementation for AIX, which was done by Mentat, Inc.. I have been deep into the Mentat code (as one of the three Novell engineers who worked on the "Pathworks for VMS (NetWare)" product), and I was able to save 3 additional copies under DEC MTS (MultiThreading Services). The Mentat services under AIX run as a kernel process (thread) and do not suffer the context switch based push latency of "normal" (idiotic) streams implementations. Now that Linux supports kernel "threads", if you could also support kernel preemption, it would behoove you to try streams again. I suggest you contact Jim Freeman at Caldera, since he is a seasoned, professional programmers, used to working on streams in SMP environments. Tell him "hi" for me (I used to work with him -- I'm also a seasoned professional programmer with experience working on kernel code in SMP environments, though I'm more a FS/VM/thjreading guy than a streams guy). > Streams can be done at the user level with minimal kernel support. And with protection domain crossing overhead out the wazoo for service requests which should, rightfully, be turned around in the kernel. Like NFS RPCs. > The only reason it is completely in the kernel in most commercial > systems is that someone let it in there in the first place. It is > very hard to "take out" something like that once it is in, especially > in a commercial source tree. Bullshit. XKernel first ran on SVR4 systems. Linux was a definite late-bloomer. > Fine grained mutexes and locks, yes that will indeed get you scaling > better than a master lock implementation (which is what Linux has at > the moment). But it is not a reasonable way to implement scalable SMP > systems. I suggest you go to work in inductry and implement a commercial SMP system before you make that judgement. That's what I did. The *entire* game is *concurrency*. The *ENTIRE* game. And you spell that "increased blocking granularity. Of course, I have a somewhat unfair advantage, having worked on code which was designed to run in SVR4 ES/MP (*Unisys), UnixWare 2.x, Sequent, and Solaris SMP kernels. I happen to know where these system made their mistakes. I can tell you for a fact that the 8 processor limitation touted in most of these companies literature (except Sequent's) is utter bullshit based on a global pool allocator. I suggest you read both "UNIX for Modern Architectures" and "UNIX Internals, the New Frontiers" and pay attention to the modified SLAB allocators employed by SVR4 and derived systems, which originated at Sun, and at the per CPU pool architecture used in Sequent's code (and the limitations there). If you contend for the bus, you reduce concurrency. If you contend for the bus more than you absolutely have to, your design is inhernetly flawed and needs to be corrected. > For how I think it should be done, investigate the numerous papers > available on non-blocking synchronization and (harder to find) self > locking data structures. Data structure locking was the biggest, stupid-ass mistake that Sun made. They have no hierarchically intermediate granularity to prevent having to lock the world to get the lock onto their data structures. Without this, they can not establish per processor domains of authority, and we're back to beating our head against the bus to engender some form of synchronization. If you hit the bus, you are stupidly reducing concurrency for no good reason. You will never get better than .85 per additional CPU (calculated expotentially until you run out of bus). If you want to look at a good locking example, look at the Unisys SVR4 ES/MP implementation of VFS locking on the 60x0 series of machines (the VFS locking was one place where Sequent screwed up, Big Time). I think you will find it as difficult to go back and fix your mistakes as the commercial companies have found it... and as, in fact, FreeBSD and the other free UNIX implementations have found it. That is the problem with bulling ahead without considering the ramifications of your "right" decisions. Regards, Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199612031843.LAA14375>