Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 3 Dec 1996 11:43:00 -0700 (MST)
From:      Terry Lambert <terry@lambert.org>
To:        davem@jenolan.rutgers.edu (David S. Miller)
Cc:        avalon@coombs.anu.edu.au, dyson@freebsd.org, dennis@etinc.com, kpneal@pobox.com, hackers@freebsd.org, sparclinux@vger.rutgers.edu
Subject:   Re: TCP/IP bandwidth bragging
Message-ID:  <199612031843.LAA14375@phaeton.artisoft.com>
In-Reply-To: <199612030410.XAA18471@jenolan.caipgeneral> from "David S. Miller" at Dec 2, 96 11:10:51 pm

next in thread | previous in thread | raw e-mail | index | archive | help

Of shoes, and ships, and sealing wax...
Of SMP, and streams...
Of the RT kernel threading message push...
And the concurrency of Things...



>    Tell me, does Linux implement STREAMS in the kernel with a properly
>    stacked network implementation, using DLPI and TLPI with fine grain
>    mutexes and locks ?
> 
> Oh yes, then we'll have real performance.  Take a look at how many
> fastpaths and shortcuts the Solaris folks have to do to overcome the
> performance problems assosciated with streams.  The Solaris
> performance people are constantly breaking their necks to find new
> ways to overcome these issues.  If Ritchie couldn't get it right,
> perhaps this is good enough cause that it isn't such a hot idea, and
> that the implementation needs to be done differently (see below) or
> the entire idea trashed.

The problem with streams is that it runs in user context in most
(bad) implementations.  Moving it to user space won't fix this
problem, it will codify it for all time.

A streams in a RT-aware kernel, or a kernel running it as a kernel
thread and supporting kernel preemption and prioritization would
tell a significantly different story.

The speed of streams is relative to the number of stack/boundry
traversals, and the overall packet assembly overhead.

UnixWare lost 15% of its network performance when 2.x moved from
"monolithic" drivers to ODI-based streams driver, mostly because
they added two boundry crossings.

Each boundry crossing required that the "runstreams" entry point
be called to propagate messages over the boundry.  This is
equivalent to requiring two additional context switch overheads,
plus any overhead before the context switch was invoked by a blocking
call or quantum expiration.


Having trivially "fixed" the streams in UnixWare by running the
interrupt push to completion (at the expense of serializing the
interrupts during the execution), I can tell you that the problems
with streams are *purely* related to the execution architecture,
and not to the architecture of streams itself.

We can examine a similar architecture, the FICUS VFS interface,
which was integrated into BSD 4.4-Lite, to see better performance
for a file system is not dependent on a monolithic design.


If, in fact, you were truly worried about boundry crossing overhead,
you would build a multiheaded monolithic module, or what some research
papers have called a "collapsed stack".  This would be a monolithic
module that nevertheless exported seperate stream heads for IP, UDP,
and TCP, even though internally, there were no msgbuf boundy pushes
taking place.


I suggest you look at the streams implementation for AIX, which was
done by Mentat, Inc..  I have been deep into the Mentat code (as one
of the three Novell engineers who worked on the "Pathworks for VMS
(NetWare)" product), and I was able to save 3 additional copies
under DEC MTS (MultiThreading Services).  The Mentat services under
AIX run as a kernel process (thread) and do not suffer the context
switch based push latency of "normal" (idiotic) streams implementations.


Now that Linux supports kernel "threads", if you could also support
kernel preemption, it would behoove you to try streams again.  I
suggest you contact Jim Freeman at Caldera, since he is a seasoned,
professional programmers, used to working on streams in SMP environments.
Tell him "hi" for me (I used to work with him -- I'm also a seasoned
professional programmer with experience working on kernel code in
SMP environments, though I'm more a FS/VM/thjreading guy than a streams
guy).


> Streams can be done at the user level with minimal kernel support.

And with protection domain crossing overhead out the wazoo for service
requests which should, rightfully, be turned around in the kernel.  Like
NFS RPCs.


> The only reason it is completely in the kernel in most commercial
> systems is that someone let it in there in the first place.  It is
> very hard to "take out" something like that once it is in, especially
> in a commercial source tree.

Bullshit.  XKernel first ran on SVR4 systems.  Linux was a definite
late-bloomer.


> Fine grained mutexes and locks, yes that will indeed get you scaling
> better than a master lock implementation (which is what Linux has at
> the moment).  But it is not a reasonable way to implement scalable SMP
> systems.

I suggest you go to work in inductry and implement a commercial SMP
system before you make that judgement.  That's what I did.  The
*entire* game is *concurrency*.  The *ENTIRE* game.  And you spell
that "increased blocking granularity.

Of course, I have a somewhat unfair advantage, having worked on code
which was designed to run in SVR4 ES/MP (*Unisys), UnixWare 2.x, Sequent,
and Solaris SMP kernels.  I happen to know where these system made their
mistakes.

I can tell you for a fact that the 8 processor limitation touted in
most of these companies literature (except Sequent's) is utter bullshit
based on a global pool allocator.  I suggest you read both "UNIX for
Modern Architectures" and "UNIX Internals, the New Frontiers" and
pay attention to the modified SLAB allocators employed by SVR4 and
derived systems, which originated at Sun, and at the per CPU pool
architecture used in Sequent's code (and the limitations there).  If
you contend for the bus, you reduce concurrency.  If you contend for
the bus more than you absolutely have to, your design is inhernetly
flawed and needs to be corrected.


> For how I think it should be done, investigate the numerous papers
> available on non-blocking synchronization and (harder to find) self
> locking data structures.

Data structure locking was the biggest, stupid-ass mistake that Sun
made.  They have no hierarchically intermediate granularity to prevent
having to lock the world to get the lock onto their data structures.
Without this, they can not establish per processor domains of authority,
and we're back to beating our head against the bus to engender some
form of synchronization.  If you hit the bus, you are stupidly reducing
concurrency for no good reason.  You will never get better than .85 per
additional CPU (calculated expotentially until you run out of bus).

If you want to look at a good locking example, look at the Unisys SVR4
ES/MP implementation of VFS locking on the 60x0 series of machines (the
VFS locking was one place where Sequent screwed up, Big Time).


I think you will find it as difficult to go back and fix your mistakes
as the commercial companies have found it... and as, in fact, FreeBSD
and the other free UNIX implementations have found it.  That is the
problem with bulling ahead without considering the ramifications of
your "right" decisions.


					Regards,
					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199612031843.LAA14375>