From owner-freebsd-hackers Thu Jun 20 14:12:47 2002 Delivered-To: freebsd-hackers@freebsd.org Received: from outhub2.tibco.com (outhub2.tibco.com [63.100.100.156]) by hub.freebsd.org (Postfix) with ESMTP id D8FE537B40C for ; Thu, 20 Jun 2002 14:12:22 -0700 (PDT) Received: by outhub2.tibco.com; id OAA06453; Thu, 20 Jun 2002 14:12:24 -0700 (PDT) Received: from na-h-inhub2.tibco.com(10.106.128.34) by outhub2.tibco.com via smap (V4.1) id xmaa06376; Thu, 20 Jun 02 14:11:55 -0700 Received: from mail1.tibco.com (nsmail2.tibco.com [10.106.128.42]) by na-h-inhub2.tibco.com (8.12.2/8.12.2) with ESMTP id g5KL1eFl020538 for ; Thu, 20 Jun 2002 14:01:40 -0700 (PDT) Received: from tibco.com ([10.105.146.230]) by mail1.tibco.com (Netscape Messaging Server 4.15) with ESMTP id GY0W7T00.947; Thu, 20 Jun 2002 14:11:53 -0700 Message-ID: <3D1244F1.8020900@tibco.com> Date: Thu, 20 Jun 2002 14:11:13 -0700 From: "Aram Compeau" User-Agent: Mozilla/5.0 (Windows; U; WinNT4.0; en-US; rv:0.9.4) Gecko/20011019 Netscape6/6.2 X-Accept-Language: en,zh,fr,ja,ko MIME-Version: 1.0 To: Terry Lambert Cc: hackers@FreeBSD.ORG Subject: Re: projects? References: <200206200209.g5K297R14456@monica.cs.rpi.edu> <3D113D16.6D1A0238@mindspring.com> <3D114FE7.3020304@tibco.com> <3D119425.7EF9BE3E@mindspring.com> Content-Type: multipart/alternative; boundary="------------000306030500070200080206" Sender: owner-freebsd-hackers@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG --------------000306030500070200080206 Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit Great list, thanks for that. While I think LRP and TCP Rate Halving are quite interesting, I think tackling the SMP Safe Queues makes the best use my resources. I fear that testing some of the other items requires setups that are not feasible for me. Cheers, Aram Terry Lambert wrote: >Aram Compeau wrote: > >>>Too bad he's sick of networking. There a lot of intersting code >>>that could be implemented in the main line FreeBSD that would be >>>really beneficial, overall. >>> >>Could you elaborate briefly on what you'd like to see worked on with >>respect to this? I don't want you to spend a lot of time describing >>anything, but I am curious. I don't generally have large blocks of spare >>time, but could work on something steadily with a low flame. >> > >--- >LRP >--- > >I would like FreeBSD to support LRP (Lazy Receiver Processing), >an idea which came from the Scala Server Project at Rice >University. > >LRP gets rid of the need to run network processing in kernel >threads, in order to get parallel operation on SMP systems; >so long as the interrupt processing load is balanced, it's >possible to handle interrupts in an overlapped fashion. > >Right now, there are four sets of source code: SunOS 4.1.3_U1, >FreeBSD 2.2-BETA, FreeBSD 4.0, FreeBSD 4.3. The first three >are from Rice University. The fourth is from Duke University, >and is a port forward of the 4.0 Rice code. > >The Rice code, other than the FreeBSD 2.2-BETA, is unusable. >It mixes in an idea called "Resource Containers" (RESCON), >that is really not very useful (I can go into great detail on >this, if necessary). It also has a restrictive license. The >FreeBSD 2.2-BETA implementation has a CMU MACH style license >(same as some FreeBSD code already has). > >The LRP implementation in all these cases is flawed, in that >it assumes that the LRP processing will be universal across >an entire address family, and the experimental implementation >loads a full copy of the AF_INET stack under another family >name. A real integration is tricky, including credentials on >accept calls, an attribute on the family struct, to indicate >that it's LRP'ed, so that common subsystems can behave very >differently, support for Accept filters and othe Kevents, etc.). > >LRP gives a minimum of a factor of 3 improvement in connections >per second, without the SYN cache code involved at all, through >an overall reduction in processing latency. It also has the >effect of preventing "receiver livelock". > > http://www.cs.rice.edu/CS/Systems/LRP/ > http://www.cs.duke.edu/~anderson/freebsd/muse-sosp/readme.txt > >---------------- >TCP Rate Halving >---------------- > >I would like to see FreeBSD support TCP Rate Halving, and idea >from the Pittsburgh Cupercomputing Center (PSC) at Carengie >Mellon University (CMU). > >These are the people who invented "traceroute". > >TCP Rate halving is an alternative to the RFC-2581 Fast Recovery >algorithm for congestion control. It effectively causes the >congestion recovery to be self-clocked by ACKs, which has the >overall effect of avoiding the normal burstiness of TCP recovery >following congestion. > >This builds on work by Van Jacobsen, J. Hoe, and Sally Floyd. > >Their current implementation is for NetBSD 1.3.2. > > http://www.psc.edu/networking/rate_halving.html > >--------------- >SACK, FACK, ECN >--------------- > >Also from PSC at CMU. > >SACK and FACK are well known. It's annnoying that Luigi Rizzo's >code from 1997 or so was never integrated into FreeBSD. > >ECN is an implementation of Early Congestion Notification. > > http://www.psc.edu/networking/tcp.html > > >---- >VRRP >---- > >There is an implementation of a real VRRP for FreeBSD available; >it is in ports. > >This is a real VRRP (Virtual Router Redundancy Protocol), not >like the Linux version which uses the multicast mask and thus >loses multicast capability. > >There are intersting issues in actual deployment of this code; >specifically, the VMAC that needs to be used in order to >logically seperate virtual routers is not really implemented >well, so there are common ARP issues. > >There are a couple of projects that one could take on here; by >far, the most interesting (IMO) would be to support multiple >virtual network cards on a single physical network card. Most >of the Gigabit Ethernet cards, and some of the 10/100Mbit cards, >can support multiple MAC addresses (the Intel Gigabit card can >support 16, the Tigon III supports 4, and the Tigone II supports >2). > >The work required would be to support the ability to have a >single driver, single NIC, multiple virtual NICs. > >There are also interesting issues, like being able to selectively >control ARP response from a VRRP interface which is not the >master interface. This has intersting implications for the >routing code, and for the initialization code, which normally >handles the gratuitous ARP. More information can be found in >the VRRP RFC, RFC-2338. > >---------- >TCP Timers >---------- > >I've discussed this before in depth. Basically, the timer code >is very poor for a large number of connections, and increasing >the size of the callout wheel is not a real/reasonable answer. > >I would like to see the code go back to the BSD 4.2 model, which >is a well known model. There is plenty of prior art in this area, >but the main thing that needs to be taken from the BSD 4.2 is per >interval timer lists, so that the list scanning, for the most part, >scans only those timers that have expired (+ 1). Basically, a >TAILQ per interval for ficed interval timers. > >A very obvious way to measure the performance improvement here is >to establish a very large number of connections. If you have 4G >of memory in an IA32 machine, you should have no problem getting >to 300,000 connections. If you really work at it, I have been >able to push this number to 1.6 Million simultaneous connections. > > >--------------- >SMP Safe Queues >--------------- > >For simple queue types, it should be possible to make queueing >and dequing an intrinsically atomic operation. > >This basically means that the queue locking that is being added >to make the networking code SMP safe, is largely unnecessary, >and is caused solely by the fact that the queue macros themselves >are not being properly handled through ordering of operations, >rather than being locked around. > >In theory, this is also possible for a "counted queue". A >"counted queue" is a necessary construct for RED queueing, >which needs to maintain a moving average for comparison to >the actual queue depth, so that it can do RED (Random Early >Drop) of packets. > >--- >WFS >--- > >Weighted fair share queueing is a method of handling scheduling >of processes, such that the kernel processing. > >This isn't technically a networking issue. However, if the >programs in user space which are intended to operate (or, if >you are a 5.x purist, the kernel threads in kernel space, if >you pull an Ingo Mollnar and cram everything that shouldn't >be in the kernel, into the kernel) do not remove data from >the input processing queue fast enough, you will still suffer >from receiver livelock. Basically, you need to be able to >run the programs with a priority, relative to interrupt processing. > >Some of the work that Jon Lemon and Luigi Rizzo have done in >this area is interesting, but it's not sufficient to resolve >the problem (sorry guys). Unfortunately, they don't tend to >run their systems under breaking point stress, so they don't >see the drop-off in performance that happens at high enough >load.To be able to test this, you would have to have a lab >with the ability to throw a large number of clients against >a large number of servers, with the packets transiting an >applicaiton in a FreeBSD box, at close to wire speeds. We >are talking at least 32 clients and servers, unless you have >access to purpose built code (it's easier to just throw the >machines at it, and be done with it). > >--- >--- >--- > >Basically, that's my short list. There are actually a lot more >things that could be done in the networking area; there are things >to do in the routing area, and things to do with RED queueing, and >things to do with resource tuning, etc., and, of course, there's >the bugs that you normally see in the BSD stack only when you try >to dothings like open more than 65535 outbound connections from a >single box, etc.. > >Personally, I'm tired of solving the same problems over and over >again, so I'd like to see the code in FreeBSD proper, so that it >becomes part of the intellectual commons. > >-- Terry > >To Unsubscribe: send mail to majordomo@FreeBSD.org >with "unsubscribe freebsd-hackers" in the body of the message > > --------------000306030500070200080206 Content-Type: text/html; charset=us-ascii Content-Transfer-Encoding: 7bit Great list, thanks for that. While I think LRP and TCP Rate Halving are quite interesting, I think tackling the SMP Safe Queues makes the best use my resources. I fear that testing some of the other items requires setups that are not feasible for me.

Cheers,

Aram


Terry Lambert wrote:
Aram Compeau wrote:
Too bad he's sick of networking.  There a lot of intersting code
that could be implemented in the main line FreeBSD that would be
really beneficial, overall.
Could you elaborate briefly on what you'd like to see worked on with
respect to this? I don't want you to spend a lot of time describing
anything, but I am curious. I don't generally have large blocks of spare
time, but could work on something steadily with a low flame.

---
LRP
---

I would like FreeBSD to support LRP (Lazy Receiver Processing),
an idea which came from the Scala Server Project at Rice
University.

LRP gets rid of the need to run network processing in kernel
threads, in order to get parallel operation on SMP systems;
so long as the interrupt processing load is balanced, it's
possible to handle interrupts in an overlapped fashion.

Right now, there are four sets of source code: SunOS 4.1.3_U1,
FreeBSD 2.2-BETA, FreeBSD 4.0, FreeBSD 4.3. The first three
are from Rice University. The fourth is from Duke University,
and is a port forward of the 4.0 Rice code.

The Rice code, other than the FreeBSD 2.2-BETA, is unusable.
It mixes in an idea called "Resource Containers" (RESCON),
that is really not very useful (I can go into great detail on
this, if necessary). It also has a restrictive license. The
FreeBSD 2.2-BETA implementation h as a CMU MACH style license
(same as some FreeBSD code already has).

The LRP implementation in all these cases is flawed, in that
it assumes that the LRP processing will be universal across
an entire address family, and the experimental implementation
loads a full copy of the AF_INET stack under another family
name. A real integration is tricky, including credentials on
accept calls, an attribute on the family struct, to indicate
that it's LRP'ed, so that common subsystems can behave very
differently, support for Accept filters and othe Kevents, etc.).

LRP gives a minimum of a factor of 3 improvement in connections
per second, without the SYN cache code involved at all, through
an overall reduction in processing latency. It also has the
effect of preventing "receiver livelock".

http://www.cs.rice.edu/CS/Systems/LRP/
http://www.cs.duke.edu/~anderson/freebsd/muse-sosp/readme.txt

----------------
TCP Rate Halving
----------------

I would like to see FreeBSD support TCP Rate Halving, and idea
from the Pittsburgh Cupercomputing Center (PSC) at Carengie
Mellon University (CMU).

These are the people who invented "traceroute".

TCP Rate halving is an alternative to the RFC-2581 Fast Recovery
algorithm for congestion control. It effectively causes the
congestion recovery to be self-clocked by ACKs, which has the
overall effect of avoiding the normal burstiness of TCP recovery
following congestion.

This builds on work by Van Jacobsen, J. Hoe, and Sally Floyd.

Their current implementation is for NetBSD 1.3.2.

http://www.psc.edu/networking/rate_halving.h tml

---------------
SACK, FACK, ECN
---------------

Also from PSC at CMU.

SACK and FACK are well known. It's annnoying that Luigi Rizzo's
code from 1997 or so was never integrated into FreeBSD.

ECN is an implementation of Early Congestion Notification.

http://www.psc.edu/networking/tcp.html


----
VRRP
----

There is an implementation of a real VRRP for FreeBSD available;
it is in ports.

This is a real VRRP (Virtual Router Redundancy Protocol), not
like the Linux version which uses the multicast mask and thus
loses multicast capability.

There are intersting issues in actual deployment of this code;
specifically, the VMAC that needs to be used in order to
logically seperate virtual routers is not really implemented
well, so there are common ARP issues.

There are a couple of projects that one could take on here; by
far, the most interesting (IMO) would be to support multiple
virtual network cards on a single physical network card. Most
of the Gigabit Ethernet cards, and some of the 10/100Mbit cards,
can support multiple MAC addresses (the Intel Gigabit card can
support 16, the Tigon III supports 4, and the Tigone II supports
2).

The work required would be to support the ability to have a
single driver, single NIC, multiple virtual NICs.

There are also interesting issues, like being able to selectively
control ARP response from a VRRP interface which is not the
master interface. This has intersting implications for the
routing code, and for the initialization code, which normally
handles the gratuitous ARP. More information can be found in
the VRRP RFC, RFC-2338.

----------
TCP Timers
----------

I've discussed this before in depth. Basically, the timer code
is very poor for a large nu mber of connections, and increasing
the size of the callout wheel is not a real/reasonable answer.

I would like to see the code go back to the BSD 4.2 model, which
is a well known model. There is plenty of prior art in this area,
but the main thing that needs to be taken from the BSD 4.2 is per
interval timer lists, so that the list scanning, for the most part,
scans only those timers that have expired (+ 1). Basically, a
TAILQ per interval for ficed interval timers.

A very obvious way to measure the performance improvement here is
to establish a very large number of connections. If you have 4G
of memory in an IA32 machine, you should have no problem getting
to 300,000 connections. If you really work at it, I have been
able to push this number to 1.6 Million simultaneous connections.


---------------
SMP Safe Queues
---------------

For simple queue types, it should be possible to make queueing
and dequi ng an intrinsically atomic operation.

This basically means that the queue locking that is being added
to make the networking code SMP safe, is largely unnecessary,
and is caused solely by the fact that the queue macros themselves
are not being properly handled through ordering of operations,
rather than being locked around.

In theory, this is also possible for a "counted queue". A
"counted queue" is a necessary construct for RED queueing,
which needs to maintain a moving average for comparison to
the actual queue depth, so that it can do RED (Random Early
Drop) of packets.

---
WFS
---

Weighted fair share queueing is a method of handling scheduling
of processes, such that the kernel processing.

This isn't technically a networking issue. However, if the
programs in user space which are intended to operate (or, if
you are a 5.x purist, the kernel threads in kernel space, if
you pull an Ingo Mollnar an d cram everything that shouldn't
be in the kernel, into the kernel) do not remove data from
the input processing queue fast enough, you will still suffer
from receiver livelock. Basically, you need to be able to
run the programs with a priority, relative to interrupt processing.

Some of the work that Jon Lemon and Luigi Rizzo have done in
this area is interesting, but it's not sufficient to resolve
the problem (sorry guys). Unfortunately, they don't tend to
run their systems under breaking point stress, so they don't
see the drop-off in performance that happens at high enough
load.To be able to test this, you would have to have a lab
with the ability to throw a large number of clients against
a large number of servers, with the packets transiting an
applicaiton in a FreeBSD box, at close to wire speeds. We
are talking at least 32 clients and servers, unless you have
access to purpose built code (it's easier to just throw the
machines at it, and be done with it).

---
---
---

Basically, that's my short list. There are actually a lot more
things that could be done in the networking area; there are things
to do in the routing area, and things to do with RED queueing, and
things to do with resource tuning, etc., and, of course, there's
the bugs that you normally see in the BSD stack only when you try
to dothings like open more than 65535 outbound connections from a
single box, etc..

Personally, I'm tired of solving the same problems over and over
again, so I'd like to see the code in FreeBSD proper, so that it
becomes part of the intellectual commons.

-- Terry

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message



--------------000306030500070200080206-- To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hackers" in the body of the message