FreeBSD Mail Archives

Date:      Wed, 12 Mar 1997 22:35:45 -0600
From:      Chris Csanady <ccsanady@nyx.pr.mcs.net>
To:        "David S. Miller" <davem@jenolan.rutgers.edu>
Cc:        hackers@FreeBSD.ORG
Subject:   Re: Solaris TPC-C benchmarks (with Oracle) 
Message-ID:  <199703130435.WAA11627@nyx.pr.mcs.net>
In-Reply-To: Your message of Wed, 12 Mar 1997 21:21:34 -0500. <199703130221.VAA21577@jenolan.caipgeneral>


>   Date: Wed, 12 Mar 1997 19:42:03 -0600
>   From: Chris Csanady <ccsanady@nyx.pr.mcs.net>
>
>   For starters, I'd like to get rid of the usage of mbuf chains.  This is mostly
>   a simple, if time consuming task.  (i think)  It will save a bunch of copying 
>   around the net code, as well as simplifying things.  The only part I'm not
>   really sure about is how to do memory management with the new "pbuf's."  I
>   looked at the linux code, and they call their generic kmalloc() to allocate a
>   buffer the size of the packet.  This would be easier, but I dont like it. :)
>   In Van Jacobsons slides from his talk, he mentions that routines call the
>   output driver to get packet buffers (pbufs), not a generic allocator.
>
>The drivers do the buffer managment in Jacobsons pbuf kernel.  So you
>go:
>
>tcp_send_fin(struct netdevice *dev, int len)
>{
>	struct pbuf *p;
>
>	p = dev->alloc(len);
>	[ ... ]
>}
>
>Later on you'd go:
>
>	p->dev_free(p);

Ok, this is pretty much as I thought.  But is it worth it to do more
complicated memory management, or just eat the wasted space of fixed
size buffers?  I mean, it won't waste any more space than mbuf clusters
do for ethernet.  If your using ATM, or HIPPI, you can afford the extra
memory. :)

>
>One huge tip, do _not_ just implement Jacobson's pbuf code blindly.
>Anyone who even glances at those slides immediately goes "Geese he's
>ignoring all issues of flow control"  I find this rather ironic for
>someone who is effectively the godfather of TCP flow control.

I was curious about this--in his slides, he mentions that sockbuf's
go away.  :\  Can you elaborate more on whats going on?

>Secondly, his fast paths for input bank on the fact that you can get
>right into user context when you detect a header prediction hit.  The
>only way to do this effectively on a system you'd ever want anyone to
>actually run is the following:

I think that the header prediction code is called from a user context,
so you would already be there.  

>
>1) Device drivers loan pbufs (ie. possibly pieces of device memory of
>   driver private fixed dma buffering areas) to the networking code
>   on receive.

pbufs would essentially be the same as mbufs to the drivers i would
think--except less complicated.  Right now, I think that the drivers
just dma into an mbuf cluster.  I don't see why it can't loan them out
for a while.

>
>2) Once the protocol layer detects that this pbuf can go right into
>   user space, it jacks up the receiving application processes
>   priority such that it becomes a real time thread.  This is because
>   you must to guarentee extremely low latencies to the driver whose
>   resources you are holding onto.  If the pbuf cannot be processed
>   now the pbuf is copied into a new buffer and finally the orig pbuf
>   is given back to the device before splnet is left via dev->free(p)

See above.

>3) If we got a hit and this can go right into userspace, then when
>   splnet gets left the kernel sees that whoever is currently on the
>   cpu should get off such that any real time networking processes can
>   eat the pbufs.
>
>4) tcp_receive() runs in the applications context, csum_copy()'s the
>   pbuf right into user space (or perhaps does a flip, this makes the
>   driver-->net pbuf method interface slightly more intricate), and
>   then calls p->free(p), the applications priority is lowered back
>   down to what it was before the new pbuf came in.
>
>This is all nontrivial to pull of.  One nice effect is that you
>actually then have a chance of doing real networking page flipping
>with the device buffer method scheme.

Does Van Jacobsons kernel to page flipping?  I thought he just did a
checksum and copy to a user buffer.  I remember John saying something
about it being more expensive to do this than a copy, although it was
in a different context. (with regard to the pipe code i think)..  I
dont know.  If it would work, it would sure be nice, but my simple
pbuf allocator would definately not work..

>
>   In the new implementation, sorecieve, and sosend go away. :)
>
>See my comments about flow control above, some of the code must stay.

Yes, it turns into a bunch of protocol specific routines.

>
>   The new architecture also seems as if it would scale nicely with SMP.  This
>   is also one of the reasons im interested in doing it.
>
>No one has quantified that pbufs can be made to scale on SMP, it may
>(and I think it will) have the same scalability problems that SLAB
>allocators can have.  At a minimum you'd have to grab a per device
>lock to keep track of the device pbuf pool properly, since any of the
>networking code can call upon the code which needs to acquire this
>lock you're probably going to need to make it a sleeping lock to get
>decent performance.  Guess what?  Then you need to implement what
>Solaris does which is allow interrupt handlers to sleep, in order for
>it to work at all.

I should have said viability rather than scalability. :)  The fact is
that protocol and interrupt processing with the new model is more
oriented toward doing things a packet at a time.  Slapping mbuf chains
on queues, and all that layered processing would be hell to do, and
not very efficient.

>
>I'd suggest fixing the TCP timers first, they are a much larger
>scalability problem than the buffering in BSD.  (IRIX scales to 2,000
>connections per second, thats real connections, not some bogus Zeus
>benchmark exploiting http connection reuse features etc., and they're
>still using mbufs)  Then go to the time wait problem (much harder to
>solve than the timers, but less painful to fix than redoing the
>buffering), then fix select(), then think about pbufs.

I'd like to finish volume 2 before I even think about the timers or
such..

--Chris Csanady
>
>---------------------------------------------////
>Yow! 11.26 MB/s remote host TCP bandwidth & ////
>199 usec remote TCP latency over 100Mb/s   ////
>ethernet.  Beat that!                     ////
>-----------------------------------------////__________  o
>David S. Miller, davem@caip.rutgers.edu /_____________/ / // /_/ ><

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199703130435.WAA11627>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation