Date: Wed, 12 Mar 1997 22:35:45 -0600 From: Chris Csanady <ccsanady@nyx.pr.mcs.net> To: "David S. Miller" <davem@jenolan.rutgers.edu> Cc: hackers@FreeBSD.ORG Subject: Re: Solaris TPC-C benchmarks (with Oracle) Message-ID: <199703130435.WAA11627@nyx.pr.mcs.net> In-Reply-To: Your message of Wed, 12 Mar 1997 21:21:34 -0500. <199703130221.VAA21577@jenolan.caipgeneral>
index | next in thread | previous in thread | raw e-mail
> Date: Wed, 12 Mar 1997 19:42:03 -0600
> From: Chris Csanady <ccsanady@nyx.pr.mcs.net>
>
> For starters, I'd like to get rid of the usage of mbuf chains. This is mostly
> a simple, if time consuming task. (i think) It will save a bunch of copying
> around the net code, as well as simplifying things. The only part I'm not
> really sure about is how to do memory management with the new "pbuf's." I
> looked at the linux code, and they call their generic kmalloc() to allocate a
> buffer the size of the packet. This would be easier, but I dont like it. :)
> In Van Jacobsons slides from his talk, he mentions that routines call the
> output driver to get packet buffers (pbufs), not a generic allocator.
>
>The drivers do the buffer managment in Jacobsons pbuf kernel. So you
>go:
>
>tcp_send_fin(struct netdevice *dev, int len)
>{
> struct pbuf *p;
>
> p = dev->alloc(len);
> [ ... ]
>}
>
>Later on you'd go:
>
> p->dev_free(p);
Ok, this is pretty much as I thought. But is it worth it to do more
complicated memory management, or just eat the wasted space of fixed
size buffers? I mean, it won't waste any more space than mbuf clusters
do for ethernet. If your using ATM, or HIPPI, you can afford the extra
memory. :)
>
>One huge tip, do _not_ just implement Jacobson's pbuf code blindly.
>Anyone who even glances at those slides immediately goes "Geese he's
>ignoring all issues of flow control" I find this rather ironic for
>someone who is effectively the godfather of TCP flow control.
I was curious about this--in his slides, he mentions that sockbuf's
go away. :\ Can you elaborate more on whats going on?
>Secondly, his fast paths for input bank on the fact that you can get
>right into user context when you detect a header prediction hit. The
>only way to do this effectively on a system you'd ever want anyone to
>actually run is the following:
I think that the header prediction code is called from a user context,
so you would already be there.
>
>1) Device drivers loan pbufs (ie. possibly pieces of device memory of
> driver private fixed dma buffering areas) to the networking code
> on receive.
pbufs would essentially be the same as mbufs to the drivers i would
think--except less complicated. Right now, I think that the drivers
just dma into an mbuf cluster. I don't see why it can't loan them out
for a while.
>
>2) Once the protocol layer detects that this pbuf can go right into
> user space, it jacks up the receiving application processes
> priority such that it becomes a real time thread. This is because
> you must to guarentee extremely low latencies to the driver whose
> resources you are holding onto. If the pbuf cannot be processed
> now the pbuf is copied into a new buffer and finally the orig pbuf
> is given back to the device before splnet is left via dev->free(p)
See above.
>3) If we got a hit and this can go right into userspace, then when
> splnet gets left the kernel sees that whoever is currently on the
> cpu should get off such that any real time networking processes can
> eat the pbufs.
>
>4) tcp_receive() runs in the applications context, csum_copy()'s the
> pbuf right into user space (or perhaps does a flip, this makes the
> driver-->net pbuf method interface slightly more intricate), and
> then calls p->free(p), the applications priority is lowered back
> down to what it was before the new pbuf came in.
>
>This is all nontrivial to pull of. One nice effect is that you
>actually then have a chance of doing real networking page flipping
>with the device buffer method scheme.
Does Van Jacobsons kernel to page flipping? I thought he just did a
checksum and copy to a user buffer. I remember John saying something
about it being more expensive to do this than a copy, although it was
in a different context. (with regard to the pipe code i think).. I
dont know. If it would work, it would sure be nice, but my simple
pbuf allocator would definately not work..
>
> In the new implementation, sorecieve, and sosend go away. :)
>
>See my comments about flow control above, some of the code must stay.
Yes, it turns into a bunch of protocol specific routines.
>
> The new architecture also seems as if it would scale nicely with SMP. This
> is also one of the reasons im interested in doing it.
>
>No one has quantified that pbufs can be made to scale on SMP, it may
>(and I think it will) have the same scalability problems that SLAB
>allocators can have. At a minimum you'd have to grab a per device
>lock to keep track of the device pbuf pool properly, since any of the
>networking code can call upon the code which needs to acquire this
>lock you're probably going to need to make it a sleeping lock to get
>decent performance. Guess what? Then you need to implement what
>Solaris does which is allow interrupt handlers to sleep, in order for
>it to work at all.
I should have said viability rather than scalability. :) The fact is
that protocol and interrupt processing with the new model is more
oriented toward doing things a packet at a time. Slapping mbuf chains
on queues, and all that layered processing would be hell to do, and
not very efficient.
>
>I'd suggest fixing the TCP timers first, they are a much larger
>scalability problem than the buffering in BSD. (IRIX scales to 2,000
>connections per second, thats real connections, not some bogus Zeus
>benchmark exploiting http connection reuse features etc., and they're
>still using mbufs) Then go to the time wait problem (much harder to
>solve than the timers, but less painful to fix than redoing the
>buffering), then fix select(), then think about pbufs.
I'd like to finish volume 2 before I even think about the timers or
such..
--Chris Csanady
>
>---------------------------------------------////
>Yow! 11.26 MB/s remote host TCP bandwidth & ////
>199 usec remote TCP latency over 100Mb/s ////
>ethernet. Beat that! ////
>-----------------------------------------////__________ o
>David S. Miller, davem@caip.rutgers.edu /_____________/ / // /_/ ><
help
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199703130435.WAA11627>
