Date: Wed, 12 Mar 1997 22:35:45 -0600 From: Chris Csanady <ccsanady@nyx.pr.mcs.net> To: "David S. Miller" <davem@jenolan.rutgers.edu> Cc: hackers@FreeBSD.ORG Subject: Re: Solaris TPC-C benchmarks (with Oracle) Message-ID: <199703130435.WAA11627@nyx.pr.mcs.net> In-Reply-To: Your message of Wed, 12 Mar 1997 21:21:34 -0500. <199703130221.VAA21577@jenolan.caipgeneral>
next in thread | previous in thread | raw e-mail | index | archive | help
> Date: Wed, 12 Mar 1997 19:42:03 -0600 > From: Chris Csanady <ccsanady@nyx.pr.mcs.net> > > For starters, I'd like to get rid of the usage of mbuf chains. This is mostly > a simple, if time consuming task. (i think) It will save a bunch of copying > around the net code, as well as simplifying things. The only part I'm not > really sure about is how to do memory management with the new "pbuf's." I > looked at the linux code, and they call their generic kmalloc() to allocate a > buffer the size of the packet. This would be easier, but I dont like it. :) > In Van Jacobsons slides from his talk, he mentions that routines call the > output driver to get packet buffers (pbufs), not a generic allocator. > >The drivers do the buffer managment in Jacobsons pbuf kernel. So you >go: > >tcp_send_fin(struct netdevice *dev, int len) >{ > struct pbuf *p; > > p = dev->alloc(len); > [ ... ] >} > >Later on you'd go: > > p->dev_free(p); Ok, this is pretty much as I thought. But is it worth it to do more complicated memory management, or just eat the wasted space of fixed size buffers? I mean, it won't waste any more space than mbuf clusters do for ethernet. If your using ATM, or HIPPI, you can afford the extra memory. :) > >One huge tip, do _not_ just implement Jacobson's pbuf code blindly. >Anyone who even glances at those slides immediately goes "Geese he's >ignoring all issues of flow control" I find this rather ironic for >someone who is effectively the godfather of TCP flow control. I was curious about this--in his slides, he mentions that sockbuf's go away. :\ Can you elaborate more on whats going on? >Secondly, his fast paths for input bank on the fact that you can get >right into user context when you detect a header prediction hit. The >only way to do this effectively on a system you'd ever want anyone to >actually run is the following: I think that the header prediction code is called from a user context, so you would already be there. > >1) Device drivers loan pbufs (ie. possibly pieces of device memory of > driver private fixed dma buffering areas) to the networking code > on receive. pbufs would essentially be the same as mbufs to the drivers i would think--except less complicated. Right now, I think that the drivers just dma into an mbuf cluster. I don't see why it can't loan them out for a while. > >2) Once the protocol layer detects that this pbuf can go right into > user space, it jacks up the receiving application processes > priority such that it becomes a real time thread. This is because > you must to guarentee extremely low latencies to the driver whose > resources you are holding onto. If the pbuf cannot be processed > now the pbuf is copied into a new buffer and finally the orig pbuf > is given back to the device before splnet is left via dev->free(p) See above. >3) If we got a hit and this can go right into userspace, then when > splnet gets left the kernel sees that whoever is currently on the > cpu should get off such that any real time networking processes can > eat the pbufs. > >4) tcp_receive() runs in the applications context, csum_copy()'s the > pbuf right into user space (or perhaps does a flip, this makes the > driver-->net pbuf method interface slightly more intricate), and > then calls p->free(p), the applications priority is lowered back > down to what it was before the new pbuf came in. > >This is all nontrivial to pull of. One nice effect is that you >actually then have a chance of doing real networking page flipping >with the device buffer method scheme. Does Van Jacobsons kernel to page flipping? I thought he just did a checksum and copy to a user buffer. I remember John saying something about it being more expensive to do this than a copy, although it was in a different context. (with regard to the pipe code i think).. I dont know. If it would work, it would sure be nice, but my simple pbuf allocator would definately not work.. > > In the new implementation, sorecieve, and sosend go away. :) > >See my comments about flow control above, some of the code must stay. Yes, it turns into a bunch of protocol specific routines. > > The new architecture also seems as if it would scale nicely with SMP. This > is also one of the reasons im interested in doing it. > >No one has quantified that pbufs can be made to scale on SMP, it may >(and I think it will) have the same scalability problems that SLAB >allocators can have. At a minimum you'd have to grab a per device >lock to keep track of the device pbuf pool properly, since any of the >networking code can call upon the code which needs to acquire this >lock you're probably going to need to make it a sleeping lock to get >decent performance. Guess what? Then you need to implement what >Solaris does which is allow interrupt handlers to sleep, in order for >it to work at all. I should have said viability rather than scalability. :) The fact is that protocol and interrupt processing with the new model is more oriented toward doing things a packet at a time. Slapping mbuf chains on queues, and all that layered processing would be hell to do, and not very efficient. > >I'd suggest fixing the TCP timers first, they are a much larger >scalability problem than the buffering in BSD. (IRIX scales to 2,000 >connections per second, thats real connections, not some bogus Zeus >benchmark exploiting http connection reuse features etc., and they're >still using mbufs) Then go to the time wait problem (much harder to >solve than the timers, but less painful to fix than redoing the >buffering), then fix select(), then think about pbufs. I'd like to finish volume 2 before I even think about the timers or such.. --Chris Csanady > >---------------------------------------------//// >Yow! 11.26 MB/s remote host TCP bandwidth & //// >199 usec remote TCP latency over 100Mb/s //// >ethernet. Beat that! //// >-----------------------------------------////__________ o >David S. Miller, davem@caip.rutgers.edu /_____________/ / // /_/ ><
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199703130435.WAA11627>