Date: Wed, 12 Mar 1997 21:21:34 -0500 From: "David S. Miller" <davem@jenolan.rutgers.edu> To: ccsanady@nyx.pr.mcs.net Cc: hackers@FreeBSD.ORG Subject: Re: Solaris TPC-C benchmarks (with Oracle) Message-ID: <199703130221.VAA21577@jenolan.caipgeneral> In-Reply-To: <199703130142.TAA11137@nyx.pr.mcs.net> (message from Chris Csanady on Wed, 12 Mar 1997 19:42:03 -0600)
next in thread | previous in thread | raw e-mail | index | archive | help
Date: Wed, 12 Mar 1997 19:42:03 -0600 From: Chris Csanady <ccsanady@nyx.pr.mcs.net> For starters, I'd like to get rid of the usage of mbuf chains. This is mostly a simple, if time consuming task. (i think) It will save a bunch of copying around the net code, as well as simplifying things. The only part I'm not really sure about is how to do memory management with the new "pbuf's." I looked at the linux code, and they call their generic kmalloc() to allocate a buffer the size of the packet. This would be easier, but I dont like it. :) In Van Jacobsons slides from his talk, he mentions that routines call the output driver to get packet buffers (pbufs), not a generic allocator. The drivers do the buffer managment in Jacobsons pbuf kernel. So you go: tcp_send_fin(struct netdevice *dev, int len) { struct pbuf *p; p = dev->alloc(len); [ ... ] } Later on you'd go: p->dev_free(p); One huge tip, do _not_ just implement Jacobson's pbuf code blindly. Anyone who even glances at those slides immediately goes "Geese he's ignoring all issues of flow control" I find this rather ironic for someone who is effectively the godfather of TCP flow control. Secondly, his fast paths for input bank on the fact that you can get right into user context when you detect a header prediction hit. The only way to do this effectively on a system you'd ever want anyone to actually run is the following: 1) Device drivers loan pbufs (ie. possibly pieces of device memory of driver private fixed dma buffering areas) to the networking code on receive. 2) Once the protocol layer detects that this pbuf can go right into user space, it jacks up the receiving application processes priority such that it becomes a real time thread. This is because you must to guarentee extremely low latencies to the driver whose resources you are holding onto. If the pbuf cannot be processed now the pbuf is copied into a new buffer and finally the orig pbuf is given back to the device before splnet is left via dev->free(p) 3) If we got a hit and this can go right into userspace, then when splnet gets left the kernel sees that whoever is currently on the cpu should get off such that any real time networking processes can eat the pbufs. 4) tcp_receive() runs in the applications context, csum_copy()'s the pbuf right into user space (or perhaps does a flip, this makes the driver-->net pbuf method interface slightly more intricate), and then calls p->free(p), the applications priority is lowered back down to what it was before the new pbuf came in. This is all nontrivial to pull of. One nice effect is that you actually then have a chance of doing real networking page flipping with the device buffer method scheme. In the new implementation, sorecieve, and sosend go away. :) See my comments about flow control above, some of the code must stay. The new architecture also seems as if it would scale nicely with SMP. This is also one of the reasons im interested in doing it. No one has quantified that pbufs can be made to scale on SMP, it may (and I think it will) have the same scalability problems that SLAB allocators can have. At a minimum you'd have to grab a per device lock to keep track of the device pbuf pool properly, since any of the networking code can call upon the code which needs to acquire this lock you're probably going to need to make it a sleeping lock to get decent performance. Guess what? Then you need to implement what Solaris does which is allow interrupt handlers to sleep, in order for it to work at all. I'd suggest fixing the TCP timers first, they are a much larger scalability problem than the buffering in BSD. (IRIX scales to 2,000 connections per second, thats real connections, not some bogus Zeus benchmark exploiting http connection reuse features etc., and they're still using mbufs) Then go to the time wait problem (much harder to solve than the timers, but less painful to fix than redoing the buffering), then fix select(), then think about pbufs. ---------------------------------------------//// Yow! 11.26 MB/s remote host TCP bandwidth & //// 199 usec remote TCP latency over 100Mb/s //// ethernet. Beat that! //// -----------------------------------------////__________ o David S. Miller, davem@caip.rutgers.edu /_____________/ / // /_/ ><
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199703130221.VAA21577>