Date: Mon, 12 Jul 2010 13:33:48 +0200 From: Pieter de Goeje <pdegoeje@service2media.com> To: freebsd-hackers@freebsd.org Cc: Sergey Babkin <babkin@verizon.net> Subject: Re: TCP over UDP Message-ID: <201007121333.49017.pdegoeje@service2media.com> In-Reply-To: <4C386208.291D2FB5@verizon.net> References: <4C386208.291D2FB5@verizon.net>
next in thread | previous in thread | raw e-mail | index | archive | help
On Saturday 10 July 2010 14:05:29 Sergey Babkin wrote: > Hi guys, > > I've got this idea, and I wonder if anyone has done it already, > and if not then why. The idea is to put the TCP logic over UDP. > > I've done some googling and all I've found is some academical > user-space implementations of TCP that actually try to interoperate > with "real" TCP. What I'm thinking about is different. It's > to use the TCP-derived logic as a portable library that would > do the good flow control, retransmitting, delivery confirmations > etc over UDP. > > Basically, every time you use UDP, you've got to reinvent your > own retransmission and reliability protocol. And these protocols > are typically no good at all, as the story with NFS switching > from UDP to TCP and improving the performance shows. At the same > time TCP provides a very good transport control logic, so why not > just reuse this logic in a library to solve the UDP issues once > and for all? > > Then of course, why not just use TCP? The problem of TCP is that > it's expensive. It uses the kernel memory for its contexts. > It also requires a file descriptor per each connection. The file > descriptors are an expensive resource, and besides, even if > the limit is raised, there is the issue with historic select() > fd_set allocating only 1024 bits and nobody checking for the > overflow. Even if your own code is carefully designed to avoid using > select() at all and/or create large enough bitmasks, it could > always happen to use some stupid library that doesn't do that > and causes the interesting one-bit memory corruptions. > > Moving the connection logic to the user space makes the connections > cheap. A hundred bytes or so per connection state is no big > deal, you can easily create a million of these connections to > the same process. All the state stays in the user-space pageable > memory. Well, all of them sending data at the same time > might not work so well, but caching a large number of currently > inactive connections becomes cheap. Think of XMLRPC or SOAP > or anything else over HTTP reusing the same TCP connection for > multiple sequential requests. Now there is a painful balance > of inactivity timeouts: make them too long and you > overload the server, make them too short and the connections > get dropped all the time. The cheap connections would allow > to keep the much longer timeouts. > > Then there are other interesting possibilities arising from the easy > access to the protocol state. The underlying datagramness can be > exposed to the top level, and this immediately gives the transactional > TCP. Or we could look at the state and find out if the data has > been actually delivered to and confirmed by the other side. > Or we can even drop the inactive connections at the server without > notifying the client. Then if the client sends more requests on this > connection, the server could semi-transparently re-establish it > (OK, this would require an extension from TCP). Or we can do > the better keep-alives, not the TCP's hour-long ones, but > something within a few seconds (would not work too well with > millions of connections, but it's a different use case where > we want to detect the lost peer fast). Or having "sub-channels", > each with its own sequence number. If the data gets transferred > over 100 parallel logical connections, few bytes at a time for > each of them, combining the whole bunch into one datagram would > be much more efficient tahn sending 100 datagrams. These are just > the ideas off the bat, there's got to be more of these interesting > usages. > > It all looks like such an obviously good idea, that I wonder, > why didn't anyone else try it before? Or have they tried > it and found that it's not such a good idea after all? > > -SB TCP actually scales pretty well. All modern operating systems provide a way to do efficient select() operations, for example with FreeBSD's kqueue. Using a small bit of tuning one can effectively deal with 100k+ TCP connections on a single system. This mainly has to do with increasing the maximum number of filedescriptors and decreasing the maximum send/receive buffer sizes to conserve memory. TCP provides very good throughput, and it achieves this using large send and receive buffers. Your userspace implementation will need to implement something similar. A few hundred bytes per connection is simply not enough. If you want to deal with millions of clients, your protocol shall better not have any state at all. A good example of this is DNS. I think that most applications can either use TCP directly with or without tuning or they have such specialized needs that a custom protocol is the only solution. Regards, Pieter de Goeje
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?201007121333.49017.pdegoeje>