From owner-freebsd-hackers@FreeBSD.ORG Sat Jul 10 15:31:27 2010 Return-Path: Delivered-To: hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 8A73C106566C for ; Sat, 10 Jul 2010 15:31:27 +0000 (UTC) (envelope-from babkin@verizon.net) Received: from vms173019pub.verizon.net (vms173019pub.verizon.net [206.46.173.19]) by mx1.freebsd.org (Postfix) with ESMTP id 6EADA8FC14 for ; Sat, 10 Jul 2010 15:31:27 +0000 (UTC) Received: from verizon.net ([unknown] [173.54.27.21]) by vms173019.mailsrvcs.net (Sun Java(tm) System Messaging Server 7u2-7.02 32bit (built Apr 16 2009)) with ESMTPA id <0L5C00LMNLS1B4VX@vms173019.mailsrvcs.net> for hackers@freebsd.org; Sat, 10 Jul 2010 10:31:16 -0500 (CDT) Sender: root Message-id: <4C386208.291D2FB5@verizon.net> Date: Sat, 10 Jul 2010 08:05:29 -0400 From: Sergey Babkin X-Mailer: Mozilla 4.7 [en] (X11; U; FreeBSD 4.7-RELEASE i386) X-Accept-Language: en, ru MIME-version: 1.0 To: hackers@freebsd.org Content-type: text/plain; charset=us-ascii Content-transfer-encoding: 7bit X-Mailman-Approved-At: Mon, 12 Jul 2010 11:07:51 +0000 Cc: Subject: TCP over UDP X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 10 Jul 2010 15:31:27 -0000 Hi guys, I've got this idea, and I wonder if anyone has done it already, and if not then why. The idea is to put the TCP logic over UDP. I've done some googling and all I've found is some academical user-space implementations of TCP that actually try to interoperate with "real" TCP. What I'm thinking about is different. It's to use the TCP-derived logic as a portable library that would do the good flow control, retransmitting, delivery confirmations etc over UDP. Basically, every time you use UDP, you've got to reinvent your own retransmission and reliability protocol. And these protocols are typically no good at all, as the story with NFS switching from UDP to TCP and improving the performance shows. At the same time TCP provides a very good transport control logic, so why not just reuse this logic in a library to solve the UDP issues once and for all? Then of course, why not just use TCP? The problem of TCP is that it's expensive. It uses the kernel memory for its contexts. It also requires a file descriptor per each connection. The file descriptors are an expensive resource, and besides, even if the limit is raised, there is the issue with historic select() fd_set allocating only 1024 bits and nobody checking for the overflow. Even if your own code is carefully designed to avoid using select() at all and/or create large enough bitmasks, it could always happen to use some stupid library that doesn't do that and causes the interesting one-bit memory corruptions. Moving the connection logic to the user space makes the connections cheap. A hundred bytes or so per connection state is no big deal, you can easily create a million of these connections to the same process. All the state stays in the user-space pageable memory. Well, all of them sending data at the same time might not work so well, but caching a large number of currently inactive connections becomes cheap. Think of XMLRPC or SOAP or anything else over HTTP reusing the same TCP connection for multiple sequential requests. Now there is a painful balance of inactivity timeouts: make them too long and you overload the server, make them too short and the connections get dropped all the time. The cheap connections would allow to keep the much longer timeouts. Then there are other interesting possibilities arising from the easy access to the protocol state. The underlying datagramness can be exposed to the top level, and this immediately gives the transactional TCP. Or we could look at the state and find out if the data has been actually delivered to and confirmed by the other side. Or we can even drop the inactive connections at the server without notifying the client. Then if the client sends more requests on this connection, the server could semi-transparently re-establish it (OK, this would require an extension from TCP). Or we can do the better keep-alives, not the TCP's hour-long ones, but something within a few seconds (would not work too well with millions of connections, but it's a different use case where we want to detect the lost peer fast). Or having "sub-channels", each with its own sequence number. If the data gets transferred over 100 parallel logical connections, few bytes at a time for each of them, combining the whole bunch into one datagram would be much more efficient tahn sending 100 datagrams. These are just the ideas off the bat, there's got to be more of these interesting usages. It all looks like such an obviously good idea, that I wonder, why didn't anyone else try it before? Or have they tried it and found that it's not such a good idea after all? -SB