From owner-freebsd-hackers@FreeBSD.ORG  Sat Jul 10 15:31:27 2010
Return-Path: <owner-freebsd-hackers@FreeBSD.ORG>
Delivered-To: hackers@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 8A73C106566C
	for <hackers@freebsd.org>; Sat, 10 Jul 2010 15:31:27 +0000 (UTC)
	(envelope-from babkin@verizon.net)
Received: from vms173019pub.verizon.net (vms173019pub.verizon.net
	[206.46.173.19])
	by mx1.freebsd.org (Postfix) with ESMTP id 6EADA8FC14
	for <hackers@freebsd.org>; Sat, 10 Jul 2010 15:31:27 +0000 (UTC)
Received: from verizon.net ([unknown] [173.54.27.21])
	by vms173019.mailsrvcs.net
	(Sun Java(tm) System Messaging Server 7u2-7.02 32bit (built Apr 16
	2009)) with ESMTPA id <0L5C00LMNLS1B4VX@vms173019.mailsrvcs.net> for
	hackers@freebsd.org; Sat, 10 Jul 2010 10:31:16 -0500 (CDT)
Sender: root
Message-id: <4C386208.291D2FB5@verizon.net>
Date: Sat, 10 Jul 2010 08:05:29 -0400
From: Sergey Babkin <babkin@verizon.net>
X-Mailer: Mozilla 4.7 [en] (X11; U; FreeBSD 4.7-RELEASE i386)
X-Accept-Language: en, ru
MIME-version: 1.0
To: hackers@freebsd.org
Content-type: text/plain; charset=us-ascii
Content-transfer-encoding: 7bit
X-Mailman-Approved-At: Mon, 12 Jul 2010 11:07:51 +0000
Cc: 
Subject: TCP over UDP
X-BeenThere: freebsd-hackers@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Technical Discussions relating to FreeBSD
	<freebsd-hackers.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>, 
	<mailto:freebsd-hackers-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hackers>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Help: <mailto:freebsd-hackers-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>,
	<mailto:freebsd-hackers-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 10 Jul 2010 15:31:27 -0000

Hi guys,

I've got this idea, and I wonder if anyone has done it already,
and if not then why. The idea is to put the TCP logic over UDP.

I've done some googling and all I've found is some academical
user-space implementations of TCP that actually try to interoperate
with "real" TCP. What I'm thinking about is different. It's
to use the TCP-derived logic as a portable library that would
do the good flow control, retransmitting, delivery confirmations 
etc over UDP.

Basically, every time you use UDP, you've got to reinvent your
own retransmission and reliability protocol. And these protocols
are typically no good at all, as the story with NFS switching
from UDP to TCP and improving the performance shows. At the same
time TCP provides a very good transport control logic, so why not
just reuse this logic in a library to solve the UDP issues once
and for all?

Then of course, why not just use TCP? The problem of TCP is that
it's expensive. It uses the kernel memory for its contexts.
It also requires a file descriptor per each connection. The file
descriptors are an expensive resource, and besides, even if
the limit is raised, there is the issue with historic select()
fd_set allocating only 1024 bits and nobody checking for the
overflow. Even if your own code is carefully designed to avoid using
select() at all and/or create large enough bitmasks, it could
always happen to use some stupid library that doesn't do that 
and causes the interesting one-bit memory corruptions.

Moving the connection logic to the user space makes the connections
cheap. A hundred bytes or so per connection state is no big
deal, you can easily create a million of these connections to
the same process. All the state stays in the user-space pageable
memory. Well, all of them sending data at the same time
might not work so well, but caching a large number of currently
inactive connections becomes cheap. Think of XMLRPC or SOAP
or anything else over HTTP reusing the same TCP connection for
multiple sequential requests. Now there is a painful balance 
of inactivity timeouts: make them too long and you
overload the server, make them too short and the connections
get dropped all the time. The cheap connections would allow
to keep the much longer timeouts.

Then there are other interesting possibilities arising from the easy
access to the protocol state. The underlying datagramness can be 
exposed to the top level, and this immediately gives the transactional
TCP. Or we could look at the state and find out if the data has
been actually delivered to and confirmed by the other side.
Or we can even drop the inactive connections at the server without
notifying the client. Then if the client sends more requests on this
connection, the server could semi-transparently re-establish it
(OK, this would require an extension from TCP). Or we can do
the better keep-alives, not the TCP's hour-long ones, but 
something within a few seconds (would not work too well with
millions of connections, but it's a different use case where
we want to detect the lost peer fast). Or having "sub-channels",
each with its own sequence number. If the data gets transferred
over 100 parallel logical connections, few bytes at a time for
each of them, combining the whole bunch into one datagram would
be much more efficient tahn sending 100 datagrams. These are just 
the ideas off the bat, there's got to be more of these interesting
usages.

It all looks like such an obviously good idea, that I wonder,
why didn't anyone else try it before? Or have they tried
it and found that it's not such a good idea after all?

-SB