From owner-freebsd-current@FreeBSD.ORG  Wed Sep 15 15:48:08 2010
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id E3E06106564A
	for <freebsd-current@freebsd.org>; Wed, 15 Sep 2010 15:48:07 +0000 (UTC)
	(envelope-from oppermann@networx.ch)
Received: from c00l3r.networx.ch (c00l3r.networx.ch [62.48.2.2])
	by mx1.freebsd.org (Postfix) with ESMTP id 555BE8FC14
	for <freebsd-current@freebsd.org>; Wed, 15 Sep 2010 15:48:06 +0000 (UTC)
Received: (qmail 72458 invoked from network); 15 Sep 2010 15:42:53 -0000
Received: from localhost (HELO [127.0.0.1]) ([127.0.0.1])
	(envelope-sender <oppermann@networx.ch>)
	by c00l3r.networx.ch (qmail-ldap-1.03) with SMTP
	for <bzeeb-lists@lists.zabbadoz.net>; 15 Sep 2010 15:42:53 -0000
Message-ID: <4C90EAB7.2000902@networx.ch>
Date: Wed, 15 Sep 2010 17:48:07 +0200
From: Andre Oppermann <oppermann@networx.ch>
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US;
	rv:1.9.2.8) Gecko/20100802 Thunderbird/3.1.2
MIME-Version: 1.0
To: "Bjoern A. Zeeb" <bzeeb-lists@lists.zabbadoz.net>
References: <4C8E0C1E.2020707@networx.ch>
	<20100915151632.E31898@maildrop.int.zabbadoz.net>
In-Reply-To: <20100915151632.E31898@maildrop.int.zabbadoz.net>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
X-Mailman-Approved-At: Wed, 15 Sep 2010 17:08:05 +0000
Cc: freebsd-net@freebsd.org, freebsd-current@freebsd.org
Subject: Re: TCP loopback socket fusing
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>, 
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 15 Sep 2010 15:48:08 -0000

On 15.09.2010 17:19, Bjoern A. Zeeb wrote:
> On Mon, 13 Sep 2010, Andre Oppermann wrote:
>
> Hey,
>
>> When a TCP connection via loopback back to localhost is made the whole
>> send, segmentation and receive path (with larger packets though) is still
>> executed. This has some considerable overhead.
>>
>> To short-circuit the send and receive sockets on localhost TCP connections
>> I've made a proof-of-concept patch that directly places the data in the
>> other side's socket buffer without doing any packetization and other protocol
>> overhead (like UNIX domain sockets). The connections setup (SYN, SYN-ACK,
>> ACK) and shutdown are still handled by normal TCP segments via loopback so
>> that firewalling stills works. The actual payload data during the session
>> won't be seen and the sequence numbers don't move other than for SYN and FIN.
>> The sequence are remain valid though. Obviously tcpdump won't see any data
>> transfers either if the connection has fused sockets.
>>
>> Preliminary testing (with WITNESS and INVARIANTS enabled) has shown stable
>> operation and a rough doubling of the throughput on loopback connections.
>> I've tested most socket teardown cases and it behaves fine. I'm not entirely
>> sure I've got all possible path's but the way it is integrated should properly
>> defuse the sockets in all situations.
>
> Three comments in reverse order:
>
> 1 If S/S+A/A and shutdown aren't shortcut, can you always rely on proper
> payload order, especially in the shutdown case?

Yes.  The payload is always directly placed in the receive socket buffer
of the other socket, never in the send buffer.  There is never any unsent
data left in the send buffer that could become reordered.

> 2 Given my experience with epairs, which are basically a loop with two
> interfaces and even interface queues, any significant delay you are
> seeing is _not_ due to longer code paths through the stack but
> simply because of the netisr.

I haven't measured delay, only bandwidth.  And that's with WITNESS and
INVARIANTS enabled.  You are probably right, the netisr is taking its
toll.  Especially the TCP_INFO lock may have some contention in the
loopback case on SMP.  Though a lot of mbuf allocations, packet manipulations
and instructions (instruction cache) are avoided by fusing the sockets
together.

> 3 If properly doing this for TCP, we should probably also do it for
> other protocols.

UNIX domain sockets already do this.  This implementation is particular
for TCP and only touches the protocol specific parts.  It's not done at
the socket layer.  For UDP it's not that easy to do as most UDP connections
are one-off packets and no permanent binding between two sockets exists.
For SCTP I don't know.  From glancing over the code it seems they have,
at least partially, their own socket buffer code.  How difficult a fused
socket there would be I can't say.

-- 
Andre