From owner-freebsd-current@FreeBSD.ORG  Sat Mar  3 23:20:04 2007
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
X-Original-To: freebsd-current@freebsd.org
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id 33B4E16A404
	for <freebsd-current@freebsd.org>; Sat,  3 Mar 2007 23:20:04 +0000 (UTC)
	(envelope-from andre@freebsd.org)
Received: from c00l3r.networx.ch (c00l3r.networx.ch [62.48.2.2])
	by mx1.freebsd.org (Postfix) with ESMTP id 9461C13C4B5
	for <freebsd-current@freebsd.org>; Sat,  3 Mar 2007 23:20:03 +0000 (UTC)
	(envelope-from andre@freebsd.org)
Received: (qmail 2824 invoked from network); 3 Mar 2007 22:52:21 -0000
Received: from c00l3r.networx.ch (HELO [127.0.0.1]) ([62.48.2.2])
	(envelope-sender <andre@freebsd.org>)
	by c00l3r.networx.ch (qmail-ldap-1.03) with SMTP
	for <rwatson@FreeBSD.org>; 3 Mar 2007 22:52:21 -0000
Message-ID: <45EA02AA.8030800@freebsd.org>
Date: Sun, 04 Mar 2007 00:20:10 +0100
From: Andre Oppermann <andre@freebsd.org>
User-Agent: Thunderbird 1.5.0.9 (Windows/20061207)
MIME-Version: 1.0
To: Robert Watson <rwatson@FreeBSD.org>
References: <45E8276D.60105@freebsd.org>
	<20070303222356.S60688@fledge.watson.org>
In-Reply-To: <20070303222356.S60688@fledge.watson.org>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: freebsd-net@freebsd.org, gallatin@freebsd.org, freebsd-current@freebsd.org,
	kmacy@freebsd.org
Subject: Re: New optimized soreceive_stream() for TCP sockets,
	proof of concept
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>, 
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 03 Mar 2007 23:20:04 -0000

Robert Watson wrote:
> 
> On Fri, 2 Mar 2007, Andre Oppermann wrote:
> 
>> Instead of the unlock-lock dance soreceive_stream() pulls a properly 
>> sized (relative to the receive system call buffer space) from the 
>> socket buffer drops the lock and gives copyout as much time as it 
>> needs.  In the mean time the lower half can happily add as many new 
>> packets as it wants without having to wait for a lock.  It also allows 
>> the upper and lower halfs to run on different CPUs without much 
>> interference.  There is a unsolved nasty race condition in the patch 
>> though. When the socket closes and we still have data around or the 
>> copyout failed it tries to put the data back into the socket buffer 
>> which is gone already by then leading to a panic.  Work is underway to 
>> find a realiable fix for this.  I wanted to get this out to the 
>> community nonetheless to give it some more exposure.
> 
> I'll try to take a look at this in the next few days.
> 
> However, I find the description above of soreceive() a bit odd -- I'm 
> pretty sure it doesn't do some of the things you're describing.  For 
> example, soreceive() does release the locks acquired by the network 
> input processing path while copying to user space: there should be no 
> contention during the copyout(), only while processing the socket buffer 
> between copyout() calls. This is possible because the socket receive 
> sleep lock (not the mutex) holds sb_mb constant if it is non-NULL, 
> making copyout() of sb_mb->m_data safe while not holding the socket 
> buffer mutex in the current implementation.

The copyout is done without holding the lock.  However for every mbuf
in the socket buffer it unlocks, does the copyout and then locks it again
for the next.  I was referring to that unlock-lock pair for every mbuf.

> In my experience, soreceive() is an incredibly complicated function, and 
> could stand significant simplification.  However, it has to be done very 
> carefully for exactly this reason :-).  There are some existing bugs in 
> soreceive(), one involving incorrect handling of interlaced I/O due to a 
> label being in the wrong place, that we should resolve.

It's damn complex.  That's one of the reasons I started the soreceive_stream()
function and related stuff.  To try to understand it and to document all the
evil edge cases right.  I'm pretty sure I've not accounted for some yet.

> BTW, the point of not pulling the data out of the socket buffer until 
> copyout() is complete is not error handling reversion so much as not 
> changing the advertised window size until the copy is done, since the 
> data isn't delivered to user space.  Copyout() can take a very long time 
> to run, due to page faults, for example, and the socket buffer 
> represents a maximum bound on in-flight traffic as specified by the 
> application.  Whether this is a property we want to keep is another 
> question, but I believe that's the rationale.

Haven't thought of that rationale yet.  So far it appeared to me that it
was done for sanity reasons and there wasn't really a need in the spl()
days to do it otherwise.  I'll think some more about it and whether it
is good, bad or doesn't matter.  Mind you this patch is just a pretty
advanced proof of concept thing.  Certainly not something to kick into
the tree by tomorrow or the day after.

-- 
Andre