Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 04 Mar 2007 00:20:10 +0100
From:      Andre Oppermann <andre@freebsd.org>
To:        Robert Watson <rwatson@FreeBSD.org>
Cc:        freebsd-net@freebsd.org, gallatin@freebsd.org, freebsd-current@freebsd.org, kmacy@freebsd.org
Subject:   Re: New optimized soreceive_stream() for TCP sockets, proof of concept
Message-ID:  <45EA02AA.8030800@freebsd.org>
In-Reply-To: <20070303222356.S60688@fledge.watson.org>
References:  <45E8276D.60105@freebsd.org> <20070303222356.S60688@fledge.watson.org>

next in thread | previous in thread | raw e-mail | index | archive | help
Robert Watson wrote:
> 
> On Fri, 2 Mar 2007, Andre Oppermann wrote:
> 
>> Instead of the unlock-lock dance soreceive_stream() pulls a properly 
>> sized (relative to the receive system call buffer space) from the 
>> socket buffer drops the lock and gives copyout as much time as it 
>> needs.  In the mean time the lower half can happily add as many new 
>> packets as it wants without having to wait for a lock.  It also allows 
>> the upper and lower halfs to run on different CPUs without much 
>> interference.  There is a unsolved nasty race condition in the patch 
>> though. When the socket closes and we still have data around or the 
>> copyout failed it tries to put the data back into the socket buffer 
>> which is gone already by then leading to a panic.  Work is underway to 
>> find a realiable fix for this.  I wanted to get this out to the 
>> community nonetheless to give it some more exposure.
> 
> I'll try to take a look at this in the next few days.
> 
> However, I find the description above of soreceive() a bit odd -- I'm 
> pretty sure it doesn't do some of the things you're describing.  For 
> example, soreceive() does release the locks acquired by the network 
> input processing path while copying to user space: there should be no 
> contention during the copyout(), only while processing the socket buffer 
> between copyout() calls. This is possible because the socket receive 
> sleep lock (not the mutex) holds sb_mb constant if it is non-NULL, 
> making copyout() of sb_mb->m_data safe while not holding the socket 
> buffer mutex in the current implementation.

The copyout is done without holding the lock.  However for every mbuf
in the socket buffer it unlocks, does the copyout and then locks it again
for the next.  I was referring to that unlock-lock pair for every mbuf.

> In my experience, soreceive() is an incredibly complicated function, and 
> could stand significant simplification.  However, it has to be done very 
> carefully for exactly this reason :-).  There are some existing bugs in 
> soreceive(), one involving incorrect handling of interlaced I/O due to a 
> label being in the wrong place, that we should resolve.

It's damn complex.  That's one of the reasons I started the soreceive_stream()
function and related stuff.  To try to understand it and to document all the
evil edge cases right.  I'm pretty sure I've not accounted for some yet.

> BTW, the point of not pulling the data out of the socket buffer until 
> copyout() is complete is not error handling reversion so much as not 
> changing the advertised window size until the copy is done, since the 
> data isn't delivered to user space.  Copyout() can take a very long time 
> to run, due to page faults, for example, and the socket buffer 
> represents a maximum bound on in-flight traffic as specified by the 
> application.  Whether this is a property we want to keep is another 
> question, but I believe that's the rationale.

Haven't thought of that rationale yet.  So far it appeared to me that it
was done for sanity reasons and there wasn't really a need in the spl()
days to do it otherwise.  I'll think some more about it and whether it
is good, bad or doesn't matter.  Mind you this patch is just a pretty
advanced proof of concept thing.  Certainly not something to kick into
the tree by tomorrow or the day after.

-- 
Andre




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?45EA02AA.8030800>