Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 20 Jul 2007 18:04:34 +0100 (BST)
From:      Robert Watson <rwatson@FreeBSD.org>
To:        Eygene Ryabinkin <rea-fbsd@codelabs.ru>
Cc:        Julian Elischer <julian@ironport.com>, FreeBSD Net <freebsd-net@freebsd.org>, Julian Elischer <julian@elischer.org>
Subject:   Re: Wierd networking.
Message-ID:  <20070720175713.S39675@fledge.watson.org>
In-Reply-To: <20070719084812.GS4053@void.codelabs.ru>
References:  <469D4C9D.7090302@ironport.com> <469D4FB6.9040609@elischer.org> <3DBBD4E3-ABEA-451A-8E6A-02E9CBAD6A37@mac.com> <20070718055228.GA4053@void.codelabs.ru> <469E660F.8000109@ironport.com> <20070719084812.GS4053@void.codelabs.ru>

next in thread | previous in thread | raw e-mail | index | archive | help

On Thu, 19 Jul 2007, Eygene Ryabinkin wrote:

> Another way to deal with the problem is not to send the FIN's after the one 
> provoked by the closed descriptor.  As I understand, the SS_NOFDREF check is 
> a optimization to avoid processing unneeded data in the TCP stack.  So we 
> may just silently blackhole the successive packets, at least some of them.

While it could be it also does that, SS_NOFDREF is actually part of the socket 
state cycle, and used in part to determine when it is appropriate to free a 
socket.  As you observe, the key here is that there are actually three 
separate and somewhat independent state cycles going on here: the file 
descriptor state cycle, the socket state cycle, and the TCP state cycle. 
This is further complicated by the fact that we actually have a three-part 
state model for TCP, allowing reduced state to be maintained during the 
three-way handshake on the server, and during the TIMEWAIT state.  The trick 
is to properly manage the API/protocol interactions and the data structures.

In FreeBSD 6.x and earlier, we have a moderately large number of bugs relating 
to mishandling of freed TCP state, and in FreeBSD 7 in order to reduce 
complexity and locking requirements, we moved to a model in which it is an 
invariant of the socket<->pcb relationship that a valid PCB is present for all 
"live" sockets.  As such, the so->so_pcb pointer is always valid, and any 
valid socket will always have valid TCP state.  However, the inverse is only 
sometimes true: we may free socket state when in the final stages of the TCP 
connection in order to avoid keeping around the memory overhead of the socket 
and socket buffers during, for example, TIMEWAIT.  If you look at sofree() in 
7.x, you'll see the logic we use to determine whether it's time to free the 
socket itself or not:

         if ((so->so_state & SS_NOFDREF) == 0 || so->so_count != 0 ||
             (so->so_state & SS_PROTOREF) || (so->so_qstate & SQ_COMP)) {
                 SOCK_UNLOCK(so);
                 ACCEPT_UNLOCK();
                 return;
         }

Notice that we have both an explicit reference count and several flags that 
are effectively also references.  SS_NOFDREF is set when a file descriptor, if 
there has ever been one for the socket, has its reference removed. 
SS_PROTOREF means that the protocol has asserted a reference on the socket -- 
for example, if the socket is closed but there is still pending data to be 
sent out, so the socket buffers are required.  SQ_COMP is set if the socket is 
in a listen queue.  Over the last two years, I've been gradually attempting to 
move to explicit reference models, strong and well-document invariants about 
the stability of the pointers that span layers (i.e., inp_ppcb, inp_socket, 
so_pcb, etc), as well as gradually simplifying the model.  It wouldn't 
surprise me if issues remain.

Robert N M Watson
Computer Laboratory
University of Cambridge



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20070720175713.S39675>