From owner-svn-src-all@FreeBSD.ORG  Tue Jun 23 08:29:46 2009
Return-Path: <owner-svn-src-all@FreeBSD.ORG>
Delivered-To: svn-src-all@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 4BD50106564A
	for <svn-src-all@freebsd.org>; Tue, 23 Jun 2009 08:29:46 +0000 (UTC)
	(envelope-from andre@freebsd.org)
Received: from c00l3r.networx.ch (c00l3r.networx.ch [62.48.2.2])
	by mx1.freebsd.org (Postfix) with ESMTP id AE7E78FC26
	for <svn-src-all@freebsd.org>; Tue, 23 Jun 2009 08:29:43 +0000 (UTC)
	(envelope-from andre@freebsd.org)
Received: (qmail 23427 invoked from network); 23 Jun 2009 08:17:05 -0000
Received: from localhost (HELO [127.0.0.1]) ([127.0.0.1])
	(envelope-sender <andre@freebsd.org>)
	by c00l3r.networx.ch (qmail-ldap-1.03) with SMTP
	for <rwatson@FreeBSD.org>; 23 Jun 2009 08:17:05 -0000
Message-ID: <4A409275.30705@freebsd.org>
Date: Tue, 23 Jun 2009 10:29:41 +0200
From: Andre Oppermann <andre@freebsd.org>
User-Agent: Thunderbird 1.5.0.14 (Windows/20071210)
MIME-Version: 1.0
To: Robert Watson <rwatson@FreeBSD.org>
References: <200906222308.n5MN856I055711@svn.freebsd.org>
	<alpine.BSF.2.00.0906230631420.91891@fledge.watson.org>
In-Reply-To: <alpine.BSF.2.00.0906230631420.91891@fledge.watson.org>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: svn-src-head@freebsd.org, svn-src-all@freebsd.org,
	src-committers@freebsd.org
Subject: Re: svn commit: r194672 - in head/sys: kern netinet sys
X-BeenThere: svn-src-all@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: "SVN commit messages for the entire src tree \(except for &quot;
	user&quot; and &quot; projects&quot; \)" <svn-src-all.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/svn-src-all>,
	<mailto:svn-src-all-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/svn-src-all>
List-Post: <mailto:svn-src-all@freebsd.org>
List-Help: <mailto:svn-src-all-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/svn-src-all>,
	<mailto:svn-src-all-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 23 Jun 2009 08:29:46 -0000

Robert Watson wrote:
> 
> On Mon, 22 Jun 2009, Andre Oppermann wrote:
> 
>>  Add soreceive_stream(), an optimized version of soreceive() for
>>  stream (TCP) sockets.
> 
> While this sounds like very interesting work, I was struck by the lack of:
> 
>   Reviewed by:    ?
>   Tested by:    ?
> 
> Have you tested this change with some of our TCP conumer edge cases, 
> especially in-kernel consumers:
> 
> - Accept filters
> - NFS client
> - NFS server
> - smbfs
> - iscsi initiator

No, yes, yes, no, no.  Plus a large number of SSH copying (it will trip
over every corrupted byte) and fetching half of the ports collection
programs source code and comparing the checksum.

> I also assume that the plan is that this will not be enabled by default 
> in 8.0?  With soreceive_dgram, it took three months to shake out the 
> nits, and the datagram case is far simpler than the stream case.  Given 
> that experience, I'd guess it will take longer for the new stream code 
> to shake out, and probably involve painfully tracking subtle bugs in 
> blocking code, data corruption in NFS, etc.  I've identified one such 
> possible bug for iscsi.

No, it not supposed to be enable by default in 8.0.

> To provide easier access for early adopters, we shipped with 
> soreceive_dgram available for UDP, but optionally enabled by a loader 
> tunable, for at least one minor rev.  I would recommend doing something 
> similar here.

A good hint, thank you.

> A few inline comments below, including one serious bug (assuming my 
> reading is right).  This doesn't consistute a full review because it's 
> too early in the morning to reason fully about the socket buffer code.

OK. Thanks. I will look into the specifics later today.

-- 
Andre

> Robert N M Watson
> Computer Laboratory
> University of Cambridge
> 
>>  It is functionally identical to generic soreceive() but has a
>>  number stream specific optimizations:
>>  o does only one sockbuf unlock/lock per receive independent of
>>    the length of data to be moved into the uio compared to
>>    soreceive() which unlocks/locks per *mbuf*.
> 
> Hmm.  I count four locks and unlocks per receive, assuming no blocking 
> and a I/O to user memory:
> 
> - sblock/sbunlock to stabilize sb_mb and prevent I/O interlacing.
> - start/stop socket buffer mutex lock/unlock at beginning/end of function
> - unlock/relock around m_mbuftouio()
> - unlock/relock around pru_rcvd
> 
> One idea I've futzed with a little is whether or not we could drop 
> sblock() in certain cases for receive and transmit.  As far as I'm 
> aware, it is used for at least four purposes in receive:
> 
> - Prevent I/O interlacing from concurrent system calls
> - Provide consistency for blocking logic during concurrent system calls
> - Stabilize non-NULL sb_mb while the socket buffer mutex is dropped
> - Minimize contention on the socket buffer mutex during concurrent system
>   calls
> 
> Some of these are more important than others -- in particular, the 
> function of preventing I/O interlacing for stream system calls seems 
> something we might drop, as it's not a guarantee provided by other OS's, 
> so no portable application should depend on it.  Have you looked at all 
> at this?
> 
>>  o uses m_mbuftouio() instead of its own copy(out) variant.
>>  o much more compact code flow as a large number of special
>>    cases is removed.
>>  o much improved reability.
>>
>>  It offers significantly reduced CPU usage and lock contention
>>  when receiving fast TCP streams.  Additional gains are obtained
>>  when the receiving application is using SO_RCVLOWAT to batch up
>>  some data before a read (and wakeup) is done.
>>
>>  This function was written by "reverse engineering" and is not
>>  just a stripped down variant of soreceive().
>>
>>  It is not yet enabled by default on TCP sockets.  Instead it is
>>  commented out in the protocol initialization in tcp_usrreq.c
>>  until more widespread testing has been done.
> 
> Have you looked at using this with SCTP and UNIX domain sockets?  UNIX 
> domain socket performance is quite important for database workloads, and 
> SCTP is
> going to be increasingly important for Internet applications.  In the 
> UNIX domain socket case, not supporting control mbuf will be a potential 
> problem (see comments below about either implementing it or falling 
> back, and if falling back m_mbuftouio needs to know how to properly free 
> them).
> 
>>  Testers, especially with 10GigE gear, are welcome.
>>
>>  MFP4:    r164817 //depot/user/andre/soreceive_stream/
>>
>> Modified:
>>  head/sys/kern/uipc_socket.c
>>  head/sys/netinet/tcp_usrreq.c
>>  head/sys/sys/socketvar.h
>>
>> Modified: head/sys/kern/uipc_socket.c
>> ============================================================================== 
>>
>> --- head/sys/kern/uipc_socket.c    Mon Jun 22 22:54:44 2009    (r194671)
>> +++ head/sys/kern/uipc_socket.c    Mon Jun 22 23:08:05 2009    (r194672)
>> @@ -1857,6 +1857,202 @@ release:
>> }
>>
>> /*
>> + * Optimized version of soreceive() for stream (TCP) sockets.
>> + */
>> +int
>> +soreceive_stream(struct socket *so, struct sockaddr **psa, struct uio 
>> *uio,
>> +    struct mbuf **mp0, struct mbuf **controlp, int *flagsp)
>> +{
>> +    int len = 0, error = 0, flags, oresid;
>> +    struct sockbuf *sb;
>> +    struct mbuf *m, *n = NULL;
>> +
>> +    /* We only do stream sockets. */
>> +    if (so->so_type != SOCK_STREAM)
>> +        return (EINVAL);
> 
> This should be a KASSERT().  You should also assert !PR_ADDR and 
> !PR_ATOMIC, and possibly a few other things to ensure that 
> soreceive_stream is not used with protocols that require features it 
> does not implement.
> 
> Much better to discover on sanity-checking the kernel than through 
> subtle application bugs later when people start throwing switches 
> without giving things full consideration.
> 
>> +    if (psa != NULL)
>> +        *psa = NULL;
>> +    if (controlp != NULL)
>> +        return (EINVAL);
> 
> This would seem to preclude easy future use of ancillary data with TCP 
> -- one type I've been considering adding is TCP_INFO updates as part of 
> the stream. How would you feel about making this call into 
> soreceive_generic() instead?
> 
> My reading of the control code was that it actually was fairly simple 
> and I did include it in soreceive_dgram() (especially given that it's 
> popular for UDP so you can get the receive IP address per-datagram).
> 
>> +    if (flagsp != NULL)
>> +        flags = *flagsp &~ MSG_EOR;
>> +    else
>> +        flags = 0;
>> +    if (flags & MSG_OOB)
>> +        return (soreceive_rcvoob(so, uio, flags));
>> +    if (mp0 != NULL)
>> +        *mp0 = NULL;
>> +
>> +    sb = &so->so_rcv;
>> +
>> +    /* Prevent other readers from entering the socket. */
>> +    error = sblock(sb, SBLOCKWAIT(flags));
>> +    if (error)
>> +        goto out;
>> +    SOCKBUF_LOCK(sb);
>> +
>> +    /* Easy one, no space to copyout anything. */
>> +    if (uio->uio_resid == 0) {
>> +        error = EINVAL;
>> +        goto out;
>> +    }
> 
> Didn't we previously allow a resid of 0 to be used to probe for socket 
> errors without receiving data?  I've never used this semantic but it 
> seems to be supported, which undoubtably means someone is using it :-).  
> I notice that previously we didn't allow receiving EWOULDBLOCK using a 
> 0-byte buffer, but perhaps we should have.
> 
>> +    oresid = uio->uio_resid;
>> +
>> +    /* We will never ever get anything unless we are connected. */
>> +    if (!(so->so_state & (SS_ISCONNECTED|SS_ISDISCONNECTED))) {
>> +        /* When disconnecting there may be still some data left. */
>> +        if (sb->sb_cc > 0)
>> +            goto deliver;
>> +        if (!(so->so_state & SS_ISDISCONNECTED))
>> +            error = ENOTCONN;
>> +        goto out;
>> +    }
>> +
>> +    /* Socket buffer is empty and we shall not block. */
>> +    if (sb->sb_cc == 0 &&
>> +        ((sb->sb_flags & SS_NBIO) || (flags & 
>> (MSG_DONTWAIT|MSG_NBIO)))) {
>> +        error = EAGAIN;
>> +        goto out;
>> +    }
> 
> I think you've changed the error number reporting here in subtle ways, 
> but it's early in the morning.  It appears we now prefer ENOTCONN to 
> so_error once the socket has entered SS_ISDISCONNECTED.  We need to 
> report and clear so_error if non-0 so that ECONNRESET, EPERM, etc, can 
> be reported properly.
> 
> The new so_state logic is markedly different from what came before, but 
> I'm not sure what the impact of that is.  Certainly the comment needs to 
> change to "... are or have been connected".  The previous logic appears 
> to have handled SS_ISCONNECTING differently, and possibly 
> SS_ISDISCONNECTING.
> 
>> +restart:
>> +    SOCKBUF_LOCK_ASSERT(&so->so_rcv);
>> +
>> +    /* Abort if socket has reported problems. */
>> +    if (so->so_error) {
>> +        if (sb->sb_cc > 0)
>> +            goto deliver;
>> +        if (oresid > uio->uio_resid)
>> +            goto out;
>> +        error = so->so_error;
>> +        if (!(flags & MSG_PEEK))
>> +            so->so_error = 0;
>> +        goto out;
>> +    }
>> +
>> +    /* Door is closed.  Deliver what is left, if any. */
>> +    if (sb->sb_state & SBS_CANTRCVMORE) {
>> +        if (sb->sb_cc > 0)
>> +            goto deliver;
>> +        else
>> +            goto out;
>> +    }
>> +
>> +    /* Socket buffer got some data that we shall deliver now. */
>> +    if (sb->sb_cc > 0 && !(flags & MSG_WAITALL) &&
>> +        ((sb->sb_flags & SS_NBIO) ||
>> +         (flags & (MSG_DONTWAIT|MSG_NBIO)) ||
>> +         sb->sb_cc >= sb->sb_lowat ||
>> +         sb->sb_cc >= uio->uio_resid ||
>> +         sb->sb_cc >= sb->sb_hiwat) ) {
>> +        goto deliver;
>> +    }
>> +
>> +    /* On MSG_WAITALL we must wait until all data or error arrives. */
>> +    if ((flags & MSG_WAITALL) &&
>> +        (sb->sb_cc >= uio->uio_resid || sb->sb_cc >= sb->sb_lowat))
>> +        goto deliver;
>> +
>> +    /*
>> +     * Wait and block until (more) data comes in.
>> +     * NB: Drops the sockbuf lock during wait.
>> +     */
>> +    error = sbwait(sb);
>> +    if (error)
>> +        goto out;
>> +    goto restart;
>> +
>> +deliver:
>> +    SOCKBUF_LOCK_ASSERT(&so->so_rcv);
>> +    KASSERT(sb->sb_cc > 0, ("%s: sockbuf empty", __func__));
>> +    KASSERT(sb->sb_mb != NULL, ("%s: sb_mb == NULL", __func__));
>> +
>> +    /* Statistics. */
>> +    if (uio->uio_td)
>> +        uio->uio_td->td_ru.ru_msgrcv++;
>> +
>> +    /* Fill uio until full or current end of socket buffer is 
>> reached. */
>> +    len = min(uio->uio_resid, sb->sb_cc);
>> +    if (mp0 != NULL) {
>> +        /* Dequeue as many mbufs as possible. */
>> +        if (!(flags & MSG_PEEK) && len >= sb->sb_mb->m_len) {
>> +            for (*mp0 = m = sb->sb_mb;
>> +                 m != NULL && m->m_len <= len;
>> +                 m = m->m_next) {
>> +                len -= m->m_len;
>> +                uio->uio_resid -= m->m_len;
>> +                sbfree(sb, m);
>> +                n = m;
>> +            }
> 
> Since control mbufs aren't supported, you should assert !MT_CONTROL.  Or 
> add control mbuf support per above.
> 
> Again, early in the morning, but: does (mp0 != NULL) still work with 
> (flags & MSG_WAITALL)?  In soreceive_generic(), mp0 is updated to point 
> to &m->m_next as the loop iterates, so that as you go back to 'repeat:', 
> the chain pointed to by mp0 is appended to rather than replaced, whereas 
> your code appears to replace it in each loop (which seems wrong).  
> Perhaps it's just too early in the morning.
> 
> I think the only consumer of (mp0 != NULL) with MSG_WAITALL is the iscsi 
> initiator, but breaking that would be bad.  smbfs probably *should* use 
> it... :-)
> 
>> +            sb->sb_mb = m;
>> +            if (sb->sb_mb == NULL)
>> +                SB_EMPTY_FIXUP(sb);
>> +            n->m_next = NULL;
>> +        }
>> +        /* Copy the remainder. */
>> +        if (len > 0) {
>> +            KASSERT(sb->sb_mb != NULL,
>> +                ("%s: len > 0 && sb->sb_mb empty", __func__));
>> +
>> +            m = m_copym(sb->sb_mb, 0, len, M_DONTWAIT);
>> +            if (m == NULL)
>> +                len = 0;    /* Don't flush data from sockbuf. */
>> +            else
>> +                uio->uio_resid -= m->m_len;
>> +            if (*mp0 != NULL)
>> +                n->m_next = m;
>> +            else
>> +                *mp0 = m;
>> +            if (*mp0 == NULL) {
>> +                error = ENOBUFS;
>> +                goto out;
>> +            }
>> +        }
>> +    } else {
>> +        /* NB: Must unlock socket buffer as uiomove may sleep. */
>> +        SOCKBUF_UNLOCK(sb);
>> +        error = m_mbuftouio(uio, sb->sb_mb, len);
> 
> m_mbuftouio() should assert !MT_CONTROL on each mbuf if it doesn't 
> support control mbufs.  See also above. :-)
> 
>> +        SOCKBUF_LOCK(sb);
>> +        if (error)
>> +            goto out;
>> +    }
>> +    SBLASTRECORDCHK(sb);
>> +    SBLASTMBUFCHK(sb);
>> +
>> +    /*
>> +     * Remove the delivered data from the socket buffer unless we
>> +     * were only peeking.
>> +     */
>> +    if (!(flags & MSG_PEEK)) {
>> +        if (len > 0)
>> +            sbdrop_locked(sb, len);
>> +
>> +        /* Notify protocol that we drained some data. */
>> +        if ((so->so_proto->pr_flags & PR_WANTRCVD) &&
>> +            (((flags & MSG_WAITALL) && uio->uio_resid > 0) ||
>> +             !(flags & MSG_SOCALLBCK))) {
>> +            SOCKBUF_UNLOCK(sb);
>> +            (*so->so_proto->pr_usrreqs->pru_rcvd)(so, flags);
>> +            SOCKBUF_LOCK(sb);
>> +        }
>> +    }
>> +
>> +    /*
>> +     * For MSG_WAITALL we may have to loop again and wait for
>> +     * more data to come in.
>> +     */
>> +    if ((flags & MSG_WAITALL) && uio->uio_resid > 0)
>> +        goto restart;
>> +out:
>> +    SOCKBUF_LOCK_ASSERT(sb);
>> +    SBLASTRECORDCHK(sb);
>> +    SBLASTMBUFCHK(sb);
>> +    SOCKBUF_UNLOCK(sb);
>> +    sbunlock(sb);
>> +    return (error);
>> +}
>> +
>> +/*
>>  * Optimized version of soreceive() for simple datagram cases from 
>> userspace.
>>  * Unlike in the stream case, we're able to drop a datagram if copyout()
>>  * fails, and because we handle datagrams atomically, we don't need to 
>> use a
>>
>> Modified: head/sys/netinet/tcp_usrreq.c
>> ============================================================================== 
>>
>> --- head/sys/netinet/tcp_usrreq.c    Mon Jun 22 22:54:44 2009    
>> (r194671)
>> +++ head/sys/netinet/tcp_usrreq.c    Mon Jun 22 23:08:05 2009    
>> (r194672)
>> @@ -1032,6 +1032,9 @@ struct pr_usrreqs tcp_usrreqs = {
>>     .pru_send =        tcp_usr_send,
>>     .pru_shutdown =        tcp_usr_shutdown,
>>     .pru_sockaddr =        in_getsockaddr,
>> +#if 0
>> +    .pru_soreceive =    soreceive_stream,
>> +#endif
>>     .pru_sosetlabel =    in_pcbsosetlabel,
>>     .pru_close =        tcp_usr_close,
>> };
>> @@ -1053,6 +1056,9 @@ struct pr_usrreqs tcp6_usrreqs = {
>>     .pru_send =        tcp_usr_send,
>>     .pru_shutdown =        tcp_usr_shutdown,
>>     .pru_sockaddr =        in6_mapped_sockaddr,
>> +#if 0
>> +    .pru_soreceive =    soreceive_stream,
>> +#endif
>>      .pru_sosetlabel =    in_pcbsosetlabel,
>>     .pru_close =        tcp_usr_close,
>> };
>>
>> Modified: head/sys/sys/socketvar.h
>> ============================================================================== 
>>
>> --- head/sys/sys/socketvar.h    Mon Jun 22 22:54:44 2009    (r194671)
>> +++ head/sys/sys/socketvar.h    Mon Jun 22 23:08:05 2009    (r194672)
>> @@ -345,6 +345,9 @@ int    sopoll_generic(struct socket *so, in
>>         struct ucred *active_cred, struct thread *td);
>> int    soreceive(struct socket *so, struct sockaddr **paddr, struct 
>> uio *uio,
>>         struct mbuf **mp0, struct mbuf **controlp, int *flagsp);
>> +int    soreceive_stream(struct socket *so, struct sockaddr **paddr,
>> +        struct uio *uio, struct mbuf **mp0, struct mbuf **controlp,
>> +        int *flagsp);
>> int    soreceive_dgram(struct socket *so, struct sockaddr **paddr,
>>         struct uio *uio, struct mbuf **mp0, struct mbuf **controlp,
>>         int *flagsp);
>>
> 
>