From owner-svn-src-all@FreeBSD.ORG Tue Jun 23 08:29:46 2009 Return-Path: Delivered-To: svn-src-all@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 4BD50106564A for ; Tue, 23 Jun 2009 08:29:46 +0000 (UTC) (envelope-from andre@freebsd.org) Received: from c00l3r.networx.ch (c00l3r.networx.ch [62.48.2.2]) by mx1.freebsd.org (Postfix) with ESMTP id AE7E78FC26 for ; Tue, 23 Jun 2009 08:29:43 +0000 (UTC) (envelope-from andre@freebsd.org) Received: (qmail 23427 invoked from network); 23 Jun 2009 08:17:05 -0000 Received: from localhost (HELO [127.0.0.1]) ([127.0.0.1]) (envelope-sender ) by c00l3r.networx.ch (qmail-ldap-1.03) with SMTP for ; 23 Jun 2009 08:17:05 -0000 Message-ID: <4A409275.30705@freebsd.org> Date: Tue, 23 Jun 2009 10:29:41 +0200 From: Andre Oppermann User-Agent: Thunderbird 1.5.0.14 (Windows/20071210) MIME-Version: 1.0 To: Robert Watson References: <200906222308.n5MN856I055711@svn.freebsd.org> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: svn-src-head@freebsd.org, svn-src-all@freebsd.org, src-committers@freebsd.org Subject: Re: svn commit: r194672 - in head/sys: kern netinet sys X-BeenThere: svn-src-all@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "SVN commit messages for the entire src tree \(except for " user" and " projects" \)" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 23 Jun 2009 08:29:46 -0000 Robert Watson wrote: > > On Mon, 22 Jun 2009, Andre Oppermann wrote: > >> Add soreceive_stream(), an optimized version of soreceive() for >> stream (TCP) sockets. > > While this sounds like very interesting work, I was struck by the lack of: > > Reviewed by: ? > Tested by: ? > > Have you tested this change with some of our TCP conumer edge cases, > especially in-kernel consumers: > > - Accept filters > - NFS client > - NFS server > - smbfs > - iscsi initiator No, yes, yes, no, no. Plus a large number of SSH copying (it will trip over every corrupted byte) and fetching half of the ports collection programs source code and comparing the checksum. > I also assume that the plan is that this will not be enabled by default > in 8.0? With soreceive_dgram, it took three months to shake out the > nits, and the datagram case is far simpler than the stream case. Given > that experience, I'd guess it will take longer for the new stream code > to shake out, and probably involve painfully tracking subtle bugs in > blocking code, data corruption in NFS, etc. I've identified one such > possible bug for iscsi. No, it not supposed to be enable by default in 8.0. > To provide easier access for early adopters, we shipped with > soreceive_dgram available for UDP, but optionally enabled by a loader > tunable, for at least one minor rev. I would recommend doing something > similar here. A good hint, thank you. > A few inline comments below, including one serious bug (assuming my > reading is right). This doesn't consistute a full review because it's > too early in the morning to reason fully about the socket buffer code. OK. Thanks. I will look into the specifics later today. -- Andre > Robert N M Watson > Computer Laboratory > University of Cambridge > >> It is functionally identical to generic soreceive() but has a >> number stream specific optimizations: >> o does only one sockbuf unlock/lock per receive independent of >> the length of data to be moved into the uio compared to >> soreceive() which unlocks/locks per *mbuf*. > > Hmm. I count four locks and unlocks per receive, assuming no blocking > and a I/O to user memory: > > - sblock/sbunlock to stabilize sb_mb and prevent I/O interlacing. > - start/stop socket buffer mutex lock/unlock at beginning/end of function > - unlock/relock around m_mbuftouio() > - unlock/relock around pru_rcvd > > One idea I've futzed with a little is whether or not we could drop > sblock() in certain cases for receive and transmit. As far as I'm > aware, it is used for at least four purposes in receive: > > - Prevent I/O interlacing from concurrent system calls > - Provide consistency for blocking logic during concurrent system calls > - Stabilize non-NULL sb_mb while the socket buffer mutex is dropped > - Minimize contention on the socket buffer mutex during concurrent system > calls > > Some of these are more important than others -- in particular, the > function of preventing I/O interlacing for stream system calls seems > something we might drop, as it's not a guarantee provided by other OS's, > so no portable application should depend on it. Have you looked at all > at this? > >> o uses m_mbuftouio() instead of its own copy(out) variant. >> o much more compact code flow as a large number of special >> cases is removed. >> o much improved reability. >> >> It offers significantly reduced CPU usage and lock contention >> when receiving fast TCP streams. Additional gains are obtained >> when the receiving application is using SO_RCVLOWAT to batch up >> some data before a read (and wakeup) is done. >> >> This function was written by "reverse engineering" and is not >> just a stripped down variant of soreceive(). >> >> It is not yet enabled by default on TCP sockets. Instead it is >> commented out in the protocol initialization in tcp_usrreq.c >> until more widespread testing has been done. > > Have you looked at using this with SCTP and UNIX domain sockets? UNIX > domain socket performance is quite important for database workloads, and > SCTP is > going to be increasingly important for Internet applications. In the > UNIX domain socket case, not supporting control mbuf will be a potential > problem (see comments below about either implementing it or falling > back, and if falling back m_mbuftouio needs to know how to properly free > them). > >> Testers, especially with 10GigE gear, are welcome. >> >> MFP4: r164817 //depot/user/andre/soreceive_stream/ >> >> Modified: >> head/sys/kern/uipc_socket.c >> head/sys/netinet/tcp_usrreq.c >> head/sys/sys/socketvar.h >> >> Modified: head/sys/kern/uipc_socket.c >> ============================================================================== >> >> --- head/sys/kern/uipc_socket.c Mon Jun 22 22:54:44 2009 (r194671) >> +++ head/sys/kern/uipc_socket.c Mon Jun 22 23:08:05 2009 (r194672) >> @@ -1857,6 +1857,202 @@ release: >> } >> >> /* >> + * Optimized version of soreceive() for stream (TCP) sockets. >> + */ >> +int >> +soreceive_stream(struct socket *so, struct sockaddr **psa, struct uio >> *uio, >> + struct mbuf **mp0, struct mbuf **controlp, int *flagsp) >> +{ >> + int len = 0, error = 0, flags, oresid; >> + struct sockbuf *sb; >> + struct mbuf *m, *n = NULL; >> + >> + /* We only do stream sockets. */ >> + if (so->so_type != SOCK_STREAM) >> + return (EINVAL); > > This should be a KASSERT(). You should also assert !PR_ADDR and > !PR_ATOMIC, and possibly a few other things to ensure that > soreceive_stream is not used with protocols that require features it > does not implement. > > Much better to discover on sanity-checking the kernel than through > subtle application bugs later when people start throwing switches > without giving things full consideration. > >> + if (psa != NULL) >> + *psa = NULL; >> + if (controlp != NULL) >> + return (EINVAL); > > This would seem to preclude easy future use of ancillary data with TCP > -- one type I've been considering adding is TCP_INFO updates as part of > the stream. How would you feel about making this call into > soreceive_generic() instead? > > My reading of the control code was that it actually was fairly simple > and I did include it in soreceive_dgram() (especially given that it's > popular for UDP so you can get the receive IP address per-datagram). > >> + if (flagsp != NULL) >> + flags = *flagsp &~ MSG_EOR; >> + else >> + flags = 0; >> + if (flags & MSG_OOB) >> + return (soreceive_rcvoob(so, uio, flags)); >> + if (mp0 != NULL) >> + *mp0 = NULL; >> + >> + sb = &so->so_rcv; >> + >> + /* Prevent other readers from entering the socket. */ >> + error = sblock(sb, SBLOCKWAIT(flags)); >> + if (error) >> + goto out; >> + SOCKBUF_LOCK(sb); >> + >> + /* Easy one, no space to copyout anything. */ >> + if (uio->uio_resid == 0) { >> + error = EINVAL; >> + goto out; >> + } > > Didn't we previously allow a resid of 0 to be used to probe for socket > errors without receiving data? I've never used this semantic but it > seems to be supported, which undoubtably means someone is using it :-). > I notice that previously we didn't allow receiving EWOULDBLOCK using a > 0-byte buffer, but perhaps we should have. > >> + oresid = uio->uio_resid; >> + >> + /* We will never ever get anything unless we are connected. */ >> + if (!(so->so_state & (SS_ISCONNECTED|SS_ISDISCONNECTED))) { >> + /* When disconnecting there may be still some data left. */ >> + if (sb->sb_cc > 0) >> + goto deliver; >> + if (!(so->so_state & SS_ISDISCONNECTED)) >> + error = ENOTCONN; >> + goto out; >> + } >> + >> + /* Socket buffer is empty and we shall not block. */ >> + if (sb->sb_cc == 0 && >> + ((sb->sb_flags & SS_NBIO) || (flags & >> (MSG_DONTWAIT|MSG_NBIO)))) { >> + error = EAGAIN; >> + goto out; >> + } > > I think you've changed the error number reporting here in subtle ways, > but it's early in the morning. It appears we now prefer ENOTCONN to > so_error once the socket has entered SS_ISDISCONNECTED. We need to > report and clear so_error if non-0 so that ECONNRESET, EPERM, etc, can > be reported properly. > > The new so_state logic is markedly different from what came before, but > I'm not sure what the impact of that is. Certainly the comment needs to > change to "... are or have been connected". The previous logic appears > to have handled SS_ISCONNECTING differently, and possibly > SS_ISDISCONNECTING. > >> +restart: >> + SOCKBUF_LOCK_ASSERT(&so->so_rcv); >> + >> + /* Abort if socket has reported problems. */ >> + if (so->so_error) { >> + if (sb->sb_cc > 0) >> + goto deliver; >> + if (oresid > uio->uio_resid) >> + goto out; >> + error = so->so_error; >> + if (!(flags & MSG_PEEK)) >> + so->so_error = 0; >> + goto out; >> + } >> + >> + /* Door is closed. Deliver what is left, if any. */ >> + if (sb->sb_state & SBS_CANTRCVMORE) { >> + if (sb->sb_cc > 0) >> + goto deliver; >> + else >> + goto out; >> + } >> + >> + /* Socket buffer got some data that we shall deliver now. */ >> + if (sb->sb_cc > 0 && !(flags & MSG_WAITALL) && >> + ((sb->sb_flags & SS_NBIO) || >> + (flags & (MSG_DONTWAIT|MSG_NBIO)) || >> + sb->sb_cc >= sb->sb_lowat || >> + sb->sb_cc >= uio->uio_resid || >> + sb->sb_cc >= sb->sb_hiwat) ) { >> + goto deliver; >> + } >> + >> + /* On MSG_WAITALL we must wait until all data or error arrives. */ >> + if ((flags & MSG_WAITALL) && >> + (sb->sb_cc >= uio->uio_resid || sb->sb_cc >= sb->sb_lowat)) >> + goto deliver; >> + >> + /* >> + * Wait and block until (more) data comes in. >> + * NB: Drops the sockbuf lock during wait. >> + */ >> + error = sbwait(sb); >> + if (error) >> + goto out; >> + goto restart; >> + >> +deliver: >> + SOCKBUF_LOCK_ASSERT(&so->so_rcv); >> + KASSERT(sb->sb_cc > 0, ("%s: sockbuf empty", __func__)); >> + KASSERT(sb->sb_mb != NULL, ("%s: sb_mb == NULL", __func__)); >> + >> + /* Statistics. */ >> + if (uio->uio_td) >> + uio->uio_td->td_ru.ru_msgrcv++; >> + >> + /* Fill uio until full or current end of socket buffer is >> reached. */ >> + len = min(uio->uio_resid, sb->sb_cc); >> + if (mp0 != NULL) { >> + /* Dequeue as many mbufs as possible. */ >> + if (!(flags & MSG_PEEK) && len >= sb->sb_mb->m_len) { >> + for (*mp0 = m = sb->sb_mb; >> + m != NULL && m->m_len <= len; >> + m = m->m_next) { >> + len -= m->m_len; >> + uio->uio_resid -= m->m_len; >> + sbfree(sb, m); >> + n = m; >> + } > > Since control mbufs aren't supported, you should assert !MT_CONTROL. Or > add control mbuf support per above. > > Again, early in the morning, but: does (mp0 != NULL) still work with > (flags & MSG_WAITALL)? In soreceive_generic(), mp0 is updated to point > to &m->m_next as the loop iterates, so that as you go back to 'repeat:', > the chain pointed to by mp0 is appended to rather than replaced, whereas > your code appears to replace it in each loop (which seems wrong). > Perhaps it's just too early in the morning. > > I think the only consumer of (mp0 != NULL) with MSG_WAITALL is the iscsi > initiator, but breaking that would be bad. smbfs probably *should* use > it... :-) > >> + sb->sb_mb = m; >> + if (sb->sb_mb == NULL) >> + SB_EMPTY_FIXUP(sb); >> + n->m_next = NULL; >> + } >> + /* Copy the remainder. */ >> + if (len > 0) { >> + KASSERT(sb->sb_mb != NULL, >> + ("%s: len > 0 && sb->sb_mb empty", __func__)); >> + >> + m = m_copym(sb->sb_mb, 0, len, M_DONTWAIT); >> + if (m == NULL) >> + len = 0; /* Don't flush data from sockbuf. */ >> + else >> + uio->uio_resid -= m->m_len; >> + if (*mp0 != NULL) >> + n->m_next = m; >> + else >> + *mp0 = m; >> + if (*mp0 == NULL) { >> + error = ENOBUFS; >> + goto out; >> + } >> + } >> + } else { >> + /* NB: Must unlock socket buffer as uiomove may sleep. */ >> + SOCKBUF_UNLOCK(sb); >> + error = m_mbuftouio(uio, sb->sb_mb, len); > > m_mbuftouio() should assert !MT_CONTROL on each mbuf if it doesn't > support control mbufs. See also above. :-) > >> + SOCKBUF_LOCK(sb); >> + if (error) >> + goto out; >> + } >> + SBLASTRECORDCHK(sb); >> + SBLASTMBUFCHK(sb); >> + >> + /* >> + * Remove the delivered data from the socket buffer unless we >> + * were only peeking. >> + */ >> + if (!(flags & MSG_PEEK)) { >> + if (len > 0) >> + sbdrop_locked(sb, len); >> + >> + /* Notify protocol that we drained some data. */ >> + if ((so->so_proto->pr_flags & PR_WANTRCVD) && >> + (((flags & MSG_WAITALL) && uio->uio_resid > 0) || >> + !(flags & MSG_SOCALLBCK))) { >> + SOCKBUF_UNLOCK(sb); >> + (*so->so_proto->pr_usrreqs->pru_rcvd)(so, flags); >> + SOCKBUF_LOCK(sb); >> + } >> + } >> + >> + /* >> + * For MSG_WAITALL we may have to loop again and wait for >> + * more data to come in. >> + */ >> + if ((flags & MSG_WAITALL) && uio->uio_resid > 0) >> + goto restart; >> +out: >> + SOCKBUF_LOCK_ASSERT(sb); >> + SBLASTRECORDCHK(sb); >> + SBLASTMBUFCHK(sb); >> + SOCKBUF_UNLOCK(sb); >> + sbunlock(sb); >> + return (error); >> +} >> + >> +/* >> * Optimized version of soreceive() for simple datagram cases from >> userspace. >> * Unlike in the stream case, we're able to drop a datagram if copyout() >> * fails, and because we handle datagrams atomically, we don't need to >> use a >> >> Modified: head/sys/netinet/tcp_usrreq.c >> ============================================================================== >> >> --- head/sys/netinet/tcp_usrreq.c Mon Jun 22 22:54:44 2009 >> (r194671) >> +++ head/sys/netinet/tcp_usrreq.c Mon Jun 22 23:08:05 2009 >> (r194672) >> @@ -1032,6 +1032,9 @@ struct pr_usrreqs tcp_usrreqs = { >> .pru_send = tcp_usr_send, >> .pru_shutdown = tcp_usr_shutdown, >> .pru_sockaddr = in_getsockaddr, >> +#if 0 >> + .pru_soreceive = soreceive_stream, >> +#endif >> .pru_sosetlabel = in_pcbsosetlabel, >> .pru_close = tcp_usr_close, >> }; >> @@ -1053,6 +1056,9 @@ struct pr_usrreqs tcp6_usrreqs = { >> .pru_send = tcp_usr_send, >> .pru_shutdown = tcp_usr_shutdown, >> .pru_sockaddr = in6_mapped_sockaddr, >> +#if 0 >> + .pru_soreceive = soreceive_stream, >> +#endif >> .pru_sosetlabel = in_pcbsosetlabel, >> .pru_close = tcp_usr_close, >> }; >> >> Modified: head/sys/sys/socketvar.h >> ============================================================================== >> >> --- head/sys/sys/socketvar.h Mon Jun 22 22:54:44 2009 (r194671) >> +++ head/sys/sys/socketvar.h Mon Jun 22 23:08:05 2009 (r194672) >> @@ -345,6 +345,9 @@ int sopoll_generic(struct socket *so, in >> struct ucred *active_cred, struct thread *td); >> int soreceive(struct socket *so, struct sockaddr **paddr, struct >> uio *uio, >> struct mbuf **mp0, struct mbuf **controlp, int *flagsp); >> +int soreceive_stream(struct socket *so, struct sockaddr **paddr, >> + struct uio *uio, struct mbuf **mp0, struct mbuf **controlp, >> + int *flagsp); >> int soreceive_dgram(struct socket *so, struct sockaddr **paddr, >> struct uio *uio, struct mbuf **mp0, struct mbuf **controlp, >> int *flagsp); >> > >