Date: Tue, 23 Jun 2009 08:07:26 +0100 (BST) From: Robert Watson <rwatson@FreeBSD.org> To: Andre Oppermann <andre@FreeBSD.org> Cc: svn-src-head@freebsd.org, svn-src-all@freebsd.org, src-committers@freebsd.org Subject: Re: svn commit: r194672 - in head/sys: kern netinet sys Message-ID: <alpine.BSF.2.00.0906230631420.91891@fledge.watson.org> In-Reply-To: <200906222308.n5MN856I055711@svn.freebsd.org> References: <200906222308.n5MN856I055711@svn.freebsd.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, 22 Jun 2009, Andre Oppermann wrote: > Add soreceive_stream(), an optimized version of soreceive() for > stream (TCP) sockets. While this sounds like very interesting work, I was struck by the lack of: Reviewed by: ? Tested by: ? Have you tested this change with some of our TCP conumer edge cases, especially in-kernel consumers: - Accept filters - NFS client - NFS server - smbfs - iscsi initiator I also assume that the plan is that this will not be enabled by default in 8.0? With soreceive_dgram, it took three months to shake out the nits, and the datagram case is far simpler than the stream case. Given that experience, I'd guess it will take longer for the new stream code to shake out, and probably involve painfully tracking subtle bugs in blocking code, data corruption in NFS, etc. I've identified one such possible bug for iscsi. To provide easier access for early adopters, we shipped with soreceive_dgram available for UDP, but optionally enabled by a loader tunable, for at least one minor rev. I would recommend doing something similar here. A few inline comments below, including one serious bug (assuming my reading is right). This doesn't consistute a full review because it's too early in the morning to reason fully about the socket buffer code. Robert N M Watson Computer Laboratory University of Cambridge > It is functionally identical to generic soreceive() but has a > number stream specific optimizations: > o does only one sockbuf unlock/lock per receive independent of > the length of data to be moved into the uio compared to > soreceive() which unlocks/locks per *mbuf*. Hmm. I count four locks and unlocks per receive, assuming no blocking and a I/O to user memory: - sblock/sbunlock to stabilize sb_mb and prevent I/O interlacing. - start/stop socket buffer mutex lock/unlock at beginning/end of function - unlock/relock around m_mbuftouio() - unlock/relock around pru_rcvd One idea I've futzed with a little is whether or not we could drop sblock() in certain cases for receive and transmit. As far as I'm aware, it is used for at least four purposes in receive: - Prevent I/O interlacing from concurrent system calls - Provide consistency for blocking logic during concurrent system calls - Stabilize non-NULL sb_mb while the socket buffer mutex is dropped - Minimize contention on the socket buffer mutex during concurrent system calls Some of these are more important than others -- in particular, the function of preventing I/O interlacing for stream system calls seems something we might drop, as it's not a guarantee provided by other OS's, so no portable application should depend on it. Have you looked at all at this? > o uses m_mbuftouio() instead of its own copy(out) variant. > o much more compact code flow as a large number of special > cases is removed. > o much improved reability. > > It offers significantly reduced CPU usage and lock contention > when receiving fast TCP streams. Additional gains are obtained > when the receiving application is using SO_RCVLOWAT to batch up > some data before a read (and wakeup) is done. > > This function was written by "reverse engineering" and is not > just a stripped down variant of soreceive(). > > It is not yet enabled by default on TCP sockets. Instead it is > commented out in the protocol initialization in tcp_usrreq.c > until more widespread testing has been done. Have you looked at using this with SCTP and UNIX domain sockets? UNIX domain socket performance is quite important for database workloads, and SCTP is going to be increasingly important for Internet applications. In the UNIX domain socket case, not supporting control mbuf will be a potential problem (see comments below about either implementing it or falling back, and if falling back m_mbuftouio needs to know how to properly free them). > Testers, especially with 10GigE gear, are welcome. > > MFP4: r164817 //depot/user/andre/soreceive_stream/ > > Modified: > head/sys/kern/uipc_socket.c > head/sys/netinet/tcp_usrreq.c > head/sys/sys/socketvar.h > > Modified: head/sys/kern/uipc_socket.c > ============================================================================== > --- head/sys/kern/uipc_socket.c Mon Jun 22 22:54:44 2009 (r194671) > +++ head/sys/kern/uipc_socket.c Mon Jun 22 23:08:05 2009 (r194672) > @@ -1857,6 +1857,202 @@ release: > } > > /* > + * Optimized version of soreceive() for stream (TCP) sockets. > + */ > +int > +soreceive_stream(struct socket *so, struct sockaddr **psa, struct uio *uio, > + struct mbuf **mp0, struct mbuf **controlp, int *flagsp) > +{ > + int len = 0, error = 0, flags, oresid; > + struct sockbuf *sb; > + struct mbuf *m, *n = NULL; > + > + /* We only do stream sockets. */ > + if (so->so_type != SOCK_STREAM) > + return (EINVAL); This should be a KASSERT(). You should also assert !PR_ADDR and !PR_ATOMIC, and possibly a few other things to ensure that soreceive_stream is not used with protocols that require features it does not implement. Much better to discover on sanity-checking the kernel than through subtle application bugs later when people start throwing switches without giving things full consideration. > + if (psa != NULL) > + *psa = NULL; > + if (controlp != NULL) > + return (EINVAL); This would seem to preclude easy future use of ancillary data with TCP -- one type I've been considering adding is TCP_INFO updates as part of the stream. How would you feel about making this call into soreceive_generic() instead? My reading of the control code was that it actually was fairly simple and I did include it in soreceive_dgram() (especially given that it's popular for UDP so you can get the receive IP address per-datagram). > + if (flagsp != NULL) > + flags = *flagsp &~ MSG_EOR; > + else > + flags = 0; > + if (flags & MSG_OOB) > + return (soreceive_rcvoob(so, uio, flags)); > + if (mp0 != NULL) > + *mp0 = NULL; > + > + sb = &so->so_rcv; > + > + /* Prevent other readers from entering the socket. */ > + error = sblock(sb, SBLOCKWAIT(flags)); > + if (error) > + goto out; > + SOCKBUF_LOCK(sb); > + > + /* Easy one, no space to copyout anything. */ > + if (uio->uio_resid == 0) { > + error = EINVAL; > + goto out; > + } Didn't we previously allow a resid of 0 to be used to probe for socket errors without receiving data? I've never used this semantic but it seems to be supported, which undoubtably means someone is using it :-). I notice that previously we didn't allow receiving EWOULDBLOCK using a 0-byte buffer, but perhaps we should have. > + oresid = uio->uio_resid; > + > + /* We will never ever get anything unless we are connected. */ > + if (!(so->so_state & (SS_ISCONNECTED|SS_ISDISCONNECTED))) { > + /* When disconnecting there may be still some data left. */ > + if (sb->sb_cc > 0) > + goto deliver; > + if (!(so->so_state & SS_ISDISCONNECTED)) > + error = ENOTCONN; > + goto out; > + } > + > + /* Socket buffer is empty and we shall not block. */ > + if (sb->sb_cc == 0 && > + ((sb->sb_flags & SS_NBIO) || (flags & (MSG_DONTWAIT|MSG_NBIO)))) { > + error = EAGAIN; > + goto out; > + } I think you've changed the error number reporting here in subtle ways, but it's early in the morning. It appears we now prefer ENOTCONN to so_error once the socket has entered SS_ISDISCONNECTED. We need to report and clear so_error if non-0 so that ECONNRESET, EPERM, etc, can be reported properly. The new so_state logic is markedly different from what came before, but I'm not sure what the impact of that is. Certainly the comment needs to change to "... are or have been connected". The previous logic appears to have handled SS_ISCONNECTING differently, and possibly SS_ISDISCONNECTING. > +restart: > + SOCKBUF_LOCK_ASSERT(&so->so_rcv); > + > + /* Abort if socket has reported problems. */ > + if (so->so_error) { > + if (sb->sb_cc > 0) > + goto deliver; > + if (oresid > uio->uio_resid) > + goto out; > + error = so->so_error; > + if (!(flags & MSG_PEEK)) > + so->so_error = 0; > + goto out; > + } > + > + /* Door is closed. Deliver what is left, if any. */ > + if (sb->sb_state & SBS_CANTRCVMORE) { > + if (sb->sb_cc > 0) > + goto deliver; > + else > + goto out; > + } > + > + /* Socket buffer got some data that we shall deliver now. */ > + if (sb->sb_cc > 0 && !(flags & MSG_WAITALL) && > + ((sb->sb_flags & SS_NBIO) || > + (flags & (MSG_DONTWAIT|MSG_NBIO)) || > + sb->sb_cc >= sb->sb_lowat || > + sb->sb_cc >= uio->uio_resid || > + sb->sb_cc >= sb->sb_hiwat) ) { > + goto deliver; > + } > + > + /* On MSG_WAITALL we must wait until all data or error arrives. */ > + if ((flags & MSG_WAITALL) && > + (sb->sb_cc >= uio->uio_resid || sb->sb_cc >= sb->sb_lowat)) > + goto deliver; > + > + /* > + * Wait and block until (more) data comes in. > + * NB: Drops the sockbuf lock during wait. > + */ > + error = sbwait(sb); > + if (error) > + goto out; > + goto restart; > + > +deliver: > + SOCKBUF_LOCK_ASSERT(&so->so_rcv); > + KASSERT(sb->sb_cc > 0, ("%s: sockbuf empty", __func__)); > + KASSERT(sb->sb_mb != NULL, ("%s: sb_mb == NULL", __func__)); > + > + /* Statistics. */ > + if (uio->uio_td) > + uio->uio_td->td_ru.ru_msgrcv++; > + > + /* Fill uio until full or current end of socket buffer is reached. */ > + len = min(uio->uio_resid, sb->sb_cc); > + if (mp0 != NULL) { > + /* Dequeue as many mbufs as possible. */ > + if (!(flags & MSG_PEEK) && len >= sb->sb_mb->m_len) { > + for (*mp0 = m = sb->sb_mb; > + m != NULL && m->m_len <= len; > + m = m->m_next) { > + len -= m->m_len; > + uio->uio_resid -= m->m_len; > + sbfree(sb, m); > + n = m; > + } Since control mbufs aren't supported, you should assert !MT_CONTROL. Or add control mbuf support per above. Again, early in the morning, but: does (mp0 != NULL) still work with (flags & MSG_WAITALL)? In soreceive_generic(), mp0 is updated to point to &m->m_next as the loop iterates, so that as you go back to 'repeat:', the chain pointed to by mp0 is appended to rather than replaced, whereas your code appears to replace it in each loop (which seems wrong). Perhaps it's just too early in the morning. I think the only consumer of (mp0 != NULL) with MSG_WAITALL is the iscsi initiator, but breaking that would be bad. smbfs probably *should* use it... :-) > + sb->sb_mb = m; > + if (sb->sb_mb == NULL) > + SB_EMPTY_FIXUP(sb); > + n->m_next = NULL; > + } > + /* Copy the remainder. */ > + if (len > 0) { > + KASSERT(sb->sb_mb != NULL, > + ("%s: len > 0 && sb->sb_mb empty", __func__)); > + > + m = m_copym(sb->sb_mb, 0, len, M_DONTWAIT); > + if (m == NULL) > + len = 0; /* Don't flush data from sockbuf. */ > + else > + uio->uio_resid -= m->m_len; > + if (*mp0 != NULL) > + n->m_next = m; > + else > + *mp0 = m; > + if (*mp0 == NULL) { > + error = ENOBUFS; > + goto out; > + } > + } > + } else { > + /* NB: Must unlock socket buffer as uiomove may sleep. */ > + SOCKBUF_UNLOCK(sb); > + error = m_mbuftouio(uio, sb->sb_mb, len); m_mbuftouio() should assert !MT_CONTROL on each mbuf if it doesn't support control mbufs. See also above. :-) > + SOCKBUF_LOCK(sb); > + if (error) > + goto out; > + } > + SBLASTRECORDCHK(sb); > + SBLASTMBUFCHK(sb); > + > + /* > + * Remove the delivered data from the socket buffer unless we > + * were only peeking. > + */ > + if (!(flags & MSG_PEEK)) { > + if (len > 0) > + sbdrop_locked(sb, len); > + > + /* Notify protocol that we drained some data. */ > + if ((so->so_proto->pr_flags & PR_WANTRCVD) && > + (((flags & MSG_WAITALL) && uio->uio_resid > 0) || > + !(flags & MSG_SOCALLBCK))) { > + SOCKBUF_UNLOCK(sb); > + (*so->so_proto->pr_usrreqs->pru_rcvd)(so, flags); > + SOCKBUF_LOCK(sb); > + } > + } > + > + /* > + * For MSG_WAITALL we may have to loop again and wait for > + * more data to come in. > + */ > + if ((flags & MSG_WAITALL) && uio->uio_resid > 0) > + goto restart; > +out: > + SOCKBUF_LOCK_ASSERT(sb); > + SBLASTRECORDCHK(sb); > + SBLASTMBUFCHK(sb); > + SOCKBUF_UNLOCK(sb); > + sbunlock(sb); > + return (error); > +} > + > +/* > * Optimized version of soreceive() for simple datagram cases from userspace. > * Unlike in the stream case, we're able to drop a datagram if copyout() > * fails, and because we handle datagrams atomically, we don't need to use a > > Modified: head/sys/netinet/tcp_usrreq.c > ============================================================================== > --- head/sys/netinet/tcp_usrreq.c Mon Jun 22 22:54:44 2009 (r194671) > +++ head/sys/netinet/tcp_usrreq.c Mon Jun 22 23:08:05 2009 (r194672) > @@ -1032,6 +1032,9 @@ struct pr_usrreqs tcp_usrreqs = { > .pru_send = tcp_usr_send, > .pru_shutdown = tcp_usr_shutdown, > .pru_sockaddr = in_getsockaddr, > +#if 0 > + .pru_soreceive = soreceive_stream, > +#endif > .pru_sosetlabel = in_pcbsosetlabel, > .pru_close = tcp_usr_close, > }; > @@ -1053,6 +1056,9 @@ struct pr_usrreqs tcp6_usrreqs = { > .pru_send = tcp_usr_send, > .pru_shutdown = tcp_usr_shutdown, > .pru_sockaddr = in6_mapped_sockaddr, > +#if 0 > + .pru_soreceive = soreceive_stream, > +#endif > .pru_sosetlabel = in_pcbsosetlabel, > .pru_close = tcp_usr_close, > }; > > Modified: head/sys/sys/socketvar.h > ============================================================================== > --- head/sys/sys/socketvar.h Mon Jun 22 22:54:44 2009 (r194671) > +++ head/sys/sys/socketvar.h Mon Jun 22 23:08:05 2009 (r194672) > @@ -345,6 +345,9 @@ int sopoll_generic(struct socket *so, in > struct ucred *active_cred, struct thread *td); > int soreceive(struct socket *so, struct sockaddr **paddr, struct uio *uio, > struct mbuf **mp0, struct mbuf **controlp, int *flagsp); > +int soreceive_stream(struct socket *so, struct sockaddr **paddr, > + struct uio *uio, struct mbuf **mp0, struct mbuf **controlp, > + int *flagsp); > int soreceive_dgram(struct socket *so, struct sockaddr **paddr, > struct uio *uio, struct mbuf **mp0, struct mbuf **controlp, > int *flagsp); >
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?alpine.BSF.2.00.0906230631420.91891>