Date: Sat, 15 Dec 2007 10:56:46 +0000 (GMT) From: Robert Watson <rwatson@FreeBSD.org> To: Kip Macy <kip.macy@gmail.com> Cc: FreeBSD Current <freebsd-current@freebsd.org>, freebsd-arch@freebsd.org Subject: Re: pending changes for TOE support Message-ID: <20071215100351.Q70617@fledge.watson.org> In-Reply-To: <b1fa29170712150057m690bd36bm7a1910969e92293b@mail.gmail.com> References: <b1fa29170712121303x537fd11fj4b8827bb353ad8e4@mail.gmail.com> <b1fa29170712150057m690bd36bm7a1910969e92293b@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Sat, 15 Dec 2007, Kip Macy wrote: > The updated patch is at: http://www.fsmware.com/tcp/tcp_offload.diff > > I've only had structural feedback from one person. Please let me know if you > intend to provide feedback I'd like to get this in without much further > delay. Per our out-of-band communication, I won't have the opportunity to do any serious reviewing of this work until later next week, as I have other personal obligations in the mean time that preclude that. However, skimming the comments, I have a couple of areas where I'd like some clarification to help guide my reading. I've only got fifteen minutes so I'll cut off when I run out. + * A driver publishes that it provides offload services + * by setting IFCAP_TOE in the ifnet. The offload connect + * will bypass any further work if the interface that a + * connection would use does not support TCP offload. My initial feeling is that, even if an interface supports TOE, we shouldn't enable the capability in the enabled vector by default, as TOE bypasses firewall behavior, etc, and would certainly be a surprise if an admin swapped a chelsio card for a non-TOE supporting card. What's your feeling on this? + * The TOE API assumes that the tcp offload engine can offload the + * the entire connection from set up to teardown, with some provision + * being made to allowing the software stack to handle time wait. If + * the device does not meet these criteria, it is the driver's responsibility + * to overload the functions that it needs to in tcp_usrreqs and make + * its own calls to tcp_output if it needs to do so. While I'm familiar with TCP, I'm less familiar with the scope of what cards support for TOE. Do we know of any cards that are less capable than the chelsio card in this respect, or are they all sort of on-par on that front? I.e., do we think the above eventuality is likely? If we don't, then one of the things I'd like to see us do is fairly carefully assert, at least for a few months, that TCP never "slips" into any transmission-related paths that could lead to truly odd and hard-to-diagnose behavior when runnning with TOE. I.e., tcp_output, etc. If we do think it's likely, we don't need to address this immediately, but we should make sure that before we ship TOE in a release, we've thought somewhat more thoroughly about that case. As long as TOE remains un-MFC'd, we don't find ourselves with an obligation to maintain guarantees about the interfaces, and that includes dealing with incompatibility :-). Do we know if any of the current 10gbps vendors other than chelsio are actively looking at TOE on FreeBSD, and could be engaged in discussion? + * The toe_usrreqs structure constitutes the TOE driver's + * interface to the TCP stack for functionality that doesn't + * interact directly with userspace. If one wants to provide + * (optional) functionality to do zero-copy to/from + * userspace one still needs to override soreceive/sosend + * with functions that fault in and pin the user buffers. And this is an issue we also should work out in order to properly fix ZERO_COPY_SOCKETS anyway. I think it might be useful to add a couple of paragraphs here on three topics: (1) Clarify the way in which windows are updated between the device driver and the socket code, both for sending/receiving. You talk a bit about "credit", but introducing it up-front would be useful. (2) One of the issues I've run into in the TCP and socket code generally is that there was significant lack of clarity on the "life cycle" of the set of related data structures. Could you write a bit of text about when drivers will allocate state and when they will free it? I.e., tu_attach allocates state, tu_{abort,detach} free it, and TCP promises not to call anything before attach or anything after abort/detach. (3) Could you talk at a high level about the ways in which TOE drivers will interact with TCP? You do it a bit in each of the sections, but if there's a principle, pulling it out would be useful. Also, you should indicate whether the driver is allowed to drop the inpcb lock or not. Doing this would address a few of the comments I have below also. + * + tu_send + * - tells the driver that new data may have been added to the + * socket's send buffer - the driver should not fail if the + * buffer is in fact unchanged I'm a bit confused by the description of the error condition here. Could you clarify when a driver should return an error, and what the impact of an error returned will be on the connection state? In fact, it probably makes sense to have an up-front comment on conventions for error-handling -- if TOE returns an error will that generally lead to a TCP tear-down? + * - The driver expects the inpcb lock to be held and This comment is truncated -- is there an and? We should specify that drivers are not allowed to drop the inpcb lock if that is the case, FYI. + * + tu_rcvd + * - returns credits to the driver and triggers window updates + * to the peer (a credit is a byte in the user's receive window) Might begin with a sentence defining the notion of credit. Is it possible to use tu_rcvd to reduce credit to the card if the socket buffer size is changed, or just increase it? + * - the driver is expected to determine how many bytes have been + * consumed and credit that back to the card so that it can grow + * the window again Could you provide an example of how it is to do that -- i.e., is it just going to inspect so_rcv in the same way native TCP does? + * - this function needs to correctly handle being called any number of + * times without any bytes being consumed from the receive buffer. + * - the driver expects the inpcb lock to be held + * + * + tu_disconnect + * - tells the driver to send FIN to peer + * - driver is expected to send the remaining data and then do a clean half close + * - disconnect implies at least half-close so only send, abort, and detach + * are legal Could you clarify this a bit? Do you mean that TCP guarangees that only tu_send, tu_abort, and tu_detach will be delivered to the driver in the future? + * - the driver is expected to handle transition through the shutdown + * state machine and allow the stack to support SO_LINGER. Probably worth commenting that the device driver won't detach the toe state. + * + * + tu_abort + * - closes the connection and sends a RST to peer + * - driver is expectd to trigger an RST and detach the toepcb In regular TCP, the pru_abort method is only called on pending connections while still in the listen queues of a listen socket. Is this true of tu_abort, or is tu_abort a more general method to be used to cancel connections? If so, probably worth commenting on that. + * - no further calls are legal after abort + * - the driver expects the inpcb lock to be held + * + * + tu_detach + * - tells driver that the socket is going away so disconnect + * the toepcb and free appropriate resources + * - allows the driver to cleanly handle the case of connection state + * outliving the socket + * - no further calls are legal after detach + * - the driver acquires the tcbinfo lock For this call, you haven't specified whether the inpcb lock is held. If it is, the driver acquiring the tcbinfo lock without first dropping the inpcb lock would be a lock order reversal. Should the caller instead acquire/hold it? For the above calls, what guarantees does the TCP stack make about the presence of the socket, if any? These interfaces all pass the tcpcb, but in our regular TCP stack, the invariant is the existence of the inpcb, not the tcpcb, which may be replaced with a tcptw (or in one edge case, inp_ppcb may be NULL). If there will be drivers in the future that implement timewait, perhaps we should be passing in the inpcb? + * + tu_syncache_event + * - even if it is not actually needed, the driver is expected to + * call syncache_add for the initial SYN and then syncache_expand + * for the SYN,ACK + * - tells driver that a connection either has not been added or has + * been dropped from the syncache + * - the driver is expected to maintain state that lives outside the + * software stack so the syncache needs to be able to notify the + * toe driver that the software stack is not going to create a connection + * for a received SYN + * - the driver is responsible for any synchronization required Presumably tu_syncache_event is called from the syncache and locks will be held when that happens...? How will the race between the syncache deciding to drop a connection of its own accord and the hardware/driver deciding to accept be addressed, generally speaking? + +extern struct toe_usrreqs tcp_offload_usrreqs; What is the purpose of this global? Presumably we can have two drivers that both implement offload at once? More comments to follow. Robert N M Watson Computer Laboratory University of Cambridge > > -Kip > > > On 12/12/07, Kip Macy <kip.macy@gmail.com> wrote: >> After review by Mike Silbersack I've committed the hooks that provide >> a driver independent interface to TCP offload. >> >> I would like to commit the changes to tcp_subr.c and tcp_usrreq.c to >> actually make use of the new interface. Please review the following: >> >> http://www.fsmware.com/freebsd/tcp/tcp_subr.c.diff >> >> http://www.fsmware.com/freebsd/tcp/tcp_usrreq.c.diff >> >> The new KPI is provided by the following 2 files: >> http://www.fsmware.com/freebsd/tcp/tcp_ofld.c >> http://www.fsmware.com/freebsd/tcp/tcp_ofld.h >> >> >> Thank you for taking the time to review and provide feedback. >> >> -Kip >> >
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20071215100351.Q70617>