Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 15 Dec 2007 10:56:46 +0000 (GMT)
From:      Robert Watson <rwatson@FreeBSD.org>
To:        Kip Macy <kip.macy@gmail.com>
Cc:        FreeBSD Current <freebsd-current@freebsd.org>, freebsd-arch@freebsd.org
Subject:   Re: pending changes for TOE support
Message-ID:  <20071215100351.Q70617@fledge.watson.org>
In-Reply-To: <b1fa29170712150057m690bd36bm7a1910969e92293b@mail.gmail.com>
References:  <b1fa29170712121303x537fd11fj4b8827bb353ad8e4@mail.gmail.com> <b1fa29170712150057m690bd36bm7a1910969e92293b@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Sat, 15 Dec 2007, Kip Macy wrote:

> The updated patch is at: http://www.fsmware.com/tcp/tcp_offload.diff
>
> I've only had structural feedback from one person. Please let me know if you 
> intend to provide feedback I'd like to get this in without much further 
> delay.

Per our out-of-band communication, I won't have the opportunity to do any 
serious reviewing of this work until later next week, as I have other personal 
obligations in the mean time that preclude that.  However, skimming the 
comments, I have a couple of areas where I'd like some clarification to help 
guide my reading.  I've only got fifteen minutes so I'll cut off when I run 
out.

+ * A driver publishes that it provides offload services
+ * by setting IFCAP_TOE in the ifnet. The offload connect
+ * will bypass any further work if the interface that a
+ * connection would use does not support TCP offload.

My initial feeling is that, even if an interface supports TOE, we shouldn't 
enable the capability in the enabled vector by default, as TOE bypasses 
firewall behavior, etc, and would certainly be a surprise if an admin swapped 
a chelsio card for a non-TOE supporting card.  What's your feeling on this?

+ * The TOE API assumes that the tcp offload engine can offload the
+ * the entire connection from set up to teardown, with some provision
+ * being made to allowing the software stack to handle time wait. If
+ * the device does not meet these criteria, it is the driver's responsibility
+ * to overload the functions that it needs to in tcp_usrreqs and make
+ * its own calls to tcp_output if it needs to do so.

While I'm familiar with TCP, I'm less familiar with the scope of what cards 
support for TOE.  Do we know of any cards that are less capable than the 
chelsio card in this respect, or are they all sort of on-par on that front? 
I.e., do we think the above eventuality is likely?

If we don't, then one of the things I'd like to see us do is fairly carefully 
assert, at least for a few months, that TCP never "slips" into any 
transmission-related paths that could lead to truly odd and hard-to-diagnose 
behavior when runnning with TOE.  I.e., tcp_output, etc.

If we do think it's likely, we don't need to address this immediately, but we 
should make sure that before we ship TOE in a release, we've thought somewhat 
more thoroughly about that case.  As long as TOE remains un-MFC'd, we don't 
find ourselves with an obligation to maintain guarantees about the interfaces, 
and that includes dealing with incompatibility :-).  Do we know if any of the 
current 10gbps vendors other than chelsio are actively looking at TOE on 
FreeBSD, and could be engaged in discussion?

+ * The toe_usrreqs structure constitutes the TOE driver's
+ * interface to the TCP stack for functionality that doesn't
+ * interact directly with userspace. If one wants to provide
+ * (optional) functionality to do zero-copy to/from
+ * userspace one still needs to override soreceive/sosend
+ * with functions that fault in and pin the user buffers.

And this is an issue we also should work out in order to properly fix 
ZERO_COPY_SOCKETS anyway.

I think it might be useful to add a couple of paragraphs here on three topics:

(1) Clarify the way in which windows are updated between the device driver and
     the socket code, both for sending/receiving.  You talk a bit about
     "credit", but introducing it up-front would be useful.

(2) One of the issues I've run into in the TCP and socket code generally is
     that there was significant lack of clarity on the "life cycle" of the set
     of related data structures.  Could you write a bit of text about when
     drivers will allocate state and when they will free it?  I.e., tu_attach
     allocates state, tu_{abort,detach} free it, and TCP promises not to call
     anything before attach or anything after abort/detach.

(3) Could you talk at a high level about the ways in which TOE drivers will
     interact with TCP?  You do it a bit in each of the sections, but if
     there's a principle, pulling it out would be useful.  Also, you should
     indicate whether the driver is allowed to drop the inpcb lock or not.

Doing this would address a few of the comments I have below also.

+ * + tu_send
+ *   - tells the driver that new data may have been added to the
+ *     socket's send buffer - the driver should not fail if the
+ *     buffer is in fact unchanged

I'm a bit confused by the description of the error condition here.  Could you 
clarify when a driver should return an error, and what the impact of an error 
returned will be on the connection state?  In fact, it probably makes sense to 
have an up-front comment on conventions for error-handling -- if TOE returns 
an error will that generally lead to a TCP tear-down?

+ *   - The driver expects the inpcb lock to be held and

This comment is truncated -- is there an and?

We should specify that drivers are not allowed to drop the inpcb lock if that 
is the case, FYI.

+ * + tu_rcvd
+ *   - returns credits to the driver and triggers window updates
+ *     to the peer (a credit is a byte in the user's receive window)

Might begin with a sentence defining the notion of credit.  Is it possible to 
use tu_rcvd to reduce credit to the card if the socket buffer size is changed, 
or just increase it?

+ *   - the driver is expected to determine how many bytes have been
+ *     consumed and credit that back to the card so that it can grow
+ *     the window again

Could you provide an example of how it is to do that -- i.e., is it just going 
to inspect so_rcv in the same way native TCP does?

+ *   - this function needs to correctly handle being called any number of
+ *     times without any bytes being consumed from the receive buffer.
+ *   - the driver expects the inpcb lock to be held
+ *
+ * + tu_disconnect
+ *   - tells the driver to send FIN to peer
+ *   - driver is expected to send the remaining data and then do a clean half 
close
+ *   - disconnect implies at least half-close so only send, abort, and detach
+ *     are legal

Could you clarify this a bit?  Do you mean that TCP guarangees that only 
tu_send, tu_abort, and tu_detach will be delivered to the driver in the 
future?

+ *   - the driver is expected to handle transition through the shutdown
+ *     state machine and allow the stack to support SO_LINGER.

Probably worth commenting that the device driver won't detach the toe state.

+ *
+ * + tu_abort
+ *   - closes the connection and sends a RST to peer
+ *   - driver is expectd to trigger an RST and detach the toepcb

In regular TCP, the pru_abort method is only called on pending connections 
while still in the listen queues of a listen socket.  Is this true of 
tu_abort, or is tu_abort a more general method to be used to cancel 
connections?  If so, probably worth commenting on that.

+ *   - no further calls are legal after abort
+ *   - the driver expects the inpcb lock to be held
+ *
+ * + tu_detach
+ *   - tells driver that the socket is going away so disconnect
+ *     the toepcb and free appropriate resources
+ *   - allows the driver to cleanly handle the case of connection state
+ *     outliving the socket
+ *   - no further calls are legal after detach
+ *   - the driver acquires the tcbinfo lock

For this call, you haven't specified whether the inpcb lock is held.  If it 
is, the driver acquiring the tcbinfo lock without first dropping the inpcb 
lock would be a lock order reversal.  Should the caller instead acquire/hold 
it?

For the above calls, what guarantees does the TCP stack make about the 
presence of the socket, if any?

These interfaces all pass the tcpcb, but in our regular TCP stack, the 
invariant is the existence of the inpcb, not the tcpcb, which may be replaced 
with a tcptw (or in one edge case, inp_ppcb may be NULL).  If there will be 
drivers in the future that implement timewait, perhaps we should be passing in 
the inpcb?

+ * + tu_syncache_event
+ *   - even if it is not actually needed, the driver is expected to
+ *     call syncache_add for the initial SYN and then syncache_expand
+ *     for the SYN,ACK
+ *   - tells driver that a connection either has not been added or has
+ *     been dropped from the syncache
+ *   - the driver is expected to maintain state that lives outside the
+ *     software stack so the syncache needs to be able to notify the
+ *     toe driver that the software stack is not going to create a connection
+ *     for a received SYN
+ *   - the driver is responsible for any synchronization required

Presumably tu_syncache_event is called from the syncache and locks will be 
held when that happens...?

How will the race between the syncache deciding to drop a connection of its 
own accord and the hardware/driver deciding to accept be addressed, generally 
speaking?

+
+extern struct toe_usrreqs tcp_offload_usrreqs;

What is the purpose of this global?  Presumably we can have two drivers that 
both implement offload at once?

More comments to follow.

Robert N M Watson
Computer Laboratory
University of Cambridge


>
>   -Kip
>
>
> On 12/12/07, Kip Macy <kip.macy@gmail.com> wrote:
>> After review by Mike Silbersack I've committed the hooks that provide
>> a driver independent interface to TCP offload.
>>
>> I would like to commit the changes to tcp_subr.c and tcp_usrreq.c to
>> actually make use of the new interface. Please review the following:
>>
>> http://www.fsmware.com/freebsd/tcp/tcp_subr.c.diff
>>
>> http://www.fsmware.com/freebsd/tcp/tcp_usrreq.c.diff
>>
>> The new KPI is provided by the following 2 files:
>> http://www.fsmware.com/freebsd/tcp/tcp_ofld.c
>> http://www.fsmware.com/freebsd/tcp/tcp_ofld.h
>>
>>
>> Thank you for taking the time to review and provide feedback.
>>
>> -Kip
>>
>



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20071215100351.Q70617>