From owner-freebsd-current@FreeBSD.ORG Sat Dec 15 19:16:23 2007 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 69DD016A469; Sat, 15 Dec 2007 19:16:23 +0000 (UTC) (envelope-from rwatson@FreeBSD.org) Received: from cyrus.watson.org (cyrus.watson.org [209.31.154.42]) by mx1.freebsd.org (Postfix) with ESMTP id 34E2213C4E8; Sat, 15 Dec 2007 19:16:23 +0000 (UTC) (envelope-from rwatson@FreeBSD.org) Received: from fledge.watson.org (fledge.watson.org [209.31.154.41]) by cyrus.watson.org (Postfix) with ESMTP id B08814700F; Sat, 15 Dec 2007 14:16:22 -0500 (EST) Date: Sat, 15 Dec 2007 19:16:22 +0000 (GMT) From: Robert Watson X-X-Sender: robert@fledge.watson.org To: Kip Macy In-Reply-To: Message-ID: <20071215190252.I85668@fledge.watson.org> References: <20071215100351.Q70617@fledge.watson.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: FreeBSD Current , freebsd-arch@freebsd.org Subject: Re: pending changes for TOE support X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 15 Dec 2007 19:16:23 -0000 On Sat, 15 Dec 2007, Kip Macy wrote: > The current implementation bypasses the firewall. This and likely other > hardware has extensive filtering support so it isn't neccessarily intrinsic. I'm not sure I agree when it comes to features like DUMMYNET, NAT, BPF, etc. TCP offload completely bypasses, by its very intent, most of the network stack. > The usage model at this moment is that the customer makes a conscious > decision to load the TOE driver and understands the implications. I think > this is quite adequate for 10GigE cards currently. However, this will need > to be revisited when these chips start showing up on mainstream > motherboards. I think I would prefer that our policy switch be the capenable flag, so that compiling things in or out (or loading, which is the logical equivilent) doesn't change functional behavior for existing interfaces. >> While I'm familiar with TCP, I'm less familiar with the scope of what cards >> support for TOE. Do we know of any cards that are less capable than the >> chelsio card in this respect, or are they all sort of on-par on that front? >> I.e., do we think the above eventuality is likely? > > I don't have any way of knowing. I think it is probably safe to say that any > vendors that don't meet that criteria now will in the future as transistor > density increases. I think it behooves us to find out, given that we're designing a KPI for those cards also. I agree with the transistor argument, and given that TOE is a fairly undeployed technology at this point, it may quickly resolve itself if it hasn't. >> If we don't, then one of the things I'd like to see us do is fairly >> carefully assert, at least for a few months, that TCP never "slips" into >> any transmission-related paths that could lead to truly odd and >> hard-to-diagnose behavior when runnning with TOE. I.e., tcp_output, etc. > > I'm happy to do that. However, I see problems introduced by offloading > connections as being driver bugs much the same as problems caused by the > driver's TCP segmentation offload or checksum offload. The problems will be > isolated to connections using a specific interface. Interesting point -- it's amazing how broken checksum processing in, and TCP is many orders of magnitude more complex. >> the socket code, both for sending/receiving. You talk a bit about >> "credit", but introducing it up-front would be useful. > > I didn't realize a definition was necessary. To the best of my knowledge > this is the common term used when discussing flow control. I've seen it used > for Fibre Channel and IB. The one ambiguity that arises is whether or not it > refers to bytes or segments. I think a phrase wouldn't hurt; also, I notice you did only address flow control in one direction in the comments, which is why I mentioned both sending and receiving. The clearer we make this, the happier we'll be. I suspect we'll actually want to move a lot of this text from the include file to the man page for the TOE interface... >> (3) Could you talk at a high level about the ways in which TOE drivers will >> interact with TCP? You do it a bit in each of the sections, but if >> there's a principle, pulling it out would be useful. Also, you should >> indicate whether the driver is allowed to drop the inpcb lock or not. > > I've done my best to minimize changes to TCP. It is safe to assume that the > invariants are the same as those for tcp_output. I think we should ask the > author of tcp_output to document the interface, expected state transitions, > and its invariants (joke). :-P Documenting locking semantics such as "You can rely on lock X being held, but do not drop it" takes an extra phrase and can save someone a lot of time. >> I'm a bit confused by the description of the error condition here. Could >> you clarify when a driver should return an error, and what the impact of an >> error returned will be on the connection state? In fact, it probably makes >> sense to have an up-front comment on conventions for error-handling -- if >> TOE returns an error will that generally lead to a TCP tear-down? > > The offload routines are substituted for tcp_output and thus should interact > with the stack in the same way. By extension they should have the same > failure modes and invariants. Most driver authors will not be intimately familiar with tcp_output()'s subleties, and documenting error-handling for a KPI is always a good idea. > The interface is intended to drop in the place of tcp_output. <"see what tcp_output does" repeated many times> tcp_output() was previously an internal function of the TCP code, and now the semantics are being exposed to device drivers. Let's not perpetuate poorly documented driver interfaces by adding another one. I think it would be a reasonable expectation of a driver author to have consistent documentation of the life cycle of data structures and objects, locking expectations and requirements, and the semantics for error values from functions. Certainly, they need to look at TCP a fair amount because they'll be pulling things out of inpcb, tcpcb, etc, but I'd rather we limit that requirement to simple things (addresses, socket options) that are relatively static and avoid it being for complex things (locking, error handling) that tend to be more subject to change. Also, if you document what you think the behavior is or should be, we can then check to see if we agree. Robert N M Watson Computer Laboratory University of Cambridge