From owner-freebsd-hackers@FreeBSD.ORG Tue Aug 12 01:52:40 2014 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 1F9F03AC; Tue, 12 Aug 2014 01:52:40 +0000 (UTC) Received: from mail-pd0-x233.google.com (mail-pd0-x233.google.com [IPv6:2607:f8b0:400e:c02::233]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id D86D721E8; Tue, 12 Aug 2014 01:52:39 +0000 (UTC) Received: by mail-pd0-f179.google.com with SMTP id v10so5732184pde.24 for ; Mon, 11 Aug 2014 18:52:39 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=sender:message-id:date:from:user-agent:mime-version:to:cc:subject :references:in-reply-to:content-type:content-transfer-encoding; bh=w8A+F6vYNVcN5wnSyxQvNxjjn47bh3vBPICM8MPx520=; b=d3QrfY9JhYRGc7ly2emQZMM4/Jni5L6Pzo7DxcJSapkMq4h5w0iNvuEd/lnCDN2yx2 c+Fd2uN0Mo5lAysopoRZyac7ubkCJdUDhP5JayUXa/nS7zaRK1tAkEq+Zmdicjq6cCtR gvdyPHEdZ1aKKbgyCbnQjdKkgiPbDJu0UKAbAm4uGGTLGXkHwix9aO12zyQt3bBiS0sK av7ZPo6cO+u8sHB07+N+gISMBnk6zZuw6LWesA1OMO63LdKco7DMzVI737EBQBxOUxyz 0TQKYL2rR1ng9s+qzyYcmlyebUWdWrdjGEx6o9GDHxy6BfPRpRo8vYsJszDqbNtRav4L qppQ== X-Received: by 10.70.89.76 with SMTP id bm12mr1439355pdb.40.1407808359462; Mon, 11 Aug 2014 18:52:39 -0700 (PDT) Received: from [10.192.166.0] (stargate.chelsio.com. [67.207.112.58]) by mx.google.com with ESMTPSA id f12sm7471604pat.19.2014.08.11.18.52.38 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 11 Aug 2014 18:52:38 -0700 (PDT) Sender: Navdeep Parhar Message-ID: <53E97365.6040405@FreeBSD.org> Date: Mon, 11 Aug 2014 18:52:37 -0700 From: Navdeep Parhar User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:31.0) Gecko/20100101 Thunderbird/31.0 MIME-Version: 1.0 To: Adrian Chadd Subject: Re: Support for zero copy sockets References: <1407171616.44440.YahooMailBasic@web181702.mail.ne1.yahoo.com> <20140811082610.GF7828@equilibrium.bsdes.net> <53E91578.3060209@FreeBSD.org> In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Cc: Alan Cox , Victor Balada Diaz , Sushanth Rai , "freebsd-hackers@freebsd.org" X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 12 Aug 2014 01:52:40 -0000 On 08/11/14 17:42, Adrian Chadd wrote: > On 11 August 2014 12:11, Navdeep Parhar wrote: >> There is zero copy receive (aka Direct Data Placement -- DDP) in the TOE >> driver that accompanies cxgbe(4). I have a tx zero copy implementation >> for it as well (this is not in -current right now). But all this code >> is chip specific and applies only to TCP connections that are handled >> by the TOE driver. It doesn't rely on COW or page flipping. >> >> The reason I'm mentioning all of this here is that if anyone is thinking >> of working on proper zero copy awareness (and APIs) at the socket layer >> then count me in as an interested party. > > I'm not going to get into it just for now, as I have enough on my > FreeBSD plate to do already. I'm in the same situation. > > However, the thing that always irked me about the hardware based > solutions is that they're great for a subset of problems - typically > small sets of sockets. The real interesting problem for me is how to > make it work for say, 500,000 or more concurrent TCP sessions. The hardware based solutions that I'm familiar with can handle tens of thousands of TCP sockets concurrently. The protocol processing is entirely on the chip and when DDP is active the chip can DMA the payload straight to its final destination -- typically a userspace buffer. The only VM operation involved is wiring and then unwiring the uio. The complication is that the driver (cxgbe's t4_tom in this case) has absolutely no idea what an application does (blocking read vs. poll/select+read vs. aio_read vs. ...) so it makes some safe but suboptimal choices. It would be nice if there were an API (very vaguely along the lines of madvise but for sockets, or maybe a sockopt knob) that an application could use to provide hints about its behavior. We could also do with separate zero-copy flavors of the sosend/soreceive usrreqs. And more hints (per read/write operation) that might let us avoid even the wire/unwire operation. Anyway, let's save this discussion for later, when either of us has the time to come up with a specific set of proposals for -net and -arch. Regards, Navdeep > > I can see a method of doing zero-copy writes to the network stack - > look at what the AIO code does in the physical IO path for doing > writes. It wires down the memory and stuffs it into the buffer. > > The thing I haven't yet sorted out is what to do about mappings in > case kernel code wants to peek at the socket data payload for whatever > reason. > > (And yes, reads are still a problem.) > > > > -a >