From owner-freebsd-hackers@FreeBSD.ORG Sun Mar 29 08:19:04 2015 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id BFBE638B; Sun, 29 Mar 2015 08:19:04 +0000 (UTC) Received: from zxy.spb.ru (zxy.spb.ru [195.70.199.98]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 71699F9C; Sun, 29 Mar 2015 08:19:04 +0000 (UTC) Received: from slw by zxy.spb.ru with local (Exim 4.84 (FreeBSD)) (envelope-from ) id 1Yc8Qw-0009Ms-Ff; Sun, 29 Mar 2015 11:19:02 +0300 Date: Sun, 29 Mar 2015 11:19:02 +0300 From: Slawa Olhovchenkov To: Adrian Chadd Subject: Re: irq cpu binding Message-ID: <20150329081902.GN23643@zxy.spb.ru> References: <20150328221621.GG23643@zxy.spb.ru> <20150328224634.GH23643@zxy.spb.ru> <20150328230533.GI23643@zxy.spb.ru> <20150328234116.GJ23643@zxy.spb.ru> <20150329003354.GK23643@zxy.spb.ru> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.23 (2014-03-12) X-SA-Exim-Connect-IP: X-SA-Exim-Mail-From: slw@zxy.spb.ru X-SA-Exim-Scanned: No (on zxy.spb.ru); SAEximRunCond expanded to false Cc: "freebsd-hackers@freebsd.org" X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 29 Mar 2015 08:19:04 -0000 On Sat, Mar 28, 2015 at 10:46:54PM -0700, Adrian Chadd wrote: > >> * It turns out that fragments were being 100% handled out of order > >> (compared to non-fragments in the same stream) when doing fragment > >> reassembly, because the current system was assuming direct dispatch > >> netisr and not checking any packet contents for whether they're on the > >> wrong CPU. I checked. It's not noticable unless you go digging, but > >> it's absolutely happening. That's why I spun a lot of cycles looking > >> at the IP fragment reassembly path and which methods get called on the > >> frames as they're reinjected. > > > > In case of fragmented packet we have first fragment (may be arrived > > not first) contained L4 information and dispatchet to correct bucket > > and other fragments, don't contains this information and dispathed > > anywere. As I understund IP stack gather all packet before processing. > > All we need -- do processing on CPU arriving first segment. > > I'm pretty sure that wasn't what was happening when i went digging. I > was using UDP and varying the transmit size so I had exact control > over the fragmentation. > > The driver rx path does direct dispatch netisr processing, and for > fragments it was hashed on only L3 details not L4. Even the first > frame is hashed on L3 only. So it'd go to a different queue compared > to L4 hashing, and subsequent fragments would come in on the same > queue. Once it was completed, it was processed up inline - it wasn't > going back into netisr and getting re-checked for the right queue. Two case: 1) let this behavior 2) rewrite fo resheduling. I think 1) acceptable -- fragmented packets very rarely, compared to target data rate (2Mpps and more). > > What's problem there? > > I am don't intersting how NIC do hashing (anyway, hashing for direct > > and reflex traffic is different -- this is not Tilera). > > All I need -- distributing flow to CPU, for balance load and reduction > > lock congenstion. > > Right, but you assume all packets in a flow go to the same CPU, and I > discovered this wasn't the case. > That's why I went down the path with RSS to make it right. Only fragmented packets case or other case? > > > >> * For applications - I'm not sure yet, but at the minimum the librss > >> API I have vaguely sketched out and coded up in a git branch lets you > >> pull out the list of buckets and which CPU it's on. I'm going to > >> extend that a bit more, but it should be enough for things like nginx > >> to say "ok, start up one nginx process per RSS bucket, and here's the > >> CPU set for it to bind to." You said it has worker groups - that's > >> great; I want that to be auto configured. > > > > For applications minimum is (per socket) select/kqueut/accept work > > only for flow, arrived at CPU matched CPU at time select/kqueut/accept > > (yes, for correct work application must pined to this CPU). > > > > And application don't need know anything about buckets and etc. > > > > After this, arrived packet activated IRQ handler, ithread, driver > > interrup thread, TCP stack, select/accept, read, write, tcp_output -- > > all on same cpu. I can be wrong, this is save L2/L3 cache. > > > > Where I missunderstund? > > The other half of the network stack - the sending side - also needs to > be either on the same or nearby CPU, or you still end up with lock > contention and cache thrashing. For incoming connections this will be automatuc -- sending will be from CPU binding to receiving queue. Outgoing connections is more complex case, yes. Need to transfer FD (with re-binding) and signaling (from kernel to application) about prefered CPU. Prefered CPU is CPU give SYN-ACK. And this need assistance from application. But I am currently can't remember application massive servering outgouing connections.