From owner-freebsd-hackers@FreeBSD.ORG Sun Mar 29 15:20:26 2015 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 7EBBDDC6 for ; Sun, 29 Mar 2015 15:20:26 +0000 (UTC) Received: from mail-ig0-x22e.google.com (mail-ig0-x22e.google.com [IPv6:2607:f8b0:4001:c05::22e]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4F12BF14 for ; Sun, 29 Mar 2015 15:20:26 +0000 (UTC) Received: by igbud6 with SMTP id ud6so54974973igb.1 for ; Sun, 29 Mar 2015 08:20:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:cc:content-type; bh=zq6KuNpEuI+iuzljCR0Ry9a2ywB2KKHWOxrx8O/U7Ak=; b=c0fASuUTRWpIXMEX+zffY1GrrLWG0xpiTKP5uK9UopWhGeEIirRSZ2gslu1Y7c+qa8 pVsbadMRXXIVAtC4Tl92Zx5m/AJgjnMQfnPzVWAur/QpKylN1/vMIxA9uJTpnK18aZU7 s0nEQC9kvCC9Xzh+YlaldP4xYmz7FqPwqHJbaJLyvdRjO9RZ2HIJoBJtk+ov78Z03LiF MsZ4VAFaBH4gVpqaP3tkutSA7RDQefG01p6itVsMeUVJ1AbBFwhcXTRI1XWxhXy5HliS C9w7Og78zWgoZ75to/lyKIoy/HoZL8kukxciyb4N53IfyZxnjevgRjXPAFeQ5/8DjbTl B80w== MIME-Version: 1.0 X-Received: by 10.107.39.72 with SMTP id n69mr27670147ion.8.1427642425766; Sun, 29 Mar 2015 08:20:25 -0700 (PDT) Sender: adrian.chadd@gmail.com Received: by 10.36.17.194 with HTTP; Sun, 29 Mar 2015 08:20:25 -0700 (PDT) In-Reply-To: <20150329081902.GN23643@zxy.spb.ru> References: <20150328221621.GG23643@zxy.spb.ru> <20150328224634.GH23643@zxy.spb.ru> <20150328230533.GI23643@zxy.spb.ru> <20150328234116.GJ23643@zxy.spb.ru> <20150329003354.GK23643@zxy.spb.ru> <20150329081902.GN23643@zxy.spb.ru> Date: Sun, 29 Mar 2015 08:20:25 -0700 X-Google-Sender-Auth: 6b7u5b9oxAaCXfoPqIaS2Ko1HM0 Message-ID: Subject: Re: irq cpu binding From: Adrian Chadd To: Slawa Olhovchenkov Content-Type: text/plain; charset=UTF-8 Cc: "freebsd-hackers@freebsd.org" X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 29 Mar 2015 15:20:26 -0000 On 29 March 2015 at 01:19, Slawa Olhovchenkov wrote: > On Sat, Mar 28, 2015 at 10:46:54PM -0700, Adrian Chadd wrote: > >> >> * It turns out that fragments were being 100% handled out of order >> >> (compared to non-fragments in the same stream) when doing fragment >> >> reassembly, because the current system was assuming direct dispatch >> >> netisr and not checking any packet contents for whether they're on the >> >> wrong CPU. I checked. It's not noticable unless you go digging, but >> >> it's absolutely happening. That's why I spun a lot of cycles looking >> >> at the IP fragment reassembly path and which methods get called on the >> >> frames as they're reinjected. >> > >> > In case of fragmented packet we have first fragment (may be arrived >> > not first) contained L4 information and dispatchet to correct bucket >> > and other fragments, don't contains this information and dispathed >> > anywere. As I understund IP stack gather all packet before processing. >> > All we need -- do processing on CPU arriving first segment. >> >> I'm pretty sure that wasn't what was happening when i went digging. I >> was using UDP and varying the transmit size so I had exact control >> over the fragmentation. >> >> The driver rx path does direct dispatch netisr processing, and for >> fragments it was hashed on only L3 details not L4. Even the first >> frame is hashed on L3 only. So it'd go to a different queue compared >> to L4 hashing, and subsequent fragments would come in on the same >> queue. Once it was completed, it was processed up inline - it wasn't >> going back into netisr and getting re-checked for the right queue. > > Two case: > 1) let this behavior > 2) rewrite fo resheduling. > > I think 1) acceptable -- fragmented packets very rarely, compared to > target data rate (2Mpps and more). > >> > What's problem there? >> > I am don't intersting how NIC do hashing (anyway, hashing for direct >> > and reflex traffic is different -- this is not Tilera). >> > All I need -- distributing flow to CPU, for balance load and reduction >> > lock congenstion. >> >> Right, but you assume all packets in a flow go to the same CPU, and I >> discovered this wasn't the case. >> That's why I went down the path with RSS to make it right. > > Only fragmented packets case or other case? > >> > >> >> * For applications - I'm not sure yet, but at the minimum the librss >> >> API I have vaguely sketched out and coded up in a git branch lets you >> >> pull out the list of buckets and which CPU it's on. I'm going to >> >> extend that a bit more, but it should be enough for things like nginx >> >> to say "ok, start up one nginx process per RSS bucket, and here's the >> >> CPU set for it to bind to." You said it has worker groups - that's >> >> great; I want that to be auto configured. >> > >> > For applications minimum is (per socket) select/kqueut/accept work >> > only for flow, arrived at CPU matched CPU at time select/kqueut/accept >> > (yes, for correct work application must pined to this CPU). >> > >> > And application don't need know anything about buckets and etc. >> > >> > After this, arrived packet activated IRQ handler, ithread, driver >> > interrup thread, TCP stack, select/accept, read, write, tcp_output -- >> > all on same cpu. I can be wrong, this is save L2/L3 cache. >> > >> > Where I missunderstund? >> >> The other half of the network stack - the sending side - also needs to >> be either on the same or nearby CPU, or you still end up with lock >> contention and cache thrashing. > > For incoming connections this will be automatuc -- sending will be > from CPU binding to receiving queue. > > Outgoing connections is more complex case, yes. > Need to transfer FD (with re-binding) and signaling (from kernel to > application) about prefered CPU. Prefered CPU is CPU give SYN-ACK. > And this need assistance from application. But I am currently can't > remember application massive servering outgouing connections. Or you realise you need to rewrite your userland application so it doesn't have to do this, and instead uses an IOCP/libdispatch style IO API to register for IO events and get IO completions to occur in any given completion thread. Then it doesn't have to care about moving descriptors around - it just creates an outbound socket, and then the IO completion callbacks will happen wherever they need to happen. If that needs to shuffle around due to RSS rebalancing then it'll "just happen". And yeah, I know of plenty of applications doing massive outbound connections - anything being an intermediary HTTP proxy. :) -adrian