Date: Sun, 29 Mar 2015 11:19:02 +0300 From: Slawa Olhovchenkov <slw@zxy.spb.ru> To: Adrian Chadd <adrian@freebsd.org> Cc: "freebsd-hackers@freebsd.org" <freebsd-hackers@freebsd.org> Subject: Re: irq cpu binding Message-ID: <20150329081902.GN23643@zxy.spb.ru> In-Reply-To: <CAJ-VmonmeeTaSOpSJCJTP7yTeno1LEt-dt9wEhNHk36oY6yY7Q@mail.gmail.com> References: <20150328221621.GG23643@zxy.spb.ru> <CAJ-Vmomd6Z5Ou7cvV1Kg4m=X2907507hqKMWiz6ssZ45Pi_-Dg@mail.gmail.com> <20150328224634.GH23643@zxy.spb.ru> <CAJ-VmokwGgHGP6AjBcGbyJShBPX6dyJjjNeCBcjxLi1obaiRtQ@mail.gmail.com> <20150328230533.GI23643@zxy.spb.ru> <CAJ-VmongWE_z7Rod8-SoFmyiLqiTbHtSaAwjgAs05L_Z3jrWXA@mail.gmail.com> <20150328234116.GJ23643@zxy.spb.ru> <CAJ-VmokSHHm3kMwz=bp7VbgZwADD2_pEr27NdzUfkGq1U=x_sw@mail.gmail.com> <20150329003354.GK23643@zxy.spb.ru> <CAJ-VmonmeeTaSOpSJCJTP7yTeno1LEt-dt9wEhNHk36oY6yY7Q@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Sat, Mar 28, 2015 at 10:46:54PM -0700, Adrian Chadd wrote: > >> * It turns out that fragments were being 100% handled out of order > >> (compared to non-fragments in the same stream) when doing fragment > >> reassembly, because the current system was assuming direct dispatch > >> netisr and not checking any packet contents for whether they're on the > >> wrong CPU. I checked. It's not noticable unless you go digging, but > >> it's absolutely happening. That's why I spun a lot of cycles looking > >> at the IP fragment reassembly path and which methods get called on the > >> frames as they're reinjected. > > > > In case of fragmented packet we have first fragment (may be arrived > > not first) contained L4 information and dispatchet to correct bucket > > and other fragments, don't contains this information and dispathed > > anywere. As I understund IP stack gather all packet before processing. > > All we need -- do processing on CPU arriving first segment. > > I'm pretty sure that wasn't what was happening when i went digging. I > was using UDP and varying the transmit size so I had exact control > over the fragmentation. > > The driver rx path does direct dispatch netisr processing, and for > fragments it was hashed on only L3 details not L4. Even the first > frame is hashed on L3 only. So it'd go to a different queue compared > to L4 hashing, and subsequent fragments would come in on the same > queue. Once it was completed, it was processed up inline - it wasn't > going back into netisr and getting re-checked for the right queue. Two case: 1) let this behavior 2) rewrite fo resheduling. I think 1) acceptable -- fragmented packets very rarely, compared to target data rate (2Mpps and more). > > What's problem there? > > I am don't intersting how NIC do hashing (anyway, hashing for direct > > and reflex traffic is different -- this is not Tilera). > > All I need -- distributing flow to CPU, for balance load and reduction > > lock congenstion. > > Right, but you assume all packets in a flow go to the same CPU, and I > discovered this wasn't the case. > That's why I went down the path with RSS to make it right. Only fragmented packets case or other case? > > > >> * For applications - I'm not sure yet, but at the minimum the librss > >> API I have vaguely sketched out and coded up in a git branch lets you > >> pull out the list of buckets and which CPU it's on. I'm going to > >> extend that a bit more, but it should be enough for things like nginx > >> to say "ok, start up one nginx process per RSS bucket, and here's the > >> CPU set for it to bind to." You said it has worker groups - that's > >> great; I want that to be auto configured. > > > > For applications minimum is (per socket) select/kqueut/accept work > > only for flow, arrived at CPU matched CPU at time select/kqueut/accept > > (yes, for correct work application must pined to this CPU). > > > > And application don't need know anything about buckets and etc. > > > > After this, arrived packet activated IRQ handler, ithread, driver > > interrup thread, TCP stack, select/accept, read, write, tcp_output -- > > all on same cpu. I can be wrong, this is save L2/L3 cache. > > > > Where I missunderstund? > > The other half of the network stack - the sending side - also needs to > be either on the same or nearby CPU, or you still end up with lock > contention and cache thrashing. For incoming connections this will be automatuc -- sending will be from CPU binding to receiving queue. Outgoing connections is more complex case, yes. Need to transfer FD (with re-binding) and signaling (from kernel to application) about prefered CPU. Prefered CPU is CPU give SYN-ACK. And this need assistance from application. But I am currently can't remember application massive servering outgouing connections.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20150329081902.GN23643>