Date: Tue, 7 Apr 2009 05:11:49 -0700 (PDT) From: Barney Cordoba <barney_cordoba@yahoo.com> To: Ivan Voras <ivoras@freebsd.org>, Robert Watson <rwatson@FreeBSD.org> Cc: freebsd-net@freebsd.org Subject: Re: Advice on a multithreaded netisr patch? Message-ID: <952316.35609.qm@web63906.mail.re1.yahoo.com> In-Reply-To: <alpine.BSF.2.00.0904061238250.34905@fledge.watson.org>
next in thread | previous in thread | raw e-mail | index | archive | help
--- On Mon, 4/6/09, Robert Watson <rwatson@FreeBSD.org> wrote: > From: Robert Watson <rwatson@FreeBSD.org> > Subject: Re: Advice on a multithreaded netisr patch? > To: "Ivan Voras" <ivoras@freebsd.org> > Cc: freebsd-net@freebsd.org > Date: Monday, April 6, 2009, 7:59 AM > On Mon, 6 Apr 2009, Ivan Voras wrote: > > >>> I'd like to understand more. If (in > netisr) I have a mbuf with headers, is this data already > transfered from the card or is it magically "not here > yet"? > >> > >> A lot depends on the details of the card and > driver. The driver will take cache misses on the descriptor > ring entry, if it's not already in cache, and the link > layer will take a cache miss on the front of the ethernet > frame in the cluster pointed to by the mbuf header as part > of its demux. What happens next depends on your dispatch > model and cache line size. Let's make a few simplifying > assumptions that are mostly true: > > > > So, a mbuf can reference data not yet copied from the > NIC hardware? I'm specifically trying to undestand what > m_pullup() does. > > I think we're talking slightly at cross purposes. > There are two transfers of interest: > > (1) DMA of the packet data to main memory from the NIC > (2) Servicing of CPU cache misses to access data in main > memory > > By the time you receive an interrupt, the DMA is complete, > so once you believe a packet referenced by the descriptor > ring is done, you don't have to wait for DMA. However, > the packet data is in main memory rather than your CPU > cache, so you'll need to take a cache miss in order to > retrieve it. You don't want to prefetch before you know > the packet data is there, or you may prefetch stale data > from the previous packet sent or received from the cluster. > > m_pullup() has to do with mbuf chain memory contiguity > during packet processing. The usual usage is something > along the following lines: > > struct whatever *w; > > m = m_pullup(m, sizeof(*w)); > if (m == NULL) > return; > w = mtod(m, struct whatever *); > > m_pullup() here ensures that the first sizeof(*w) bytes of > mbuf data are contiguously stored so that the cast of w to > m's data will point at a complete structure we can use > to interpret packet data. In the common case in the receipt > path, m_pullup() should be a no-op, since almost all drivers > receive data in a single cluster. > > However, there are cases where it might not happen, such as > loopback traffic where unusual encapsulation is used, > leading to a call to M_PREPEND() that inserts a new mbuf on > the front of the chain, which is later m_defrag()'d > leading to a higher level header crossing a boundary or the > like. > > This issue is almost entirely independent from things like > the cache line miss issue, unless you hit the uncommon case > of having to do work in m_pullup(), in which case life > sucks. > > It would be useful to use DTrace to profile a number of the > workfull m_foo() functions to make sure we're not > hitting them in normal workloads, btw. > > >>> As the card and the OS can already process > many packets per second for > >>> something fairly complex as routing > >>> (http://www.tancsa.com/blast.html), and TCP > chokes swi:net at 100% of > >>> a core, isn't this indication there's > certainly more space for > >>> improvement even with a single-queue > old-fashioned NICs? > >> > >> Maybe. It depends on the relative costs of local > processing vs > >> redistributing the work, which involves > schedulers, IPIs, additional > >> cache misses, lock contention, and so on. This > means there's a period > >> where it can't possibly be a win, and then at > some point it's a win as > >> long as the stack scales. This is essentially the > usual trade-off in > >> using threads and parallelism: does the benefit of > multiple parallel > >> execution units make up for the overheads of > synchronization and data > >> migration? > > > > Do you have any idea at all why I'm seeing the > weird difference of netstat packets per second (250,000) and > my application's TCP performance (< 1,000 pps)? > Summary: each packet is guaranteed to be a whole message > causing a transaction in the application - without the > changes I see pps almost identical to tps. Even if the > source of netstat statistics somehow manages to count > packets multiple time (I don't see how that can happen), > no relation can describe differences this huge. It almost > looks like something in the upper layers is discarding > packets (also not likely: TCP timeouts would occur and the > application wouldn't be able to push 250,000 pps) - but > what? Where to look? > > Is this for the loopback workload? If so, remember that > there may be some other things going on: > > - Every packet is processed at least two times: once went > sent, and then again > when it's received. > > - A TCP segment will need to be ACK'd, so if you're > sending data in chunks in > one direction, the ACKs will not be piggy-backed on > existing data tranfers, > and instead be sent independently, hitting the network > stack two more times. > > - Remember that TCP works to expand its window, and then > maintains the highest > performance it can by bumping up against the top of > available bandwidth > continuously. This involves detecting buffer limits by > generating packets > that can't be sent, adding to the packet count. With > loopback traffic, the > drop point occurs when you exceed the size of the > netisr's queue for IP, so > you might try bumping that from the default to something > much larger. > > And nothing beats using tcpdump -- have you tried > tcpdumping the loopback to see what is actually being sent? > If not, that's always educational -- perhaps something > weird is going on with delayed ACKs, etc. > > > You mean for the general code? I purposely don't > lock my statistics variables because I'm not that > interested in exact numbers (orders of magnitude are > relevant). As far as I understand, unlocked "x++" > should be trivially fast in this case? > > No. x++ is massively slow if executed in parallel across > many cores on a variable in a single cache line. See my > recent commit to kern_tc.c for an example: the updating of > trivial statistics for the kernel time calls reduced 30m > syscalls/second to 3m syscalls/second due to heavy > contention on the cache line holding the statistic. One of > my goals for 8.0 is to fix this problem for IP and TCP > layers, and ideally also ifnet but we'll see. We should > be maintaining those stats per-CPU and then aggregating to > report them to userspace. This is what we already do for a > number of system stats -- UMA and kernel malloc, syscall and > trap counters, etc. > > >> - Use cpuset to pin ithreads, the netisr, and > whatever else, to specific > >> cores > >> so that they don't migrate, and if your > system uses HTT, experiment with > >> pinning the ithread and the netisr on different > threads on the same > >> core, or > >> at least, different cores on the same die. > > > > I'm using em hardware; I still think there's a > possibility I'm fighting the driver in some cases but > this has priority #2. > > Have you tried LOCK_PROFILING? It would quickly tell you > if driver locks were a source of significant contention. It > works quite well... When I enabled LOCK_PROFILING my side modules, such as if_ibg, stopped working. It seems that the ifnet structure or something changed with that option enabled. Is there a way to sync this without having to integrate everything into a specific kernel build? Barney
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?952316.35609.qm>