Date: Mon, 6 Apr 2009 12:59:10 +0100 (BST) From: Robert Watson <rwatson@FreeBSD.org> To: Ivan Voras <ivoras@freebsd.org> Cc: freebsd-net@freebsd.org Subject: Re: Advice on a multithreaded netisr patch? Message-ID: <alpine.BSF.2.00.0904061238250.34905@fledge.watson.org> In-Reply-To: <grbcfg$poe$1@ger.gmane.org> References: <gra7mq$ei8$1@ger.gmane.org> <alpine.BSF.2.00.0904051422280.12639@fledge.watson.org> <grac1s$p56$1@ger.gmane.org> <alpine.BSF.2.00.0904051440460.12639@fledge.watson.org> <grappq$tsg$1@ger.gmane.org> <alpine.BSF.2.00.0904052243250.34905@fledge.watson.org> <grbcfg$poe$1@ger.gmane.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, 6 Apr 2009, Ivan Voras wrote: >>> I'd like to understand more. If (in netisr) I have a mbuf with headers, is >>> this data already transfered from the card or is it magically "not here >>> yet"? >> >> A lot depends on the details of the card and driver. The driver will take >> cache misses on the descriptor ring entry, if it's not already in cache, >> and the link layer will take a cache miss on the front of the ethernet >> frame in the cluster pointed to by the mbuf header as part of its demux. >> What happens next depends on your dispatch model and cache line size. >> Let's make a few simplifying assumptions that are mostly true: > > So, a mbuf can reference data not yet copied from the NIC hardware? I'm > specifically trying to undestand what m_pullup() does. I think we're talking slightly at cross purposes. There are two transfers of interest: (1) DMA of the packet data to main memory from the NIC (2) Servicing of CPU cache misses to access data in main memory By the time you receive an interrupt, the DMA is complete, so once you believe a packet referenced by the descriptor ring is done, you don't have to wait for DMA. However, the packet data is in main memory rather than your CPU cache, so you'll need to take a cache miss in order to retrieve it. You don't want to prefetch before you know the packet data is there, or you may prefetch stale data from the previous packet sent or received from the cluster. m_pullup() has to do with mbuf chain memory contiguity during packet processing. The usual usage is something along the following lines: struct whatever *w; m = m_pullup(m, sizeof(*w)); if (m == NULL) return; w = mtod(m, struct whatever *); m_pullup() here ensures that the first sizeof(*w) bytes of mbuf data are contiguously stored so that the cast of w to m's data will point at a complete structure we can use to interpret packet data. In the common case in the receipt path, m_pullup() should be a no-op, since almost all drivers receive data in a single cluster. However, there are cases where it might not happen, such as loopback traffic where unusual encapsulation is used, leading to a call to M_PREPEND() that inserts a new mbuf on the front of the chain, which is later m_defrag()'d leading to a higher level header crossing a boundary or the like. This issue is almost entirely independent from things like the cache line miss issue, unless you hit the uncommon case of having to do work in m_pullup(), in which case life sucks. It would be useful to use DTrace to profile a number of the workfull m_foo() functions to make sure we're not hitting them in normal workloads, btw. >>> As the card and the OS can already process many packets per second for >>> something fairly complex as routing >>> (http://www.tancsa.com/blast.html), and TCP chokes swi:net at 100% of >>> a core, isn't this indication there's certainly more space for >>> improvement even with a single-queue old-fashioned NICs? >> >> Maybe. It depends on the relative costs of local processing vs >> redistributing the work, which involves schedulers, IPIs, additional >> cache misses, lock contention, and so on. This means there's a period >> where it can't possibly be a win, and then at some point it's a win as >> long as the stack scales. This is essentially the usual trade-off in >> using threads and parallelism: does the benefit of multiple parallel >> execution units make up for the overheads of synchronization and data >> migration? > > Do you have any idea at all why I'm seeing the weird difference of netstat > packets per second (250,000) and my application's TCP performance (< 1,000 > pps)? Summary: each packet is guaranteed to be a whole message causing a > transaction in the application - without the changes I see pps almost > identical to tps. Even if the source of netstat statistics somehow manages > to count packets multiple time (I don't see how that can happen), no > relation can describe differences this huge. It almost looks like something > in the upper layers is discarding packets (also not likely: TCP timeouts > would occur and the application wouldn't be able to push 250,000 pps) - but > what? Where to look? Is this for the loopback workload? If so, remember that there may be some other things going on: - Every packet is processed at least two times: once went sent, and then again when it's received. - A TCP segment will need to be ACK'd, so if you're sending data in chunks in one direction, the ACKs will not be piggy-backed on existing data tranfers, and instead be sent independently, hitting the network stack two more times. - Remember that TCP works to expand its window, and then maintains the highest performance it can by bumping up against the top of available bandwidth continuously. This involves detecting buffer limits by generating packets that can't be sent, adding to the packet count. With loopback traffic, the drop point occurs when you exceed the size of the netisr's queue for IP, so you might try bumping that from the default to something much larger. And nothing beats using tcpdump -- have you tried tcpdumping the loopback to see what is actually being sent? If not, that's always educational -- perhaps something weird is going on with delayed ACKs, etc. > You mean for the general code? I purposely don't lock my statistics > variables because I'm not that interested in exact numbers (orders of > magnitude are relevant). As far as I understand, unlocked "x++" should be > trivially fast in this case? No. x++ is massively slow if executed in parallel across many cores on a variable in a single cache line. See my recent commit to kern_tc.c for an example: the updating of trivial statistics for the kernel time calls reduced 30m syscalls/second to 3m syscalls/second due to heavy contention on the cache line holding the statistic. One of my goals for 8.0 is to fix this problem for IP and TCP layers, and ideally also ifnet but we'll see. We should be maintaining those stats per-CPU and then aggregating to report them to userspace. This is what we already do for a number of system stats -- UMA and kernel malloc, syscall and trap counters, etc. >> - Use cpuset to pin ithreads, the netisr, and whatever else, to specific >> cores >> so that they don't migrate, and if your system uses HTT, experiment with >> pinning the ithread and the netisr on different threads on the same >> core, or >> at least, different cores on the same die. > > I'm using em hardware; I still think there's a possibility I'm fighting the > driver in some cases but this has priority #2. Have you tried LOCK_PROFILING? It would quickly tell you if driver locks were a source of significant contention. It works quite well... >> - If your card supports RSS, pass the flowid up the stack in the mbuf >> packet >> header flowid field, and use that instead of the hash for work placement. > > Don't know about em. Don't really want to touch it if I don't have to :) if_em doesn't support it, but if_igb does. If this saves you a minimum of one and possibly two cache misses per packet, it could be a huge performance improvement. Robert N M Watson Computer Laboratory University of Cambridge
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?alpine.BSF.2.00.0904061238250.34905>