From owner-freebsd-net@FreeBSD.ORG Tue Apr 7 16:47:41 2009 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id E97791065692 for ; Tue, 7 Apr 2009 16:47:41 +0000 (UTC) (envelope-from julian@elischer.org) Received: from outH.internet-mail-service.net (outh.internet-mail-service.net [216.240.47.231]) by mx1.freebsd.org (Postfix) with ESMTP id C7B0A8FC22 for ; Tue, 7 Apr 2009 16:47:41 +0000 (UTC) (envelope-from julian@elischer.org) Received: from idiom.com (mx0.idiom.com [216.240.32.160]) by out.internet-mail-service.net (Postfix) with ESMTP id A5F10B98E3; Tue, 7 Apr 2009 09:47:41 -0700 (PDT) X-Client-Authorized: MaGic Cook1e X-Client-Authorized: MaGic Cook1e X-Client-Authorized: MaGic Cook1e X-Client-Authorized: MaGic Cook1e Received: from julian-mac.elischer.org (home.elischer.org [216.240.48.38]) by idiom.com (Postfix) with ESMTP id 5EE282D60E1; Tue, 7 Apr 2009 09:47:37 -0700 (PDT) Message-ID: <49DB83CB.9070707@elischer.org> Date: Tue, 07 Apr 2009 09:48:11 -0700 From: Julian Elischer User-Agent: Thunderbird 2.0.0.21 (Macintosh/20090302) MIME-Version: 1.0 To: barney_cordoba@yahoo.com References: <952316.35609.qm@web63906.mail.re1.yahoo.com> In-Reply-To: <952316.35609.qm@web63906.mail.re1.yahoo.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: freebsd-net@freebsd.org, Robert Watson , Ivan Voras Subject: Re: Advice on a multithreaded netisr patch? X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 07 Apr 2009 16:47:42 -0000 Barney Cordoba wrote: > > > > --- On Mon, 4/6/09, Robert Watson wrote: > >> From: Robert Watson >> Subject: Re: Advice on a multithreaded netisr patch? >> To: "Ivan Voras" >> Cc: freebsd-net@freebsd.org >> Date: Monday, April 6, 2009, 7:59 AM >> On Mon, 6 Apr 2009, Ivan Voras wrote: >> >>>>> I'd like to understand more. If (in >> netisr) I have a mbuf with headers, is this data already >> transfered from the card or is it magically "not here >> yet"? >>>> A lot depends on the details of the card and >> driver. The driver will take cache misses on the descriptor >> ring entry, if it's not already in cache, and the link >> layer will take a cache miss on the front of the ethernet >> frame in the cluster pointed to by the mbuf header as part >> of its demux. What happens next depends on your dispatch >> model and cache line size. Let's make a few simplifying >> assumptions that are mostly true: >>> So, a mbuf can reference data not yet copied from the >> NIC hardware? I'm specifically trying to undestand what >> m_pullup() does. >> >> I think we're talking slightly at cross purposes. >> There are two transfers of interest: >> >> (1) DMA of the packet data to main memory from the NIC >> (2) Servicing of CPU cache misses to access data in main >> memory >> >> By the time you receive an interrupt, the DMA is complete, >> so once you believe a packet referenced by the descriptor >> ring is done, you don't have to wait for DMA. However, >> the packet data is in main memory rather than your CPU >> cache, so you'll need to take a cache miss in order to >> retrieve it. You don't want to prefetch before you know >> the packet data is there, or you may prefetch stale data >> from the previous packet sent or received from the cluster. >> >> m_pullup() has to do with mbuf chain memory contiguity >> during packet processing. The usual usage is something >> along the following lines: >> >> struct whatever *w; >> >> m = m_pullup(m, sizeof(*w)); >> if (m == NULL) >> return; >> w = mtod(m, struct whatever *); >> >> m_pullup() here ensures that the first sizeof(*w) bytes of >> mbuf data are contiguously stored so that the cast of w to >> m's data will point at a complete structure we can use >> to interpret packet data. In the common case in the receipt >> path, m_pullup() should be a no-op, since almost all drivers >> receive data in a single cluster. >> >> However, there are cases where it might not happen, such as >> loopback traffic where unusual encapsulation is used, >> leading to a call to M_PREPEND() that inserts a new mbuf on >> the front of the chain, which is later m_defrag()'d >> leading to a higher level header crossing a boundary or the >> like. >> >> This issue is almost entirely independent from things like >> the cache line miss issue, unless you hit the uncommon case >> of having to do work in m_pullup(), in which case life >> sucks. >> >> It would be useful to use DTrace to profile a number of the >> workfull m_foo() functions to make sure we're not >> hitting them in normal workloads, btw. >> >>>>> As the card and the OS can already process >> many packets per second for >>>>> something fairly complex as routing >>>>> (http://www.tancsa.com/blast.html), and TCP >> chokes swi:net at 100% of >>>>> a core, isn't this indication there's >> certainly more space for >>>>> improvement even with a single-queue >> old-fashioned NICs? >>>> Maybe. It depends on the relative costs of local >> processing vs >>>> redistributing the work, which involves >> schedulers, IPIs, additional >>>> cache misses, lock contention, and so on. This >> means there's a period >>>> where it can't possibly be a win, and then at >> some point it's a win as >>>> long as the stack scales. This is essentially the >> usual trade-off in >>>> using threads and parallelism: does the benefit of >> multiple parallel >>>> execution units make up for the overheads of >> synchronization and data >>>> migration? >>> Do you have any idea at all why I'm seeing the >> weird difference of netstat packets per second (250,000) and >> my application's TCP performance (< 1,000 pps)? >> Summary: each packet is guaranteed to be a whole message >> causing a transaction in the application - without the >> changes I see pps almost identical to tps. Even if the >> source of netstat statistics somehow manages to count >> packets multiple time (I don't see how that can happen), >> no relation can describe differences this huge. It almost >> looks like something in the upper layers is discarding >> packets (also not likely: TCP timeouts would occur and the >> application wouldn't be able to push 250,000 pps) - but >> what? Where to look? >> >> Is this for the loopback workload? If so, remember that >> there may be some other things going on: >> >> - Every packet is processed at least two times: once went >> sent, and then again >> when it's received. >> >> - A TCP segment will need to be ACK'd, so if you're >> sending data in chunks in >> one direction, the ACKs will not be piggy-backed on >> existing data tranfers, >> and instead be sent independently, hitting the network >> stack two more times. >> >> - Remember that TCP works to expand its window, and then >> maintains the highest >> performance it can by bumping up against the top of >> available bandwidth >> continuously. This involves detecting buffer limits by >> generating packets >> that can't be sent, adding to the packet count. With >> loopback traffic, the >> drop point occurs when you exceed the size of the >> netisr's queue for IP, so >> you might try bumping that from the default to something >> much larger. >> >> And nothing beats using tcpdump -- have you tried >> tcpdumping the loopback to see what is actually being sent? >> If not, that's always educational -- perhaps something >> weird is going on with delayed ACKs, etc. >> >>> You mean for the general code? I purposely don't >> lock my statistics variables because I'm not that >> interested in exact numbers (orders of magnitude are >> relevant). As far as I understand, unlocked "x++" >> should be trivially fast in this case? >> >> No. x++ is massively slow if executed in parallel across >> many cores on a variable in a single cache line. See my >> recent commit to kern_tc.c for an example: the updating of >> trivial statistics for the kernel time calls reduced 30m >> syscalls/second to 3m syscalls/second due to heavy >> contention on the cache line holding the statistic. One of >> my goals for 8.0 is to fix this problem for IP and TCP >> layers, and ideally also ifnet but we'll see. We should >> be maintaining those stats per-CPU and then aggregating to >> report them to userspace. This is what we already do for a >> number of system stats -- UMA and kernel malloc, syscall and >> trap counters, etc. >> >>>> - Use cpuset to pin ithreads, the netisr, and >> whatever else, to specific >>>> cores >>>> so that they don't migrate, and if your >> system uses HTT, experiment with >>>> pinning the ithread and the netisr on different >> threads on the same >>>> core, or >>>> at least, different cores on the same die. >>> I'm using em hardware; I still think there's a >> possibility I'm fighting the driver in some cases but >> this has priority #2. >> >> Have you tried LOCK_PROFILING? It would quickly tell you >> if driver locks were a source of significant contention. It >> works quite well... > > When I enabled LOCK_PROFILING my side modules, such as if_ibg, > stopped working. It seems that the ifnet structure or something > changed with that option enabled. Is there a way to sync this without > having to integrate everything into a specific kernel build? > no, I don't think there is any other way.. last time I checked the mutex structure changed size which meant that almost everything else that included a mutex changed size. That may not be true now but I haven't checked.. > Barney > > > > _______________________________________________ > freebsd-net@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org"