From owner-freebsd-net@FreeBSD.ORG Tue Apr 7 12:11:50 2009 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id D89491065825 for ; Tue, 7 Apr 2009 12:11:50 +0000 (UTC) (envelope-from barney_cordoba@yahoo.com) Received: from web63906.mail.re1.yahoo.com (web63906.mail.re1.yahoo.com [69.147.97.121]) by mx1.freebsd.org (Postfix) with SMTP id 6F4388FC08 for ; Tue, 7 Apr 2009 12:11:50 +0000 (UTC) (envelope-from barney_cordoba@yahoo.com) Received: (qmail 35966 invoked by uid 60001); 7 Apr 2009 12:11:50 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s1024; t=1239106309; bh=QQnklHzORTgtnBxvApx/35hxpzLUqRD4UPae3d/Tc6o=; h=Message-ID:X-YMail-OSG:Received:X-Mailer:Date:From:Reply-To:Subject:To:Cc:In-Reply-To:MIME-Version:Content-Type; b=sA9hP0o38ntWa01awAfsC1A0EU4AhmzDXdHBzKJNkM+v0iBkoJJWjLRRXpVHUY70EAh3ejaubn65yKxKualyMGe4+7C6mIBf/N6vTEpF6ELUBxtA/RcDMNK247y1NK2y8hqpa8tSoIPL12bFNInF88M7Q4+51rb6uhAVDRNQrqk= DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.com; h=Message-ID:X-YMail-OSG:Received:X-Mailer:Date:From:Reply-To:Subject:To:Cc:In-Reply-To:MIME-Version:Content-Type; b=rxa+BBk7r+B5uH+R5s9NJmh6yCCTuqY7SY4bL7UsDomTUfU+8bGNjKOHlOdVPDOPuOFyl+OCMyGAyTQhg6o0xAIOUWn/bQmcQCKQxr+G33rPGDcgxBuwnhJQ/OIsHozlipo+vhY24iuguupEdfp5WPU8xjGo71OVwAdYQqQZp2I=; Message-ID: <952316.35609.qm@web63906.mail.re1.yahoo.com> X-YMail-OSG: qPdgWekVM1kVWVo3zshllwJm0FXN5.Y29LgUBsOxAwqH91p0lPj_QYvwafSOggJrXuZs4mGJKkRfpIMTkov9eaboET89cPiWAUEsy.P_3NPlCI0v3EjL2Cwt_v9WYpZ7Yizu7N2d6zcx4qaQN_xGtxp8YmcXDrQmNXMnYs4pioh.lkk41Q3NTdmgIX0bdhKtknCFrLlmK_Qe6XHyVu4AF84tmxCRtGKQjASKvdW2OqVoVXpQ6XLIV1kHOLaK4SxrmpWDedZsvwC0dSX4eFoRzOkpKqgTffiZH5R82G2tbg3v.jWNNNnc7jcRjm1ZU5JTRyxPeG3EfkANJX9yRyhi1RTF Received: from [98.242.222.229] by web63906.mail.re1.yahoo.com via HTTP; Tue, 07 Apr 2009 05:11:49 PDT X-Mailer: YahooMailWebService/0.7.289.1 Date: Tue, 7 Apr 2009 05:11:49 -0700 (PDT) From: Barney Cordoba To: Ivan Voras , Robert Watson In-Reply-To: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: freebsd-net@freebsd.org Subject: Re: Advice on a multithreaded netisr patch? X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: barney_cordoba@yahoo.com List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 07 Apr 2009 12:12:06 -0000 --- On Mon, 4/6/09, Robert Watson wrote: > From: Robert Watson > Subject: Re: Advice on a multithreaded netisr patch? > To: "Ivan Voras" > Cc: freebsd-net@freebsd.org > Date: Monday, April 6, 2009, 7:59 AM > On Mon, 6 Apr 2009, Ivan Voras wrote: > > >>> I'd like to understand more. If (in > netisr) I have a mbuf with headers, is this data already > transfered from the card or is it magically "not here > yet"? > >> > >> A lot depends on the details of the card and > driver. The driver will take cache misses on the descriptor > ring entry, if it's not already in cache, and the link > layer will take a cache miss on the front of the ethernet > frame in the cluster pointed to by the mbuf header as part > of its demux. What happens next depends on your dispatch > model and cache line size. Let's make a few simplifying > assumptions that are mostly true: > > > > So, a mbuf can reference data not yet copied from the > NIC hardware? I'm specifically trying to undestand what > m_pullup() does. > > I think we're talking slightly at cross purposes. > There are two transfers of interest: > > (1) DMA of the packet data to main memory from the NIC > (2) Servicing of CPU cache misses to access data in main > memory > > By the time you receive an interrupt, the DMA is complete, > so once you believe a packet referenced by the descriptor > ring is done, you don't have to wait for DMA. However, > the packet data is in main memory rather than your CPU > cache, so you'll need to take a cache miss in order to > retrieve it. You don't want to prefetch before you know > the packet data is there, or you may prefetch stale data > from the previous packet sent or received from the cluster. > > m_pullup() has to do with mbuf chain memory contiguity > during packet processing. The usual usage is something > along the following lines: > > struct whatever *w; > > m = m_pullup(m, sizeof(*w)); > if (m == NULL) > return; > w = mtod(m, struct whatever *); > > m_pullup() here ensures that the first sizeof(*w) bytes of > mbuf data are contiguously stored so that the cast of w to > m's data will point at a complete structure we can use > to interpret packet data. In the common case in the receipt > path, m_pullup() should be a no-op, since almost all drivers > receive data in a single cluster. > > However, there are cases where it might not happen, such as > loopback traffic where unusual encapsulation is used, > leading to a call to M_PREPEND() that inserts a new mbuf on > the front of the chain, which is later m_defrag()'d > leading to a higher level header crossing a boundary or the > like. > > This issue is almost entirely independent from things like > the cache line miss issue, unless you hit the uncommon case > of having to do work in m_pullup(), in which case life > sucks. > > It would be useful to use DTrace to profile a number of the > workfull m_foo() functions to make sure we're not > hitting them in normal workloads, btw. > > >>> As the card and the OS can already process > many packets per second for > >>> something fairly complex as routing > >>> (http://www.tancsa.com/blast.html), and TCP > chokes swi:net at 100% of > >>> a core, isn't this indication there's > certainly more space for > >>> improvement even with a single-queue > old-fashioned NICs? > >> > >> Maybe. It depends on the relative costs of local > processing vs > >> redistributing the work, which involves > schedulers, IPIs, additional > >> cache misses, lock contention, and so on. This > means there's a period > >> where it can't possibly be a win, and then at > some point it's a win as > >> long as the stack scales. This is essentially the > usual trade-off in > >> using threads and parallelism: does the benefit of > multiple parallel > >> execution units make up for the overheads of > synchronization and data > >> migration? > > > > Do you have any idea at all why I'm seeing the > weird difference of netstat packets per second (250,000) and > my application's TCP performance (< 1,000 pps)? > Summary: each packet is guaranteed to be a whole message > causing a transaction in the application - without the > changes I see pps almost identical to tps. Even if the > source of netstat statistics somehow manages to count > packets multiple time (I don't see how that can happen), > no relation can describe differences this huge. It almost > looks like something in the upper layers is discarding > packets (also not likely: TCP timeouts would occur and the > application wouldn't be able to push 250,000 pps) - but > what? Where to look? > > Is this for the loopback workload? If so, remember that > there may be some other things going on: > > - Every packet is processed at least two times: once went > sent, and then again > when it's received. > > - A TCP segment will need to be ACK'd, so if you're > sending data in chunks in > one direction, the ACKs will not be piggy-backed on > existing data tranfers, > and instead be sent independently, hitting the network > stack two more times. > > - Remember that TCP works to expand its window, and then > maintains the highest > performance it can by bumping up against the top of > available bandwidth > continuously. This involves detecting buffer limits by > generating packets > that can't be sent, adding to the packet count. With > loopback traffic, the > drop point occurs when you exceed the size of the > netisr's queue for IP, so > you might try bumping that from the default to something > much larger. > > And nothing beats using tcpdump -- have you tried > tcpdumping the loopback to see what is actually being sent? > If not, that's always educational -- perhaps something > weird is going on with delayed ACKs, etc. > > > You mean for the general code? I purposely don't > lock my statistics variables because I'm not that > interested in exact numbers (orders of magnitude are > relevant). As far as I understand, unlocked "x++" > should be trivially fast in this case? > > No. x++ is massively slow if executed in parallel across > many cores on a variable in a single cache line. See my > recent commit to kern_tc.c for an example: the updating of > trivial statistics for the kernel time calls reduced 30m > syscalls/second to 3m syscalls/second due to heavy > contention on the cache line holding the statistic. One of > my goals for 8.0 is to fix this problem for IP and TCP > layers, and ideally also ifnet but we'll see. We should > be maintaining those stats per-CPU and then aggregating to > report them to userspace. This is what we already do for a > number of system stats -- UMA and kernel malloc, syscall and > trap counters, etc. > > >> - Use cpuset to pin ithreads, the netisr, and > whatever else, to specific > >> cores > >> so that they don't migrate, and if your > system uses HTT, experiment with > >> pinning the ithread and the netisr on different > threads on the same > >> core, or > >> at least, different cores on the same die. > > > > I'm using em hardware; I still think there's a > possibility I'm fighting the driver in some cases but > this has priority #2. > > Have you tried LOCK_PROFILING? It would quickly tell you > if driver locks were a source of significant contention. It > works quite well... When I enabled LOCK_PROFILING my side modules, such as if_ibg, stopped working. It seems that the ifnet structure or something changed with that option enabled. Is there a way to sync this without having to integrate everything into a specific kernel build? Barney