From owner-freebsd-net@FreeBSD.ORG Mon Apr 6 10:37:21 2009 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 68C071065688 for ; Mon, 6 Apr 2009 10:37:21 +0000 (UTC) (envelope-from barney_cordoba@yahoo.com) Received: from web63904.mail.re1.yahoo.com (web63904.mail.re1.yahoo.com [69.147.97.119]) by mx1.freebsd.org (Postfix) with SMTP id 19F058FC12 for ; Mon, 6 Apr 2009 10:37:20 +0000 (UTC) (envelope-from barney_cordoba@yahoo.com) Received: (qmail 63616 invoked by uid 60001); 6 Apr 2009 10:37:20 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s1024; t=1239014240; bh=t61mFWB4CKSnqZioDLU3wCAaq5MB6KLbBAVNmJIodN4=; h=Message-ID:X-YMail-OSG:Received:X-Mailer:Date:From:Reply-To:Subject:To:Cc:In-Reply-To:MIME-Version:Content-Type; b=V+ssyr9A+GmlFRtJH87ZPK+lJ6W75tWSVjxHFFC0Zw9v+IoBZgrW+qS2slpin3MnIN6T1LdmhhqMpqelS83898pAwbdc7mP5NVZYZGDRf8QtSh43Yqd72qeJp6qH13e5gVA2xDgBrP3GRlwwixVWEqLUxxmnYFGlHQayA5FJ9wU= DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.com; h=Message-ID:X-YMail-OSG:Received:X-Mailer:Date:From:Reply-To:Subject:To:Cc:In-Reply-To:MIME-Version:Content-Type; b=MpvEkaOxsTkqvTmwJMVUF5KFEStUFo5Xm6bqNTf2RiApQKbUXo86kxvN3HwpjwYGutJrktu2FxiXev1eAFpTH0JqBZ+X/WSo8sy96hXM6y63yOZLeNdnv1bUGKvmYFs5jiB8vuaAYQrOo4fJpRRFv7T3cOZf9qbPhsjOER9F24I=; Message-ID: <86599.63596.qm@web63904.mail.re1.yahoo.com> X-YMail-OSG: mhj7Sc4VM1ni_V9G4FD5czP8DVwFTjlyLkQXi6Sl90nc.DvtqIlSdYlmUwbULgIrqwe878qSX0_NqvBqwhSnGbuqFQ0l225Od5zXMPi6iYVjWpXdSVOSC7jkjf5BdIVbevvasq6p1F9MvqTWbgyVXj2zpja3sPmQJoffDoz1kC30WTlr9Dzo52Aqas1MmBB8kIs3A_Tb.hgEYqYWYSAvx_W3CKyQq4v.VIPWpvarG3DzW.SndAUNIzFh2NjDj_ZFrHcta_qNGwC8nDDf4i0MNuobBRRysoT.NY1L0H4Br9eNjT4C8QgESAc5UHAZ Received: from [98.242.222.229] by web63904.mail.re1.yahoo.com via HTTP; Mon, 06 Apr 2009 03:37:19 PDT X-Mailer: YahooMailWebService/0.7.289.1 Date: Mon, 6 Apr 2009 03:37:19 -0700 (PDT) From: Barney Cordoba To: Ivan Voras , Robert Watson In-Reply-To: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: freebsd-net@freebsd.org Subject: Re: Advice on a multithreaded netisr patch? X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: barney_cordoba@yahoo.com List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 06 Apr 2009 10:37:21 -0000 --- On Sun, 4/5/09, Robert Watson wrote: > From: Robert Watson > Subject: Re: Advice on a multithreaded netisr patch? > To: "Ivan Voras" > Cc: freebsd-net@freebsd.org > Date: Sunday, April 5, 2009, 6:17 PM > On Sun, 5 Apr 2009, Ivan Voras wrote: > > >> The argument is not that they are slower (although > they probably are a bit slower), rather that they introduce > serialization bottlenecks by requiring synchronization > between CPUs in order to distribute the work. Certainly some > of the scalability issues in the stack are not a result of > that, but a good number are. > > > > I'd like to understand more. If (in netisr) I have > a mbuf with headers, is this data already transfered from > the card or is it magically "not here yet"? > > A lot depends on the details of the card and driver. The > driver will take cache misses on the descriptor ring entry, > if it's not already in cache, and the link layer will > take a cache miss on the front of the ethernet frame in the > cluster pointed to by the mbuf header as part of its demux. > What happens next depends on your dispatch model and cache > line size. Let's make a few simplifying assumptions > that are mostly true: > > - The driver associats a single cluster with each receive > ring entry for each > packet to be stored in, and the cluster is > cacheline-aligned. No header > splitting is enabled. > > - Standard ethernet encapsulation of IP is used, without > additional VLAN > headers or other encapsulation, etc. There are no IP > options. > > - We don't need to validate any checksums because the > hardware has done it for > us, so no need to take cache misses on data that > doesn't matter until we > reach higher layers. > > In the device driver/ithread code, we'll now proceed to > take some cache misses assuming we're not pretty lucky: > > (1) The descriptor ring entry > (2) The mbuf packet header > (3) The first cache line in the cluster > > This is sufficient to figure out what protocol we're > going to dispatch to, and depending on dispatch model, we > now either enqueue the packet for delivery to a netisr, or > we directly dispatch the handler for IP. > > If the packet is processed on the current CPU and we're > direct dispatching, or if we've dispatched to a netisr > on the same CPU and we're quite lucky, the mbuf packet > header and front of the cluster will be in the cache. > > However, what happens next depends on the cache fetch and > line size. If things happen in 32-byte cache lines or > smaller, we cache miss on the end of the IP header, because > the last two bytes of the destination IP address start at > offset 32 into the cluster. If we have 64-byte fetching and > line size, things go better because both the full IP and TCP > headers should be in that first cache line. > > One big advantage to direct dispatch is that it maximizes > the chances that we don't blow out the low-level CPU > caches between link-layer and IP-layer processing, meaning > that we might actually get through all the IP and TCP > headers without a cache miss on a 64-byte line size. If we > netisr dispatch to another CPU without a shared cache, or we > netisr dispatch to the current CPU but there's a > scheduling delay, other packets queued first, etc, we'll > take a number of the same cache misses over again as things > get pulled into the right cache. > > This presents a strong cache motivation to keep a packet > "on" a CPU and even in the same thread once > you've started processing it. If you have to enqueue, > you take locks, take a context switch, deal with the fact > that LRU on cache lines isn't going to like your queue > depth, and potentially pay a number of additional cache > misses on the same data. There are also some other good > reasons to use direct dispatch, such as avoiding doing work > on packets that will later be dropped if the netisr queue > overflows. > > This is why we direct dispatch by default, and why this is > quite a good strategy for multiple input queue network > cards, where it also buys us parallelism. > > Note that if the flow RSS hash is in the same cache line as > the rest of the receive descriptor ring entry, you may be > able to avoid the cache miss on the cluster and simply > redirect it to another CPU's netisr without ever reading > packet data, which avoids at least one and possibly two > cache misses, but also means that you have to run the link > layer in the remote netisr, rather than locally in the > ithread. > > > In the first case, the package reception code path is > not changed until it's queued on a thread, on which > it's handled in the future (or is the influence of > "other" data like timers and internal TCP > reassembly buffers so large?). In the second case, why? > > The good news about TCP reassembly is that we don't > have to look at the data, only mbuf headers and reassembly > buffer entries, so with any luck we've avoided actually > taking a cache miss on the data. If things go well, we can > avoid looking at anything but mbuf and packet headers until > the socket copies out, but I'm not sure how well we do > that in practice. > > > As the card and the OS can already process many > packets per second for something fairly complex as routing > (http://www.tancsa.com/blast.html), and TCP chokes swi:net > at 100% of a core, isn't this indication there's > certainly more space for improvement even with a > single-queue old-fashioned NICs? > > Maybe. It depends on the relative costs of local > processing vs redistributing the work, which involves > schedulers, IPIs, additional cache misses, lock contention, > and so on. This means there's a period where it > can't possibly be a win, and then at some point it's > a win as long as the stack scales. This is essentially the > usual trade-off in using threads and parallelism: does the > benefit of multiple parallel execution units make up for the > overheads of synchronization and data migration? > > There are some previous e-mail threads where people have > observed that for some workloads, switching to netisr wins > over direct dispatch. For example, if you have a number of > cores and are doing firewall processing, offloading work to > the netisr from the input ithread may improve performance. > However, this appears not to be the common case for end-host > workloads on the hardware we mostly target, and this is > increasingly true as multiple input queues come into play, > as the card itself will allow us to use multiple CPUs > without any interactions between the CPUs. > > This isn't to say that work redistribution using a > netisr-like scheme isn't a good idea: in a world where > CPU threads are weak compared to the wire workflow, and > there's cache locality across threads on the same core, > or NUMA is present, there may be a potential for a big win > when available work significantly exceeds what a single CPU > thread/core can handle. In that case, we want to place the > work as close as possible to take advantage of shared caches > or the memory being local to the CPU thread/core doing the > deferred work. > > FYI, the localhost case is a bit weird -- I think we have > some scheduling issues that are causing loopback netisr > stuff to be pessimally scheduled. Here are some suggestions > for things to try and see if they help, though: > > - Comment out all ifnet, IP, and TCP global statistics in > your local stack -- > especially look for things tcpstat.whatever++;. > > - Use cpuset to pin ithreads, the netisr, and whatever > else, to specific cores > so that they don't migrate, and if your system uses > HTT, experiment with > pinning the ithread and the netisr on different threads > on the same core, or > at least, different cores on the same die. > > - Experiment with using just the source IP, the source + > destination IP, and > both IPs plus TCP ports in your hash. > > - If your card supports RSS, pass the flowid up the stack > in the mbuf packet > header flowid field, and use that instead of the hash for > work placement. > > - If you're doing pure PPS tests with UDP (or the > like), and your test can > tolerate disordering, try hashing based on the mbuf > header address or > something else that will distribute the work but not take > a cache miss. > > - If you have a flowid or the above disordered condition > applies, try shifting > the link layer dispatch to the netisr, rather than doing > the demux in the > ithread, as that will avoid cache misses in the ithread > and do all the demux > in the netisr. > > Robert N M Watson > Computer Laboratory > University of Cambridge Is there a way to give a kernel thread exclusive use of a core? I know you can pin a kernel thread with sched_bind(), but is there a way to keep other threads from using the core? On an 8 core system it almost seems that the randomness of more cores is a negative in some situations. Also, I've noticed that calling sched_bind() during bootup is a bad thing in that it locks the system. I'm not certain but I suspect its the thread_lock that is the culprit. Is there a clean way to determine that its safe to lock curthread and do a cpu bind? Barney