Date: Thu, 12 Jan 2006 14:06:50 -0700 From: Scott Long <scottl@samsco.org> To: Andrew Gallatin <gallatin@cs.duke.edu> Cc: cvs-src@FreeBSD.org, src-committers@FreeBSD.org, Scott Long <scottl@FreeBSD.org>, cvs-all@FreeBSD.org Subject: Re: cvs commit: src/sys/dev/em if_em.c if_em.h Message-ID: <43C6C4EA.20303@samsco.org> In-Reply-To: <20060112152119.A6776@grasshopper.cs.duke.edu> References: <200601110030.k0B0UPOx009098@repoman.freebsd.org> <20060112152119.A6776@grasshopper.cs.duke.edu>
next in thread | previous in thread | raw e-mail | index | archive | help
Andrew Gallatin wrote: > Scott Long [scottl@FreeBSD.org] wrote: > >>scottl 2006-01-11 00:30:25 UTC >> >> FreeBSD src repository >> >> Modified files: >> sys/dev/em if_em.c if_em.h >> Log: >> Significant performance improvements for the if_em driver: > > > Very cool. > > >> - If possible, use a fast interupt handler instead of an ithread handler. Use >> the interrupt handler to check and squelch the interrupt, then schedule a >> taskqueue to do the actual work. This has three benefits: >> - Eliminates the 'interrupt aliasing' problem found in many chipsets by >> allowing the driver to mask the interrupt in the NIC instead of the >> OS masking the interrupt in the APIC. > > > Neat. Just like Windows.. > > <....> > > >> - Don't hold the driver lock in the RX handler. The handler and all data >> associated is effectively serialized already. This eliminates the cost of >> dropping and reaquiring the lock for every receieved packet. The result >> is much lower contention for the driver lock, resulting in lower CPU usage >> and lower latency for interactive workloads. > > > This seems orthogonal to using a fastintr/taskqueue, or am I missing > something? > > Assuming a system where interrupt aliasing is not a problem, how much > does using a fastintr/taskqueue change interrupt latency as compared > to using an ithread? I would (naively) assume that using an ithread > would be faster & cheaper. Or is disabling/enabling interrupts in the > apic really expensive? > Touching the APIC is tricky. First, you have to pay the cost of a spinlock. Then you have to may the cost of at least one read and write across the FSB. Even though the APIC registers are memory mapped, they are still uncached. It's not terribly expensive, but it does add up. Bypassing this and using a fast interrupt means that you pay the cost of 1 PCI read, which you would have to do anyways with either method, and 1 PCI write, which will be posted at the host-pci bridge and thus only as expensive as an FSB write. Overall, I don't think that the cost difference is a whole lot, but when you are talking about thousands of interrupts per second, especially if multiple interfaces are running under load, it might be important. And the 750x and 752x chipsets are so common that it is worthwhile to deal with them (and there are reports that the aliasing problem is happening on more chipsets than just these now). As for latency, the taskqueue runs at the same PI_NET priority as an the ithread would. I thought that there was an optimization on some platforms to encourage quick preemption for ithreads when they are scheduled, but I can't find it now. So, the taskqueue shouldn't be all that different from an ithread, and it even means that there won't be any sharing between instances even if the interrupt vector is shared. Another advantage is that you get adaptive polling for free. Interface polling only works well when you have a consistently high workload. For spikey workloads, you do get higher latency at the leading edge of the spike since the polling thread is asleep while waiting for the next tick. Trying to estimate workload and latency in the polling loop is a pain, while letting the hardware trigger you directly is a whole lot easier. However, taskqueues are really just a proof of concept for what I really want, which is to allow drivers to register both a fast handler and an ithread handler. For drivers doing this, the ithread would be private to the driver and would only be activated if the fast handler signals it. Drivers without fast handlers would still get ithreads that would still act the way they do now. If an interrupt vector is shared with multiple handlers, the fast handlers would all get run, but the only ithreads that would run would be for drivers without a fast handler and for drivers that signaled for it to run from the fast handler. Anyways, John and I have discussed this quite a bit over the last year, we just need time to implement it. > Do you have a feel for how much of the increase was do to the other > changes (rx lock, avoiding register reads)? Both of those do make a difference, but I didn't introduce them into testing until Andre had already done some tests that showed that the taskqueue helped. I don't recall what the difference was, but I think it was in low 10% range. Another thing that I want to do is to get the tx-complete path to run without a lock. For if_em, this means killing the shortcut in en_encap of calling into it to clean up the tx ring. it also means being careful with updating and checking the tx ring counters between the two sides of the driver. But if it can be made to work then almost all top/bottom contention in the driver can be eliminated. Scott
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?43C6C4EA.20303>