Date: Thu, 19 Oct 2006 18:01:40 -0600 From: Scott Long <scottl@samsco.org> To: John Polstra <jdp@polstra.com> Cc: freebsd-net <freebsd-net@freebsd.org> Subject: Re: em network issues Message-ID: <453811E4.9010304@samsco.org> In-Reply-To: <XFMail.20061019152433.jdp@polstra.com> References: <XFMail.20061019152433.jdp@polstra.com>
next in thread | previous in thread | raw e-mail | index | archive | help
John Polstra wrote: > On 19-Oct-2006 Scott Long wrote: >> The performance measurements that Andre and I did early this year showed >> that the INTR_FAST handler provided a very large benefit. > > I'm trying to understand why that's the case. Is it because an > INTR_FAST interrupt doesn't have to be masked and unmasked in the > APIC? I can't see any other reason for much of a performance > difference in that driver. With or without INTR_FAST, you've got > the bulk of the work being done in a background thread -- either the > ithread or the taskqueue thread. It's not clear to me that it's any > cheaper to run a task than it is to run an ithread. > > A difference might show up if you had two or more em devices sharing > the same IRQ. Then they'd share one ithread, but would each get their > own taskqueue thread. But sharing an IRQ among multiple gigabit NICs > would be avoided by anyone who cared about performance, so it's not a > very interesting case. Besides, when you first committed this > stuff, INTR_FAST interrupts were not sharable. Interrupt sharing is one consideration. And sometimes sharing can't be avoided; I'm working with a product that has 2 dual port e1000 cards, 4 dual port FC cards, and an onboard dual port e1000 chip and dual port SCSI chip. All of those devices are only on 2 PCI buses. So, sharing is inevitable. MSI might help mitigate this, once it becomes a reality in FreeBSD. The cost of the APIC operation can be a factor. There is a single spinlock for all APICs, so you can get contention with multiple CPUs servicing interrupts and colliding on the APIC lock. When there isn't much contention, the cost of the APIC spinlock is no more than the cost of the taskqueue lock, and the scheduling locks are about the same. > > Another change you made in the same commit (if_em.c revision 1.98) > greatly reduced the number of PCI writes made to the RX ring consumer > pointer register. That would yield a significant performance > improvement. Did you see gains from INTR_FAST even without this > independent change? PCI writes are store-and-forward so they are practically free, unlike PCI reads. There was speculation that interrupting the chip less with fewer writes might be a benefit, but I don't think that this was ever tested by itself. The big win came from moving the locking outside of the basic interrupt handler. The handler can now run completely free of any other driver or stack contention. False interrupts (especially for the shared case) can be handled without blocking the rest of the driver. Since there is no ithread sharing, there is also a fairly deterministic amount of time to get to the interrupt handler now; so the chance of rx overflows goes down. Even if one thread is stuck in a tight loop doing tx or tx-complete operations, the taskqueue can run the rx path without any contention. Also, in the non-shared case, the amount of time needed to do the register read in the handler no longer delays the upper or lower half of the driver. Andre measured about a 20% gain with just this, IIRC. I then took the next step and pushed locking out of the RX path. That dramatically increased forwarding performance, as the rx path of one driver could drive the tx path of the other driver without any contention from either driver. We came close to getting 1mpps in the fast-forward case, and we suspect that the only reason that we didn't reach the 1m mark is because prefetching was turned off accidentally on the chip. This last change brought the overall performance gain to about 50% more than the unmodified driver. I certainly don't attribute that all to the INTR_FAST change, but like I stated above, a good portion of it is. Note that I started the work on if_em not to improve performance, but to prototype a way to eliminate the interrupt aliasing that was happening on the Intel chipsets. I only asked Andre and others to test it for performance because I didn't want it to be a pessimization. Once we discovered that it helped, I went the extra steps with optimizing the rx path. Scott
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?453811E4.9010304>