From owner-freebsd-net@FreeBSD.ORG Thu Nov 9 08:01:52 2006 Return-Path: X-Original-To: net@freebsd.org Delivered-To: freebsd-net@FreeBSD.ORG Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id C689D16A417 for ; Thu, 9 Nov 2006 08:01:52 +0000 (UTC) (envelope-from bde@zeta.org.au) Received: from mailout1.pacific.net.au (mailout1-3.pacific.net.au [61.8.2.210]) by mx1.FreeBSD.org (Postfix) with ESMTP id A309543D94 for ; Thu, 9 Nov 2006 08:01:33 +0000 (GMT) (envelope-from bde@zeta.org.au) Received: from mailproxy2.pacific.net.au (mailproxy2.pacific.net.au [61.8.2.163]) by mailout1.pacific.net.au (Postfix) with ESMTP id 10EB65A1C15 for ; Thu, 9 Nov 2006 19:01:30 +1100 (EST) Received: from katana.zip.com.au (katana.zip.com.au [61.8.7.246]) by mailproxy2.pacific.net.au (Postfix) with ESMTP id 924D527407 for ; Thu, 9 Nov 2006 19:01:28 +1100 (EST) Date: Thu, 9 Nov 2006 19:01:27 +1100 (EST) From: Bruce Evans X-X-Sender: bde@delplex.bde.org To: net@freebsd.org Message-ID: <20061109171151.T60086@delplex.bde.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: Subject: fix for interrupt handling in bge X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 09 Nov 2006 08:01:52 -0000 bge_intr(), at least for a BCM5705, has races and foot shootings which result in it sometimes returning without handling all the interrupt events that it has acked. Then the interrupt doesn't repeat, and the unhandled events don't get handled a new event (or possibly a timeout) generates a new interrupt. There seems to be no timeout, so on systems where there is not much network activity except by threads waiting for the unhandled events, the unhandled events don't get handled until much later. This activity occurred for benchmarking a kernel build: - build time on ffs: 30 seconds - build time on nfs: 42 seconds (with system mostly idle except for the build) - 38-42 seconds (while doing like editing) - 35 seconds (with ping -i 0.01 to generate new events) - 33 seconds (with the following fix): %%% Index: if_bge.c =================================================================== RCS file: /home/ncvs/src/sys/dev/bge/if_bge.c,v retrieving revision 1.151 diff -u -2 -r1.151 if_bge.c --- if_bge.c 19 Oct 2006 08:03:22 -0000 1.151 +++ if_bge.c 8 Nov 2006 14:05:25 -0000 @@ -2884,11 +2884,17 @@ /* + * Ack the interrupt by writing something to BGE_MBX_IRQ0_LO. Don't + * disable interrupts by writing nonzero like we used to, since with + * our current organization this just gives complications for + * re-enabling interrupts. We used to have races instead of the + * necessary complications. + */ + CSR_WRITE_4(sc, BGE_MBX_IRQ0_LO, 0); + + /* * Do the mandatory PCI flush as well as get the link status. */ statusword = CSR_READ_4(sc, BGE_MAC_STS) & BGE_MACSTAT_LINK_CHANGED; - /* Ack interrupt and stop others from occuring. */ - CSR_WRITE_4(sc, BGE_MBX_IRQ0_LO, 1); - /* Make sure the descriptor ring indexes are coherent. */ bus_dmamap_sync(sc->bge_cdata.bge_status_tag, @@ -2910,7 +2916,4 @@ } - /* Re-enable interrupts. */ - CSR_WRITE_4(sc, BGE_MBX_IRQ0_LO, 0); - if (ifp->if_drv_flags & IFF_DRV_RUNNING && !IFQ_DRV_IS_EMPTY(&ifp->if_snd)) %%% The first race is that the ack was done after the "mandatory PCI flush". This resulted in acking all interrupt events that occur before some indeterminate future time (depending on when the write is flushed). This was relatively harmless, since we acked again later. However, the patch removes the later ack since it is too hard to avoid the races in the later one, so it might now be important to do the first ack correctly. It might be unimportant because if we handle any interrupt event then we will flush the write as a side effect of reading something. Anyway, flushing the write correctly just involves moving the ack before the flush, so it is free. The first foot shooting is that the hardware design apparently (?) encourages this race by requiring (?) a write instead of a read for the ack. The second foot shooting is that we disabled interrupts. Interrupts are already disabled (for all devices sharing the interrupt), so with the current organization disabling them for the device does nothing except encourage the race for re-enabling interrupts. The fix is to not disable interrupts. I think this has the minor disadvantage of generating a new interrupt for any events that occur while we are handling the old ones, even if we handle some of the new ones too. The second race is that re-enabling interrupts also acks interripts. This resulted in acking all interrupt events that occur before the time that the device sees the ack, including any that occurred after we finished looking for events to handle. Since we don't loop, the race window can be almost as large as the total interrupt handling time (e.g., if we handle 0 rx events and many tx events, and a new rx event occurs while we are handling the tx events). The fix is to avoid this complication. The complication would have to be handled to use a task queue. The Linux tg3 driver uses something like a task queue. tg3 also does a status check, so a status check is apparently not too costly like a comment in previous versions of bge said. Without a status check, shared interrupts would work especially poorly, and looping until all events have been handled wouldn't work. With a task queue, the method for re-enabling interrupts would probably need to be: enable_interrupts_and_ack_as_an_unwanted_side effect(); if ((statuscheck() & HAVE_INTERRUPT_EVENTS) == 0) return; disable_interrupts_and_ack_as_an_unnecessary_side effect(); continue; The third race is that we didn't even flush the problematic ack for re-enabling interrupts. This resulted in acking all interrupt events that occur before some indeterminate future time over which we have no control at all. It is possible that bge is depending on these bugs for accidental interrupt moderation. I just noticed that bge (5705) with these patches gets 5-20 times as many interrupts as bge (5701) without these patches and with a much older kernel. The 5701 has about half the ping latency of the 5705 but is otherwise slower including having a lower pps despite or because of being on a faster bus with a newer kernel. Bruce