From owner-freebsd-net@FreeBSD.ORG  Thu Nov  9 08:01:52 2006
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
X-Original-To: net@freebsd.org
Delivered-To: freebsd-net@FreeBSD.ORG
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id C689D16A417
	for <net@freebsd.org>; Thu,  9 Nov 2006 08:01:52 +0000 (UTC)
	(envelope-from bde@zeta.org.au)
Received: from mailout1.pacific.net.au (mailout1-3.pacific.net.au [61.8.2.210])
	by mx1.FreeBSD.org (Postfix) with ESMTP id A309543D94
	for <net@freebsd.org>; Thu,  9 Nov 2006 08:01:33 +0000 (GMT)
	(envelope-from bde@zeta.org.au)
Received: from mailproxy2.pacific.net.au (mailproxy2.pacific.net.au
	[61.8.2.163])
	by mailout1.pacific.net.au (Postfix) with ESMTP id 10EB65A1C15
	for <net@freebsd.org>; Thu,  9 Nov 2006 19:01:30 +1100 (EST)
Received: from katana.zip.com.au (katana.zip.com.au [61.8.7.246])
	by mailproxy2.pacific.net.au (Postfix) with ESMTP id 924D527407
	for <net@freebsd.org>; Thu,  9 Nov 2006 19:01:28 +1100 (EST)
Date: Thu, 9 Nov 2006 19:01:27 +1100 (EST)
From: Bruce Evans <bde@zeta.org.au>
X-X-Sender: bde@delplex.bde.org
To: net@freebsd.org
Message-ID: <20061109171151.T60086@delplex.bde.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: 
Subject: fix for interrupt handling in bge
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
	<mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
	<mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 09 Nov 2006 08:01:52 -0000

bge_intr(), at least for a BCM5705, has races and foot shootings which
result in it sometimes returning without handling all the interrupt
events that it has acked.  Then the interrupt doesn't repeat, and the
unhandled events don't get handled a new event (or possibly a timeout)
generates a new interrupt.  There seems to be no timeout, so on systems
where there is not much network activity except by threads waiting for
the unhandled events, the unhandled events don't get handled until
much later.  This activity occurred for benchmarking a kernel build:

- build time on ffs:  30 seconds
- build time on nfs:  42 seconds (with system mostly idle except for the build)
-                     38-42 seconds (while doing like editing)
-                     35 seconds (with ping -i 0.01 to generate new events)
-                     33 seconds (with the following fix):

%%%
Index: if_bge.c
===================================================================
RCS file: /home/ncvs/src/sys/dev/bge/if_bge.c,v
retrieving revision 1.151
diff -u -2 -r1.151 if_bge.c
--- if_bge.c	19 Oct 2006 08:03:22 -0000	1.151
+++ if_bge.c	8 Nov 2006 14:05:25 -0000
@@ -2884,11 +2884,17 @@

  	/*
+	 * Ack the interrupt by writing something to BGE_MBX_IRQ0_LO.  Don't
+	 * disable interrupts by writing nonzero like we used to, since with
+	 * our current organization this just gives complications for
+	 * re-enabling interrupts.  We used to have races instead of the
+	 * necessary complications.
+	 */
+	CSR_WRITE_4(sc, BGE_MBX_IRQ0_LO, 0);
+
+	/*
  	 * Do the mandatory PCI flush as well as get the link status.
  	 */
  	statusword = CSR_READ_4(sc, BGE_MAC_STS) & BGE_MACSTAT_LINK_CHANGED;

-	/* Ack interrupt and stop others from occuring. */
-	CSR_WRITE_4(sc, BGE_MBX_IRQ0_LO, 1);
-
  	/* Make sure the descriptor ring indexes are coherent. */
  	bus_dmamap_sync(sc->bge_cdata.bge_status_tag,
@@ -2910,7 +2916,4 @@
  	}

-	/* Re-enable interrupts. */
-	CSR_WRITE_4(sc, BGE_MBX_IRQ0_LO, 0);
-
  	if (ifp->if_drv_flags & IFF_DRV_RUNNING &&
  	    !IFQ_DRV_IS_EMPTY(&ifp->if_snd))
%%%

The first race is that the ack was done after the "mandatory PCI flush".
This resulted in acking all interrupt events that occur before some
indeterminate future time (depending on when the write is flushed).
This was relatively harmless, since we acked again later.  However,
the patch removes the later ack since it is too hard to avoid the races
in the later one, so it might now be important to do the first ack
correctly.  It might be unimportant because if we handle any interrupt
event then we will flush the write as a side effect of reading something.
Anyway, flushing the write correctly just involves moving the ack
before the flush, so it is free.

The first foot shooting is that the hardware design apparently (?)
encourages this race by requiring (?) a write instead of a read for
the ack.

The second foot shooting is that we disabled interrupts.  Interrupts
are already disabled (for all devices sharing the interrupt), so with
the current organization disabling them for the device does nothing
except encourage the race for re-enabling interrupts.  The fix is to
not disable interrupts.  I think this has the minor disadvantage of
generating a new interrupt for any events that occur while we are
handling the old ones, even if we handle some of the new ones too.

The second race is that re-enabling interrupts also acks interripts.
This resulted in acking all interrupt events that occur before the
time that the device sees the ack, including any that occurred after
we finished looking for events to handle.  Since we don't loop, the
race window can be almost as large as the total interrupt handling
time (e.g., if we handle 0 rx events and many tx events, and a new rx
event occurs while we are handling the tx events).  The fix is to avoid
this complication.  The complication would have to be handled to use
a task queue.  The Linux tg3 driver uses something like a task queue.
tg3 also does a status check, so a status check is apparently not too
costly like a comment in previous versions of bge said.  Without a
status check, shared interrupts would work especially poorly, and
looping until all events have been handled wouldn't work.  With a
task queue, the method for re-enabling interrupts would probably need
to be:
     enable_interrupts_and_ack_as_an_unwanted_side effect();
     if ((statuscheck() & HAVE_INTERRUPT_EVENTS) == 0)
        return;
     disable_interrupts_and_ack_as_an_unnecessary_side effect();
     continue;

The third race is that we didn't even flush the problematic ack for
re-enabling interrupts.  This resulted in acking all interrupt events
that occur before some indeterminate future time over which we have
no control at all.

It is possible that bge is depending on these bugs for accidental
interrupt moderation.  I just noticed that bge (5705) with these patches
gets 5-20 times as many interrupts as bge (5701) without these patches
and with a much older kernel.  The 5701 has about half the ping latency
of the 5705 but is otherwise slower including having a lower pps despite
or because of being on a faster bus with a newer kernel.

Bruce