From owner-freebsd-net@FreeBSD.ORG  Mon Jun 19 08:04:34 2006
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
X-Original-To: freebsd-net@FreeBSD.org
Delivered-To: freebsd-net@FreeBSD.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 5095016A479;
	Mon, 19 Jun 2006 08:04:34 +0000 (UTC) (envelope-from bde@zeta.org.au)
Received: from mailout1.pacific.net.au (mailout1.pacific.net.au [61.8.0.84])
	by mx1.FreeBSD.org (Postfix) with ESMTP id AC700440E8;
	Mon, 19 Jun 2006 08:04:33 +0000 (GMT) (envelope-from bde@zeta.org.au)
Received: from mailproxy1.pacific.net.au (mailproxy1.pacific.net.au
	[61.8.2.162])
	by mailout1.pacific.net.au (Postfix) with ESMTP id 8C1CD329330;
	Mon, 19 Jun 2006 18:04:30 +1000 (EST)
Received: from katana.zip.com.au (katana.zip.com.au [61.8.7.246])
	by mailproxy1.pacific.net.au (8.13.4/8.13.4/Debian-3sarge1) with ESMTP
	id k5J84QxG025511; Mon, 19 Jun 2006 18:04:27 +1000
Date: Mon, 19 Jun 2006 18:04:26 +1000 (EST)
From: Bruce Evans <bde@zeta.org.au>
X-X-Sender: bde@delplex.bde.org
To: John-Mark Gurney <gurney_j@resnet.uoregon.edu>
In-Reply-To: <20060618194044.GC1142@funkthat.com>
Message-ID: <20060619162819.F44832@delplex.bde.org>
References: <20060615115738.J2512@fledge.watson.org>
	<XFMail.20060615091807.jdp@polstra.com>
	<20060618194044.GC1142@funkthat.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: freebsd-net@FreeBSD.org, Robert Watson <rwatson@FreeBSD.org>,
	John Polstra <jdp@polstra.com>
Subject: Re: IF_HANDOFF vs. IFQ_HANDOFF
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
	<mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
	<mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 19 Jun 2006 08:04:34 -0000

On Sun, 18 Jun 2006, John-Mark Gurney wrote:

> John Polstra wrote this message on Thu, Jun 15, 2006 at 09:18 -0700:
>> in the HW but have not yet completed.  When the completion interrupt
>> comes in, the driver is supposed to check the if_snd queue for more
>> mbufs and process them.  Only when the transmit side of the HW goes
>> totally idle should IFF_OACTIVE be cleared again.  Most of our drivers
>> set the flag only when they run out of transmit descriptors (i.e.,
>> practically never), which is just plain wrong.
>
> But the problem is that for small packets, this can mean that there
> will be a delay in handling the ring if we wait to process packets
> once the tx ring is empty.. if we ever want to max out gige w/ 64byte
> packets, we have to clear OACTIVE whenever tx approches running out
> of packets before we can send this.. In most cases we don't know how
> long that is (since we don't keep track of packet sizes, etc), so it's
> easiest/best to clear it whenever the tx ring is not full...

To max out the link without unmaxing CPU for other uses, you do have
to know when the tx approaches running out of packets.  This is best
done using watermark stuff.  There should be a nearly-complete interrupt
at low water, and (only after low water is reached and the interrupt
handler doesn't refill the tx ring to be above low water again) a
completion interrupt at actual completion.  My version of the sk driver
does this.  It arrange for the nearly-complete interrupt at about 32
fragments (min 128 uS) before the tx runs dry, and no other tx interrupts
unless the queue length stays below 32, while the -current driver gets
an interrupt after every packet.  It does this mainly to reduce the
tx interrupt load from 1 per packet to (under load) 1 per 480 fragments.
The correct handling of OACTIVE is obtained as a side effect almost
automatically.  It must be decided when to interrupt (sk hardware
allows interrupting or not interrupting after every fragment), and it
would be obviously wrong to interrupt only after the last fragment in
the ring since the tx might run dry then (even if the tx interrupt
occurs when the last fragment is removed by the hardware from the ring
but before it is sent, it only takes a few uS to send it so the tx
would often run dry due to software latency).

I'm not very familiar with NIC hardware and don't know how other NICs
support timing of tx interrupts, but watermark stuff like the above
is routine for serial devices/drivers.  sk's support for interrupting
on any fragment is too flexible to be good (it is painful to program,
and there doesn't seem to be a good way to time out if there is no
good fragment to interrupt on or when you program the interruption on
a wrong fragment).

Related serial device programming: 8250-16650 UARTs interrupt when the
last character is removed from the tx "ring".  This is not programmable,
but the delay is long enough at low speeds (87 uS at 115200 bps).  The
16950 UART has a programmable tx interrupt trigger level which defaults
to 1 character time.  The delay from this is too short at higher speeds
(11 uS at 921600 bps...).  I use 16.  The "tx" ring size of a 16950
is 128 characters.  Timing for characters in a UART at 921600 bps is
similar to timing for normal packets in 1G bps ethernet (1G/921600 ~=
1K ~= 1500+ normal ethernet packet size), so similar ring sizes and
trigger levels are good (smaller ones would be better for smaller
packets).  Strangely, at 921600 bps, the tx trigger levels become more
critical for maxing out the device than the rx trigger levels, since
rx is forced to keep up by the external device (provided that maxes
out the connection and it is possible to keep up), while poorly chosen
tx trigger levels ensure significant dead time when the tx runs dry.

BTW, I can't see any significant effect (good or bad) from sk's
interrupt moderation, at least with tx changed as above.  sk's interrupt
moderation is very primitive compared with that of some NICs (it's
just a single timer for tx and rx).  Interrupting on every packet gives
too many interrupts, and my changes fix this much better than any
simple timeout-based moderation could do.  My changes don't help at
all for rx, and interrupt moderation doesn't seem to help either.
OTOH, fxp's interrupt moderation works well in practice (I don't know
how) and em's interrupt moderation works well in theory (I understand
its documentation but haven't used any em devices).  em has several
independent trigger levels and timeouts, and the problem of using them
effectively for rx is one of predicting future traffic.  IIRC, em has
sysctls to move this problem to the user.

In the current sk driver, I think keeping IFF_OACTIVE set for longer
would work, and you can also keep track of the queue lenghth, because
of the excessive interrupts -- you get an interrupt after every packet
(modulo interrupt moderation), not just on completion, and the interrupt
handler can both keep the h/w queue full while IFF_OACTIVE is set and
keep track of the queue length as needed for deciding when to set
IFF_OACTIVE.  The CPU usage is thus large no matter whether IFF_ACTIVE
is set correctly.  Interrupt moderation complicates things and unmaxes
the link.  The interrupt moderation timeout is normally set to 100 uS.
This allows significant tx-dry times (the worst case (if IFF_OACTIVE
is not set incorrectly) is sending a tnygram in 4uS, idling for ~96
uS, ...) but isn't very moderate since sending or receiving a normal
packet takes about 15uS.  I think the interrupt moderation timeout for
sk is purely periodic, while for better hardware (even 16550 UARTs!)
at least rx timeouts only occur after the device (in the relevant
direction) has been idle for some time.

Bruce