From owner-freebsd-arch@FreeBSD.ORG  Sun Jul 30 14:04:49 2006
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
X-Original-To: arch@FreeBSD.org
Delivered-To: freebsd-arch@FreeBSD.ORG
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 4A19616A4E2;
	Sun, 30 Jul 2006 14:04:49 +0000 (UTC)
	(envelope-from rwatson@FreeBSD.org)
Received: from cyrus.watson.org (cyrus.watson.org [209.31.154.42])
	by mx1.FreeBSD.org (Postfix) with ESMTP id B771B43D5C;
	Sun, 30 Jul 2006 14:04:48 +0000 (GMT)
	(envelope-from rwatson@FreeBSD.org)
Received: from fledge.watson.org (fledge.watson.org [209.31.154.41])
	by cyrus.watson.org (Postfix) with ESMTP id 45F3A46BBD;
	Sun, 30 Jul 2006 10:04:48 -0400 (EDT)
Date: Sun, 30 Jul 2006 15:04:48 +0100 (BST)
From: Robert Watson <rwatson@FreeBSD.org>
X-X-Sender: robert@fledge.watson.org
To: net@FreeBSD.org, arch@FreeBSD.org
Message-ID: <20060730141642.D16341@fledge.watson.org>
MIME-Version: 1.0
Content-Type: MULTIPART/MIXED; BOUNDARY="0-1695162780-1154268288=:16341"
Cc: 
Subject: Changes in the network interface queueing handoff model
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 30 Jul 2006 14:04:49 -0000

  This message is in MIME format.  The first part should be readable text,
  while the remaining parts are likely unreadable without MIME-aware tools.

--0-1695162780-1154268288=:16341
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed


5BOne of the ideas that I, Scott Long, and a few others have been bouncing 
around for some time is a restructuring of the network interface packet 
transmission API to reduce the number of locking operations and allow network 
device drivers increased control of the queueing behavior.  Right now, it 
works something like that following:

- When a network protocol wants to transmit, it calls the ifnet's link layer
   output routine via ifp->if_output() with the ifnet pointer, packet,
   destination address information, and route information.

- The link layer (e.g., ether_output() + ether_output_frame()) encapsulates
   the packet as necessary, performs a link layer address translation (such as
   ARP), and hands off to the ifnet driver via a call to IFQ_HANDOFF(), which
   accepts the ifnet pointer and packet.

- The ifnet layer enqueues the packet in the ifnet send queue (ifp->if_snd),
   and then looks at the driver's IFF_DRV_OACTIVE flag to determine if it needs
   to "start" output by the driver.  If the driver is already active, it
   doesn't, and otherwise, it does.

- The driver dequeues the packet from ifp->if_snd, performs any driver
   encapsulation and wrapping, and notifies the hardware.  In modern hardware,
   this consists of hooking the data of the packet up to the descriptor ring
   and notifying the hardware to pick it up via DMA.  In order hardware, the
   driver would perform a series of I/O operations to send the entire packet
   directly to the card via a system bus.

Why change this?  A few reasons:

- The ifnet layer send queue is becoming decreasingly useful over time.  Most
   modern hardware has a significant number of slots in its transmit descriptor
   ring, tuned for the performance of the hardware, etc, which is the effective
   transmit queue in practice.  The additional queue depth doesn't increase
   throughput substantially (if at all) but does consume memory.

- On extremely fast hardware (with respect to CPU speed), the queue remains
   essentially empty, so we pay the cost of enqueueing and dequeuing a packet
   from an empty queue.

- The ifnet send queue is a separately locked object from the device driver,
   meaning that for a single enqueue/dequeue pair, we pay an extra four lock
   operations (two for insert, two for remove) per packet.

- For synthetic link layer drivers, such as if_vlan, which have no need for
   queueing at all, the cost of queueing is eliminated.

- IFF_DRV_OACTIVE is no longer inspected by the link layer, only by the
   driver, which helps eliminate a latent race condition involving use of the
   flag.

The proposed change is simple: right now one or more enqueue operations 
occurs, when a call to ifp->if_start() is made to notify the driver that it 
may need to do something (if the ACTIVE flag isn't set).  In the new world 
order, the driver is directly passed the mbuf, and may then choose to queue it 
or otherwise handle it as it sees fit.  The immediate practical benefit is 
clear: if the queueing at the ifnet layer is unnecessary, it is entirely 
avoided, skipping enqueue, dequeue, and four mutex operations.  This applies 
immediately for VLAN processing, but also means that for modern gigabit cards, 
the hardware queue (which will be used anyway) is the only queue necessary.

There are a few downsides, of course:

- For older hardware without its own queueing, the queue is still required --
   not only that, but we've now introduced an unconditional function pointer
   invocation, which on older hardware, is has more significant relative cost
   than it has on more recent CPUs.

- If drivers still require or use a queue, they must now synchronize access to
   the queue.  The obvious choices are to use the ifq lock (and restore the
   above four lock operations), or to use the driver mutex (and risk higher
   contention).  Right now, if the driver is busy (driver mutex held) then an
   enqueue is still possible, but with this change and a single mutex
   protecting the send queue and driver, that is no longer possible.

Attached is a patch that maintains the current if_start, but adds 
if_startmbuf.  If a device driver implements if_startmbuf and the global 
sysctl net.startmbuf_enabled is set to 1, then the if_startmbuf path in the 
driver will be used.  Otherwise, if_start is used.  I have modified the if_em 
driver to implement if_startmbuf also.  If there is no packet backlog in the 
if_snd queue, it directly places the packet in the transmit descriptor ring. 
If there is a backlog, it uses the if_snd queue protected by driver mutex, 
rather than a separate ifq mutex.

In some basic local micro-benchmarks, I saw a 5% improvement in UDP 0-byte 
paylod PPS on UP, and a 10% improvement on SMP.  I saw a 1.7% performance 
improvement in the bulk serving of 1k files over HTTP.  These are only 
micro-benchmarks, and reflect a configuration in which the CPU is unable to 
keep up with the output rate of the 1gbps ethernet card in the device, so 
reductions in host CPU usage are immediately visible in increased output as 
the CPU is able to better keep up with the network hardware.  Other 
configurations are also of interest of interesting, especially ones in which 
the network device is unable to keep up with the CPU, resulting in more 
queueing.

Conceptual review as well as banchmarking, etc, would be most welcome.

Robert N M Watson
Computer Laboratory
University of Cambridge
--0-1695162780-1154268288=:16341
Content-Type: TEXT/plain; charset=US-ASCII; name=20060730-if_startmbuf.diff
Content-Transfer-Encoding: BASE64
Content-ID: <20060730150448.G16341@fledge.watson.org>
Content-Description: 
Content-Disposition: attachment; filename=20060730-if_startmbuf.diff

LS0tIC8vZGVwb3QvdmVuZG9yL2ZyZWVic2Qvc3JjL3N5cy9kZXYvZW0vaWZf
ZW0uYwkyMDA2LzA3LzI3IDAwOjQ2OjI0DQorKysgLy9kZXBvdC91c2VyL3J3
YXRzb24vaWZuZXQvc3JjL3N5cy9kZXYvZW0vaWZfZW0uYwkyMDA2LzA3LzI5
IDE4OjQzOjE0DQpAQCAtNzM1LDYgKzczNSw5NSBAQA0KIAlFTV9VTkxPQ0so
c2MpOw0KIH0NCiANCitzdGF0aWMgaW50DQorZW1fc3RhcnRtYnVmKHN0cnVj
dCBpZm5ldCAqaWZwLCBzdHJ1Y3QgbWJ1ZiAqbSkNCit7DQorICAgICAgICBz
dHJ1Y3QgbWJ1ZiAgICAqbV9oZWFkOw0KKyAgICAgICAgc3RydWN0IGVtX3Nv
ZnRjICpzYyA9IGlmcC0+aWZfc29mdGM7DQorCXN0cnVjdCBpZnF1ZXVlICpp
ZnEgPSAoc3RydWN0IGlmcXVldWUgKikmaWZwLT5pZl9zbmQ7DQorDQorCS8q
DQorCSAqIFRocmVlIGNhc2VzOg0KKwkgKg0KKwkgKiAoMSkgSW50ZXJmYWNl
IGlzbid0IHJ1bm5pbmcsIGxpbmsgaXMgZG93biwgb3IgaXMgYWxyZWFkeSBh
Y3RpdmUsDQorCSAqICAgICBldGMsIHNpbXBseSBlbnF1ZXVlLg0KKwkgKg0K
KwkgKiAoMikgVGhlIGludGVyZmFjZSBpcyBydW5uaW5nLCBub3QgdG9vIGJ1
c3ksIGFuZCB3ZSBoYXZlIG5vIG1idWZzDQorCSAqICAgICBpbiB0aGUgaWZu
ZXQgc2VuZCBxdWV1ZSwgc28gdHJ5IHRvIGhhbmQgZGlyZWN0bHkgdG8gaGFy
ZHdhcmUuDQorCSAqDQorCSAqICgzKSBUaGUgaW50ZXJmYWNlIGlzIHJ1bm5p
bmcsIGJ1dCB3ZSBoYXZlIGEgYmFja2xvZy4gIEluc2VydCB0aGUNCisJICog
ICAgIGN1cnJlbnQgbWJ1ZiBpbnRvIHRoZSBxdWV1ZSBhbmQgcHJvY2VzcyBp
bi1vcmRlciwgaWYgcG9zc2libGUuDQorCSAqLw0KKwlFTV9MT0NLKHNjKTsN
CisJaWYgKCgoaWZwLT5pZl9kcnZfZmxhZ3MgJiAoSUZGX0RSVl9SVU5OSU5H
fElGRl9EUlZfT0FDVElWRSkpICE9DQorCSAgICBJRkZfRFJWX1JVTk5JTkcp
IHx8ICFzYy0+bGlua19hY3RpdmUpIHsNCisJCWlmIChfSUZfUUZVTEwoaWZx
KSkgew0KKwkJCV9JRl9EUk9QKGlmcSk7DQorCQkJRU1fVU5MT0NLKHNjKTsN
CisJCQltX2ZyZWVtKG0pOw0KKwkJCXJldHVybiAoRU5PQlVGUyk7DQorCQl9
DQorCQlfSUZfRU5RVUVVRShpZnEsIG0pOw0KKwkJRU1fVU5MT0NLKHNjKTsN
CisJCXJldHVybiAoMCk7DQorCX0NCisNCisJLyoNCisJICogWFhYUlc6IFZh
cmlvdXMgY2FzZXMgaGVyZSBoYXZlIGhpc3RvcmljYWxseSBjb3VudGVkIGFz
IHN1Y2Nlc3NlcywNCisJICogYnV0IHBlcmhhcHMgdGhleSBzaG91bGQgcmV0
dXJuIEVOT0JVRlM/DQorCSAqLw0KKwlpZiAoX0lGX1FMRU4oaWZxKSA9PSAw
KSB7DQorCSAJLyoNCisJCSAqIGVtX2VuY2FwKCkgY2FuIG1vZGlmeSBvdXIg
cG9pbnRlciwgYW5kIG9yIG1ha2UgaXQgTlVMTCBvbg0KKwkJICogZmFpbHVy
ZS4gIEluIHRoYXQgZXZlbnQsIHdlIGNhbid0IGVucXVldWUuDQorCQkgKi8N
CisJCWlmIChlbV9lbmNhcChzYywgJm0pKSB7DQorCQkJaWYgKG0gPT0gTlVM
TCkgew0KKwkJCQlFTV9VTkxPQ0soc2MpOw0KKwkJCQlyZXR1cm4gKDApOw0K
KwkJCX0NCisJCQlpZnAtPmlmX2ZsYWdzIHw9IElGRl9EUlZfT0FDVElWRTsN
CisJCQlfSUZfUFJFUEVORChpZnEsIG0pOw0KKwkJCUVNX1VOTE9DSyhzYyk7
DQorCQkJcmV0dXJuICgwKTsNCisJCX0NCisJCUJQRl9NVEFQKGlmcCwgbSk7
DQorCQlpZnAtPmlmX3RpbWVyID0gRU1fVFhfVElNRU9VVDsNCisJCUVNX1VO
TE9DSyhzYyk7DQorCQlyZXR1cm4gKDApOw0KKwl9DQorDQorCWlmIChfSUZf
UUZVTEwoaWZxKSkgew0KKwkJX0lGX0RST1AoaWZxKTsNCisJCUVNX1VOTE9D
SyhzYyk7DQorCQltX2ZyZWVtKG0pOw0KKwkJcmV0dXJuIChFTk9CVUZTKTsN
CisJfQ0KKwlfSUZfRU5RVUVVRShpZnEsIG0pOw0KKw0KKwl3aGlsZSAoIUlG
UV9EUlZfSVNfRU1QVFkoJmlmcC0+aWZfc25kKSkgew0KKwkJSUZRX0RSVl9E
RVFVRVVFKCZpZnAtPmlmX3NuZCwgbV9oZWFkKTsNCisJCWlmIChtX2hlYWQg
PT0gTlVMTCkNCisJCQlicmVhazsNCisJIAkvKg0KKwkJICogZW1fZW5jYXAo
KSBjYW4gbW9kaWZ5IG91ciBwb2ludGVyLCBhbmQgb3IgbWFrZSBpdCBOVUxM
IG9uDQorCQkgKiBmYWlsdXJlLiAgSW4gdGhhdCBldmVudCwgd2UgY2FuJ3Qg
cmVxdWV1ZS4NCisJCSAqLw0KKwkJaWYgKGVtX2VuY2FwKHNjLCAmbV9oZWFk
KSkgew0KKwkJCWlmIChtX2hlYWQgPT0gTlVMTCkNCisJCQkJYnJlYWs7DQor
CQkJaWZwLT5pZl9kcnZfZmxhZ3MgfD0gSUZGX0RSVl9PQUNUSVZFOw0KKwkJ
CUlGUV9EUlZfUFJFUEVORCgmaWZwLT5pZl9zbmQsIG1faGVhZCk7DQorCQkJ
YnJlYWs7DQorCQl9DQorCQlCUEZfTVRBUChpZnAsIG1faGVhZCk7DQorCQlp
ZnAtPmlmX3RpbWVyID0gRU1fVFhfVElNRU9VVDsNCisJfQ0KKw0KKwlFTV9V
TkxPQ0soc2MpOw0KKwlyZXR1cm4gKDApOw0KK30NCisNCiAvKioqKioqKioq
KioqKioqKioqKioqKioqKioqKioqKioqKioqKioqKioqKioqKioqKioqKioq
KioqKioqKioqKioqKioqDQogICogIElvY3RsIGVudHJ5IHBvaW50DQogICoN
CkBAIC0yMTU0LDYgKzIyNDMsNyBAQA0KIAlpZnAtPmlmX2ZsYWdzID0gSUZG
X0JST0FEQ0FTVCB8IElGRl9TSU1QTEVYIHwgSUZGX01VTFRJQ0FTVDsNCiAJ
aWZwLT5pZl9pb2N0bCA9IGVtX2lvY3RsOw0KIAlpZnAtPmlmX3N0YXJ0ID0g
ZW1fc3RhcnQ7DQorCWlmcC0+aWZfc3RhcnRtYnVmID0gZW1fc3RhcnRtYnVm
Ow0KIAlpZnAtPmlmX3dhdGNoZG9nID0gZW1fd2F0Y2hkb2c7DQogCUlGUV9T
RVRfTUFYTEVOKCZpZnAtPmlmX3NuZCwgc2MtPm51bV90eF9kZXNjIC0gMSk7
DQogCWlmcC0+aWZfc25kLmlmcV9kcnZfbWF4bGVuID0gc2MtPm51bV90eF9k
ZXNjIC0gMTsNCi0tLSAvL2RlcG90L3ZlbmRvci9mcmVlYnNkL3NyYy9zeXMv
bmV0L2lmLmMJMjAwNi8wNy8wOSAwNjowNjoyNQ0KKysrIC8vZGVwb3QvdXNl
ci9yd2F0c29uL2lmbmV0L3NyYy9zeXMvbmV0L2lmLmMJMjAwNi8wNy8yNiAx
NzozMjo1MA0KQEAgLTI0ODYsMjggKzI0ODYsMTExIEBADQogCShpZnAtPmlm
X3N0YXJ0KShpZnApOw0KIH0NCiANCitzdGF0aWMgaW50CXN0YXJ0bWJ1Zl9l
bmFibGVkOw0KK1NZU0NUTF9JTlQoX25ldCwgT0lEX0FVVE8sIHN0YXJ0bWJ1
Zl9lbmFibGVkLCBDVExGTEFHX1JXLCAmc3RhcnRtYnVmX2VuYWJsZWQsDQor
ICAgIDAsICIiKTsNCisNCisvKg0KKyAqIFhYWFJXOg0KKyAqDQorICogaWZf
dmFyLmggYW5kIHRoZSBpbnRlcmZhY2UgaGFuZG9mZiBhcmUgc29tZSBvZiB0
aGUgbmFzdGllc3QgcGllY2VzIG9mIHRoZQ0KKyAqIEJTRCBuZXR3b3JrIHN0
YWNrLiAgR2VuZXJhdGlvbnMgb2YgaGFja3MsIHZhcmlhbnRzLCBpbmNvbnNp
c3RlbmN5LCBhbmQNCisgKiBmb29saXNobmVzcyBoYXZlIHJlc3VsdGVkIGlu
IGVzc2VudGlhbGx5IHVucmVhZGFibGUgY29kZS4gIEZvciBleGFtcGxlLA0K
KyAqIHdoeSBhcmUgdGhlIGlmcV8qIGludGVyZmFjZXMgdGhlIG9uZXMgdGhh
dCB1c2UgdGhlIGRlZmF1bHQgaWZuZXQgc2VuZA0KKyAqIHF1ZXVlLCBhbmQg
dGhlIGlmXyogaW50ZXJmYWNlcyB0aGUgb25lcyB0aGF0IHVzZSBhbHRlcm5h
dGl2ZSBxdWV1ZXMsDQorICogcG9zc2libHkgd2l0aCBubyBpZm5ldCBhdCBh
bGw/ICBBbmQgd2h5IGRvIHNvbWUgaW50ZXJmYWNlcyByZXR1cm4gZXJybm8N
CisgKiB2YWx1ZXMsIGJ1dCBvdGhlcnMgYm9vbGVhbnM/DQorICovDQorDQor
LyoNCisgKiBIYW5kb2ZmIGZ1bmN0aW9uIGZvciBzaW1wbGUgaWZuZXQgc3Ry
dWN0dXJlcy4gIFJldHVybnMgYW4gZXJybm8gdmFsdWUuDQorICovDQogaW50
DQotaWZfaGFuZG9mZihzdHJ1Y3QgaWZxdWV1ZSAqaWZxLCBzdHJ1Y3QgbWJ1
ZiAqbSwgc3RydWN0IGlmbmV0ICppZnAsIGludCBhZGp1c3QpDQoraWZxX2hh
bmRvZmYoc3RydWN0IGlmbmV0ICppZnAsIHN0cnVjdCBtYnVmICptLCBpbnQg
YWRqdXN0KQ0KK3sNCisJaW50IGVycm9yLCBsZW4sIHN0YXJ0bWJ1ZjsNCisJ
c2hvcnQgbWZsYWdzOw0KKw0KKwlsZW4gPSBtLT5tX3BrdGhkci5sZW47DQor
CW1mbGFncyA9IG0tPm1fZmxhZ3M7DQorDQorCWlmIChzdGFydG1idWZfZW5h
YmxlZCAmJiBpZnAtPmlmX3N0YXJ0bWJ1ZiAhPSBOVUxMKQ0KKwkJc3RhcnRt
YnVmID0gMTsNCisJZWxzZQ0KKwkJc3RhcnRtYnVmID0gMDsNCisNCisJaWYg
KHN0YXJ0bWJ1ZikNCisJCWVycm9yID0gaWZwLT5pZl9zdGFydG1idWYoaWZw
LCBtKTsNCisJZWxzZQ0KKwkJSUZRX0VOUVVFVUUoJmlmcC0+aWZfc25kLCBt
LCBlcnJvcik7DQorCWlmIChlcnJvciA9PSAwKSB7DQorCQlpZnAtPmlmX29i
eXRlcyArPSBsZW4gKyBhZGp1c3Q7DQorCQlpZiAobWZsYWdzICYgKE1fQkNB
U1R8TV9NQ0FTVCkpDQorCQkJaWZwLT5pZl9vbWNhc3RzKys7DQorCX0NCisJ
aWYgKCFzdGFydG1idWYgJiYgKGlmcC0+aWZfZHJ2X2ZsYWdzICYgSUZGX0RS
Vl9PQUNUSVZFKSA9PSAwKQ0KKwkJaWZfc3RhcnQoaWZwKTsNCisJcmV0dXJu
IChlcnJvcik7DQorfQ0KKw0KKy8qDQorICogSGFuZG9mZiBmdW5jdGlvbiBm
b3IgYW4gaWZxdWV1ZSB3aXRoIGFuIG9wdGlvbmFsbHkgYWZmaWxpdGlhdGVk
IGlmbmV0Lg0KKyAqIFJldHVybnMgYSBib29sZWFuLg0KKyAqLw0KK2ludA0K
K2lmX2hhbmRvZmYoc3RydWN0IGlmcXVldWUgKmlmcSwgc3RydWN0IG1idWYg
Km0sIHN0cnVjdCBpZm5ldCAqaWZwLA0KKyAgICBpbnQgYWRqdXN0KQ0KK3sN
CisJaW50IGxlbiwgYWN0aXZlLCBzdGFydG1idWYsIHN1Y2Nlc3M7DQorCXNo
b3J0IG1mbGFnczsNCisNCisJYWN0aXZlID0gMDsNCisJbGVuID0gbS0+bV9w
a3RoZHIubGVuOw0KKwltZmxhZ3MgPSBtLT5tX2ZsYWdzOw0KKw0KKwlpZiAo
c3RhcnRtYnVmX2VuYWJsZWQgJiYgaWZwICE9IE5VTEwgJiYgaWZwLT5pZl9z
dGFydG1idWYgIT0gTlVMTCkNCisJCXN0YXJ0bWJ1ZiA9IDE7DQorCWVsc2UN
CisJCXN0YXJ0bWJ1ZiA9IDA7DQorDQorCWlmIChzdGFydG1idWYpDQorCQlz
dWNjZXNzID0gKGlmcC0+aWZfc3RhcnRtYnVmKGlmcCwgbSkgPT0gMCk7DQor
CWVsc2Ugew0KKwkJSUZfTE9DSyhpZnEpOw0KKwkJaWYgKF9JRl9RRlVMTChp
ZnEpKSB7DQorCQkJX0lGX0RST1AoaWZxKTsNCisJCQltX2ZyZWVtKG0pOw0K
KwkJCXN1Y2Nlc3MgPSAwOw0KKwkJfSBlbHNlIHsNCisJCQlfSUZfRU5RVUVV
RShpZnEsIG0pOw0KKwkJCXN1Y2Nlc3MgPSAxOw0KKwkJfQ0KKwkJSUZfVU5M
T0NLKGlmcSk7DQorCQlpZiAoaWZwICE9IE5VTEwgJiYgIShpZnAtPmlmX2Ry
dl9mbGFncyAmIElGRl9EUlZfT0FDVElWRSkpDQorCQkJaWZfc3RhcnQoaWZw
KTsNCisJfQ0KKwlpZiAoc3VjY2VzcyAmJiBpZnAgIT0gTlVMTCkgew0KKwkJ
aWZwLT5pZl9vYnl0ZXMgKz0gbGVuICsgYWRqdXN0Ow0KKwkJaWYgKG0tPm1f
ZmxhZ3MgJiAoTV9CQ0FTVHxNX01DQVNUKSkNCisJCQlpZnAtPmlmX29tY2Fz
dHMrKzsNCisJfQ0KKwlyZXR1cm4gKHN1Y2Nlc3MpOw0KK30NCisNCisvKg0K
KyAqIFV0aWxpdHkgZnVuY3Rpb24gdG8gYmUgdXNlZCBieSBkZXZpY2UgZHJp
dmVycyB3aGVuIHRoZXkgbmVlZCB0byBlbnF1ZXVlIGENCisgKiBwYWNrZXQg
dG8gYW4gaW50ZXJmYWNlLXJlbGF0ZWQgcXVldWUgcmF0aGVyIHRoYW4gaW1t
ZWRpYXRlbHkgZGVsaXZlcmluZy4NCisgKi8NCitpbnQNCitpZl9zdGFydG1i
dWZfZW5xdWV1ZShzdHJ1Y3QgaWZxdWV1ZSAqaWZxLCBzdHJ1Y3QgbWJ1ZiAq
bSkNCiB7DQotCWludCBhY3RpdmUgPSAwOw0KIA0KLQlJRl9MT0NLKGlmcSk7
DQogCWlmIChfSUZfUUZVTEwoaWZxKSkgew0KIAkJX0lGX0RST1AoaWZxKTsN
Ci0JCUlGX1VOTE9DSyhpZnEpOw0KIAkJbV9mcmVlbShtKTsNCiAJCXJldHVy
biAoMCk7DQogCX0NCi0JaWYgKGlmcCAhPSBOVUxMKSB7DQotCQlpZnAtPmlm
X29ieXRlcyArPSBtLT5tX3BrdGhkci5sZW4gKyBhZGp1c3Q7DQotCQlpZiAo
bS0+bV9mbGFncyAmIChNX0JDQVNUfE1fTUNBU1QpKQ0KLQkJCWlmcC0+aWZf
b21jYXN0cysrOw0KLQkJYWN0aXZlID0gaWZwLT5pZl9kcnZfZmxhZ3MgJiBJ
RkZfRFJWX09BQ1RJVkU7DQotCX0NCiAJX0lGX0VOUVVFVUUoaWZxLCBtKTsN
Ci0JSUZfVU5MT0NLKGlmcSk7DQotCWlmIChpZnAgIT0gTlVMTCAmJiAhYWN0
aXZlKQ0KLQkJaWZfc3RhcnQoaWZwKTsNCiAJcmV0dXJuICgxKTsNCiB9DQog
DQotLS0gLy9kZXBvdC92ZW5kb3IvZnJlZWJzZC9zcmMvc3lzL25ldC9pZl92
YXIuaAkyMDA2LzA2LzE5IDIyOjIxOjIyDQorKysgLy9kZXBvdC91c2VyL3J3
YXRzb24vaWZuZXQvc3JjL3N5cy9uZXQvaWZfdmFyLmgJMjAwNi8wNy8zMCAx
MDoxMTo1NA0KQEAgLTE2Miw3ICsxNjIsOCBAQA0KIAkJKHN0cnVjdCBpZm5l
dCAqLCBzdHJ1Y3Qgc29ja2FkZHIgKiosIHN0cnVjdCBzb2NrYWRkciAqKTsN
CiAJc3RydWN0CWlmYWRkcgkqaWZfYWRkcjsJLyogcG9pbnRlciB0byBsaW5r
LWxldmVsIGFkZHJlc3MgKi8NCiAJdm9pZAkqaWZfc3BhcmUyOwkJLyogc3Bh
cmUgcG9pbnRlciAyICovDQotCXZvaWQJKmlmX3NwYXJlMzsJCS8qIHNwYXJl
IHBvaW50ZXIgMyAqLw0KKwlpbnQJKCppZl9zdGFydG1idWYpCQkvKiBlbnF1
ZXVlIGFuZCBzdGFydCBvdXRwdXQgKi8NCisJCShzdHJ1Y3QgaWZuZXQgKiwg
c3RydWN0IG1idWYgKik7DQogCWludAlpZl9kcnZfZmxhZ3M7CQkvKiBkcml2
ZXItbWFuYWdlZCBzdGF0dXMgZmxhZ3MgKi8NCiAJdV9pbnQJaWZfc3BhcmVf
ZmxhZ3MyOwkvKiBzcGFyZSBmbGFncyAyICovDQogCXN0cnVjdCAgaWZhbHRx
IGlmX3NuZDsJCS8qIG91dHB1dCBxdWV1ZSAoaW5jbHVkZXMgYWx0cSkgKi8N
CkBAIC0zNzAsMTIgKzM3MSwxNSBAQA0KIAkJbXR4X3VubG9jaygmR2lhbnQp
OwkJCQkJXA0KIH0gd2hpbGUgKDApDQogDQoraW50CWlmcV9oYW5kb2ZmKHN0
cnVjdCBpZm5ldCAqaWZwLCBzdHJ1Y3QgbWJ1ZiAqbSwgaW50IGFkanVzdCk7
DQogaW50CWlmX2hhbmRvZmYoc3RydWN0IGlmcXVldWUgKmlmcSwgc3RydWN0
IG1idWYgKm0sIHN0cnVjdCBpZm5ldCAqaWZwLA0KIAkgICAgaW50IGFkanVz
dCk7DQoraW50CWlmX3N0YXJ0bWJ1Zl9lbnF1ZXVlKHN0cnVjdCBpZnF1ZXVl
ICppZnEsIHN0cnVjdCBtYnVmICptKTsNCisNCisjZGVmaW5lCUlGX0hBTkRP
RkZfQURKKGlmcSwgbSwgaWZwLCBhZGopCVwNCisJaWZfaGFuZG9mZigoc3Ry
dWN0IGlmcXVldWUgKilpZnEsIG0sIGlmcCwgYWRqKQ0KICNkZWZpbmUJSUZf
SEFORE9GRihpZnEsIG0sIGlmcCkJCQlcDQogCWlmX2hhbmRvZmYoKHN0cnVj
dCBpZnF1ZXVlICopaWZxLCBtLCBpZnAsIDApDQotI2RlZmluZQlJRl9IQU5E
T0ZGX0FESihpZnEsIG0sIGlmcCwgYWRqKQlcDQotCWlmX2hhbmRvZmYoKHN0
cnVjdCBpZnF1ZXVlICopaWZxLCBtLCBpZnAsIGFkaikNCiANCiB2b2lkCWlm
X3N0YXJ0KHN0cnVjdCBpZm5ldCAqKTsNCiANCkBAIC00NTksMjUgKzQ2Myw4
IEBADQogI2RlZmluZQlJRlFfSU5DX0RST1BTKGlmcSkJCSgoaWZxKS0+aWZx
X2Ryb3BzKyspDQogI2RlZmluZQlJRlFfU0VUX01BWExFTihpZnEsIGxlbikJ
KChpZnEpLT5pZnFfbWF4bGVuID0gKGxlbikpDQogDQotLyoNCi0gKiBUaGUg
SUZGX0RSVl9PQUNUSVZFIHRlc3Qgc2hvdWxkIHJlYWxseSBvY2N1ciBpbiB0
aGUgZGV2aWNlIGRyaXZlciwgbm90IGluDQotICogdGhlIGhhbmRvZmYgbG9n
aWMsIGFzIHRoYXQgZmxhZyBpcyBsb2NrZWQgYnkgdGhlIGRldmljZSBkcml2
ZXIuDQotICovDQotI2RlZmluZQlJRlFfSEFORE9GRl9BREooaWZwLCBtLCBh
ZGosIGVycikJCQkJXA0KLWRvIHsJCQkJCQkJCQlcDQotCWludCBsZW47CQkJ
CQkJCVwNCi0Jc2hvcnQgbWZsYWdzOwkJCQkJCQlcDQotCQkJCQkJCQkJXA0K
LQlsZW4gPSAobSktPm1fcGt0aGRyLmxlbjsJCQkJCVwNCi0JbWZsYWdzID0g
KG0pLT5tX2ZsYWdzOwkJCQkJCVwNCi0JSUZRX0VOUVVFVUUoJihpZnApLT5p
Zl9zbmQsIG0sIGVycik7CQkJCVwNCi0JaWYgKChlcnIpID09IDApIHsJCQkJ
CQlcDQotCQkoaWZwKS0+aWZfb2J5dGVzICs9IGxlbiArIChhZGopOwkJCVwN
Ci0JCWlmIChtZmxhZ3MgJiBNX01DQVNUKQkJCQkJXA0KLQkJCShpZnApLT5p
Zl9vbWNhc3RzKys7CQkJCVwNCi0JCWlmICgoKGlmcCktPmlmX2Rydl9mbGFn
cyAmIElGRl9EUlZfT0FDVElWRSkgPT0gMCkJXA0KLQkJCWlmX3N0YXJ0KGlm
cCk7CQkJCQlcDQotCX0JCQkJCQkJCVwNCisjZGVmaW5lCUlGUV9IQU5ET0ZG
X0FESihpZnAsIG0sIGFkaiwgZXJyKSBkbyB7CQkJCVwNCisJZXJyID0gaWZx
X2hhbmRvZmYoaWZwLCBtLCBhZGopOwkJCQkJXA0KIH0gd2hpbGUgKDApDQog
DQogI2RlZmluZQlJRlFfSEFORE9GRihpZnAsIG0sIGVycikJCQkJCVwNCg==

--0-1695162780-1154268288=:16341--

From owner-freebsd-arch@FreeBSD.ORG  Sun Jul 30 14:59:19 2006
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
X-Original-To: freebsd-arch@freebsd.org
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 39CA816A4DD;
	Sun, 30 Jul 2006 14:59:19 +0000 (UTC)
	(envelope-from max@love2party.net)
Received: from moutng.kundenserver.de (moutng.kundenserver.de
	[212.227.126.186])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 7A51143D46;
	Sun, 30 Jul 2006 14:59:18 +0000 (GMT)
	(envelope-from max@love2party.net)
Received: from [88.64.179.108] (helo=amd64.laiers.local)
	by mrelayeu.kundenserver.de (node=mrelayeu0) with ESMTP (Nemesis),
	id 0MKwh2-1G7ClB1MeV-0001wZ; Sun, 30 Jul 2006 16:59:17 +0200
From: Max Laier <max@love2party.net>
Organization: FreeBSD
To: freebsd-arch@freebsd.org
Date: Sun, 30 Jul 2006 16:59:10 +0200
User-Agent: KMail/1.9.3
References: <20060730141642.D16341@fledge.watson.org>
In-Reply-To: <20060730141642.D16341@fledge.watson.org>
X-Face: ,,8R(x[kmU]tKN@>gtH1yQE4aslGd<hB5S>u+2];
	R]*pL,U>^H?)gW@49@wdJ`H<=?utf-8?q?=25=7D*=5FBD=0A=09U=5For=3D=5CmOZf764=26nYj=3DJYbR1PW0ud?=>|!~,,CPC.1-D$FG@0h3#'5"k{V]a~.<=?utf-8?q?mZ=7D44=23Se=7Em=0A=09Fe=7E=5C=5DX5B=5D=5Fxj?=(ykz9QKMw_l0C2AQ]}Ym8)fU
MIME-Version: 1.0
Content-Type: multipart/signed; boundary="nextPart2120229.eJNeJPqOEV";
	protocol="application/pgp-signature"; micalg=pgp-sha1
Content-Transfer-Encoding: 7bit
Message-Id: <200607301659.16323.max@love2party.net>
X-Provags-ID: kundenserver.de abuse@kundenserver.de
	login:61c499deaeeba3ba5be80f48ecc83056
Cc: Robert Watson <rwatson@freebsd.org>, freeebsd-net@freebsd.org
Subject: Re: Changes in the network interface queueing handoff model
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 30 Jul 2006 14:59:19 -0000

--nextPart2120229.eJNeJPqOEV
Content-Type: text/plain;
  charset="iso-8859-6"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline

On Sunday 30 July 2006 16:04, Robert Watson wrote:
> One of the ideas that I, Scott Long, and a few others have been bouncing
> around for some time is a restructuring of the network interface packet
> transmission API to reduce the number of locking operations and allow
> network device drivers increased control of the queueing behavior.  Right
> now, it works something like that following:
>
> - When a network protocol wants to transmit, it calls the ifnet's link
> layer output routine via ifp->if_output() with the ifnet pointer, packet,
> destination address information, and route information.
>
> - The link layer (e.g., ether_output() + ether_output_frame()) encapsulat=
es
>    the packet as necessary, performs a link layer address translation (su=
ch
> as ARP), and hands off to the ifnet driver via a call to IFQ_HANDOFF(),
> which accepts the ifnet pointer and packet.
>
> - The ifnet layer enqueues the packet in the ifnet send queue
> (ifp->if_snd), and then looks at the driver's IFF_DRV_OACTIVE flag to
> determine if it needs to "start" output by the driver.  If the driver is
> already active, it doesn't, and otherwise, it does.
>
> - The driver dequeues the packet from ifp->if_snd, performs any driver
>    encapsulation and wrapping, and notifies the hardware.  In modern
> hardware, this consists of hooking the data of the packet up to the
> descriptor ring and notifying the hardware to pick it up via DMA.  In ord=
er
> hardware, the driver would perform a series of I/O operations to send the
> entire packet directly to the card via a system bus.
>
> Why change this?  A few reasons:
>
> - The ifnet layer send queue is becoming decreasingly useful over time.=20
> Most modern hardware has a significant number of slots in its transmit
> descriptor ring, tuned for the performance of the hardware, etc, which is
> the effective transmit queue in practice.  The additional queue depth
> doesn't increase throughput substantially (if at all) but does consume
> memory.
>
> - On extremely fast hardware (with respect to CPU speed), the queue remai=
ns
>    essentially empty, so we pay the cost of enqueueing and dequeuing a
> packet from an empty queue.
>
> - The ifnet send queue is a separately locked object from the device
> driver, meaning that for a single enqueue/dequeue pair, we pay an extra
> four lock operations (two for insert, two for remove) per packet.
>
> - For synthetic link layer drivers, such as if_vlan, which have no need f=
or
>    queueing at all, the cost of queueing is eliminated.
>
> - IFF_DRV_OACTIVE is no longer inspected by the link layer, only by the
>    driver, which helps eliminate a latent race condition involving use of
> the flag.
>
> The proposed change is simple: right now one or more enqueue operations
> occurs, when a call to ifp->if_start() is made to notify the driver that =
it
> may need to do something (if the ACTIVE flag isn't set).  In the new world
> order, the driver is directly passed the mbuf, and may then choose to que=
ue
> it or otherwise handle it as it sees fit.  The immediate practical benefit
> is clear: if the queueing at the ifnet layer is unnecessary, it is entire=
ly
> avoided, skipping enqueue, dequeue, and four mutex operations.  This
> applies immediately for VLAN processing, but also means that for modern
> gigabit cards, the hardware queue (which will be used anyway) is the only
> queue necessary.
>
> There are a few downsides, of course:
>
> - For older hardware without its own queueing, the queue is still required
> -- not only that, but we've now introduced an unconditional function
> pointer invocation, which on older hardware, is has more significant
> relative cost than it has on more recent CPUs.
>
> - If drivers still require or use a queue, they must now synchronize acce=
ss
> to the queue.  The obvious choices are to use the ifq lock (and restore t=
he
> above four lock operations), or to use the driver mutex (and risk higher
> contention).  Right now, if the driver is busy (driver mutex held) then an
> enqueue is still possible, but with this change and a single mutex
> protecting the send queue and driver, that is no longer possible.
>
> Attached is a patch that maintains the current if_start, but adds
> if_startmbuf.  If a device driver implements if_startmbuf and the global
> sysctl net.startmbuf_enabled is set to 1, then the if_startmbuf path in t=
he
> driver will be used.  Otherwise, if_start is used.  I have modified the
> if_em driver to implement if_startmbuf also.  If there is no packet backl=
og
> in the if_snd queue, it directly places the packet in the transmit
> descriptor ring. If there is a backlog, it uses the if_snd queue protected
> by driver mutex, rather than a separate ifq mutex.
>
> In some basic local micro-benchmarks, I saw a 5% improvement in UDP 0-byte
> paylod PPS on UP, and a 10% improvement on SMP.  I saw a 1.7% performance
> improvement in the bulk serving of 1k files over HTTP.  These are only
> micro-benchmarks, and reflect a configuration in which the CPU is unable =
to
> keep up with the output rate of the 1gbps ethernet card in the device, so
> reductions in host CPU usage are immediately visible in increased output =
as
> the CPU is able to better keep up with the network hardware.  Other
> configurations are also of interest of interesting, especially ones in
> which the network device is unable to keep up with the CPU, resulting in
> more queueing.
>
> Conceptual review as well as banchmarking, etc, would be most welcome.

This begs the question: What about ALTQ?

If we maintain the fallback mechanism in _handoff, we can just add=20
ALTQ_IS_ENABLED() to the test.  Otherwise every driver's startmbuf function=
=20
would have to take care of ALTQ itself, which is not preferable.

I strongly agree with you comment about how messed up ifq_*/if_* in if_var.=
h=20
are - and I'm afraid that's partly me fault for bringing in ALTQ.

=2D-=20
/"\  Best regards,                      | mlaier@freebsd.org
\ /  Max Laier                          | ICQ #67774661
 X   http://pf4freebsd.love2party.net/  | mlaier@EFnet
/ \  ASCII Ribbon Campaign              | Against HTML Mail and News

--nextPart2120229.eJNeJPqOEV
Content-Type: application/pgp-signature

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.4 (FreeBSD)

iD8DBQBEzMlEXyyEoT62BG0RAvsrAJ4v2m/yc+PHoUM+kPE0ZZUVknJbTgCfeJYN
uQVwRejml24OusLMlSIJV5A=
=OUxd
-----END PGP SIGNATURE-----

--nextPart2120229.eJNeJPqOEV--

From owner-freebsd-arch@FreeBSD.ORG  Sun Jul 30 15:25:48 2006
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
X-Original-To: freebsd-arch@freebsd.org
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id AF52A16A4DD
	for <freebsd-arch@freebsd.org>; Sun, 30 Jul 2006 15:25:48 +0000 (UTC)
	(envelope-from rwatson@FreeBSD.org)
Received: from cyrus.watson.org (cyrus.watson.org [209.31.154.42])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 590E843D45;
	Sun, 30 Jul 2006 15:25:48 +0000 (GMT)
	(envelope-from rwatson@FreeBSD.org)
Received: from fledge.watson.org (fledge.watson.org [209.31.154.41])
	by cyrus.watson.org (Postfix) with ESMTP id F078146BE7;
	Sun, 30 Jul 2006 11:25:47 -0400 (EDT)
Date: Sun, 30 Jul 2006 16:25:47 +0100 (BST)
From: Robert Watson <rwatson@FreeBSD.org>
X-X-Sender: robert@fledge.watson.org
To: Max Laier <max@love2party.net>
In-Reply-To: <200607301659.16323.max@love2party.net>
Message-ID: <20060730160933.D16341@fledge.watson.org>
References: <20060730141642.D16341@fledge.watson.org>
	<200607301659.16323.max@love2party.net>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: freeebsd-net@freebsd.org, freebsd-arch@freebsd.org
Subject: Re: Changes in the network interface queueing handoff model
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 30 Jul 2006 15:25:48 -0000

On Sun, 30 Jul 2006, Max Laier wrote:

>> Conceptual review as well as banchmarking, etc, would be most welcome.
>
> This begs the question: What about ALTQ?
>
> If we maintain the fallback mechanism in _handoff, we can just add 
> ALTQ_IS_ENABLED() to the test.  Otherwise every driver's startmbuf function 
> would have to take care of ALTQ itself, which is not preferable.

Maxime just asked me the same question, and I realized that I had, of course, 
forgotten to mention ALTQ.  A few observations/questions:

- An underlying assumption of ALTQ is that queueing occurs in software.  This
   turns out to be decreasingly true with modern network hardware, where the
   queueing occurs in a combination of software and hardware thanks to the
   descriptor ring.  Has anyone compared the effectiveness of ALTQ on gig-e
   systems with the effectiveness on 10/100mbps systems?  I would anticipate
   that it would have an effect on the latter, but little or no effect on the
   former?

- ALTQ actually also already does a bit of what I describe --
   IFQ_DRV_DEQUEUE() and friends actually manage two ifnet queues, if_snd (the
   public queue), and if_drv_head, a second queue protected by the device
   driver lock.  Because it bulk dequeues from one queue to the other, it
   already works to amortize if_snd locking, at the cost of maintaining two
   queues and substantially more complicated logic.

- One of the side effects of this change is that it does complicate device
   drivers, especially if they are going to rely on software queueing.  In the
   example changes to if_em in my patch, the start path goes from a simple loop
   around the send queue to considering three cases: needing to queue, handling
   the optimized (empty) queue case, and needing to dequeue.  Quite a bit of
   this logic will be common across device drivers and might be something we
   can abstract out some.  There are two things going on in my proposed change:
   ownership (and hence locking and interface) with a queue moves, and code
   moves.  It could be we could transfer the locking (etc) and move less code.
   Thoughts on this would be welcome.

Notice that in the patch I do leave backwards compatible support for if_start, 
if only because rewriting all network device drivers is error prone and 
arduous.  I assume this will be a temporary condition, but it could be that we 
could leave it as a permanent one to support older devices that won't be 
updated (ISA ethernet cards, etc), where the existing queueing model works 
reasonably well.  In an earlier iteration of this patch, I had em_startmbuf() 
call a utility routine in if.c to handle enqueueing followed by calling 
em_start().  This meant no optimized fast path, but less code change.

> I strongly agree with you comment about how messed up ifq_*/if_* in
> if_var.h are - and I'm afraid that's partly me fault for bringing in ALTQ.

Heh.  I actually meant to remove that comment before posting the patch, so 
it's a bit more blunt than perhaps entirely wise. :-)  A couple of times in 
the past I've attempted to work on a rename to basically swap the two naming 
schemes and clean things up, but usually I've stalled when it occurs to me 
just how big the task is.  At this point, I think the right thing to do is 
make a decision about the semantics of the interface, and then walk the tree 
and clean up after we've simplified things.

Robert N M Watson
Computer Laboratory
University of Cambridge

From owner-freebsd-arch@FreeBSD.ORG  Sun Jul 30 15:36:21 2006
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
X-Original-To: freebsd-arch@freebsd.org
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id C589116A4DD;
	Sun, 30 Jul 2006 15:36:21 +0000 (UTC)
	(envelope-from prvs=julian=3594eb8d2@elischer.org)
Received: from a50.ironport.com (a50.ironport.com [63.251.108.112])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 6FE0143D6E;
	Sun, 30 Jul 2006 15:36:21 +0000 (GMT)
	(envelope-from prvs=julian=3594eb8d2@elischer.org)
Received: from unknown (HELO [192.168.2.4]) ([10.251.60.32])
	by a50.ironport.com with ESMTP; 30 Jul 2006 08:36:21 -0700
Message-ID: <44CCD1F4.1090902@elischer.org>
Date: Sun, 30 Jul 2006 08:36:20 -0700
From: Julian Elischer <julian@elischer.org>
User-Agent: Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; en-US;
	rv:1.7.13) Gecko/20060414
X-Accept-Language: en-us, en
MIME-Version: 1.0
To: Robert Watson <rwatson@freebsd.org>
References: <20060730141642.D16341@fledge.watson.org>	<200607301659.16323.max@love2party.net>
	<20060730160933.D16341@fledge.watson.org>
In-Reply-To: <20060730160933.D16341@fledge.watson.org>
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit
Cc: Max Laier <max@love2party.net>, freeebsd-net@freebsd.org,
	freebsd-arch@freebsd.org
Subject: Re: Changes in the network interface queueing handoff model
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 30 Jul 2006 15:36:21 -0000

Robert Watson wrote:

> On Sun, 30 Jul 2006, Max Laier wrote:
>
>
>> I strongly agree with you comment about how messed up ifq_*/if_* in
>> if_var.h are - and I'm afraid that's partly me fault for bringing in 
>> ALTQ.
>
If it becomes standard you'll have to think of another name.. it won't 
be "alt" any more..

:-)

From owner-freebsd-arch@FreeBSD.ORG  Sun Jul 30 18:36:18 2006
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
X-Original-To: arch@freebsd.org
Delivered-To: freebsd-arch@FreeBSD.ORG
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 7902B16A4DA;
	Sun, 30 Jul 2006 18:36:18 +0000 (UTC) (envelope-from sam@errno.com)
Received: from ebb.errno.com (ebb.errno.com [69.12.149.25])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 0AAF143D46;
	Sun, 30 Jul 2006 18:36:17 +0000 (GMT) (envelope-from sam@errno.com)
Received: from [10.0.0.199] ([10.0.0.199]) (authenticated bits=0)
	by ebb.errno.com (8.13.6/8.12.6) with ESMTP id k6UIaH7v011192
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
	Sun, 30 Jul 2006 11:36:17 -0700 (PDT) (envelope-from sam@errno.com)
Message-ID: <44CCFC2C.20402@errno.com>
Date: Sun, 30 Jul 2006 11:36:28 -0700
From: Sam Leffler <sam@errno.com>
Organization: Errno Consulting
User-Agent: Thunderbird 1.5.0.5 (Macintosh/20060719)
MIME-Version: 1.0
To: Robert Watson <rwatson@freebsd.org>
References: <20060730141642.D16341@fledge.watson.org>
In-Reply-To: <20060730141642.D16341@fledge.watson.org>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: arch@freebsd.org, net@freebsd.org
Subject: Re: Changes in the network interface queueing handoff model
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 30 Jul 2006 18:36:18 -0000

Robert Watson wrote:
> 
> 5BOne of the ideas that I, Scott Long, and a few others have been 
> bouncing around for some time is a restructuring of the network 
> interface packet transmission API to reduce the number of locking 
> operations and allow network device drivers increased control of the 
> queueing behavior.  Right now, it works something like that following:
> 
> - When a network protocol wants to transmit, it calls the ifnet's link 
> layer
>   output routine via ifp->if_output() with the ifnet pointer, packet,
>   destination address information, and route information.
> 
> - The link layer (e.g., ether_output() + ether_output_frame()) encapsulates
>   the packet as necessary, performs a link layer address translation 
> (such as
>   ARP), and hands off to the ifnet driver via a call to IFQ_HANDOFF(), 
> which
>   accepts the ifnet pointer and packet.
> 
> - The ifnet layer enqueues the packet in the ifnet send queue 
> (ifp->if_snd),
>   and then looks at the driver's IFF_DRV_OACTIVE flag to determine if it 
> needs
>   to "start" output by the driver.  If the driver is already active, it
>   doesn't, and otherwise, it does.
> 
> - The driver dequeues the packet from ifp->if_snd, performs any driver
>   encapsulation and wrapping, and notifies the hardware.  In modern 
> hardware,
>   this consists of hooking the data of the packet up to the descriptor ring
>   and notifying the hardware to pick it up via DMA.  In order hardware, the
>   driver would perform a series of I/O operations to send the entire packet
>   directly to the card via a system bus.
> 
> Why change this?  A few reasons:
> 
> - The ifnet layer send queue is becoming decreasingly useful over time.  
> Most
>   modern hardware has a significant number of slots in its transmit 
> descriptor
>   ring, tuned for the performance of the hardware, etc, which is the 
> effective
>   transmit queue in practice.  The additional queue depth doesn't increase
>   throughput substantially (if at all) but does consume memory.
> 
> - On extremely fast hardware (with respect to CPU speed), the queue remains
>   essentially empty, so we pay the cost of enqueueing and dequeuing a 
> packet
>   from an empty queue.
> 
> - The ifnet send queue is a separately locked object from the device 
> driver,
>   meaning that for a single enqueue/dequeue pair, we pay an extra four lock
>   operations (two for insert, two for remove) per packet.
> 
> - For synthetic link layer drivers, such as if_vlan, which have no need for
>   queueing at all, the cost of queueing is eliminated.
> 
> - IFF_DRV_OACTIVE is no longer inspected by the link layer, only by the
>   driver, which helps eliminate a latent race condition involving use of 
> the
>   flag.
> 
> The proposed change is simple: right now one or more enqueue operations 
> occurs, when a call to ifp->if_start() is made to notify the driver that 
> it may need to do something (if the ACTIVE flag isn't set).  In the new 
> world order, the driver is directly passed the mbuf, and may then choose 
> to queue it or otherwise handle it as it sees fit.  The immediate 
> practical benefit is clear: if the queueing at the ifnet layer is 
> unnecessary, it is entirely avoided, skipping enqueue, dequeue, and four 
> mutex operations.  This applies immediately for VLAN processing, but 
> also means that for modern gigabit cards, the hardware queue (which will 
> be used anyway) is the only queue necessary.
> 
> There are a few downsides, of course:
> 
> - For older hardware without its own queueing, the queue is still 
> required --
>   not only that, but we've now introduced an unconditional function pointer
>   invocation, which on older hardware, is has more significant relative 
> cost
>   than it has on more recent CPUs.
> 
> - If drivers still require or use a queue, they must now synchronize 
> access to
>   the queue.  The obvious choices are to use the ifq lock (and restore the
>   above four lock operations), or to use the driver mutex (and risk higher
>   contention).  Right now, if the driver is busy (driver mutex held) 
> then an
>   enqueue is still possible, but with this change and a single mutex
>   protecting the send queue and driver, that is no longer possible.
> 

You're headed in the direction of linux where the handoff goes through a 
packet scheduling function before it hits the driver.  This is 
equivalent to altq which, as Max pointed out, you didn't mention in this 
note.  But it would be very good to move altq out of the compile-time 
macros with this.

I have a fair amount of experience with the linux model and it works ok. 
  The main complication I've seen is when a driver needs to process 
multiple queues of packets things get more involved.  This is seen in 
802.11 drivers where there are two q's, one for data frames and one for 
management frames.  With the current scheme you have two separate queues 
  and the start method handles prioritization by polling the mgt q 
before the data q.  If instead the packet is passed to the start method 
then it needs to be tagged in some way so the it's prioritized properly. 
  Otherwise you end up with multiple start methods; one per type of 
packet.  I suspect this will be ok but the end result will be that we'll 
  need to add a priority field to mbufs (unless we pass it as an arge to 
the start method).

All this is certainly doable but I think just replacing one mechanism 
with the other (as you specified) is insufficient.

 > Attached is a patch that maintains the current if_start, but adds
 > if_startmbuf.  If a device driver implements if_startmbuf and the global
 > sysctl net.startmbuf_enabled is set to 1, then the if_startmbuf path in
 > the driver will be used.  Otherwise, if_start is used.  I have modified
 > the if_em driver to implement if_startmbuf also.  If there is no packet
 > backlog in the if_snd queue, it directly places the packet in the
 > transmit descriptor ring. If there is a backlog, it uses the if_snd
 > queue protected by driver mutex, rather than a separate ifq mutex.
 >
 > In some basic local micro-benchmarks, I saw a 5% improvement in UDP
 > 0-byte paylod PPS on UP, and a 10% improvement on SMP.  I saw a 1.7%
 > performance improvement in the bulk serving of 1k files over HTTP.
 > These are only micro-benchmarks, and reflect a configuration in which
 > the CPU is unable to keep up with the output rate of the 1gbps ethernet
 > card in the device, so reductions in host CPU usage are immediately
 > visible in increased output as the CPU is able to better keep up with
 > the network hardware.  Other configurations are also of interest of
 > interesting, especially ones in which the network device is unable to
 > keep up with the CPU, resulting in more queueing.
 >
 > Conceptual review as well as banchmarking, etc, would be most welcome.

Why is the startmbuf knob global and not per-interface?  Seems like you 
want to convert drivers one at a time?

FWIW the original model was driven by the expectation that you could 
raise the spl so the tx path was entirely synchronized from above.  With 
the SMPng work we're synchronizing transfer through each control layer. 
  If the driver softc lock (or similar) were exposed to upper layers we 
could possibly return the "lock the tx path" model we had before and 
eliminate all the locking your changes target.  But that would be a big 
layering violation and would add significant contention in the SMP case.

I think the key observation is that most network hardware today takes 
packets directly from private queues so the fast path needs to push 
things down to those queues w/ minimal overhead.  This includes devices 
that implement QoS in h/w w/ multiple queues.

	Sam

From owner-freebsd-arch@FreeBSD.ORG  Sun Jul 30 19:23:14 2006
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
X-Original-To: arch@freebsd.org
Delivered-To: freebsd-arch@FreeBSD.ORG
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 3D6F516A4DA;
	Sun, 30 Jul 2006 19:23:14 +0000 (UTC)
	(envelope-from rwatson@FreeBSD.org)
Received: from cyrus.watson.org (cyrus.watson.org [209.31.154.42])
	by mx1.FreeBSD.org (Postfix) with ESMTP id D333043D45;
	Sun, 30 Jul 2006 19:23:13 +0000 (GMT)
	(envelope-from rwatson@FreeBSD.org)
Received: from fledge.watson.org (fledge.watson.org [209.31.154.41])
	by cyrus.watson.org (Postfix) with ESMTP id 783F646CCD;
	Sun, 30 Jul 2006 15:23:13 -0400 (EDT)
Date: Sun, 30 Jul 2006 20:23:13 +0100 (BST)
From: Robert Watson <rwatson@FreeBSD.org>
X-X-Sender: robert@fledge.watson.org
To: Sam Leffler <sam@errno.com>
In-Reply-To: <44CCFC2C.20402@errno.com>
Message-ID: <20060730200929.J16341@fledge.watson.org>
References: <20060730141642.D16341@fledge.watson.org>
	<44CCFC2C.20402@errno.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: arch@freebsd.org, net@freebsd.org
Subject: Re: Changes in the network interface queueing handoff model
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 30 Jul 2006 19:23:14 -0000

On Sun, 30 Jul 2006, Sam Leffler wrote:

> I have a fair amount of experience with the linux model and it works ok. 
> The main complication I've seen is when a driver needs to process multiple 
> queues of packets things get more involved.  This is seen in 802.11 drivers 
> where there are two q's, one for data frames and one for management frames. 
> With the current scheme you have two separate queues and the start method 
> handles prioritization by polling the mgt q before the data q.  If instead 
> the packet is passed to the start method then it needs to be tagged in some 
> way so the it's prioritized properly.  Otherwise you end up with multiple 
> start methods; one per type of packet.  I suspect this will be ok but the 
> end result will be that we'll need to add a priority field to mbufs (unless 
> we pass it as an arge to the start method).
>
> All this is certainly doable but I think just replacing one mechanism with 
> the other (as you specified) is insufficient.

Hmm.  This is something that I had overlooked.  I was loosely aware that the 
if_sl code made use of multiple queues, but was under the impression that the 
classification to queues occured purely in the SLIP code.  Indeed, it does, 
but structurally, SLIP is split over the link layer (if_output) and driver 
layer (if_start), which I had forgotten.  I take it from your comments that 
802.11 also does this, which I was not aware of.

I'm a little uncomfortable with our current m_tag model, as it requires 
significant numbers of additional allocations and frees for each packet, as 
well as walking link lists.  It's fine for occasional discretionary use (i.e., 
MAC labels), but I worry about cases where it is used with every packet, and 
we start seeing moderately non-zero numbers of tags on every packet.  I think 
I would be more comfortable with an explicit queue identifier argument to 
if_start, where the link layer and driver layer agree on how to identify 
queues.

As a straw man, how would the following strike you:

 	int	if_startmbuf(struct ifnet *ifp, struct mbuf *m, int ifqid);

where for most link layers, the value would be zero, but for some link 
layer/driver combinations, it would identify a specific queue which the link 
layer believes the mbuf should be assigned, if implemented?

>> Attached is a patch that maintains the current if_start, but adds 
>> if_startmbuf.  If a device driver implements if_startmbuf and the global 
>> sysctl net.startmbuf_enabled is set to 1, then the if_startmbuf path in the 
>> driver will be used.  Otherwise, if_start is used.  I have modified the 
>> if_em driver to implement if_startmbuf also.  If there is no packet backlog 
>> in the if_snd queue, it directly places the packet in the transmit 
>> descriptor ring. If there is a backlog, it uses the if_snd queue protected 
>> by driver mutex, rather than a separate ifq mutex.
>>
>> In some basic local micro-benchmarks, I saw a 5% improvement in UDP 0-byte 
>> paylod PPS on UP, and a 10% improvement on SMP.  I saw a 1.7% performance 
>> improvement in the bulk serving of 1k files over HTTP. These are only 
>> micro-benchmarks, and reflect a configuration in which the CPU is unable to 
>> keep up with the output rate of the 1gbps ethernet card in the device, so 
>> reductions in host CPU usage are immediately visible in increased output as 
>> the CPU is able to better keep up with the network hardware.  Other 
>> configurations are also of interest of interesting, especially ones in 
>> which the network device is unable to keep up with the CPU, resulting in 
>> more queueing.
>>
>> Conceptual review as well as banchmarking, etc, would be most welcome.
>
> Why is the startmbuf knob global and not per-interface?  Seems like you want 
> to convert drivers one at a time?

I may have under-described what I have implemented.  The decision is currently 
made based on two factors: a global frob, and per-interface definition of 
if_startmbuf being non-zero.  The global frob is intended to make it easy to 
benchmark the difference.  I should modify the patch so that the global frob 
doesn't override the driver back to if_start in the event that if_startmbuf is 
defined and if_start isn't.  The global frob is intended to be removed in the 
long run, and I intend for us to continue to support both the old and new 
start methods for the forseeable future, since I don't intend to update every 
device driver we have to the new method, at least not personally :-).

> FWIW the original model was driven by the expectation that you could raise 
> the spl so the tx path was entirely synchronized from above.  With the SMPng 
> work we're synchronizing transfer through each control layer.  If the driver 
> softc lock (or similar) were exposed to upper layers we could possibly 
> return the "lock the tx path" model we had before and eliminate all the 
> locking your changes target.  But that would be a big layering violation and 
> would add significant contention in the SMP case.

In some ways, what I propose comes to much the same thing: the change I 
propose basically delegates the queueing and synchronization decisions to the 
device driver, which might choose either to use the lock already in the ifq, 
to use its own lock, or to use some other synchronization strategy.  In the 
case of if_em, I've implemented bypass of software queueing entirely in the 
common case, but in the event that the hardware ring backs up, then we still 
fall back to the if_snd queue, only we lock it using the device driver's 
transmit path mutex.  Delegating the synchronization down the stack comes with 
risks, as device driver writers will inevitably take liberties: on the other 
hand, it appears that devices are quite diverse, and those liberties have 
advantages.

> I think the key observation is that most network hardware today takes 
> packets directly from private queues so the fast path needs to push things 
> down to those queues w/ minimal overhead.  This includes devices that 
> implement QoS in h/w w/ multiple queues.

Yes -- however, you're right that the link layer needs to be able to pass more 
information down.  I'd like it to be able to do so without an m_tag 
allocation, though, which suggests (as you point out) an explicit argument to 
if_startmbuf.

Thanks,

Robert N M Watson
Computer Laboratory
University of Cambridge

From owner-freebsd-arch@FreeBSD.ORG  Sun Jul 30 20:40:11 2006
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
X-Original-To: arch@freebsd.org
Delivered-To: freebsd-arch@FreeBSD.ORG
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 9008316A4DF;
	Sun, 30 Jul 2006 20:40:11 +0000 (UTC)
	(envelope-from prvs=julian=3594eb8d2@elischer.org)
Received: from a50.ironport.com (a50.ironport.com [63.251.108.112])
	by mx1.FreeBSD.org (Postfix) with ESMTP id B4A9C43D55;
	Sun, 30 Jul 2006 20:40:10 +0000 (GMT)
	(envelope-from prvs=julian=3594eb8d2@elischer.org)
Received: from unknown (HELO [192.168.2.4]) ([10.251.60.32])
	by a50.ironport.com with ESMTP; 30 Jul 2006 13:40:09 -0700
Message-ID: <44CD1928.6000004@elischer.org>
Date: Sun, 30 Jul 2006 13:40:08 -0700
From: Julian Elischer <julian@elischer.org>
User-Agent: Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; en-US;
	rv:1.7.13) Gecko/20060414
X-Accept-Language: en-us, en
MIME-Version: 1.0
To: Robert Watson <rwatson@freebsd.org>
References: <20060730141642.D16341@fledge.watson.org>	<44CCFC2C.20402@errno.com>
	<20060730200929.J16341@fledge.watson.org>
In-Reply-To: <20060730200929.J16341@fledge.watson.org>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: net@freebsd.org, arch@freebsd.org
Subject: Re: Changes in the network interface queueing handoff model
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 30 Jul 2006 20:40:11 -0000

Robert Watson wrote:

> On Sun, 30 Jul 2006, Sam Leffler wrote:
>
>> I have a fair amount of experience with the linux model and it works 
>> ok. The main complication I've seen is when a driver needs to process 
>> multiple queues of packets things get more involved.  This is seen in 
>> 802.11 drivers where there are two q's, one for data frames and one 
>> for management frames. With the current scheme you have two separate 
>> queues and the start method handles prioritization by polling the mgt 
>> q before the data q.  If instead the packet is passed to the start 
>> method then it needs to be tagged in some way so the it's prioritized 
>> properly.  Otherwise you end up with multiple start methods; one per 
>> type of packet.  I suspect this will be ok but the end result will be 
>> that we'll need to add a priority field to mbufs (unless we pass it 
>> as an arge to the start method).
>>
We have a priority tag in netgraph that we use to keep management frames 
on time in the frame relay code
it seems to work ok.

>> All this is certainly doable but I think just replacing one mechanism 
>> with the other (as you specified) is insufficient.
>

Linux did a big analysis of what was needed at the time they did most of 
their networking and their buffer scheme
(last I looked)  had all sorts of fields for this and that. I wonder how 
it has held up over time?

>
> Hmm.  This is something that I had overlooked.  I was loosely aware 
> that the if_sl code made use of multiple queues, but was under the 
> impression that the classification to queues occured purely in the 
> SLIP code.  Indeed, it does, but structurally, SLIP is split over the 
> link layer (if_output) and driver layer (if_start), which I had 
> forgotten.  I take it from your comments that 802.11 also does this, 
> which I was not aware of.
>
> I'm a little uncomfortable with our current m_tag model, as it 
> requires significant numbers of additional allocations and frees for 
> each packet, as well as walking link lists.  It's fine for occasional 
> discretionary use (i.e., MAC labels), but I worry about cases where it 
> is used with every packet, and we start seeing moderately non-zero 
> numbers of tags on every packet.  I think I would be more comfortable 
> with an explicit queue identifier argument to if_start, where the link 
> layer and driver layer agree on how to identify queues.

It  would certainly be possible to  (for example) have 2 tags 
preallocated on each mbuf or something but it is hard to know in advance
what will be needed.

From owner-freebsd-arch@FreeBSD.ORG  Mon Jul 31 00:59:59 2006
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
X-Original-To: arch@FreeBSD.org
Delivered-To: freebsd-arch@FreeBSD.ORG
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 470AB16A4E2;
	Mon, 31 Jul 2006 00:59:59 +0000 (UTC)
	(envelope-from gnn@neville-neil.com)
Received: from mrout1-b.corp.dcn.yahoo.com (mrout1-b.corp.dcn.yahoo.com
	[216.109.112.27])
	by mx1.FreeBSD.org (Postfix) with ESMTP id C898143D46;
	Mon, 31 Jul 2006 00:59:58 +0000 (GMT)
	(envelope-from gnn@neville-neil.com)
Received: from minion.local.neville-neil.com (proxy8.corp.yahoo.com
	[216.145.48.13])
	by mrout1-b.corp.dcn.yahoo.com (8.13.6/8.13.6/y.out) with ESMTP id
	k6V0xnto001240; Sun, 30 Jul 2006 17:59:50 -0700 (PDT)
Date: Mon, 31 Jul 2006 09:59:47 +0900
Message-ID: <m2r702lhbw.wl%gnn@neville-neil.com>
From: gnn@FreeBSD.org
To: Robert Watson <rwatson@FreeBSD.org>
In-Reply-To: <20060730141642.D16341@fledge.watson.org>
References: <20060730141642.D16341@fledge.watson.org>
User-Agent: Wanderlust/2.14.0 (Africa) SEMI/1.14.6 (Maruoka) FLIM/1.14.8
	(=?ISO-8859-4?Q?Shij=F2?=) APEL/10.6 Emacs/22.0.50
	(i386-apple-darwin8.6.1) MULE/5.0 (SAKAKI)
MIME-Version: 1.0 (generated by SEMI 1.14.6 - "Maruoka")
Content-Type: text/plain; charset=US-ASCII
Cc: arch@FreeBSD.org, net@FreeBSD.org
Subject: Re: Changes in the network interface queueing handoff model
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 31 Jul 2006 00:59:59 -0000

At Sun, 30 Jul 2006 15:04:48 +0100 (BST),
rwatson wrote:
> Conceptual review as well as banchmarking, etc, would be most welcome.
> 

I remember talking about this at BSDCan and certainly for high end
hardware it seems that it's the right way to go. 

Later,
George

From owner-freebsd-arch@FreeBSD.ORG  Mon Jul 31 01:02:41 2006
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
X-Original-To: arch@FreeBSD.org
Delivered-To: freebsd-arch@FreeBSD.ORG
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 7DB2316A4DA;
	Mon, 31 Jul 2006 01:02:41 +0000 (UTC) (envelope-from sam@errno.com)
Received: from ebb.errno.com (ebb.errno.com [69.12.149.25])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 8115843D49;
	Mon, 31 Jul 2006 01:02:40 +0000 (GMT) (envelope-from sam@errno.com)
Received: from [10.0.0.199] ([10.0.0.199]) (authenticated bits=0)
	by ebb.errno.com (8.13.6/8.12.6) with ESMTP id k6V12dCh012518
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
	Sun, 30 Jul 2006 18:02:39 -0700 (PDT) (envelope-from sam@errno.com)
Message-ID: <44CD56BB.6080405@errno.com>
Date: Sun, 30 Jul 2006 18:02:51 -0700
From: Sam Leffler <sam@errno.com>
Organization: Errno Consulting
User-Agent: Thunderbird 1.5.0.5 (Macintosh/20060719)
MIME-Version: 1.0
To: Robert Watson <rwatson@FreeBSD.org>
References: <20060730141642.D16341@fledge.watson.org>
	<44CCFC2C.20402@errno.com>
	<20060730200929.J16341@fledge.watson.org>
In-Reply-To: <20060730200929.J16341@fledge.watson.org>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: arch@FreeBSD.org, net@FreeBSD.org
Subject: Re: Changes in the network interface queueing handoff model
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 31 Jul 2006 01:02:41 -0000

Robert Watson wrote:
> On Sun, 30 Jul 2006, Sam Leffler wrote:
> 
>> I have a fair amount of experience with the linux model and it works 
>> ok. The main complication I've seen is when a driver needs to process 
>> multiple queues of packets things get more involved.  This is seen in 
>> 802.11 drivers where there are two q's, one for data frames and one 
>> for management frames. With the current scheme you have two separate 
>> queues and the start method handles prioritization by polling the mgt 
>> q before the data q.  If instead the packet is passed to the start 
>> method then it needs to be tagged in some way so the it's prioritized 
>> properly.  Otherwise you end up with multiple start methods; one per 
>> type of packet.  I suspect this will be ok but the end result will be 
>> that we'll need to add a priority field to mbufs (unless we pass it as 
>> an arge to the start method).
>>
>> All this is certainly doable but I think just replacing one mechanism 
>> with the other (as you specified) is insufficient.
> 
> Hmm.  This is something that I had overlooked.  I was loosely aware that 
> the if_sl code made use of multiple queues, but was under the impression 
> that the classification to queues occured purely in the SLIP code.  
> Indeed, it does, but structurally, SLIP is split over the link layer 
> (if_output) and driver layer (if_start), which I had forgotten.  I take 
> it from your comments that 802.11 also does this, which I was not aware of.

There are several issues here but the basic one is, I believe, that we 
need to provide a per-packet notion of priority or TOS handling.  The 
distinction between mgt frame and data in 802.11 drivers is a kludge; 
the right thing is to just use priority to get the desired effect.  But 
separately 802.11 is aware of priority for WME so independent of mgt 
frames priority we still need a way to pass down an AC (access control). 
  For 802.11 I was able to do this by encoding the value in the mbuf 
flags.  If there were a field in the mbuf header this kludge could be 
removed.  For other devices we still want a way to pass around the 
DiffServ bits or similar so things like vlan priority can be set w/o 
resorting to tagging each frame.  Ideally prioritization work like 
what's done inside slip should be pulled out.

Note that just slapping a field in the mbuf is a start but we also need 
to think about how to handle it up+down the stack so layers can honor 
existing priorty and/or filling priority for packets that aren't already 
classified.

> 
> I'm a little uncomfortable with our current m_tag model, as it requires 
> significant numbers of additional allocations and frees for each packet, 
> as well as walking link lists.  It's fine for occasional discretionary 
> use (i.e., MAC labels), but I worry about cases where it is used with 
> every packet, and we start seeing moderately non-zero numbers of tags on 
> every packet.  I think I would be more comfortable with an explicit 
> queue identifier argument to if_start, where the link layer and driver 
> layer agree on how to identify queues.
> 
> As a straw man, how would the following strike you:
> 
>     int    if_startmbuf(struct ifnet *ifp, struct mbuf *m, int ifqid);
> 
> where for most link layers, the value would be zero, but for some link 
> layer/driver combinations, it would identify a specific queue which the 
> link layer believes the mbuf should be assigned, if implemented?

mbuf tags are not a solution; too expensive.  I think we need something 
in the mbuf header.

> 
>>> Attached is a patch that maintains the current if_start, but adds 
>>> if_startmbuf.  If a device driver implements if_startmbuf and the 
>>> global sysctl net.startmbuf_enabled is set to 1, then the 
>>> if_startmbuf path in the driver will be used.  Otherwise, if_start is 
>>> used.  I have modified the if_em driver to implement if_startmbuf 
>>> also.  If there is no packet backlog in the if_snd queue, it directly 
>>> places the packet in the transmit descriptor ring. If there is a 
>>> backlog, it uses the if_snd queue protected by driver mutex, rather 
>>> than a separate ifq mutex.
>>>
>>> In some basic local micro-benchmarks, I saw a 5% improvement in UDP 
>>> 0-byte paylod PPS on UP, and a 10% improvement on SMP.  I saw a 1.7% 
>>> performance improvement in the bulk serving of 1k files over HTTP. 
>>> These are only micro-benchmarks, and reflect a configuration in which 
>>> the CPU is unable to keep up with the output rate of the 1gbps 
>>> ethernet card in the device, so reductions in host CPU usage are 
>>> immediately visible in increased output as the CPU is able to better 
>>> keep up with the network hardware.  Other configurations are also of 
>>> interest of interesting, especially ones in which the network device 
>>> is unable to keep up with the CPU, resulting in more queueing.
>>>
>>> Conceptual review as well as banchmarking, etc, would be most welcome.
>>
>> Why is the startmbuf knob global and not per-interface?  Seems like 
>> you want to convert drivers one at a time?
> 
> I may have under-described what I have implemented.  The decision is 
> currently made based on two factors: a global frob, and per-interface 
> definition of if_startmbuf being non-zero.  The global frob is intended 
> to make it easy to benchmark the difference.  I should modify the patch 
> so that the global frob doesn't override the driver back to if_start in 
> the event that if_startmbuf is defined and if_start isn't.  The global 
> frob is intended to be removed in the long run, and I intend for us to 
> continue to support both the old and new start methods for the 
> forseeable future, since I don't intend to update every device driver we 
> have to the new method, at least not personally :-).
> 
>> FWIW the original model was driven by the expectation that you could 
>> raise the spl so the tx path was entirely synchronized from above.  
>> With the SMPng work we're synchronizing transfer through each control 
>> layer.  If the driver softc lock (or similar) were exposed to upper 
>> layers we could possibly return the "lock the tx path" model we had 
>> before and eliminate all the locking your changes target.  But that 
>> would be a big layering violation and would add significant contention 
>> in the SMP case.
> 
> In some ways, what I propose comes to much the same thing: the change I 
> propose basically delegates the queueing and synchronization decisions 
> to the device driver, which might choose either to use the lock already 
> in the ifq, to use its own lock, or to use some other synchronization 
> strategy.  In the case of if_em, I've implemented bypass of software 
> queueing entirely in the common case, but in the event that the hardware 
> ring backs up, then we still fall back to the if_snd queue, only we lock 
> it using the device driver's transmit path mutex.  Delegating the 
> synchronization down the stack comes with risks, as device driver 
> writers will inevitably take liberties: on the other hand, it appears 
> that devices are quite diverse, and those liberties have advantages.
> 
>> I think the key observation is that most network hardware today takes 
>> packets directly from private queues so the fast path needs to push 
>> things down to those queues w/ minimal overhead.  This includes 
>> devices that implement QoS in h/w w/ multiple queues.
> 
> Yes -- however, you're right that the link layer needs to be able to 
> pass more information down.  I'd like it to be able to do so without an 
> m_tag allocation, though, which suggests (as you point out) an explicit 
> argument to if_startmbuf.

Or an addition to the mbuf header.

	Sam

From owner-freebsd-arch@FreeBSD.ORG  Mon Jul 31 08:24:46 2006
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
X-Original-To: freebsd-arch@freebsd.org
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 7A69916A4DA;
	Mon, 31 Jul 2006 08:24:46 +0000 (UTC)
	(envelope-from hselasky@c2i.net)
Received: from swip.net (mailfe02.swip.net [212.247.154.33])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 54F5643D45;
	Mon, 31 Jul 2006 08:24:44 +0000 (GMT)
	(envelope-from hselasky@c2i.net)
X-T2-Posting-ID: gvlK0tOCzrqh9CPROFOFPw==
X-Cloudmark-Score: 0.000000 []
Received: from [193.217.133.87] (HELO [10.0.0.249])
	by mailfe02.swip.net (CommuniGate Pro SMTP 5.0.8)
	with ESMTP id 247590042; Mon, 31 Jul 2006 10:24:40 +0200
From: Hans Petter Selasky <hselasky@c2i.net>
To: freebsd-arch@freebsd.org
Date: Mon, 31 Jul 2006 10:24:48 +0200
User-Agent: KMail/1.7
References: <20060730141642.D16341@fledge.watson.org>
	<44CCFC2C.20402@errno.com>
	<20060730200929.J16341@fledge.watson.org>
In-Reply-To: <20060730200929.J16341@fledge.watson.org>
MIME-Version: 1.0
Content-Type: text/plain;
  charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
Message-Id: <200607311024.50537.hselasky@c2i.net>
Cc: net@freebsd.org, Robert Watson <rwatson@freebsd.org>, arch@freebsd.org
Subject: Re: Changes in the network interface queueing handoff model
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 31 Jul 2006 08:24:46 -0000

On Sunday 30 July 2006 21:23, Robert Watson wrote:
> On Sun, 30 Jul 2006, Sam Leffler wrote:

Just a comment while the iron is hot:

Maybe you can make the network model safe against detach. Currently I see that 
the processor can be stuck in routines like "if_start" after that "if_free()" 
has been called. This can be critical for USB ethernet devices.

--HPS

From owner-freebsd-arch@FreeBSD.ORG  Mon Jul 31 08:24:46 2006
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
X-Original-To: arch@freebsd.org
Delivered-To: freebsd-arch@FreeBSD.ORG
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 7A69916A4DA;
	Mon, 31 Jul 2006 08:24:46 +0000 (UTC)
	(envelope-from hselasky@c2i.net)
Received: from swip.net (mailfe02.swip.net [212.247.154.33])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 54F5643D45;
	Mon, 31 Jul 2006 08:24:44 +0000 (GMT)
	(envelope-from hselasky@c2i.net)
X-T2-Posting-ID: gvlK0tOCzrqh9CPROFOFPw==
X-Cloudmark-Score: 0.000000 []
Received: from [193.217.133.87] (HELO [10.0.0.249])
	by mailfe02.swip.net (CommuniGate Pro SMTP 5.0.8)
	with ESMTP id 247590042; Mon, 31 Jul 2006 10:24:40 +0200
From: Hans Petter Selasky <hselasky@c2i.net>
To: freebsd-arch@freebsd.org
Date: Mon, 31 Jul 2006 10:24:48 +0200
User-Agent: KMail/1.7
References: <20060730141642.D16341@fledge.watson.org>
	<44CCFC2C.20402@errno.com>
	<20060730200929.J16341@fledge.watson.org>
In-Reply-To: <20060730200929.J16341@fledge.watson.org>
MIME-Version: 1.0
Content-Type: text/plain;
  charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
Message-Id: <200607311024.50537.hselasky@c2i.net>
Cc: net@freebsd.org, Robert Watson <rwatson@freebsd.org>, arch@freebsd.org
Subject: Re: Changes in the network interface queueing handoff model
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 31 Jul 2006 08:24:46 -0000

On Sunday 30 July 2006 21:23, Robert Watson wrote:
> On Sun, 30 Jul 2006, Sam Leffler wrote:

Just a comment while the iron is hot:

Maybe you can make the network model safe against detach. Currently I see that 
the processor can be stuck in routines like "if_start" after that "if_free()" 
has been called. This can be critical for USB ethernet devices.

--HPS

From owner-freebsd-arch@FreeBSD.ORG  Mon Jul 31 09:53:51 2006
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
X-Original-To: freebsd-arch@freebsd.org
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id A4AAA16A4DE;
	Mon, 31 Jul 2006 09:53:51 +0000 (UTC)
	(envelope-from rwatson@FreeBSD.org)
Received: from cyrus.watson.org (cyrus.watson.org [209.31.154.42])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 4E48B43D45;
	Mon, 31 Jul 2006 09:53:51 +0000 (GMT)
	(envelope-from rwatson@FreeBSD.org)
Received: from fledge.watson.org (fledge.watson.org [209.31.154.41])
	by cyrus.watson.org (Postfix) with ESMTP id 2B16A46B0F;
	Mon, 31 Jul 2006 05:53:39 -0400 (EDT)
Date: Mon, 31 Jul 2006 10:53:39 +0100 (BST)
From: Robert Watson <rwatson@FreeBSD.org>
X-X-Sender: robert@fledge.watson.org
To: Hans Petter Selasky <hselasky@c2i.net>
In-Reply-To: <200607311024.50537.hselasky@c2i.net>
Message-ID: <20060731105045.X16341@fledge.watson.org>
References: <20060730141642.D16341@fledge.watson.org>
	<44CCFC2C.20402@errno.com>
	<20060730200929.J16341@fledge.watson.org>
	<200607311024.50537.hselasky@c2i.net>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: arch@freebsd.org, net@freebsd.org, freebsd-arch@freebsd.org
Subject: Re: Changes in the network interface queueing handoff model
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 31 Jul 2006 09:53:51 -0000


On Mon, 31 Jul 2006, Hans Petter Selasky wrote:

> On Sunday 30 July 2006 21:23, Robert Watson wrote:
>> On Sun, 30 Jul 2006, Sam Leffler wrote:
>
> Just a comment while the iron is hot:
>
> Maybe you can make the network model safe against detach. Currently I see 
> that the processor can be stuck in routines like "if_start" after that 
> "if_free()" has been called. This can be critical for USB ethernet devices.

This is something to fix in the short or long term, but I think we should not 
try to fix everything at once as there is an awful lot to fix.  There are 
really two stages to fixing the ifnet life cycle, which Brooks has been 
working on for some time.  The first is to make it generally make sense -- he 
moved ifnet out of the softc, has been working to normalize things generally, 
(add dead ifnets), etc.  The second is to add new types of reference and 
tear-down magic.  What Solaris does here, FYI, is basically add a lock around 
entering the device driver via their mac layer in order to prevent it from 
"disappearing" while in use via the ifnet interface.  I'm not sure if we want 
the same solution there or not, but it's worth thinking carefully about.  We 
had (and have) similar problems in a number of other places, where races 
between consumers of an API and a detach of the provider cause problems.

Robert N M Watson
Computer Laboratory
University of Cambridge

From owner-freebsd-arch@FreeBSD.ORG  Mon Jul 31 09:53:51 2006
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
X-Original-To: arch@freebsd.org
Delivered-To: freebsd-arch@FreeBSD.ORG
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id A4AAA16A4DE;
	Mon, 31 Jul 2006 09:53:51 +0000 (UTC)
	(envelope-from rwatson@FreeBSD.org)
Received: from cyrus.watson.org (cyrus.watson.org [209.31.154.42])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 4E48B43D45;
	Mon, 31 Jul 2006 09:53:51 +0000 (GMT)
	(envelope-from rwatson@FreeBSD.org)
Received: from fledge.watson.org (fledge.watson.org [209.31.154.41])
	by cyrus.watson.org (Postfix) with ESMTP id 2B16A46B0F;
	Mon, 31 Jul 2006 05:53:39 -0400 (EDT)
Date: Mon, 31 Jul 2006 10:53:39 +0100 (BST)
From: Robert Watson <rwatson@FreeBSD.org>
X-X-Sender: robert@fledge.watson.org
To: Hans Petter Selasky <hselasky@c2i.net>
In-Reply-To: <200607311024.50537.hselasky@c2i.net>
Message-ID: <20060731105045.X16341@fledge.watson.org>
References: <20060730141642.D16341@fledge.watson.org>
	<44CCFC2C.20402@errno.com>
	<20060730200929.J16341@fledge.watson.org>
	<200607311024.50537.hselasky@c2i.net>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: arch@freebsd.org, net@freebsd.org, freebsd-arch@freebsd.org
Subject: Re: Changes in the network interface queueing handoff model
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 31 Jul 2006 09:53:51 -0000


On Mon, 31 Jul 2006, Hans Petter Selasky wrote:

> On Sunday 30 July 2006 21:23, Robert Watson wrote:
>> On Sun, 30 Jul 2006, Sam Leffler wrote:
>
> Just a comment while the iron is hot:
>
> Maybe you can make the network model safe against detach. Currently I see 
> that the processor can be stuck in routines like "if_start" after that 
> "if_free()" has been called. This can be critical for USB ethernet devices.

This is something to fix in the short or long term, but I think we should not 
try to fix everything at once as there is an awful lot to fix.  There are 
really two stages to fixing the ifnet life cycle, which Brooks has been 
working on for some time.  The first is to make it generally make sense -- he 
moved ifnet out of the softc, has been working to normalize things generally, 
(add dead ifnets), etc.  The second is to add new types of reference and 
tear-down magic.  What Solaris does here, FYI, is basically add a lock around 
entering the device driver via their mac layer in order to prevent it from 
"disappearing" while in use via the ifnet interface.  I'm not sure if we want 
the same solution there or not, but it's worth thinking carefully about.  We 
had (and have) similar problems in a number of other places, where races 
between consumers of an API and a detach of the provider cause problems.

Robert N M Watson
Computer Laboratory
University of Cambridge

From owner-freebsd-arch@FreeBSD.ORG  Mon Jul 31 10:00:29 2006
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
X-Original-To: arch@FreeBSD.org
Delivered-To: freebsd-arch@FreeBSD.ORG
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 02BBC16A4ED;
	Mon, 31 Jul 2006 10:00:29 +0000 (UTC)
	(envelope-from rwatson@FreeBSD.org)
Received: from cyrus.watson.org (cyrus.watson.org [209.31.154.42])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 921A843D5E;
	Mon, 31 Jul 2006 10:00:27 +0000 (GMT)
	(envelope-from rwatson@FreeBSD.org)
Received: from fledge.watson.org (fledge.watson.org [209.31.154.41])
	by cyrus.watson.org (Postfix) with ESMTP id 2083D46B0D;
	Mon, 31 Jul 2006 06:00:27 -0400 (EDT)
Date: Mon, 31 Jul 2006 11:00:26 +0100 (BST)
From: Robert Watson <rwatson@FreeBSD.org>
X-X-Sender: robert@fledge.watson.org
To: Sam Leffler <sam@errno.com>
In-Reply-To: <44CD56BB.6080405@errno.com>
Message-ID: <20060731105438.D16341@fledge.watson.org>
References: <20060730141642.D16341@fledge.watson.org>
	<44CCFC2C.20402@errno.com>
	<20060730200929.J16341@fledge.watson.org> <44CD56BB.6080405@errno.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: arch@FreeBSD.org, net@FreeBSD.org
Subject: Re: Changes in the network interface queueing handoff model
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 31 Jul 2006 10:00:29 -0000


On Sun, 30 Jul 2006, Sam Leffler wrote:

>> I'm a little uncomfortable with our current m_tag model, as it requires 
>> significant numbers of additional allocations and frees for each packet, as 
>> well as walking link lists.  It's fine for occasional discretionary use 
>> (i.e., MAC labels), but I worry about cases where it is used with every 
>> packet, and we start seeing moderately non-zero numbers of tags on every 
>> packet.  I think I would be more comfortable with an explicit queue 
>> identifier argument to if_start, where the link layer and driver layer 
>> agree on how to identify queues.
>> 
>> As a straw man, how would the following strike you:
>>
>>     int if_startmbuf(struct ifnet *ifp, struct mbuf *m, int ifqid);
>> 
>> where for most link layers, the value would be zero, but for some link 
>> layer/driver combinations, it would identify a specific queue which the 
>> link layer believes the mbuf should be assigned, if implemented?
>
> mbuf tags are not a solution; too expensive.  I think we need something in 
> the mbuf header.

Agreed.  I'm also quite unhappy that we have to use m_tags for VLAN tagging 
for identical reasons: it basically guarantees at least one extra memory 
allocation and free, possibly two, for each frame with encapsulation.  This is 
one of the reasons I have been interested in reworking the ethernet link layer 
parts to increase integration of VLANs into the normal ethernet code, in order 
to avoid having to unnecessarily use expensive mbuf meta-data.

What size field is needed in the mbuf pkthdr to capture all the necessary 
priority information between driver and link layer?  An int?

Robert N M Watson
Computer Laboratory
University of Cambridge

From owner-freebsd-arch@FreeBSD.ORG  Mon Jul 31 17:05:34 2006
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
X-Original-To: arch@freebsd.org
Delivered-To: freebsd-arch@FreeBSD.ORG
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 6B42816A4DA;
	Mon, 31 Jul 2006 17:05:34 +0000 (UTC) (envelope-from jdp@polstra.com)
Received: from blake.polstra.com (blake.polstra.com [64.81.189.66])
	by mx1.FreeBSD.org (Postfix) with ESMTP id E801843D4C;
	Mon, 31 Jul 2006 17:05:33 +0000 (GMT) (envelope-from jdp@polstra.com)
Received: from strings.polstra.com (strings.polstra.com [64.81.189.67])
	by blake.polstra.com (8.13.6/8.13.6) with ESMTP id k6VH5XKG038776;
	Mon, 31 Jul 2006 10:05:33 -0700 (PDT) (envelope-from jdp@polstra.com)
Message-ID: <XFMail.20060731100533.jdp@polstra.com>
X-Mailer: XFMail 1.5.5 on FreeBSD
X-Priority: 3 (Normal)
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 8bit
MIME-Version: 1.0
In-Reply-To: <20060730141642.D16341@fledge.watson.org>
Date: Mon, 31 Jul 2006 10:05:33 -0700 (PDT)
From: John Polstra <jdp@polstra.com>
To: Robert Watson <rwatson@freebsd.org>
Cc: arch@freebsd.org, net@freebsd.org
Subject: RE: Changes in the network interface queueing handoff model
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 31 Jul 2006 17:05:34 -0000

> Attached is a patch that maintains the current if_start, but adds 
> if_startmbuf.  If a device driver implements if_startmbuf and the global 
> sysctl net.startmbuf_enabled is set to 1, then the if_startmbuf path in the 
> driver will be used.  Otherwise, if_start is used.  I have modified the if_em 
> driver to implement if_startmbuf also.  If there is no packet backlog in the 
> if_snd queue, it directly places the packet in the transmit descriptor ring. 
> If there is a backlog, it uses the if_snd queue protected by driver mutex, 
> rather than a separate ifq mutex.

I question whether you need a fallback software if_snd queue at all
for modern devices such as the Intel and Broadcom gigabit chips.  The
hardware transmit descriptor rings typically have sizes of the order
of 256 descriptors.  I think if the ring fills up, you could simply
drop the packet with ENOBUFS.  That's what happens if the if_snd queue
fills up, and its maximum size is comparable to the sizes of modern
descriptor rings.  It would simplify things quite a bit to eliminate
the if_snd queue entirely for such devices.

In any case, I'm glad you're looking at making this change.  I think
it's the right thing to do.

John

From owner-freebsd-arch@FreeBSD.ORG  Mon Jul 31 17:08:30 2006
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
X-Original-To: arch@freebsd.org
Delivered-To: freebsd-arch@FreeBSD.ORG
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 9482316A4DE;
	Mon, 31 Jul 2006 17:08:30 +0000 (UTC)
	(envelope-from rwatson@FreeBSD.org)
Received: from cyrus.watson.org (cyrus.watson.org [209.31.154.42])
	by mx1.FreeBSD.org (Postfix) with ESMTP id B4FA943D5C;
	Mon, 31 Jul 2006 17:08:27 +0000 (GMT)
	(envelope-from rwatson@FreeBSD.org)
Received: from fledge.watson.org (fledge.watson.org [209.31.154.41])
	by cyrus.watson.org (Postfix) with ESMTP id A8C8346B9B;
	Mon, 31 Jul 2006 13:08:24 -0400 (EDT)
Date: Mon, 31 Jul 2006 18:08:24 +0100 (BST)
From: Robert Watson <rwatson@FreeBSD.org>
X-X-Sender: robert@fledge.watson.org
To: John Polstra <jdp@polstra.com>
In-Reply-To: <XFMail.20060731100533.jdp@polstra.com>
Message-ID: <20060731180643.E71432@fledge.watson.org>
References: <XFMail.20060731100533.jdp@polstra.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: arch@freebsd.org, net@freebsd.org
Subject: RE: Changes in the network interface queueing handoff model
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 31 Jul 2006 17:08:30 -0000

On Mon, 31 Jul 2006, John Polstra wrote:

>> Attached is a patch that maintains the current if_start, but adds 
>> if_startmbuf.  If a device driver implements if_startmbuf and the global 
>> sysctl net.startmbuf_enabled is set to 1, then the if_startmbuf path in the 
>> driver will be used.  Otherwise, if_start is used.  I have modified the 
>> if_em driver to implement if_startmbuf also.  If there is no packet backlog 
>> in the if_snd queue, it directly places the packet in the transmit 
>> descriptor ring. If there is a backlog, it uses the if_snd queue protected 
>> by driver mutex, rather than a separate ifq mutex.
>
> I question whether you need a fallback software if_snd queue at all for 
> modern devices such as the Intel and Broadcom gigabit chips.  The hardware 
> transmit descriptor rings typically have sizes of the order of 256 
> descriptors.  I think if the ring fills up, you could simply drop the packet 
> with ENOBUFS.  That's what happens if the if_snd queue fills up, and its 
> maximum size is comparable to the sizes of modern descriptor rings.  It 
> would simplify things quite a bit to eliminate the if_snd queue entirely for 
> such devices.

I tend to agree, but implemented full queueing support for if_em to make sure 
I understood to complexity implications of completely removing queueing from 
the ifnet side dispatch.  I guess an interesting question for us is how we 
decide what the right threshold is to implement software queuing.  Do any 
if_em cards need software queueing, or do they all have adequate in-hardware 
queues as is?  Entirely cutting the queue code would significantly simplify 
em_startmbuf.

Robert N M Watson
Computer Laboratory
University of Cambridge

From owner-freebsd-arch@FreeBSD.ORG  Mon Jul 31 17:22:03 2006
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
X-Original-To: arch@freebsd.org
Delivered-To: freebsd-arch@FreeBSD.ORG
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 14FD316A4DD;
	Mon, 31 Jul 2006 17:22:03 +0000 (UTC) (envelope-from pete@he.iki.fi)
Received: from silver.he.iki.fi (helenius.fi [193.64.42.241])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 8BB0343D68;
	Mon, 31 Jul 2006 17:21:57 +0000 (GMT) (envelope-from pete@he.iki.fi)
Received: from localhost (localhost [127.0.0.1])
	by silver.he.iki.fi (Postfix) with ESMTP id 8F85DBBFB;
	Mon, 31 Jul 2006 20:21:53 +0300 (EEST)
Received: from silver.he.iki.fi ([127.0.0.1])
	by localhost (silver.he.iki.fi [127.0.0.1]) (amavisd-new, port 10024)
	with ESMTP id 1ZGb9qUh8C8P; Mon, 31 Jul 2006 20:21:50 +0300 (EEST)
Received: from [IPv6:2001:670:84:0:2410:b116:d67f:84b] (unknown
	[IPv6:2001:670:84:0:2410:b116:d67f:84b])
	by silver.he.iki.fi (Postfix) with ESMTP;
	Mon, 31 Jul 2006 20:21:50 +0300 (EEST)
Message-ID: <44CE3C2E.80007@he.iki.fi>
Date: Mon, 31 Jul 2006 20:21:50 +0300
From: Petri Helenius <pete@he.iki.fi>
User-Agent: Thunderbird 1.5.0.5 (Windows/20060719)
MIME-Version: 1.0
To: Robert Watson <rwatson@FreeBSD.org>
References: <XFMail.20060731100533.jdp@polstra.com>
	<20060731180643.E71432@fledge.watson.org>
In-Reply-To: <20060731180643.E71432@fledge.watson.org>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: arch@freebsd.org, net@freebsd.org, John Polstra <jdp@polstra.com>
Subject: Re: Changes in the network interface queueing handoff model
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 31 Jul 2006 17:22:03 -0000

Robert Watson wrote:
>
> I tend to agree, but implemented full queueing support for if_em to 
> make sure I understood to complexity implications of completely 
> removing queueing from the ifnet side dispatch.  I guess an 
> interesting question for us is how we decide what the right threshold 
> is to implement software queuing.  Do any if_em cards need software 
> queueing, or do they all have adequate in-hardware queues as is?  
> Entirely cutting the queue code would significantly simplify 
> em_startmbuf.
Actually most em cards support 4096 descriptors each way.

Pete


From owner-freebsd-arch@FreeBSD.ORG  Mon Jul 31 17:34:55 2006
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
X-Original-To: arch@freebsd.org
Delivered-To: freebsd-arch@FreeBSD.ORG
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 5914116A4DD;
	Mon, 31 Jul 2006 17:34:55 +0000 (UTC) (envelope-from jdp@polstra.com)
Received: from blake.polstra.com (blake.polstra.com [64.81.189.66])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 70D6643D46;
	Mon, 31 Jul 2006 17:34:54 +0000 (GMT) (envelope-from jdp@polstra.com)
Received: from strings.polstra.com (strings.polstra.com [64.81.189.67])
	by blake.polstra.com (8.13.6/8.13.6) with ESMTP id k6VHYrdi039191;
	Mon, 31 Jul 2006 10:34:53 -0700 (PDT) (envelope-from jdp@polstra.com)
Message-ID: <XFMail.20060731103453.jdp@polstra.com>
X-Mailer: XFMail 1.5.5 on FreeBSD
X-Priority: 3 (Normal)
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 8bit
MIME-Version: 1.0
In-Reply-To: <44CE3C2E.80007@he.iki.fi>
Date: Mon, 31 Jul 2006 10:34:53 -0700 (PDT)
From: John Polstra <jdp@polstra.com>
To: Petri Helenius <pete@he.iki.fi>
Cc: arch@freebsd.org, net@freebsd.org
Subject: Re: Changes in the network interface queueing handoff model
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 31 Jul 2006 17:34:55 -0000

On 31-Jul-2006 Petri Helenius wrote:
> Robert Watson wrote:
>>
>> I tend to agree, but implemented full queueing support for if_em to 
>> make sure I understood to complexity implications of completely 
>> removing queueing from the ifnet side dispatch.  I guess an 
>> interesting question for us is how we decide what the right threshold 
>> is to implement software queuing.  Do any if_em cards need software 
>> queueing, or do they all have adequate in-hardware queues as is?  
>> Entirely cutting the queue code would significantly simplify 
>> em_startmbuf.
> Actually most em cards support 4096 descriptors each way.

Yes, even the earliest ones supported 4096 descriptors on paper.  In
practice, the early chips had bugs that required the entire descriptor
ring to fit in a single page of memory.  That limited them to 4096/16
= 256 transmit descriptors on x86 hardware at the time.  That chip
bug was fixed a long time ago, though, and in any case 256 transmit
descriptors is a lot for most applications.

John

From owner-freebsd-arch@FreeBSD.ORG  Mon Jul 31 19:09:25 2006
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
X-Original-To: arch@FreeBSD.org
Delivered-To: freebsd-arch@FreeBSD.ORG
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 344DC16A4E2;
	Mon, 31 Jul 2006 19:09:25 +0000 (UTC)
	(envelope-from jmg@hydrogen.funkthat.com)
Received: from hydrogen.funkthat.com (gate.funkthat.com [69.17.45.168])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 8A27443D60;
	Mon, 31 Jul 2006 19:09:24 +0000 (GMT)
	(envelope-from jmg@hydrogen.funkthat.com)
Received: from hydrogen.funkthat.com (pq0v1uefe3wdhse6@localhost.funkthat.com
	[127.0.0.1])
	by hydrogen.funkthat.com (8.13.6/8.13.3) with ESMTP id k6VJ9OGn098109; 
	Mon, 31 Jul 2006 12:09:24 -0700 (PDT)
	(envelope-from jmg@hydrogen.funkthat.com)
Received: (from jmg@localhost)
	by hydrogen.funkthat.com (8.13.6/8.13.3/Submit) id k6VJ9MbM098108;
	Mon, 31 Jul 2006 12:09:22 -0700 (PDT) (envelope-from jmg)
Date: Mon, 31 Jul 2006 12:09:22 -0700
From: John-Mark Gurney <gurney_j@resnet.uoregon.edu>
To: Robert Watson <rwatson@FreeBSD.org>
Message-ID: <20060731190922.GJ96589@funkthat.com>
Mail-Followup-To: Robert Watson <rwatson@FreeBSD.org>,
	John Polstra <jdp@polstra.com>, arch@freebsd.org, net@freebsd.org
References: <XFMail.20060731100533.jdp@polstra.com>
	<20060731180643.E71432@fledge.watson.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20060731180643.E71432@fledge.watson.org>
User-Agent: Mutt/1.4.2.1i
X-Operating-System: FreeBSD 5.4-RELEASE-p6 i386
X-PGP-Fingerprint: B7 EC EF F8 AE ED A7 31  96 7A 22 B3 D8 56 36 F4
X-Files: The truth is out there
X-URL: http://resnet.uoregon.edu/~gurney_j/
X-Resume: http://resnet.uoregon.edu/~gurney_j/resume.html
Cc: arch@FreeBSD.org, net@FreeBSD.org, John Polstra <jdp@polstra.com>
Subject: Re: Changes in the network interface queueing handoff model
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
Reply-To: John-Mark Gurney <gurney_j@resnet.uoregon.edu>
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 31 Jul 2006 19:09:25 -0000

Robert Watson wrote this message on Mon, Jul 31, 2006 at 18:08 +0100:
> >I question whether you need a fallback software if_snd queue at all for 
> >modern devices such as the Intel and Broadcom gigabit chips.  The hardware 
> >transmit descriptor rings typically have sizes of the order of 256 
> >descriptors.  I think if the ring fills up, you could simply drop the 
> >packet with ENOBUFS.  That's what happens if the if_snd queue fills up, 
> >and its maximum size is comparable to the sizes of modern descriptor 
> >rings.  It would simplify things quite a bit to eliminate the if_snd queue 
> >entirely for such devices.
> 
> I tend to agree, but implemented full queueing support for if_em to make 
> sure I understood to complexity implications of completely removing 
> queueing from the ifnet side dispatch.  I guess an interesting question for 
> us is how we decide what the right threshold is to implement software 
> queuing.  Do any if_em cards need software queueing, or do they all have 
> adequate in-hardware queues as is?  Entirely cutting the queue code would 
> significantly simplify em_startmbuf.

This work tends to lead to a generic ethernet card framework that I've
been thinking about.. where instead of cards doing all the handling of
a ring buffer, the card registers a few functions to manipulate a ring
buffer (if it has one), and does the necessary work...  Though encoding
all the different style of ring buffers may be interesting, per packet
instead of per segment (if_re)...

The other part is to digest the current monolithic lock structure that
the ethernet cards have, into three (or four) different locks, tx head,
tx tail, rx head & tail...

-- 
  John-Mark Gurney				Voice: +1 415 225 5579

     "All that I will do, has been done, All that I have, has not."

From owner-freebsd-arch@FreeBSD.ORG  Tue Aug  1 12:21:55 2006
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
X-Original-To: arch@FreeBSD.org
Delivered-To: freebsd-arch@FreeBSD.ORG
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 5D4B716A4DD;
	Tue,  1 Aug 2006 12:21:55 +0000 (UTC)
	(envelope-from gallatin@cs.duke.edu)
Received: from duke.cs.duke.edu (duke.cs.duke.edu [152.3.140.1])
	by mx1.FreeBSD.org (Postfix) with ESMTP id DCE3C43D69;
	Tue,  1 Aug 2006 12:21:54 +0000 (GMT)
	(envelope-from gallatin@cs.duke.edu)
Received: from grasshopper.cs.duke.edu (grasshopper.cs.duke.edu [152.3.145.30])
	by duke.cs.duke.edu (8.13.6/8.13.6) with ESMTP id k71CLnCd024908
	(version=TLSv1/SSLv3 cipher=EDH-RSA-DES-CBC3-SHA bits=168 verify=NO);
	Tue, 1 Aug 2006 08:21:49 -0400 (EDT)
Received: (from gallatin@localhost)
	by grasshopper.cs.duke.edu (8.12.9p2/8.12.9/Submit) id k71CLixZ060991; 
	Tue, 1 Aug 2006 08:21:44 -0400 (EDT) (envelope-from gallatin)
From: Andrew Gallatin <gallatin@cs.duke.edu>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Message-ID: <17615.18264.172863.892776@grasshopper.cs.duke.edu>
Date: Tue, 1 Aug 2006 08:21:44 -0400 (EDT)
To: Robert Watson <rwatson@FreeBSD.org>
In-Reply-To: <20060731105045.X16341@fledge.watson.org>
References: <20060730141642.D16341@fledge.watson.org>
	<44CCFC2C.20402@errno.com>
	<20060730200929.J16341@fledge.watson.org>
	<200607311024.50537.hselasky@c2i.net>
	<20060731105045.X16341@fledge.watson.org>
X-Mailer: VM 6.75 under 21.1 (patch 12) "Channel Islands" XEmacs Lucid
Cc: arch@FreeBSD.org, net@FreeBSD.org, freebsd-arch@FreeBSD.org,
	Hans Petter Selasky <hselasky@c2i.net>
Subject: Re: Changes in the network interface queueing handoff model
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 01 Aug 2006 12:21:55 -0000


Robert Watson writes:
 > tear-down magic.  What Solaris does here, FYI, is basically add a lock around 
 > entering the device driver via their mac layer in order to prevent it from 
 > "disappearing" while in use via the ifnet interface.  I'm not sure if we want 

At least for GLDv2, this is a reader-writer lock.  The transmit and
receive paths take a read lock on the device's macinfo (like ifnet)
struct, and the detach code takes a write lock.  The Solaris driver
model does not serialize transmits (or receives), as one might think
from reading the above.


Drew

From owner-freebsd-arch@FreeBSD.ORG  Tue Aug  1 12:21:55 2006
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
X-Original-To: freebsd-arch@FreeBSD.org
Delivered-To: freebsd-arch@FreeBSD.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 5D4B716A4DD;
	Tue,  1 Aug 2006 12:21:55 +0000 (UTC)
	(envelope-from gallatin@cs.duke.edu)
Received: from duke.cs.duke.edu (duke.cs.duke.edu [152.3.140.1])
	by mx1.FreeBSD.org (Postfix) with ESMTP id DCE3C43D69;
	Tue,  1 Aug 2006 12:21:54 +0000 (GMT)
	(envelope-from gallatin@cs.duke.edu)
Received: from grasshopper.cs.duke.edu (grasshopper.cs.duke.edu [152.3.145.30])
	by duke.cs.duke.edu (8.13.6/8.13.6) with ESMTP id k71CLnCd024908
	(version=TLSv1/SSLv3 cipher=EDH-RSA-DES-CBC3-SHA bits=168 verify=NO);
	Tue, 1 Aug 2006 08:21:49 -0400 (EDT)
Received: (from gallatin@localhost)
	by grasshopper.cs.duke.edu (8.12.9p2/8.12.9/Submit) id k71CLixZ060991; 
	Tue, 1 Aug 2006 08:21:44 -0400 (EDT) (envelope-from gallatin)
From: Andrew Gallatin <gallatin@cs.duke.edu>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Message-ID: <17615.18264.172863.892776@grasshopper.cs.duke.edu>
Date: Tue, 1 Aug 2006 08:21:44 -0400 (EDT)
To: Robert Watson <rwatson@FreeBSD.org>
In-Reply-To: <20060731105045.X16341@fledge.watson.org>
References: <20060730141642.D16341@fledge.watson.org>
	<44CCFC2C.20402@errno.com>
	<20060730200929.J16341@fledge.watson.org>
	<200607311024.50537.hselasky@c2i.net>
	<20060731105045.X16341@fledge.watson.org>
X-Mailer: VM 6.75 under 21.1 (patch 12) "Channel Islands" XEmacs Lucid
Cc: arch@FreeBSD.org, net@FreeBSD.org, freebsd-arch@FreeBSD.org,
	Hans Petter Selasky <hselasky@c2i.net>
Subject: Re: Changes in the network interface queueing handoff model
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 01 Aug 2006 12:21:55 -0000


Robert Watson writes:
 > tear-down magic.  What Solaris does here, FYI, is basically add a lock around 
 > entering the device driver via their mac layer in order to prevent it from 
 > "disappearing" while in use via the ifnet interface.  I'm not sure if we want 

At least for GLDv2, this is a reader-writer lock.  The transmit and
receive paths take a read lock on the device's macinfo (like ifnet)
struct, and the detach code takes a write lock.  The Solaris driver
model does not serialize transmits (or receives), as one might think
from reading the above.


Drew

From owner-freebsd-arch@FreeBSD.ORG  Tue Aug  1 12:30:49 2006
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
X-Original-To: arch@FreeBSD.org
Delivered-To: freebsd-arch@FreeBSD.ORG
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id BB52F16A4E0;
	Tue,  1 Aug 2006 12:30:49 +0000 (UTC)
	(envelope-from gallatin@cs.duke.edu)
Received: from duke.cs.duke.edu (duke.cs.duke.edu [152.3.140.1])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 3AC9C43D73;
	Tue,  1 Aug 2006 12:30:43 +0000 (GMT)
	(envelope-from gallatin@cs.duke.edu)
Received: from grasshopper.cs.duke.edu (grasshopper.cs.duke.edu [152.3.145.30])
	by duke.cs.duke.edu (8.13.6/8.13.6) with ESMTP id k71CUgBl026750
	(version=TLSv1/SSLv3 cipher=EDH-RSA-DES-CBC3-SHA bits=168 verify=NO);
	Tue, 1 Aug 2006 08:30:42 -0400 (EDT)
Received: (from gallatin@localhost)
	by grasshopper.cs.duke.edu (8.12.9p2/8.12.9/Submit) id k71CUXn6061010; 
	Tue, 1 Aug 2006 08:30:33 -0400 (EDT) (envelope-from gallatin)
From: Andrew Gallatin <gallatin@cs.duke.edu>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Message-ID: <17615.18793.700752.342809@grasshopper.cs.duke.edu>
Date: Tue, 1 Aug 2006 08:30:33 -0400 (EDT)
To: Robert Watson <rwatson@FreeBSD.org>
In-Reply-To: <20060730141642.D16341@fledge.watson.org>
References: <20060730141642.D16341@fledge.watson.org>
X-Mailer: VM 6.75 under 21.1 (patch 12) "Channel Islands" XEmacs Lucid
Cc: arch@FreeBSD.org, net@FreeBSD.org
Subject: Re: Changes in the network interface queueing handoff model
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 01 Aug 2006 12:30:49 -0000


Robert Watson writes:
 > 
 > 5BOne of the ideas that I, Scott Long, and a few others have been bouncing 
 > around for some time is a restructuring of the network interface packet 
 > transmission API to reduce the number of locking operations and allow network 
 > device drivers increased control of the queueing behavior.  Right now, it 

<....>

 > - The ifnet send queue is a separately locked object from the device driver,
 >    meaning that for a single enqueue/dequeue pair, we pay an extra four lock
 >    operations (two for insert, two for remove) per packet.
 > 

Going forward, especially now that we support sun4v CoolThreads
hardware, we're going to want to rethink the "single lock" per
transmit routine model that most drivers have.  The most expensive
operation in transmit routines is bus_dmamap_load_mbuf_sg(),
especially when there is an IOMMU involved (like on CoolThreads
machines) and there is no reason why this needs to be called with a
driver's transmit lock held.  I have hard data (from Solaris) about
how much fine grained locking in a 10GbE driver's transmit routine
helps.

Drew


From owner-freebsd-arch@FreeBSD.ORG  Tue Aug  1 12:41:07 2006
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
X-Original-To: arch@FreeBSD.org
Delivered-To: freebsd-arch@FreeBSD.ORG
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 3A14F16A4DD;
	Tue,  1 Aug 2006 12:41:07 +0000 (UTC)
	(envelope-from gallatin@cs.duke.edu)
Received: from duke.cs.duke.edu (duke.cs.duke.edu [152.3.140.1])
	by mx1.FreeBSD.org (Postfix) with ESMTP id AE87943D46;
	Tue,  1 Aug 2006 12:41:06 +0000 (GMT)
	(envelope-from gallatin@cs.duke.edu)
Received: from grasshopper.cs.duke.edu (grasshopper.cs.duke.edu [152.3.145.30])
	by duke.cs.duke.edu (8.13.6/8.13.6) with ESMTP id k71Cf5ll028998
	(version=TLSv1/SSLv3 cipher=EDH-RSA-DES-CBC3-SHA bits=168 verify=NO);
	Tue, 1 Aug 2006 08:41:05 -0400 (EDT)
Received: (from gallatin@localhost)
	by grasshopper.cs.duke.edu (8.12.9p2/8.12.9/Submit) id k71Cf0NH061024; 
	Tue, 1 Aug 2006 08:41:00 -0400 (EDT) (envelope-from gallatin)
From: Andrew Gallatin <gallatin@cs.duke.edu>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Message-ID: <17615.19420.172545.986872@grasshopper.cs.duke.edu>
Date: Tue, 1 Aug 2006 08:41:00 -0400 (EDT)
To: Robert Watson <rwatson@FreeBSD.org>
In-Reply-To: <20060730141642.D16341@fledge.watson.org>
References: <20060730141642.D16341@fledge.watson.org>
X-Mailer: VM 6.75 under 21.1 (patch 12) "Channel Islands" XEmacs Lucid
Cc: arch@FreeBSD.org, net@FreeBSD.org
Subject: Re: Changes in the network interface queueing handoff model
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 01 Aug 2006 12:41:07 -0000


Robert Watson writes:

 >   The immediate practical benefit is 
 > clear: if the queueing at the ifnet layer is unnecessary, it is entirely 
 > avoided, skipping enqueue, dequeue, and four mutex operations.  

This is indeed nice, but for TCP I think the benefit would be far
greater if somebody would PLEASE, PLEASE, PLEASE implement TSO (aka
LSO).

Consider a 1460 byte mss and 64KB of data that is ready to be sent.
With the current model, that is 45 separate calls to if_output(),
and 45*4 (queuing) + 45 (tx routine) == 225 mutex operations.

Using your model, we're down to 45 mutex operations.

Using TSO, we have 4 + 1 == 5 mutex operations with the old
model, and 1 with the your model.

This is not even considering all the other overhead involved
in 45 transmits vs TSO...

Just something to think about..

Drew

From owner-freebsd-arch@FreeBSD.ORG  Tue Aug  1 13:25:30 2006
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
X-Original-To: arch@FreeBSD.org
Delivered-To: freebsd-arch@FreeBSD.ORG
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id E16EA16A4E2;
	Tue,  1 Aug 2006 13:25:30 +0000 (UTC)
	(envelope-from rwatson@FreeBSD.org)
Received: from cyrus.watson.org (cyrus.watson.org [209.31.154.42])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 8B85A43D64;
	Tue,  1 Aug 2006 13:25:30 +0000 (GMT)
	(envelope-from rwatson@FreeBSD.org)
Received: from fledge.watson.org (fledge.watson.org [209.31.154.41])
	by cyrus.watson.org (Postfix) with ESMTP id E087846C7B;
	Tue,  1 Aug 2006 09:25:30 -0400 (EDT)
Date: Tue, 1 Aug 2006 14:25:30 +0100 (BST)
From: Robert Watson <rwatson@FreeBSD.org>
X-X-Sender: robert@fledge.watson.org
To: Andrew Gallatin <gallatin@cs.duke.edu>
In-Reply-To: <17615.19420.172545.986872@grasshopper.cs.duke.edu>
Message-ID: <20060801142056.C64452@fledge.watson.org>
References: <20060730141642.D16341@fledge.watson.org>
	<17615.19420.172545.986872@grasshopper.cs.duke.edu>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: arch@FreeBSD.org, net@FreeBSD.org
Subject: Re: Changes in the network interface queueing handoff model
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 01 Aug 2006 13:25:31 -0000


On Tue, 1 Aug 2006, Andrew Gallatin wrote:

> Robert Watson writes:
>
> >   The immediate practical benefit is clear: if the queueing at the ifnet 
> > layer is unnecessary, it is entirely avoided, skipping enqueue, dequeue, 
> > and four mutex operations.
>
> This is indeed nice, but for TCP I think the benefit would be far greater if 
> somebody would PLEASE, PLEASE, PLEASE implement TSO (aka LSO).
>
> Consider a 1460 byte mss and 64KB of data that is ready to be sent. With the 
> current model, that is 45 separate calls to if_output(), and 45*4 (queuing) 
> + 45 (tx routine) == 225 mutex operations.
>
> Using your model, we're down to 45 mutex operations.
>
> Using TSO, we have 4 + 1 == 5 mutex operations with the old model, and 1 
> with the your model.
>
> This is not even considering all the other overhead involved in 45 transmits 
> vs TSO...

Jack Vogel at Intel has previously talked about having TSO patches for FreeBSD 
to use with if_em, but was running into stability/correctness problems on 7.x. 
I e-mailed him a few minutes ago to ask to take a look at the patches.  Since 
I've not yet seen them, I don't know the details of how they work -- I assume 
that some amount of tweaking is required not just to TCP so that it passes 
down larger segments, but so that larger segments are used in the right ways, 
and in keeping with the facilities offered by the underlying interface, 
especially before a routing decision has been made.

Robert N M Watson
Computer Laboratory
University of Cambridge

From owner-freebsd-arch@FreeBSD.ORG  Tue Aug  1 13:27:56 2006
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
X-Original-To: arch@FreeBSD.org
Delivered-To: freebsd-arch@FreeBSD.ORG
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id A138716A4DD;
	Tue,  1 Aug 2006 13:27:56 +0000 (UTC)
	(envelope-from rwatson@FreeBSD.org)
Received: from cyrus.watson.org (cyrus.watson.org [209.31.154.42])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 4522243D8C;
	Tue,  1 Aug 2006 13:27:49 +0000 (GMT)
	(envelope-from rwatson@FreeBSD.org)
Received: from fledge.watson.org (fledge.watson.org [209.31.154.41])
	by cyrus.watson.org (Postfix) with ESMTP id CE2D246BB0;
	Tue,  1 Aug 2006 09:27:49 -0400 (EDT)
Date: Tue, 1 Aug 2006 14:27:49 +0100 (BST)
From: Robert Watson <rwatson@FreeBSD.org>
X-X-Sender: robert@fledge.watson.org
To: Andrew Gallatin <gallatin@cs.duke.edu>
In-Reply-To: <17615.18793.700752.342809@grasshopper.cs.duke.edu>
Message-ID: <20060801142558.M64452@fledge.watson.org>
References: <20060730141642.D16341@fledge.watson.org>
	<17615.18793.700752.342809@grasshopper.cs.duke.edu>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: arch@FreeBSD.org, net@FreeBSD.org
Subject: Re: Changes in the network interface queueing handoff model
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 01 Aug 2006 13:27:56 -0000


On Tue, 1 Aug 2006, Andrew Gallatin wrote:

> > - The ifnet send queue is a separately locked object from the device driver,
> >    meaning that for a single enqueue/dequeue pair, we pay an extra four lock
> >    operations (two for insert, two for remove) per packet.
>
> Going forward, especially now that we support sun4v CoolThreads hardware, 
> we're going to want to rethink the "single lock" per transmit routine model 
> that most drivers have.  The most expensive operation in transmit routines 
> is bus_dmamap_load_mbuf_sg(), especially when there is an IOMMU involved 
> (like on CoolThreads machines) and there is no reason why this needs to be 
> called with a driver's transmit lock held.  I have hard data (from Solaris) 
> about how much fine grained locking in a 10GbE driver's transmit routine 
> helps.

Right now, with the exception of locking for the ifnet dispatch queue, I 
believe our ifnet API pretty much leaves decisions about the nature and 
granularity of synchronization to the device driver author.  The ifnet queue 
is high on my list to address (hence this thread) -- are there any other parts 
of our device driver framework that stand in the way from a device driver 
being modified to support greater parallelism in sending?

Robert N M Watson
Computer Laboratory
University of Cambridge

From owner-freebsd-arch@FreeBSD.ORG  Tue Aug  1 13:37:24 2006
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
X-Original-To: arch@FreeBSD.org
Delivered-To: freebsd-arch@FreeBSD.ORG
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 526AE16A504;
	Tue,  1 Aug 2006 13:37:24 +0000 (UTC)
	(envelope-from gallatin@cs.duke.edu)
Received: from duke.cs.duke.edu (duke.cs.duke.edu [152.3.140.1])
	by mx1.FreeBSD.org (Postfix) with ESMTP id D976243D4C;
	Tue,  1 Aug 2006 13:37:23 +0000 (GMT)
	(envelope-from gallatin@cs.duke.edu)
Received: from grasshopper.cs.duke.edu (grasshopper.cs.duke.edu [152.3.145.30])
	by duke.cs.duke.edu (8.13.6/8.13.6) with ESMTP id k71DbNX7011165
	(version=TLSv1/SSLv3 cipher=EDH-RSA-DES-CBC3-SHA bits=168 verify=NO);
	Tue, 1 Aug 2006 09:37:23 -0400 (EDT)
Received: (from gallatin@localhost)
	by grasshopper.cs.duke.edu (8.12.9p2/8.12.9/Submit) id k71DbI1c061075; 
	Tue, 1 Aug 2006 09:37:18 -0400 (EDT) (envelope-from gallatin)
From: Andrew Gallatin <gallatin@cs.duke.edu>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Message-ID: <17615.22798.24602.160771@grasshopper.cs.duke.edu>
Date: Tue, 1 Aug 2006 09:37:18 -0400 (EDT)
To: Robert Watson <rwatson@FreeBSD.org>
In-Reply-To: <20060801142056.C64452@fledge.watson.org>
References: <20060730141642.D16341@fledge.watson.org>
	<17615.19420.172545.986872@grasshopper.cs.duke.edu>
	<20060801142056.C64452@fledge.watson.org>
X-Mailer: VM 6.75 under 21.1 (patch 12) "Channel Islands" XEmacs Lucid
Cc: arch@FreeBSD.org, net@FreeBSD.org
Subject: Re: Changes in the network interface queueing handoff model
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 01 Aug 2006 13:37:24 -0000


Robert Watson writes:
 > 
 > Jack Vogel at Intel has previously talked about having TSO patches for FreeBSD 
 > to use with if_em, but was running into stability/correctness problems on 7.x. 
 > I e-mailed him a few minutes ago to ask to take a look at the patches.  Since 
 > I've not yet seen them, I don't know the details of how they work -- I assume 
 > that some amount of tweaking is required not just to TCP so that it passes 
 > down larger segments, but so that larger segments are used in the right ways, 
 > and in keeping with the facilities offered by the underlying interface, 
 > especially before a routing decision has been made.

I'm not sure about what stack changes are required for TSO.
I do know that NetBSD has had TSO for roughly 2 years, so they
might be a good place to look..

Drew

From owner-freebsd-arch@FreeBSD.ORG  Tue Aug  1 13:48:18 2006
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
X-Original-To: arch@FreeBSD.org
Delivered-To: freebsd-arch@FreeBSD.ORG
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id A811016A4E2;
	Tue,  1 Aug 2006 13:48:18 +0000 (UTC)
	(envelope-from gallatin@cs.duke.edu)
Received: from duke.cs.duke.edu (duke.cs.duke.edu [152.3.140.1])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 2D11143DA2;
	Tue,  1 Aug 2006 13:47:59 +0000 (GMT)
	(envelope-from gallatin@cs.duke.edu)
Received: from grasshopper.cs.duke.edu (grasshopper.cs.duke.edu [152.3.145.30])
	by duke.cs.duke.edu (8.13.6/8.13.6) with ESMTP id k71Dlxbp012757
	(version=TLSv1/SSLv3 cipher=EDH-RSA-DES-CBC3-SHA bits=168 verify=NO);
	Tue, 1 Aug 2006 09:47:59 -0400 (EDT)
Received: (from gallatin@localhost)
	by grasshopper.cs.duke.edu (8.12.9p2/8.12.9/Submit) id k71Dlr6g061084; 
	Tue, 1 Aug 2006 09:47:53 -0400 (EDT) (envelope-from gallatin)
From: Andrew Gallatin <gallatin@cs.duke.edu>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Message-ID: <17615.23433.918293.466584@grasshopper.cs.duke.edu>
Date: Tue, 1 Aug 2006 09:47:53 -0400 (EDT)
To: Robert Watson <rwatson@FreeBSD.org>
In-Reply-To: <20060801142558.M64452@fledge.watson.org>
References: <20060730141642.D16341@fledge.watson.org>
	<17615.18793.700752.342809@grasshopper.cs.duke.edu>
	<20060801142558.M64452@fledge.watson.org>
X-Mailer: VM 6.75 under 21.1 (patch 12) "Channel Islands" XEmacs Lucid
Cc: arch@FreeBSD.org, net@FreeBSD.org
Subject: Re: Changes in the network interface queueing handoff model
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 01 Aug 2006 13:48:18 -0000


Robert Watson writes:
 > 
 > On Tue, 1 Aug 2006, Andrew Gallatin wrote:
 > 
 > > > - The ifnet send queue is a separately locked object from the device driver,
 > > >    meaning that for a single enqueue/dequeue pair, we pay an extra four lock
 > > >    operations (two for insert, two for remove) per packet.
 > >
 > > Going forward, especially now that we support sun4v CoolThreads hardware, 
 > > we're going to want to rethink the "single lock" per transmit routine model 
 > > that most drivers have.  The most expensive operation in transmit routines 
 > > is bus_dmamap_load_mbuf_sg(), especially when there is an IOMMU involved 
 > > (like on CoolThreads machines) and there is no reason why this needs to be 
 > > called with a driver's transmit lock held.  I have hard data (from Solaris) 
 > > about how much fine grained locking in a 10GbE driver's transmit routine 
 > > helps.
 > 
 > Right now, with the exception of locking for the ifnet dispatch queue, I 
 > believe our ifnet API pretty much leaves decisions about the nature and 
 > granularity of synchronization to the device driver author.  The ifnet queue 
 > is high on my list to address (hence this thread) -- are there any other parts 
 > of our device driver framework that stand in the way from a device driver 
 > being modified to support greater parallelism in sending?

No, not that is directly related to ethernet drivers.

However, busdma is a pain.  Specifically, I hate that
bus_dmamap_load_mbuf_sg() requires a bus_dmamap_t.  That means that
any fine-grained driver will need to "allocate" a bus_dmamap_t either
via bus_dmamap_create(), or by pulling a pre-allocated bus_dmamap_t
from a pre-allocated pool.  Either will require a lock.  Solaris has a
similar problem, and I use the pool approach in my Solaris driver.

Linux's pci_map_single()/pci_unmap_addr_set()/pci_unmap_len_set()
is just so much nicer to use...

Drew

From owner-freebsd-arch@FreeBSD.ORG  Wed Aug  2 10:01:26 2006
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
X-Original-To: arch@FreeBSD.org
Delivered-To: freebsd-arch@FreeBSD.ORG
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 8421E16A4E2;
	Wed,  2 Aug 2006 10:01:26 +0000 (UTC) (envelope-from bde@zeta.org.au)
Received: from mailout1.pacific.net.au (mailout1.pacific.net.au [61.8.0.84])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 6F66243D5A;
	Wed,  2 Aug 2006 10:01:20 +0000 (GMT) (envelope-from bde@zeta.org.au)
Received: from mailproxy1.pacific.net.au (mailproxy1.pacific.net.au
	[61.8.2.162])
	by mailout1.pacific.net.au (Postfix) with ESMTP id 31436328233;
	Wed,  2 Aug 2006 20:01:19 +1000 (EST)
Received: from katana.zip.com.au (katana.zip.com.au [61.8.7.246])
	by mailproxy1.pacific.net.au (8.13.4/8.13.4/Debian-3sarge1) with ESMTP
	id k72A1GmC004061; Wed, 2 Aug 2006 20:01:17 +1000
Date: Wed, 2 Aug 2006 20:01:16 +1000 (EST)
From: Bruce Evans <bde@zeta.org.au>
X-X-Sender: bde@delplex.bde.org
To: John Polstra <jdp@polstra.com>
In-Reply-To: <XFMail.20060731100533.jdp@polstra.com>
Message-ID: <20060802184349.K90387@delplex.bde.org>
References: <XFMail.20060731100533.jdp@polstra.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: arch@FreeBSD.org, Robert Watson <rwatson@FreeBSD.org>, net@FreeBSD.org
Subject: RE: Changes in the network interface queueing handoff model
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 02 Aug 2006 10:01:26 -0000

On Mon, 31 Jul 2006, John Polstra wrote:

> I question whether you need a fallback software if_snd queue at all
> for modern devices such as the Intel and Broadcom gigabit chips.  The
> hardware transmit descriptor rings typically have sizes of the order
> of 256 descriptors.  I think if the ring fills up, you could simply
> drop the packet with ENOBUFS.  That's what happens if the if_snd queue
> fills up, and its maximum size is comparable to the sizes of modern
> descriptor rings.  It would simplify things quite a bit to eliminate
> the if_snd queue entirely for such devices.

I use an if_snd queue length of about 5000 in my version of the sk
driver to work around suckage in ENOBUFS handling.  The hardware (*)
tx ring size is 512, and tiny packets can be sent in 4 usec, so the
hardware queue provides only 2 msec worth of buffering.  select(2)
for output on sockets doesn't work right, so there is no good way (**)
for applications to proceed when a syscall returns ENOBUFS.  An extra
queue length of 500 provides an extra 20 msec worth of buffering which
is usually enough when HZ = 100.

(*) I think the sk tx ring is not really in hardware, so it can be
much larger than 512, but a length of > 5000 for it seems excessive
and caused panics when I tried it.

(**) Various bad ways can be found in various versions of ttcp and
tools/netrate.  They involve either backing off by sleeping (which
doesn't keep the tx active unless the sleep granularity is small
(which only happens under FreeBSD if HZ is too large)), or by never
backing off (which gives busy-waiting).  Instead, select() on the
output socket should actually work -- it should succeed if the tx
queue length is below a low watermark.  Apparently, select() on
output sockets normally doesn't work, since no version of ttcp that
I've looked at (not many) even tries this.

Bruce