From owner-freebsd-arch@FreeBSD.ORG Sun Jul 30 14:04:49 2006 Return-Path: X-Original-To: arch@FreeBSD.org Delivered-To: freebsd-arch@FreeBSD.ORG Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 4A19616A4E2; Sun, 30 Jul 2006 14:04:49 +0000 (UTC) (envelope-from rwatson@FreeBSD.org) Received: from cyrus.watson.org (cyrus.watson.org [209.31.154.42]) by mx1.FreeBSD.org (Postfix) with ESMTP id B771B43D5C; Sun, 30 Jul 2006 14:04:48 +0000 (GMT) (envelope-from rwatson@FreeBSD.org) Received: from fledge.watson.org (fledge.watson.org [209.31.154.41]) by cyrus.watson.org (Postfix) with ESMTP id 45F3A46BBD; Sun, 30 Jul 2006 10:04:48 -0400 (EDT) Date: Sun, 30 Jul 2006 15:04:48 +0100 (BST) From: Robert Watson X-X-Sender: robert@fledge.watson.org To: net@FreeBSD.org, arch@FreeBSD.org Message-ID: <20060730141642.D16341@fledge.watson.org> MIME-Version: 1.0 Content-Type: MULTIPART/MIXED; BOUNDARY="0-1695162780-1154268288=:16341" Cc: Subject: Changes in the network interface queueing handoff model X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 30 Jul 2006 14:04:49 -0000 This message is in MIME format. The first part should be readable text, while the remaining parts are likely unreadable without MIME-aware tools. --0-1695162780-1154268288=:16341 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed 5BOne of the ideas that I, Scott Long, and a few others have been bouncing around for some time is a restructuring of the network interface packet transmission API to reduce the number of locking operations and allow network device drivers increased control of the queueing behavior. Right now, it works something like that following: - When a network protocol wants to transmit, it calls the ifnet's link layer output routine via ifp->if_output() with the ifnet pointer, packet, destination address information, and route information. - The link layer (e.g., ether_output() + ether_output_frame()) encapsulates the packet as necessary, performs a link layer address translation (such as ARP), and hands off to the ifnet driver via a call to IFQ_HANDOFF(), which accepts the ifnet pointer and packet. - The ifnet layer enqueues the packet in the ifnet send queue (ifp->if_snd), and then looks at the driver's IFF_DRV_OACTIVE flag to determine if it needs to "start" output by the driver. If the driver is already active, it doesn't, and otherwise, it does. - The driver dequeues the packet from ifp->if_snd, performs any driver encapsulation and wrapping, and notifies the hardware. In modern hardware, this consists of hooking the data of the packet up to the descriptor ring and notifying the hardware to pick it up via DMA. In order hardware, the driver would perform a series of I/O operations to send the entire packet directly to the card via a system bus. Why change this? A few reasons: - The ifnet layer send queue is becoming decreasingly useful over time. Most modern hardware has a significant number of slots in its transmit descriptor ring, tuned for the performance of the hardware, etc, which is the effective transmit queue in practice. The additional queue depth doesn't increase throughput substantially (if at all) but does consume memory. - On extremely fast hardware (with respect to CPU speed), the queue remains essentially empty, so we pay the cost of enqueueing and dequeuing a packet from an empty queue. - The ifnet send queue is a separately locked object from the device driver, meaning that for a single enqueue/dequeue pair, we pay an extra four lock operations (two for insert, two for remove) per packet. - For synthetic link layer drivers, such as if_vlan, which have no need for queueing at all, the cost of queueing is eliminated. - IFF_DRV_OACTIVE is no longer inspected by the link layer, only by the driver, which helps eliminate a latent race condition involving use of the flag. The proposed change is simple: right now one or more enqueue operations occurs, when a call to ifp->if_start() is made to notify the driver that it may need to do something (if the ACTIVE flag isn't set). In the new world order, the driver is directly passed the mbuf, and may then choose to queue it or otherwise handle it as it sees fit. The immediate practical benefit is clear: if the queueing at the ifnet layer is unnecessary, it is entirely avoided, skipping enqueue, dequeue, and four mutex operations. This applies immediately for VLAN processing, but also means that for modern gigabit cards, the hardware queue (which will be used anyway) is the only queue necessary. There are a few downsides, of course: - For older hardware without its own queueing, the queue is still required -- not only that, but we've now introduced an unconditional function pointer invocation, which on older hardware, is has more significant relative cost than it has on more recent CPUs. - If drivers still require or use a queue, they must now synchronize access to the queue. The obvious choices are to use the ifq lock (and restore the above four lock operations), or to use the driver mutex (and risk higher contention). Right now, if the driver is busy (driver mutex held) then an enqueue is still possible, but with this change and a single mutex protecting the send queue and driver, that is no longer possible. Attached is a patch that maintains the current if_start, but adds if_startmbuf. If a device driver implements if_startmbuf and the global sysctl net.startmbuf_enabled is set to 1, then the if_startmbuf path in the driver will be used. Otherwise, if_start is used. I have modified the if_em driver to implement if_startmbuf also. If there is no packet backlog in the if_snd queue, it directly places the packet in the transmit descriptor ring. If there is a backlog, it uses the if_snd queue protected by driver mutex, rather than a separate ifq mutex. In some basic local micro-benchmarks, I saw a 5% improvement in UDP 0-byte paylod PPS on UP, and a 10% improvement on SMP. I saw a 1.7% performance improvement in the bulk serving of 1k files over HTTP. These are only micro-benchmarks, and reflect a configuration in which the CPU is unable to keep up with the output rate of the 1gbps ethernet card in the device, so reductions in host CPU usage are immediately visible in increased output as the CPU is able to better keep up with the network hardware. Other configurations are also of interest of interesting, especially ones in which the network device is unable to keep up with the CPU, resulting in more queueing. Conceptual review as well as banchmarking, etc, would be most welcome. Robert N M Watson Computer Laboratory University of Cambridge --0-1695162780-1154268288=:16341 Content-Type: TEXT/plain; charset=US-ASCII; name=20060730-if_startmbuf.diff Content-Transfer-Encoding: BASE64 Content-ID: <20060730150448.G16341@fledge.watson.org> Content-Description: Content-Disposition: attachment; filename=20060730-if_startmbuf.diff LS0tIC8vZGVwb3QvdmVuZG9yL2ZyZWVic2Qvc3JjL3N5cy9kZXYvZW0vaWZf ZW0uYwkyMDA2LzA3LzI3IDAwOjQ2OjI0DQorKysgLy9kZXBvdC91c2VyL3J3 YXRzb24vaWZuZXQvc3JjL3N5cy9kZXYvZW0vaWZfZW0uYwkyMDA2LzA3LzI5 IDE4OjQzOjE0DQpAQCAtNzM1LDYgKzczNSw5NSBAQA0KIAlFTV9VTkxPQ0so c2MpOw0KIH0NCiANCitzdGF0aWMgaW50DQorZW1fc3RhcnRtYnVmKHN0cnVj dCBpZm5ldCAqaWZwLCBzdHJ1Y3QgbWJ1ZiAqbSkNCit7DQorICAgICAgICBz dHJ1Y3QgbWJ1ZiAgICAqbV9oZWFkOw0KKyAgICAgICAgc3RydWN0IGVtX3Nv ZnRjICpzYyA9IGlmcC0+aWZfc29mdGM7DQorCXN0cnVjdCBpZnF1ZXVlICpp ZnEgPSAoc3RydWN0IGlmcXVldWUgKikmaWZwLT5pZl9zbmQ7DQorDQorCS8q DQorCSAqIFRocmVlIGNhc2VzOg0KKwkgKg0KKwkgKiAoMSkgSW50ZXJmYWNl IGlzbid0IHJ1bm5pbmcsIGxpbmsgaXMgZG93biwgb3IgaXMgYWxyZWFkeSBh Y3RpdmUsDQorCSAqICAgICBldGMsIHNpbXBseSBlbnF1ZXVlLg0KKwkgKg0K KwkgKiAoMikgVGhlIGludGVyZmFjZSBpcyBydW5uaW5nLCBub3QgdG9vIGJ1 c3ksIGFuZCB3ZSBoYXZlIG5vIG1idWZzDQorCSAqICAgICBpbiB0aGUgaWZu ZXQgc2VuZCBxdWV1ZSwgc28gdHJ5IHRvIGhhbmQgZGlyZWN0bHkgdG8gaGFy ZHdhcmUuDQorCSAqDQorCSAqICgzKSBUaGUgaW50ZXJmYWNlIGlzIHJ1bm5p bmcsIGJ1dCB3ZSBoYXZlIGEgYmFja2xvZy4gIEluc2VydCB0aGUNCisJICog ICAgIGN1cnJlbnQgbWJ1ZiBpbnRvIHRoZSBxdWV1ZSBhbmQgcHJvY2VzcyBp bi1vcmRlciwgaWYgcG9zc2libGUuDQorCSAqLw0KKwlFTV9MT0NLKHNjKTsN CisJaWYgKCgoaWZwLT5pZl9kcnZfZmxhZ3MgJiAoSUZGX0RSVl9SVU5OSU5H fElGRl9EUlZfT0FDVElWRSkpICE9DQorCSAgICBJRkZfRFJWX1JVTk5JTkcp IHx8ICFzYy0+bGlua19hY3RpdmUpIHsNCisJCWlmIChfSUZfUUZVTEwoaWZx KSkgew0KKwkJCV9JRl9EUk9QKGlmcSk7DQorCQkJRU1fVU5MT0NLKHNjKTsN CisJCQltX2ZyZWVtKG0pOw0KKwkJCXJldHVybiAoRU5PQlVGUyk7DQorCQl9 DQorCQlfSUZfRU5RVUVVRShpZnEsIG0pOw0KKwkJRU1fVU5MT0NLKHNjKTsN CisJCXJldHVybiAoMCk7DQorCX0NCisNCisJLyoNCisJICogWFhYUlc6IFZh cmlvdXMgY2FzZXMgaGVyZSBoYXZlIGhpc3RvcmljYWxseSBjb3VudGVkIGFz IHN1Y2Nlc3NlcywNCisJICogYnV0IHBlcmhhcHMgdGhleSBzaG91bGQgcmV0 dXJuIEVOT0JVRlM/DQorCSAqLw0KKwlpZiAoX0lGX1FMRU4oaWZxKSA9PSAw KSB7DQorCSAJLyoNCisJCSAqIGVtX2VuY2FwKCkgY2FuIG1vZGlmeSBvdXIg cG9pbnRlciwgYW5kIG9yIG1ha2UgaXQgTlVMTCBvbg0KKwkJICogZmFpbHVy ZS4gIEluIHRoYXQgZXZlbnQsIHdlIGNhbid0IGVucXVldWUuDQorCQkgKi8N CisJCWlmIChlbV9lbmNhcChzYywgJm0pKSB7DQorCQkJaWYgKG0gPT0gTlVM TCkgew0KKwkJCQlFTV9VTkxPQ0soc2MpOw0KKwkJCQlyZXR1cm4gKDApOw0K KwkJCX0NCisJCQlpZnAtPmlmX2ZsYWdzIHw9IElGRl9EUlZfT0FDVElWRTsN CisJCQlfSUZfUFJFUEVORChpZnEsIG0pOw0KKwkJCUVNX1VOTE9DSyhzYyk7 DQorCQkJcmV0dXJuICgwKTsNCisJCX0NCisJCUJQRl9NVEFQKGlmcCwgbSk7 DQorCQlpZnAtPmlmX3RpbWVyID0gRU1fVFhfVElNRU9VVDsNCisJCUVNX1VO TE9DSyhzYyk7DQorCQlyZXR1cm4gKDApOw0KKwl9DQorDQorCWlmIChfSUZf UUZVTEwoaWZxKSkgew0KKwkJX0lGX0RST1AoaWZxKTsNCisJCUVNX1VOTE9D SyhzYyk7DQorCQltX2ZyZWVtKG0pOw0KKwkJcmV0dXJuIChFTk9CVUZTKTsN CisJfQ0KKwlfSUZfRU5RVUVVRShpZnEsIG0pOw0KKw0KKwl3aGlsZSAoIUlG UV9EUlZfSVNfRU1QVFkoJmlmcC0+aWZfc25kKSkgew0KKwkJSUZRX0RSVl9E RVFVRVVFKCZpZnAtPmlmX3NuZCwgbV9oZWFkKTsNCisJCWlmIChtX2hlYWQg PT0gTlVMTCkNCisJCQlicmVhazsNCisJIAkvKg0KKwkJICogZW1fZW5jYXAo KSBjYW4gbW9kaWZ5IG91ciBwb2ludGVyLCBhbmQgb3IgbWFrZSBpdCBOVUxM IG9uDQorCQkgKiBmYWlsdXJlLiAgSW4gdGhhdCBldmVudCwgd2UgY2FuJ3Qg cmVxdWV1ZS4NCisJCSAqLw0KKwkJaWYgKGVtX2VuY2FwKHNjLCAmbV9oZWFk KSkgew0KKwkJCWlmIChtX2hlYWQgPT0gTlVMTCkNCisJCQkJYnJlYWs7DQor CQkJaWZwLT5pZl9kcnZfZmxhZ3MgfD0gSUZGX0RSVl9PQUNUSVZFOw0KKwkJ CUlGUV9EUlZfUFJFUEVORCgmaWZwLT5pZl9zbmQsIG1faGVhZCk7DQorCQkJ YnJlYWs7DQorCQl9DQorCQlCUEZfTVRBUChpZnAsIG1faGVhZCk7DQorCQlp ZnAtPmlmX3RpbWVyID0gRU1fVFhfVElNRU9VVDsNCisJfQ0KKw0KKwlFTV9V TkxPQ0soc2MpOw0KKwlyZXR1cm4gKDApOw0KK30NCisNCiAvKioqKioqKioq KioqKioqKioqKioqKioqKioqKioqKioqKioqKioqKioqKioqKioqKioqKioq KioqKioqKioqKioqKioqDQogICogIElvY3RsIGVudHJ5IHBvaW50DQogICoN CkBAIC0yMTU0LDYgKzIyNDMsNyBAQA0KIAlpZnAtPmlmX2ZsYWdzID0gSUZG X0JST0FEQ0FTVCB8IElGRl9TSU1QTEVYIHwgSUZGX01VTFRJQ0FTVDsNCiAJ aWZwLT5pZl9pb2N0bCA9IGVtX2lvY3RsOw0KIAlpZnAtPmlmX3N0YXJ0ID0g ZW1fc3RhcnQ7DQorCWlmcC0+aWZfc3RhcnRtYnVmID0gZW1fc3RhcnRtYnVm Ow0KIAlpZnAtPmlmX3dhdGNoZG9nID0gZW1fd2F0Y2hkb2c7DQogCUlGUV9T RVRfTUFYTEVOKCZpZnAtPmlmX3NuZCwgc2MtPm51bV90eF9kZXNjIC0gMSk7 DQogCWlmcC0+aWZfc25kLmlmcV9kcnZfbWF4bGVuID0gc2MtPm51bV90eF9k ZXNjIC0gMTsNCi0tLSAvL2RlcG90L3ZlbmRvci9mcmVlYnNkL3NyYy9zeXMv bmV0L2lmLmMJMjAwNi8wNy8wOSAwNjowNjoyNQ0KKysrIC8vZGVwb3QvdXNl ci9yd2F0c29uL2lmbmV0L3NyYy9zeXMvbmV0L2lmLmMJMjAwNi8wNy8yNiAx NzozMjo1MA0KQEAgLTI0ODYsMjggKzI0ODYsMTExIEBADQogCShpZnAtPmlm X3N0YXJ0KShpZnApOw0KIH0NCiANCitzdGF0aWMgaW50CXN0YXJ0bWJ1Zl9l bmFibGVkOw0KK1NZU0NUTF9JTlQoX25ldCwgT0lEX0FVVE8sIHN0YXJ0bWJ1 Zl9lbmFibGVkLCBDVExGTEFHX1JXLCAmc3RhcnRtYnVmX2VuYWJsZWQsDQor ICAgIDAsICIiKTsNCisNCisvKg0KKyAqIFhYWFJXOg0KKyAqDQorICogaWZf dmFyLmggYW5kIHRoZSBpbnRlcmZhY2UgaGFuZG9mZiBhcmUgc29tZSBvZiB0 aGUgbmFzdGllc3QgcGllY2VzIG9mIHRoZQ0KKyAqIEJTRCBuZXR3b3JrIHN0 YWNrLiAgR2VuZXJhdGlvbnMgb2YgaGFja3MsIHZhcmlhbnRzLCBpbmNvbnNp c3RlbmN5LCBhbmQNCisgKiBmb29saXNobmVzcyBoYXZlIHJlc3VsdGVkIGlu IGVzc2VudGlhbGx5IHVucmVhZGFibGUgY29kZS4gIEZvciBleGFtcGxlLA0K KyAqIHdoeSBhcmUgdGhlIGlmcV8qIGludGVyZmFjZXMgdGhlIG9uZXMgdGhh dCB1c2UgdGhlIGRlZmF1bHQgaWZuZXQgc2VuZA0KKyAqIHF1ZXVlLCBhbmQg dGhlIGlmXyogaW50ZXJmYWNlcyB0aGUgb25lcyB0aGF0IHVzZSBhbHRlcm5h dGl2ZSBxdWV1ZXMsDQorICogcG9zc2libHkgd2l0aCBubyBpZm5ldCBhdCBh bGw/ICBBbmQgd2h5IGRvIHNvbWUgaW50ZXJmYWNlcyByZXR1cm4gZXJybm8N CisgKiB2YWx1ZXMsIGJ1dCBvdGhlcnMgYm9vbGVhbnM/DQorICovDQorDQor LyoNCisgKiBIYW5kb2ZmIGZ1bmN0aW9uIGZvciBzaW1wbGUgaWZuZXQgc3Ry dWN0dXJlcy4gIFJldHVybnMgYW4gZXJybm8gdmFsdWUuDQorICovDQogaW50 DQotaWZfaGFuZG9mZihzdHJ1Y3QgaWZxdWV1ZSAqaWZxLCBzdHJ1Y3QgbWJ1 ZiAqbSwgc3RydWN0IGlmbmV0ICppZnAsIGludCBhZGp1c3QpDQoraWZxX2hh bmRvZmYoc3RydWN0IGlmbmV0ICppZnAsIHN0cnVjdCBtYnVmICptLCBpbnQg YWRqdXN0KQ0KK3sNCisJaW50IGVycm9yLCBsZW4sIHN0YXJ0bWJ1ZjsNCisJ c2hvcnQgbWZsYWdzOw0KKw0KKwlsZW4gPSBtLT5tX3BrdGhkci5sZW47DQor CW1mbGFncyA9IG0tPm1fZmxhZ3M7DQorDQorCWlmIChzdGFydG1idWZfZW5h YmxlZCAmJiBpZnAtPmlmX3N0YXJ0bWJ1ZiAhPSBOVUxMKQ0KKwkJc3RhcnRt YnVmID0gMTsNCisJZWxzZQ0KKwkJc3RhcnRtYnVmID0gMDsNCisNCisJaWYg KHN0YXJ0bWJ1ZikNCisJCWVycm9yID0gaWZwLT5pZl9zdGFydG1idWYoaWZw LCBtKTsNCisJZWxzZQ0KKwkJSUZRX0VOUVVFVUUoJmlmcC0+aWZfc25kLCBt LCBlcnJvcik7DQorCWlmIChlcnJvciA9PSAwKSB7DQorCQlpZnAtPmlmX29i eXRlcyArPSBsZW4gKyBhZGp1c3Q7DQorCQlpZiAobWZsYWdzICYgKE1fQkNB U1R8TV9NQ0FTVCkpDQorCQkJaWZwLT5pZl9vbWNhc3RzKys7DQorCX0NCisJ aWYgKCFzdGFydG1idWYgJiYgKGlmcC0+aWZfZHJ2X2ZsYWdzICYgSUZGX0RS Vl9PQUNUSVZFKSA9PSAwKQ0KKwkJaWZfc3RhcnQoaWZwKTsNCisJcmV0dXJu IChlcnJvcik7DQorfQ0KKw0KKy8qDQorICogSGFuZG9mZiBmdW5jdGlvbiBm b3IgYW4gaWZxdWV1ZSB3aXRoIGFuIG9wdGlvbmFsbHkgYWZmaWxpdGlhdGVk IGlmbmV0Lg0KKyAqIFJldHVybnMgYSBib29sZWFuLg0KKyAqLw0KK2ludA0K K2lmX2hhbmRvZmYoc3RydWN0IGlmcXVldWUgKmlmcSwgc3RydWN0IG1idWYg Km0sIHN0cnVjdCBpZm5ldCAqaWZwLA0KKyAgICBpbnQgYWRqdXN0KQ0KK3sN CisJaW50IGxlbiwgYWN0aXZlLCBzdGFydG1idWYsIHN1Y2Nlc3M7DQorCXNo b3J0IG1mbGFnczsNCisNCisJYWN0aXZlID0gMDsNCisJbGVuID0gbS0+bV9w a3RoZHIubGVuOw0KKwltZmxhZ3MgPSBtLT5tX2ZsYWdzOw0KKw0KKwlpZiAo c3RhcnRtYnVmX2VuYWJsZWQgJiYgaWZwICE9IE5VTEwgJiYgaWZwLT5pZl9z dGFydG1idWYgIT0gTlVMTCkNCisJCXN0YXJ0bWJ1ZiA9IDE7DQorCWVsc2UN CisJCXN0YXJ0bWJ1ZiA9IDA7DQorDQorCWlmIChzdGFydG1idWYpDQorCQlz dWNjZXNzID0gKGlmcC0+aWZfc3RhcnRtYnVmKGlmcCwgbSkgPT0gMCk7DQor CWVsc2Ugew0KKwkJSUZfTE9DSyhpZnEpOw0KKwkJaWYgKF9JRl9RRlVMTChp ZnEpKSB7DQorCQkJX0lGX0RST1AoaWZxKTsNCisJCQltX2ZyZWVtKG0pOw0K KwkJCXN1Y2Nlc3MgPSAwOw0KKwkJfSBlbHNlIHsNCisJCQlfSUZfRU5RVUVV RShpZnEsIG0pOw0KKwkJCXN1Y2Nlc3MgPSAxOw0KKwkJfQ0KKwkJSUZfVU5M T0NLKGlmcSk7DQorCQlpZiAoaWZwICE9IE5VTEwgJiYgIShpZnAtPmlmX2Ry dl9mbGFncyAmIElGRl9EUlZfT0FDVElWRSkpDQorCQkJaWZfc3RhcnQoaWZw KTsNCisJfQ0KKwlpZiAoc3VjY2VzcyAmJiBpZnAgIT0gTlVMTCkgew0KKwkJ aWZwLT5pZl9vYnl0ZXMgKz0gbGVuICsgYWRqdXN0Ow0KKwkJaWYgKG0tPm1f ZmxhZ3MgJiAoTV9CQ0FTVHxNX01DQVNUKSkNCisJCQlpZnAtPmlmX29tY2Fz dHMrKzsNCisJfQ0KKwlyZXR1cm4gKHN1Y2Nlc3MpOw0KK30NCisNCisvKg0K KyAqIFV0aWxpdHkgZnVuY3Rpb24gdG8gYmUgdXNlZCBieSBkZXZpY2UgZHJp dmVycyB3aGVuIHRoZXkgbmVlZCB0byBlbnF1ZXVlIGENCisgKiBwYWNrZXQg dG8gYW4gaW50ZXJmYWNlLXJlbGF0ZWQgcXVldWUgcmF0aGVyIHRoYW4gaW1t ZWRpYXRlbHkgZGVsaXZlcmluZy4NCisgKi8NCitpbnQNCitpZl9zdGFydG1i dWZfZW5xdWV1ZShzdHJ1Y3QgaWZxdWV1ZSAqaWZxLCBzdHJ1Y3QgbWJ1ZiAq bSkNCiB7DQotCWludCBhY3RpdmUgPSAwOw0KIA0KLQlJRl9MT0NLKGlmcSk7 DQogCWlmIChfSUZfUUZVTEwoaWZxKSkgew0KIAkJX0lGX0RST1AoaWZxKTsN Ci0JCUlGX1VOTE9DSyhpZnEpOw0KIAkJbV9mcmVlbShtKTsNCiAJCXJldHVy biAoMCk7DQogCX0NCi0JaWYgKGlmcCAhPSBOVUxMKSB7DQotCQlpZnAtPmlm X29ieXRlcyArPSBtLT5tX3BrdGhkci5sZW4gKyBhZGp1c3Q7DQotCQlpZiAo bS0+bV9mbGFncyAmIChNX0JDQVNUfE1fTUNBU1QpKQ0KLQkJCWlmcC0+aWZf b21jYXN0cysrOw0KLQkJYWN0aXZlID0gaWZwLT5pZl9kcnZfZmxhZ3MgJiBJ RkZfRFJWX09BQ1RJVkU7DQotCX0NCiAJX0lGX0VOUVVFVUUoaWZxLCBtKTsN Ci0JSUZfVU5MT0NLKGlmcSk7DQotCWlmIChpZnAgIT0gTlVMTCAmJiAhYWN0 aXZlKQ0KLQkJaWZfc3RhcnQoaWZwKTsNCiAJcmV0dXJuICgxKTsNCiB9DQog DQotLS0gLy9kZXBvdC92ZW5kb3IvZnJlZWJzZC9zcmMvc3lzL25ldC9pZl92 YXIuaAkyMDA2LzA2LzE5IDIyOjIxOjIyDQorKysgLy9kZXBvdC91c2VyL3J3 YXRzb24vaWZuZXQvc3JjL3N5cy9uZXQvaWZfdmFyLmgJMjAwNi8wNy8zMCAx MDoxMTo1NA0KQEAgLTE2Miw3ICsxNjIsOCBAQA0KIAkJKHN0cnVjdCBpZm5l dCAqLCBzdHJ1Y3Qgc29ja2FkZHIgKiosIHN0cnVjdCBzb2NrYWRkciAqKTsN CiAJc3RydWN0CWlmYWRkcgkqaWZfYWRkcjsJLyogcG9pbnRlciB0byBsaW5r LWxldmVsIGFkZHJlc3MgKi8NCiAJdm9pZAkqaWZfc3BhcmUyOwkJLyogc3Bh cmUgcG9pbnRlciAyICovDQotCXZvaWQJKmlmX3NwYXJlMzsJCS8qIHNwYXJl IHBvaW50ZXIgMyAqLw0KKwlpbnQJKCppZl9zdGFydG1idWYpCQkvKiBlbnF1 ZXVlIGFuZCBzdGFydCBvdXRwdXQgKi8NCisJCShzdHJ1Y3QgaWZuZXQgKiwg c3RydWN0IG1idWYgKik7DQogCWludAlpZl9kcnZfZmxhZ3M7CQkvKiBkcml2 ZXItbWFuYWdlZCBzdGF0dXMgZmxhZ3MgKi8NCiAJdV9pbnQJaWZfc3BhcmVf ZmxhZ3MyOwkvKiBzcGFyZSBmbGFncyAyICovDQogCXN0cnVjdCAgaWZhbHRx IGlmX3NuZDsJCS8qIG91dHB1dCBxdWV1ZSAoaW5jbHVkZXMgYWx0cSkgKi8N CkBAIC0zNzAsMTIgKzM3MSwxNSBAQA0KIAkJbXR4X3VubG9jaygmR2lhbnQp OwkJCQkJXA0KIH0gd2hpbGUgKDApDQogDQoraW50CWlmcV9oYW5kb2ZmKHN0 cnVjdCBpZm5ldCAqaWZwLCBzdHJ1Y3QgbWJ1ZiAqbSwgaW50IGFkanVzdCk7 DQogaW50CWlmX2hhbmRvZmYoc3RydWN0IGlmcXVldWUgKmlmcSwgc3RydWN0 IG1idWYgKm0sIHN0cnVjdCBpZm5ldCAqaWZwLA0KIAkgICAgaW50IGFkanVz dCk7DQoraW50CWlmX3N0YXJ0bWJ1Zl9lbnF1ZXVlKHN0cnVjdCBpZnF1ZXVl ICppZnEsIHN0cnVjdCBtYnVmICptKTsNCisNCisjZGVmaW5lCUlGX0hBTkRP RkZfQURKKGlmcSwgbSwgaWZwLCBhZGopCVwNCisJaWZfaGFuZG9mZigoc3Ry dWN0IGlmcXVldWUgKilpZnEsIG0sIGlmcCwgYWRqKQ0KICNkZWZpbmUJSUZf SEFORE9GRihpZnEsIG0sIGlmcCkJCQlcDQogCWlmX2hhbmRvZmYoKHN0cnVj dCBpZnF1ZXVlICopaWZxLCBtLCBpZnAsIDApDQotI2RlZmluZQlJRl9IQU5E T0ZGX0FESihpZnEsIG0sIGlmcCwgYWRqKQlcDQotCWlmX2hhbmRvZmYoKHN0 cnVjdCBpZnF1ZXVlICopaWZxLCBtLCBpZnAsIGFkaikNCiANCiB2b2lkCWlm X3N0YXJ0KHN0cnVjdCBpZm5ldCAqKTsNCiANCkBAIC00NTksMjUgKzQ2Myw4 IEBADQogI2RlZmluZQlJRlFfSU5DX0RST1BTKGlmcSkJCSgoaWZxKS0+aWZx X2Ryb3BzKyspDQogI2RlZmluZQlJRlFfU0VUX01BWExFTihpZnEsIGxlbikJ KChpZnEpLT5pZnFfbWF4bGVuID0gKGxlbikpDQogDQotLyoNCi0gKiBUaGUg SUZGX0RSVl9PQUNUSVZFIHRlc3Qgc2hvdWxkIHJlYWxseSBvY2N1ciBpbiB0 aGUgZGV2aWNlIGRyaXZlciwgbm90IGluDQotICogdGhlIGhhbmRvZmYgbG9n aWMsIGFzIHRoYXQgZmxhZyBpcyBsb2NrZWQgYnkgdGhlIGRldmljZSBkcml2 ZXIuDQotICovDQotI2RlZmluZQlJRlFfSEFORE9GRl9BREooaWZwLCBtLCBh ZGosIGVycikJCQkJXA0KLWRvIHsJCQkJCQkJCQlcDQotCWludCBsZW47CQkJ CQkJCVwNCi0Jc2hvcnQgbWZsYWdzOwkJCQkJCQlcDQotCQkJCQkJCQkJXA0K LQlsZW4gPSAobSktPm1fcGt0aGRyLmxlbjsJCQkJCVwNCi0JbWZsYWdzID0g KG0pLT5tX2ZsYWdzOwkJCQkJCVwNCi0JSUZRX0VOUVVFVUUoJihpZnApLT5p Zl9zbmQsIG0sIGVycik7CQkJCVwNCi0JaWYgKChlcnIpID09IDApIHsJCQkJ CQlcDQotCQkoaWZwKS0+aWZfb2J5dGVzICs9IGxlbiArIChhZGopOwkJCVwN Ci0JCWlmIChtZmxhZ3MgJiBNX01DQVNUKQkJCQkJXA0KLQkJCShpZnApLT5p Zl9vbWNhc3RzKys7CQkJCVwNCi0JCWlmICgoKGlmcCktPmlmX2Rydl9mbGFn cyAmIElGRl9EUlZfT0FDVElWRSkgPT0gMCkJXA0KLQkJCWlmX3N0YXJ0KGlm cCk7CQkJCQlcDQotCX0JCQkJCQkJCVwNCisjZGVmaW5lCUlGUV9IQU5ET0ZG X0FESihpZnAsIG0sIGFkaiwgZXJyKSBkbyB7CQkJCVwNCisJZXJyID0gaWZx X2hhbmRvZmYoaWZwLCBtLCBhZGopOwkJCQkJXA0KIH0gd2hpbGUgKDApDQog DQogI2RlZmluZQlJRlFfSEFORE9GRihpZnAsIG0sIGVycikJCQkJCVwNCg== --0-1695162780-1154268288=:16341-- From owner-freebsd-arch@FreeBSD.ORG Sun Jul 30 14:59:19 2006 Return-Path: X-Original-To: freebsd-arch@freebsd.org Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 39CA816A4DD; Sun, 30 Jul 2006 14:59:19 +0000 (UTC) (envelope-from max@love2party.net) Received: from moutng.kundenserver.de (moutng.kundenserver.de [212.227.126.186]) by mx1.FreeBSD.org (Postfix) with ESMTP id 7A51143D46; Sun, 30 Jul 2006 14:59:18 +0000 (GMT) (envelope-from max@love2party.net) Received: from [88.64.179.108] (helo=amd64.laiers.local) by mrelayeu.kundenserver.de (node=mrelayeu0) with ESMTP (Nemesis), id 0MKwh2-1G7ClB1MeV-0001wZ; Sun, 30 Jul 2006 16:59:17 +0200 From: Max Laier Organization: FreeBSD To: freebsd-arch@freebsd.org Date: Sun, 30 Jul 2006 16:59:10 +0200 User-Agent: KMail/1.9.3 References: <20060730141642.D16341@fledge.watson.org> In-Reply-To: <20060730141642.D16341@fledge.watson.org> X-Face: ,,8R(x[kmU]tKN@>gtH1yQE4aslGdu+2]; R]*pL,U>^H?)gW@49@wdJ`H<=?utf-8?q?=25=7D*=5FBD=0A=09U=5For=3D=5CmOZf764=26nYj=3DJYbR1PW0ud?=>|!~,,CPC.1-D$FG@0h3#'5"k{V]a~.<=?utf-8?q?mZ=7D44=23Se=7Em=0A=09Fe=7E=5C=5DX5B=5D=5Fxj?=(ykz9QKMw_l0C2AQ]}Ym8)fU MIME-Version: 1.0 Content-Type: multipart/signed; boundary="nextPart2120229.eJNeJPqOEV"; protocol="application/pgp-signature"; micalg=pgp-sha1 Content-Transfer-Encoding: 7bit Message-Id: <200607301659.16323.max@love2party.net> X-Provags-ID: kundenserver.de abuse@kundenserver.de login:61c499deaeeba3ba5be80f48ecc83056 Cc: Robert Watson , freeebsd-net@freebsd.org Subject: Re: Changes in the network interface queueing handoff model X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 30 Jul 2006 14:59:19 -0000 --nextPart2120229.eJNeJPqOEV Content-Type: text/plain; charset="iso-8859-6" Content-Transfer-Encoding: quoted-printable Content-Disposition: inline On Sunday 30 July 2006 16:04, Robert Watson wrote: > One of the ideas that I, Scott Long, and a few others have been bouncing > around for some time is a restructuring of the network interface packet > transmission API to reduce the number of locking operations and allow > network device drivers increased control of the queueing behavior. Right > now, it works something like that following: > > - When a network protocol wants to transmit, it calls the ifnet's link > layer output routine via ifp->if_output() with the ifnet pointer, packet, > destination address information, and route information. > > - The link layer (e.g., ether_output() + ether_output_frame()) encapsulat= es > the packet as necessary, performs a link layer address translation (su= ch > as ARP), and hands off to the ifnet driver via a call to IFQ_HANDOFF(), > which accepts the ifnet pointer and packet. > > - The ifnet layer enqueues the packet in the ifnet send queue > (ifp->if_snd), and then looks at the driver's IFF_DRV_OACTIVE flag to > determine if it needs to "start" output by the driver. If the driver is > already active, it doesn't, and otherwise, it does. > > - The driver dequeues the packet from ifp->if_snd, performs any driver > encapsulation and wrapping, and notifies the hardware. In modern > hardware, this consists of hooking the data of the packet up to the > descriptor ring and notifying the hardware to pick it up via DMA. In ord= er > hardware, the driver would perform a series of I/O operations to send the > entire packet directly to the card via a system bus. > > Why change this? A few reasons: > > - The ifnet layer send queue is becoming decreasingly useful over time.=20 > Most modern hardware has a significant number of slots in its transmit > descriptor ring, tuned for the performance of the hardware, etc, which is > the effective transmit queue in practice. The additional queue depth > doesn't increase throughput substantially (if at all) but does consume > memory. > > - On extremely fast hardware (with respect to CPU speed), the queue remai= ns > essentially empty, so we pay the cost of enqueueing and dequeuing a > packet from an empty queue. > > - The ifnet send queue is a separately locked object from the device > driver, meaning that for a single enqueue/dequeue pair, we pay an extra > four lock operations (two for insert, two for remove) per packet. > > - For synthetic link layer drivers, such as if_vlan, which have no need f= or > queueing at all, the cost of queueing is eliminated. > > - IFF_DRV_OACTIVE is no longer inspected by the link layer, only by the > driver, which helps eliminate a latent race condition involving use of > the flag. > > The proposed change is simple: right now one or more enqueue operations > occurs, when a call to ifp->if_start() is made to notify the driver that = it > may need to do something (if the ACTIVE flag isn't set). In the new world > order, the driver is directly passed the mbuf, and may then choose to que= ue > it or otherwise handle it as it sees fit. The immediate practical benefit > is clear: if the queueing at the ifnet layer is unnecessary, it is entire= ly > avoided, skipping enqueue, dequeue, and four mutex operations. This > applies immediately for VLAN processing, but also means that for modern > gigabit cards, the hardware queue (which will be used anyway) is the only > queue necessary. > > There are a few downsides, of course: > > - For older hardware without its own queueing, the queue is still required > -- not only that, but we've now introduced an unconditional function > pointer invocation, which on older hardware, is has more significant > relative cost than it has on more recent CPUs. > > - If drivers still require or use a queue, they must now synchronize acce= ss > to the queue. The obvious choices are to use the ifq lock (and restore t= he > above four lock operations), or to use the driver mutex (and risk higher > contention). Right now, if the driver is busy (driver mutex held) then an > enqueue is still possible, but with this change and a single mutex > protecting the send queue and driver, that is no longer possible. > > Attached is a patch that maintains the current if_start, but adds > if_startmbuf. If a device driver implements if_startmbuf and the global > sysctl net.startmbuf_enabled is set to 1, then the if_startmbuf path in t= he > driver will be used. Otherwise, if_start is used. I have modified the > if_em driver to implement if_startmbuf also. If there is no packet backl= og > in the if_snd queue, it directly places the packet in the transmit > descriptor ring. If there is a backlog, it uses the if_snd queue protected > by driver mutex, rather than a separate ifq mutex. > > In some basic local micro-benchmarks, I saw a 5% improvement in UDP 0-byte > paylod PPS on UP, and a 10% improvement on SMP. I saw a 1.7% performance > improvement in the bulk serving of 1k files over HTTP. These are only > micro-benchmarks, and reflect a configuration in which the CPU is unable = to > keep up with the output rate of the 1gbps ethernet card in the device, so > reductions in host CPU usage are immediately visible in increased output = as > the CPU is able to better keep up with the network hardware. Other > configurations are also of interest of interesting, especially ones in > which the network device is unable to keep up with the CPU, resulting in > more queueing. > > Conceptual review as well as banchmarking, etc, would be most welcome. This begs the question: What about ALTQ? If we maintain the fallback mechanism in _handoff, we can just add=20 ALTQ_IS_ENABLED() to the test. Otherwise every driver's startmbuf function= =20 would have to take care of ALTQ itself, which is not preferable. I strongly agree with you comment about how messed up ifq_*/if_* in if_var.= h=20 are - and I'm afraid that's partly me fault for bringing in ALTQ. =2D-=20 /"\ Best regards, | mlaier@freebsd.org \ / Max Laier | ICQ #67774661 X http://pf4freebsd.love2party.net/ | mlaier@EFnet / \ ASCII Ribbon Campaign | Against HTML Mail and News --nextPart2120229.eJNeJPqOEV Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.4 (FreeBSD) iD8DBQBEzMlEXyyEoT62BG0RAvsrAJ4v2m/yc+PHoUM+kPE0ZZUVknJbTgCfeJYN uQVwRejml24OusLMlSIJV5A= =OUxd -----END PGP SIGNATURE----- --nextPart2120229.eJNeJPqOEV-- From owner-freebsd-arch@FreeBSD.ORG Sun Jul 30 15:25:48 2006 Return-Path: X-Original-To: freebsd-arch@freebsd.org Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id AF52A16A4DD for ; Sun, 30 Jul 2006 15:25:48 +0000 (UTC) (envelope-from rwatson@FreeBSD.org) Received: from cyrus.watson.org (cyrus.watson.org [209.31.154.42]) by mx1.FreeBSD.org (Postfix) with ESMTP id 590E843D45; Sun, 30 Jul 2006 15:25:48 +0000 (GMT) (envelope-from rwatson@FreeBSD.org) Received: from fledge.watson.org (fledge.watson.org [209.31.154.41]) by cyrus.watson.org (Postfix) with ESMTP id F078146BE7; Sun, 30 Jul 2006 11:25:47 -0400 (EDT) Date: Sun, 30 Jul 2006 16:25:47 +0100 (BST) From: Robert Watson X-X-Sender: robert@fledge.watson.org To: Max Laier In-Reply-To: <200607301659.16323.max@love2party.net> Message-ID: <20060730160933.D16341@fledge.watson.org> References: <20060730141642.D16341@fledge.watson.org> <200607301659.16323.max@love2party.net> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freeebsd-net@freebsd.org, freebsd-arch@freebsd.org Subject: Re: Changes in the network interface queueing handoff model X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 30 Jul 2006 15:25:48 -0000 On Sun, 30 Jul 2006, Max Laier wrote: >> Conceptual review as well as banchmarking, etc, would be most welcome. > > This begs the question: What about ALTQ? > > If we maintain the fallback mechanism in _handoff, we can just add > ALTQ_IS_ENABLED() to the test. Otherwise every driver's startmbuf function > would have to take care of ALTQ itself, which is not preferable. Maxime just asked me the same question, and I realized that I had, of course, forgotten to mention ALTQ. A few observations/questions: - An underlying assumption of ALTQ is that queueing occurs in software. This turns out to be decreasingly true with modern network hardware, where the queueing occurs in a combination of software and hardware thanks to the descriptor ring. Has anyone compared the effectiveness of ALTQ on gig-e systems with the effectiveness on 10/100mbps systems? I would anticipate that it would have an effect on the latter, but little or no effect on the former? - ALTQ actually also already does a bit of what I describe -- IFQ_DRV_DEQUEUE() and friends actually manage two ifnet queues, if_snd (the public queue), and if_drv_head, a second queue protected by the device driver lock. Because it bulk dequeues from one queue to the other, it already works to amortize if_snd locking, at the cost of maintaining two queues and substantially more complicated logic. - One of the side effects of this change is that it does complicate device drivers, especially if they are going to rely on software queueing. In the example changes to if_em in my patch, the start path goes from a simple loop around the send queue to considering three cases: needing to queue, handling the optimized (empty) queue case, and needing to dequeue. Quite a bit of this logic will be common across device drivers and might be something we can abstract out some. There are two things going on in my proposed change: ownership (and hence locking and interface) with a queue moves, and code moves. It could be we could transfer the locking (etc) and move less code. Thoughts on this would be welcome. Notice that in the patch I do leave backwards compatible support for if_start, if only because rewriting all network device drivers is error prone and arduous. I assume this will be a temporary condition, but it could be that we could leave it as a permanent one to support older devices that won't be updated (ISA ethernet cards, etc), where the existing queueing model works reasonably well. In an earlier iteration of this patch, I had em_startmbuf() call a utility routine in if.c to handle enqueueing followed by calling em_start(). This meant no optimized fast path, but less code change. > I strongly agree with you comment about how messed up ifq_*/if_* in > if_var.h are - and I'm afraid that's partly me fault for bringing in ALTQ. Heh. I actually meant to remove that comment before posting the patch, so it's a bit more blunt than perhaps entirely wise. :-) A couple of times in the past I've attempted to work on a rename to basically swap the two naming schemes and clean things up, but usually I've stalled when it occurs to me just how big the task is. At this point, I think the right thing to do is make a decision about the semantics of the interface, and then walk the tree and clean up after we've simplified things. Robert N M Watson Computer Laboratory University of Cambridge From owner-freebsd-arch@FreeBSD.ORG Sun Jul 30 15:36:21 2006 Return-Path: X-Original-To: freebsd-arch@freebsd.org Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id C589116A4DD; Sun, 30 Jul 2006 15:36:21 +0000 (UTC) (envelope-from prvs=julian=3594eb8d2@elischer.org) Received: from a50.ironport.com (a50.ironport.com [63.251.108.112]) by mx1.FreeBSD.org (Postfix) with ESMTP id 6FE0143D6E; Sun, 30 Jul 2006 15:36:21 +0000 (GMT) (envelope-from prvs=julian=3594eb8d2@elischer.org) Received: from unknown (HELO [192.168.2.4]) ([10.251.60.32]) by a50.ironport.com with ESMTP; 30 Jul 2006 08:36:21 -0700 Message-ID: <44CCD1F4.1090902@elischer.org> Date: Sun, 30 Jul 2006 08:36:20 -0700 From: Julian Elischer User-Agent: Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; en-US; rv:1.7.13) Gecko/20060414 X-Accept-Language: en-us, en MIME-Version: 1.0 To: Robert Watson References: <20060730141642.D16341@fledge.watson.org> <200607301659.16323.max@love2party.net> <20060730160933.D16341@fledge.watson.org> In-Reply-To: <20060730160933.D16341@fledge.watson.org> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit Cc: Max Laier , freeebsd-net@freebsd.org, freebsd-arch@freebsd.org Subject: Re: Changes in the network interface queueing handoff model X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 30 Jul 2006 15:36:21 -0000 Robert Watson wrote: > On Sun, 30 Jul 2006, Max Laier wrote: > > >> I strongly agree with you comment about how messed up ifq_*/if_* in >> if_var.h are - and I'm afraid that's partly me fault for bringing in >> ALTQ. > If it becomes standard you'll have to think of another name.. it won't be "alt" any more.. :-) From owner-freebsd-arch@FreeBSD.ORG Sun Jul 30 18:36:18 2006 Return-Path: X-Original-To: arch@freebsd.org Delivered-To: freebsd-arch@FreeBSD.ORG Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 7902B16A4DA; Sun, 30 Jul 2006 18:36:18 +0000 (UTC) (envelope-from sam@errno.com) Received: from ebb.errno.com (ebb.errno.com [69.12.149.25]) by mx1.FreeBSD.org (Postfix) with ESMTP id 0AAF143D46; Sun, 30 Jul 2006 18:36:17 +0000 (GMT) (envelope-from sam@errno.com) Received: from [10.0.0.199] ([10.0.0.199]) (authenticated bits=0) by ebb.errno.com (8.13.6/8.12.6) with ESMTP id k6UIaH7v011192 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Sun, 30 Jul 2006 11:36:17 -0700 (PDT) (envelope-from sam@errno.com) Message-ID: <44CCFC2C.20402@errno.com> Date: Sun, 30 Jul 2006 11:36:28 -0700 From: Sam Leffler Organization: Errno Consulting User-Agent: Thunderbird 1.5.0.5 (Macintosh/20060719) MIME-Version: 1.0 To: Robert Watson References: <20060730141642.D16341@fledge.watson.org> In-Reply-To: <20060730141642.D16341@fledge.watson.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: arch@freebsd.org, net@freebsd.org Subject: Re: Changes in the network interface queueing handoff model X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 30 Jul 2006 18:36:18 -0000 Robert Watson wrote: > > 5BOne of the ideas that I, Scott Long, and a few others have been > bouncing around for some time is a restructuring of the network > interface packet transmission API to reduce the number of locking > operations and allow network device drivers increased control of the > queueing behavior. Right now, it works something like that following: > > - When a network protocol wants to transmit, it calls the ifnet's link > layer > output routine via ifp->if_output() with the ifnet pointer, packet, > destination address information, and route information. > > - The link layer (e.g., ether_output() + ether_output_frame()) encapsulates > the packet as necessary, performs a link layer address translation > (such as > ARP), and hands off to the ifnet driver via a call to IFQ_HANDOFF(), > which > accepts the ifnet pointer and packet. > > - The ifnet layer enqueues the packet in the ifnet send queue > (ifp->if_snd), > and then looks at the driver's IFF_DRV_OACTIVE flag to determine if it > needs > to "start" output by the driver. If the driver is already active, it > doesn't, and otherwise, it does. > > - The driver dequeues the packet from ifp->if_snd, performs any driver > encapsulation and wrapping, and notifies the hardware. In modern > hardware, > this consists of hooking the data of the packet up to the descriptor ring > and notifying the hardware to pick it up via DMA. In order hardware, the > driver would perform a series of I/O operations to send the entire packet > directly to the card via a system bus. > > Why change this? A few reasons: > > - The ifnet layer send queue is becoming decreasingly useful over time. > Most > modern hardware has a significant number of slots in its transmit > descriptor > ring, tuned for the performance of the hardware, etc, which is the > effective > transmit queue in practice. The additional queue depth doesn't increase > throughput substantially (if at all) but does consume memory. > > - On extremely fast hardware (with respect to CPU speed), the queue remains > essentially empty, so we pay the cost of enqueueing and dequeuing a > packet > from an empty queue. > > - The ifnet send queue is a separately locked object from the device > driver, > meaning that for a single enqueue/dequeue pair, we pay an extra four lock > operations (two for insert, two for remove) per packet. > > - For synthetic link layer drivers, such as if_vlan, which have no need for > queueing at all, the cost of queueing is eliminated. > > - IFF_DRV_OACTIVE is no longer inspected by the link layer, only by the > driver, which helps eliminate a latent race condition involving use of > the > flag. > > The proposed change is simple: right now one or more enqueue operations > occurs, when a call to ifp->if_start() is made to notify the driver that > it may need to do something (if the ACTIVE flag isn't set). In the new > world order, the driver is directly passed the mbuf, and may then choose > to queue it or otherwise handle it as it sees fit. The immediate > practical benefit is clear: if the queueing at the ifnet layer is > unnecessary, it is entirely avoided, skipping enqueue, dequeue, and four > mutex operations. This applies immediately for VLAN processing, but > also means that for modern gigabit cards, the hardware queue (which will > be used anyway) is the only queue necessary. > > There are a few downsides, of course: > > - For older hardware without its own queueing, the queue is still > required -- > not only that, but we've now introduced an unconditional function pointer > invocation, which on older hardware, is has more significant relative > cost > than it has on more recent CPUs. > > - If drivers still require or use a queue, they must now synchronize > access to > the queue. The obvious choices are to use the ifq lock (and restore the > above four lock operations), or to use the driver mutex (and risk higher > contention). Right now, if the driver is busy (driver mutex held) > then an > enqueue is still possible, but with this change and a single mutex > protecting the send queue and driver, that is no longer possible. > You're headed in the direction of linux where the handoff goes through a packet scheduling function before it hits the driver. This is equivalent to altq which, as Max pointed out, you didn't mention in this note. But it would be very good to move altq out of the compile-time macros with this. I have a fair amount of experience with the linux model and it works ok. The main complication I've seen is when a driver needs to process multiple queues of packets things get more involved. This is seen in 802.11 drivers where there are two q's, one for data frames and one for management frames. With the current scheme you have two separate queues and the start method handles prioritization by polling the mgt q before the data q. If instead the packet is passed to the start method then it needs to be tagged in some way so the it's prioritized properly. Otherwise you end up with multiple start methods; one per type of packet. I suspect this will be ok but the end result will be that we'll need to add a priority field to mbufs (unless we pass it as an arge to the start method). All this is certainly doable but I think just replacing one mechanism with the other (as you specified) is insufficient. > Attached is a patch that maintains the current if_start, but adds > if_startmbuf. If a device driver implements if_startmbuf and the global > sysctl net.startmbuf_enabled is set to 1, then the if_startmbuf path in > the driver will be used. Otherwise, if_start is used. I have modified > the if_em driver to implement if_startmbuf also. If there is no packet > backlog in the if_snd queue, it directly places the packet in the > transmit descriptor ring. If there is a backlog, it uses the if_snd > queue protected by driver mutex, rather than a separate ifq mutex. > > In some basic local micro-benchmarks, I saw a 5% improvement in UDP > 0-byte paylod PPS on UP, and a 10% improvement on SMP. I saw a 1.7% > performance improvement in the bulk serving of 1k files over HTTP. > These are only micro-benchmarks, and reflect a configuration in which > the CPU is unable to keep up with the output rate of the 1gbps ethernet > card in the device, so reductions in host CPU usage are immediately > visible in increased output as the CPU is able to better keep up with > the network hardware. Other configurations are also of interest of > interesting, especially ones in which the network device is unable to > keep up with the CPU, resulting in more queueing. > > Conceptual review as well as banchmarking, etc, would be most welcome. Why is the startmbuf knob global and not per-interface? Seems like you want to convert drivers one at a time? FWIW the original model was driven by the expectation that you could raise the spl so the tx path was entirely synchronized from above. With the SMPng work we're synchronizing transfer through each control layer. If the driver softc lock (or similar) were exposed to upper layers we could possibly return the "lock the tx path" model we had before and eliminate all the locking your changes target. But that would be a big layering violation and would add significant contention in the SMP case. I think the key observation is that most network hardware today takes packets directly from private queues so the fast path needs to push things down to those queues w/ minimal overhead. This includes devices that implement QoS in h/w w/ multiple queues. Sam From owner-freebsd-arch@FreeBSD.ORG Sun Jul 30 19:23:14 2006 Return-Path: X-Original-To: arch@freebsd.org Delivered-To: freebsd-arch@FreeBSD.ORG Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 3D6F516A4DA; Sun, 30 Jul 2006 19:23:14 +0000 (UTC) (envelope-from rwatson@FreeBSD.org) Received: from cyrus.watson.org (cyrus.watson.org [209.31.154.42]) by mx1.FreeBSD.org (Postfix) with ESMTP id D333043D45; Sun, 30 Jul 2006 19:23:13 +0000 (GMT) (envelope-from rwatson@FreeBSD.org) Received: from fledge.watson.org (fledge.watson.org [209.31.154.41]) by cyrus.watson.org (Postfix) with ESMTP id 783F646CCD; Sun, 30 Jul 2006 15:23:13 -0400 (EDT) Date: Sun, 30 Jul 2006 20:23:13 +0100 (BST) From: Robert Watson X-X-Sender: robert@fledge.watson.org To: Sam Leffler In-Reply-To: <44CCFC2C.20402@errno.com> Message-ID: <20060730200929.J16341@fledge.watson.org> References: <20060730141642.D16341@fledge.watson.org> <44CCFC2C.20402@errno.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: arch@freebsd.org, net@freebsd.org Subject: Re: Changes in the network interface queueing handoff model X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 30 Jul 2006 19:23:14 -0000 On Sun, 30 Jul 2006, Sam Leffler wrote: > I have a fair amount of experience with the linux model and it works ok. > The main complication I've seen is when a driver needs to process multiple > queues of packets things get more involved. This is seen in 802.11 drivers > where there are two q's, one for data frames and one for management frames. > With the current scheme you have two separate queues and the start method > handles prioritization by polling the mgt q before the data q. If instead > the packet is passed to the start method then it needs to be tagged in some > way so the it's prioritized properly. Otherwise you end up with multiple > start methods; one per type of packet. I suspect this will be ok but the > end result will be that we'll need to add a priority field to mbufs (unless > we pass it as an arge to the start method). > > All this is certainly doable but I think just replacing one mechanism with > the other (as you specified) is insufficient. Hmm. This is something that I had overlooked. I was loosely aware that the if_sl code made use of multiple queues, but was under the impression that the classification to queues occured purely in the SLIP code. Indeed, it does, but structurally, SLIP is split over the link layer (if_output) and driver layer (if_start), which I had forgotten. I take it from your comments that 802.11 also does this, which I was not aware of. I'm a little uncomfortable with our current m_tag model, as it requires significant numbers of additional allocations and frees for each packet, as well as walking link lists. It's fine for occasional discretionary use (i.e., MAC labels), but I worry about cases where it is used with every packet, and we start seeing moderately non-zero numbers of tags on every packet. I think I would be more comfortable with an explicit queue identifier argument to if_start, where the link layer and driver layer agree on how to identify queues. As a straw man, how would the following strike you: int if_startmbuf(struct ifnet *ifp, struct mbuf *m, int ifqid); where for most link layers, the value would be zero, but for some link layer/driver combinations, it would identify a specific queue which the link layer believes the mbuf should be assigned, if implemented? >> Attached is a patch that maintains the current if_start, but adds >> if_startmbuf. If a device driver implements if_startmbuf and the global >> sysctl net.startmbuf_enabled is set to 1, then the if_startmbuf path in the >> driver will be used. Otherwise, if_start is used. I have modified the >> if_em driver to implement if_startmbuf also. If there is no packet backlog >> in the if_snd queue, it directly places the packet in the transmit >> descriptor ring. If there is a backlog, it uses the if_snd queue protected >> by driver mutex, rather than a separate ifq mutex. >> >> In some basic local micro-benchmarks, I saw a 5% improvement in UDP 0-byte >> paylod PPS on UP, and a 10% improvement on SMP. I saw a 1.7% performance >> improvement in the bulk serving of 1k files over HTTP. These are only >> micro-benchmarks, and reflect a configuration in which the CPU is unable to >> keep up with the output rate of the 1gbps ethernet card in the device, so >> reductions in host CPU usage are immediately visible in increased output as >> the CPU is able to better keep up with the network hardware. Other >> configurations are also of interest of interesting, especially ones in >> which the network device is unable to keep up with the CPU, resulting in >> more queueing. >> >> Conceptual review as well as banchmarking, etc, would be most welcome. > > Why is the startmbuf knob global and not per-interface? Seems like you want > to convert drivers one at a time? I may have under-described what I have implemented. The decision is currently made based on two factors: a global frob, and per-interface definition of if_startmbuf being non-zero. The global frob is intended to make it easy to benchmark the difference. I should modify the patch so that the global frob doesn't override the driver back to if_start in the event that if_startmbuf is defined and if_start isn't. The global frob is intended to be removed in the long run, and I intend for us to continue to support both the old and new start methods for the forseeable future, since I don't intend to update every device driver we have to the new method, at least not personally :-). > FWIW the original model was driven by the expectation that you could raise > the spl so the tx path was entirely synchronized from above. With the SMPng > work we're synchronizing transfer through each control layer. If the driver > softc lock (or similar) were exposed to upper layers we could possibly > return the "lock the tx path" model we had before and eliminate all the > locking your changes target. But that would be a big layering violation and > would add significant contention in the SMP case. In some ways, what I propose comes to much the same thing: the change I propose basically delegates the queueing and synchronization decisions to the device driver, which might choose either to use the lock already in the ifq, to use its own lock, or to use some other synchronization strategy. In the case of if_em, I've implemented bypass of software queueing entirely in the common case, but in the event that the hardware ring backs up, then we still fall back to the if_snd queue, only we lock it using the device driver's transmit path mutex. Delegating the synchronization down the stack comes with risks, as device driver writers will inevitably take liberties: on the other hand, it appears that devices are quite diverse, and those liberties have advantages. > I think the key observation is that most network hardware today takes > packets directly from private queues so the fast path needs to push things > down to those queues w/ minimal overhead. This includes devices that > implement QoS in h/w w/ multiple queues. Yes -- however, you're right that the link layer needs to be able to pass more information down. I'd like it to be able to do so without an m_tag allocation, though, which suggests (as you point out) an explicit argument to if_startmbuf. Thanks, Robert N M Watson Computer Laboratory University of Cambridge From owner-freebsd-arch@FreeBSD.ORG Sun Jul 30 20:40:11 2006 Return-Path: X-Original-To: arch@freebsd.org Delivered-To: freebsd-arch@FreeBSD.ORG Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 9008316A4DF; Sun, 30 Jul 2006 20:40:11 +0000 (UTC) (envelope-from prvs=julian=3594eb8d2@elischer.org) Received: from a50.ironport.com (a50.ironport.com [63.251.108.112]) by mx1.FreeBSD.org (Postfix) with ESMTP id B4A9C43D55; Sun, 30 Jul 2006 20:40:10 +0000 (GMT) (envelope-from prvs=julian=3594eb8d2@elischer.org) Received: from unknown (HELO [192.168.2.4]) ([10.251.60.32]) by a50.ironport.com with ESMTP; 30 Jul 2006 13:40:09 -0700 Message-ID: <44CD1928.6000004@elischer.org> Date: Sun, 30 Jul 2006 13:40:08 -0700 From: Julian Elischer User-Agent: Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; en-US; rv:1.7.13) Gecko/20060414 X-Accept-Language: en-us, en MIME-Version: 1.0 To: Robert Watson References: <20060730141642.D16341@fledge.watson.org> <44CCFC2C.20402@errno.com> <20060730200929.J16341@fledge.watson.org> In-Reply-To: <20060730200929.J16341@fledge.watson.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: net@freebsd.org, arch@freebsd.org Subject: Re: Changes in the network interface queueing handoff model X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 30 Jul 2006 20:40:11 -0000 Robert Watson wrote: > On Sun, 30 Jul 2006, Sam Leffler wrote: > >> I have a fair amount of experience with the linux model and it works >> ok. The main complication I've seen is when a driver needs to process >> multiple queues of packets things get more involved. This is seen in >> 802.11 drivers where there are two q's, one for data frames and one >> for management frames. With the current scheme you have two separate >> queues and the start method handles prioritization by polling the mgt >> q before the data q. If instead the packet is passed to the start >> method then it needs to be tagged in some way so the it's prioritized >> properly. Otherwise you end up with multiple start methods; one per >> type of packet. I suspect this will be ok but the end result will be >> that we'll need to add a priority field to mbufs (unless we pass it >> as an arge to the start method). >> We have a priority tag in netgraph that we use to keep management frames on time in the frame relay code it seems to work ok. >> All this is certainly doable but I think just replacing one mechanism >> with the other (as you specified) is insufficient. > Linux did a big analysis of what was needed at the time they did most of their networking and their buffer scheme (last I looked) had all sorts of fields for this and that. I wonder how it has held up over time? > > Hmm. This is something that I had overlooked. I was loosely aware > that the if_sl code made use of multiple queues, but was under the > impression that the classification to queues occured purely in the > SLIP code. Indeed, it does, but structurally, SLIP is split over the > link layer (if_output) and driver layer (if_start), which I had > forgotten. I take it from your comments that 802.11 also does this, > which I was not aware of. > > I'm a little uncomfortable with our current m_tag model, as it > requires significant numbers of additional allocations and frees for > each packet, as well as walking link lists. It's fine for occasional > discretionary use (i.e., MAC labels), but I worry about cases where it > is used with every packet, and we start seeing moderately non-zero > numbers of tags on every packet. I think I would be more comfortable > with an explicit queue identifier argument to if_start, where the link > layer and driver layer agree on how to identify queues. It would certainly be possible to (for example) have 2 tags preallocated on each mbuf or something but it is hard to know in advance what will be needed. From owner-freebsd-arch@FreeBSD.ORG Mon Jul 31 00:59:59 2006 Return-Path: X-Original-To: arch@FreeBSD.org Delivered-To: freebsd-arch@FreeBSD.ORG Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 470AB16A4E2; Mon, 31 Jul 2006 00:59:59 +0000 (UTC) (envelope-from gnn@neville-neil.com) Received: from mrout1-b.corp.dcn.yahoo.com (mrout1-b.corp.dcn.yahoo.com [216.109.112.27]) by mx1.FreeBSD.org (Postfix) with ESMTP id C898143D46; Mon, 31 Jul 2006 00:59:58 +0000 (GMT) (envelope-from gnn@neville-neil.com) Received: from minion.local.neville-neil.com (proxy8.corp.yahoo.com [216.145.48.13]) by mrout1-b.corp.dcn.yahoo.com (8.13.6/8.13.6/y.out) with ESMTP id k6V0xnto001240; Sun, 30 Jul 2006 17:59:50 -0700 (PDT) Date: Mon, 31 Jul 2006 09:59:47 +0900 Message-ID: From: gnn@FreeBSD.org To: Robert Watson In-Reply-To: <20060730141642.D16341@fledge.watson.org> References: <20060730141642.D16341@fledge.watson.org> User-Agent: Wanderlust/2.14.0 (Africa) SEMI/1.14.6 (Maruoka) FLIM/1.14.8 (=?ISO-8859-4?Q?Shij=F2?=) APEL/10.6 Emacs/22.0.50 (i386-apple-darwin8.6.1) MULE/5.0 (SAKAKI) MIME-Version: 1.0 (generated by SEMI 1.14.6 - "Maruoka") Content-Type: text/plain; charset=US-ASCII Cc: arch@FreeBSD.org, net@FreeBSD.org Subject: Re: Changes in the network interface queueing handoff model X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 31 Jul 2006 00:59:59 -0000 At Sun, 30 Jul 2006 15:04:48 +0100 (BST), rwatson wrote: > Conceptual review as well as banchmarking, etc, would be most welcome. > I remember talking about this at BSDCan and certainly for high end hardware it seems that it's the right way to go. Later, George From owner-freebsd-arch@FreeBSD.ORG Mon Jul 31 01:02:41 2006 Return-Path: X-Original-To: arch@FreeBSD.org Delivered-To: freebsd-arch@FreeBSD.ORG Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 7DB2316A4DA; Mon, 31 Jul 2006 01:02:41 +0000 (UTC) (envelope-from sam@errno.com) Received: from ebb.errno.com (ebb.errno.com [69.12.149.25]) by mx1.FreeBSD.org (Postfix) with ESMTP id 8115843D49; Mon, 31 Jul 2006 01:02:40 +0000 (GMT) (envelope-from sam@errno.com) Received: from [10.0.0.199] ([10.0.0.199]) (authenticated bits=0) by ebb.errno.com (8.13.6/8.12.6) with ESMTP id k6V12dCh012518 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Sun, 30 Jul 2006 18:02:39 -0700 (PDT) (envelope-from sam@errno.com) Message-ID: <44CD56BB.6080405@errno.com> Date: Sun, 30 Jul 2006 18:02:51 -0700 From: Sam Leffler Organization: Errno Consulting User-Agent: Thunderbird 1.5.0.5 (Macintosh/20060719) MIME-Version: 1.0 To: Robert Watson References: <20060730141642.D16341@fledge.watson.org> <44CCFC2C.20402@errno.com> <20060730200929.J16341@fledge.watson.org> In-Reply-To: <20060730200929.J16341@fledge.watson.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: arch@FreeBSD.org, net@FreeBSD.org Subject: Re: Changes in the network interface queueing handoff model X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 31 Jul 2006 01:02:41 -0000 Robert Watson wrote: > On Sun, 30 Jul 2006, Sam Leffler wrote: > >> I have a fair amount of experience with the linux model and it works >> ok. The main complication I've seen is when a driver needs to process >> multiple queues of packets things get more involved. This is seen in >> 802.11 drivers where there are two q's, one for data frames and one >> for management frames. With the current scheme you have two separate >> queues and the start method handles prioritization by polling the mgt >> q before the data q. If instead the packet is passed to the start >> method then it needs to be tagged in some way so the it's prioritized >> properly. Otherwise you end up with multiple start methods; one per >> type of packet. I suspect this will be ok but the end result will be >> that we'll need to add a priority field to mbufs (unless we pass it as >> an arge to the start method). >> >> All this is certainly doable but I think just replacing one mechanism >> with the other (as you specified) is insufficient. > > Hmm. This is something that I had overlooked. I was loosely aware that > the if_sl code made use of multiple queues, but was under the impression > that the classification to queues occured purely in the SLIP code. > Indeed, it does, but structurally, SLIP is split over the link layer > (if_output) and driver layer (if_start), which I had forgotten. I take > it from your comments that 802.11 also does this, which I was not aware of. There are several issues here but the basic one is, I believe, that we need to provide a per-packet notion of priority or TOS handling. The distinction between mgt frame and data in 802.11 drivers is a kludge; the right thing is to just use priority to get the desired effect. But separately 802.11 is aware of priority for WME so independent of mgt frames priority we still need a way to pass down an AC (access control). For 802.11 I was able to do this by encoding the value in the mbuf flags. If there were a field in the mbuf header this kludge could be removed. For other devices we still want a way to pass around the DiffServ bits or similar so things like vlan priority can be set w/o resorting to tagging each frame. Ideally prioritization work like what's done inside slip should be pulled out. Note that just slapping a field in the mbuf is a start but we also need to think about how to handle it up+down the stack so layers can honor existing priorty and/or filling priority for packets that aren't already classified. > > I'm a little uncomfortable with our current m_tag model, as it requires > significant numbers of additional allocations and frees for each packet, > as well as walking link lists. It's fine for occasional discretionary > use (i.e., MAC labels), but I worry about cases where it is used with > every packet, and we start seeing moderately non-zero numbers of tags on > every packet. I think I would be more comfortable with an explicit > queue identifier argument to if_start, where the link layer and driver > layer agree on how to identify queues. > > As a straw man, how would the following strike you: > > int if_startmbuf(struct ifnet *ifp, struct mbuf *m, int ifqid); > > where for most link layers, the value would be zero, but for some link > layer/driver combinations, it would identify a specific queue which the > link layer believes the mbuf should be assigned, if implemented? mbuf tags are not a solution; too expensive. I think we need something in the mbuf header. > >>> Attached is a patch that maintains the current if_start, but adds >>> if_startmbuf. If a device driver implements if_startmbuf and the >>> global sysctl net.startmbuf_enabled is set to 1, then the >>> if_startmbuf path in the driver will be used. Otherwise, if_start is >>> used. I have modified the if_em driver to implement if_startmbuf >>> also. If there is no packet backlog in the if_snd queue, it directly >>> places the packet in the transmit descriptor ring. If there is a >>> backlog, it uses the if_snd queue protected by driver mutex, rather >>> than a separate ifq mutex. >>> >>> In some basic local micro-benchmarks, I saw a 5% improvement in UDP >>> 0-byte paylod PPS on UP, and a 10% improvement on SMP. I saw a 1.7% >>> performance improvement in the bulk serving of 1k files over HTTP. >>> These are only micro-benchmarks, and reflect a configuration in which >>> the CPU is unable to keep up with the output rate of the 1gbps >>> ethernet card in the device, so reductions in host CPU usage are >>> immediately visible in increased output as the CPU is able to better >>> keep up with the network hardware. Other configurations are also of >>> interest of interesting, especially ones in which the network device >>> is unable to keep up with the CPU, resulting in more queueing. >>> >>> Conceptual review as well as banchmarking, etc, would be most welcome. >> >> Why is the startmbuf knob global and not per-interface? Seems like >> you want to convert drivers one at a time? > > I may have under-described what I have implemented. The decision is > currently made based on two factors: a global frob, and per-interface > definition of if_startmbuf being non-zero. The global frob is intended > to make it easy to benchmark the difference. I should modify the patch > so that the global frob doesn't override the driver back to if_start in > the event that if_startmbuf is defined and if_start isn't. The global > frob is intended to be removed in the long run, and I intend for us to > continue to support both the old and new start methods for the > forseeable future, since I don't intend to update every device driver we > have to the new method, at least not personally :-). > >> FWIW the original model was driven by the expectation that you could >> raise the spl so the tx path was entirely synchronized from above. >> With the SMPng work we're synchronizing transfer through each control >> layer. If the driver softc lock (or similar) were exposed to upper >> layers we could possibly return the "lock the tx path" model we had >> before and eliminate all the locking your changes target. But that >> would be a big layering violation and would add significant contention >> in the SMP case. > > In some ways, what I propose comes to much the same thing: the change I > propose basically delegates the queueing and synchronization decisions > to the device driver, which might choose either to use the lock already > in the ifq, to use its own lock, or to use some other synchronization > strategy. In the case of if_em, I've implemented bypass of software > queueing entirely in the common case, but in the event that the hardware > ring backs up, then we still fall back to the if_snd queue, only we lock > it using the device driver's transmit path mutex. Delegating the > synchronization down the stack comes with risks, as device driver > writers will inevitably take liberties: on the other hand, it appears > that devices are quite diverse, and those liberties have advantages. > >> I think the key observation is that most network hardware today takes >> packets directly from private queues so the fast path needs to push >> things down to those queues w/ minimal overhead. This includes >> devices that implement QoS in h/w w/ multiple queues. > > Yes -- however, you're right that the link layer needs to be able to > pass more information down. I'd like it to be able to do so without an > m_tag allocation, though, which suggests (as you point out) an explicit > argument to if_startmbuf. Or an addition to the mbuf header. Sam From owner-freebsd-arch@FreeBSD.ORG Mon Jul 31 08:24:46 2006 Return-Path: X-Original-To: freebsd-arch@freebsd.org Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 7A69916A4DA; Mon, 31 Jul 2006 08:24:46 +0000 (UTC) (envelope-from hselasky@c2i.net) Received: from swip.net (mailfe02.swip.net [212.247.154.33]) by mx1.FreeBSD.org (Postfix) with ESMTP id 54F5643D45; Mon, 31 Jul 2006 08:24:44 +0000 (GMT) (envelope-from hselasky@c2i.net) X-T2-Posting-ID: gvlK0tOCzrqh9CPROFOFPw== X-Cloudmark-Score: 0.000000 [] Received: from [193.217.133.87] (HELO [10.0.0.249]) by mailfe02.swip.net (CommuniGate Pro SMTP 5.0.8) with ESMTP id 247590042; Mon, 31 Jul 2006 10:24:40 +0200 From: Hans Petter Selasky To: freebsd-arch@freebsd.org Date: Mon, 31 Jul 2006 10:24:48 +0200 User-Agent: KMail/1.7 References: <20060730141642.D16341@fledge.watson.org> <44CCFC2C.20402@errno.com> <20060730200929.J16341@fledge.watson.org> In-Reply-To: <20060730200929.J16341@fledge.watson.org> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200607311024.50537.hselasky@c2i.net> Cc: net@freebsd.org, Robert Watson , arch@freebsd.org Subject: Re: Changes in the network interface queueing handoff model X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 31 Jul 2006 08:24:46 -0000 On Sunday 30 July 2006 21:23, Robert Watson wrote: > On Sun, 30 Jul 2006, Sam Leffler wrote: Just a comment while the iron is hot: Maybe you can make the network model safe against detach. Currently I see that the processor can be stuck in routines like "if_start" after that "if_free()" has been called. This can be critical for USB ethernet devices. --HPS From owner-freebsd-arch@FreeBSD.ORG Mon Jul 31 08:24:46 2006 Return-Path: X-Original-To: arch@freebsd.org Delivered-To: freebsd-arch@FreeBSD.ORG Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 7A69916A4DA; Mon, 31 Jul 2006 08:24:46 +0000 (UTC) (envelope-from hselasky@c2i.net) Received: from swip.net (mailfe02.swip.net [212.247.154.33]) by mx1.FreeBSD.org (Postfix) with ESMTP id 54F5643D45; Mon, 31 Jul 2006 08:24:44 +0000 (GMT) (envelope-from hselasky@c2i.net) X-T2-Posting-ID: gvlK0tOCzrqh9CPROFOFPw== X-Cloudmark-Score: 0.000000 [] Received: from [193.217.133.87] (HELO [10.0.0.249]) by mailfe02.swip.net (CommuniGate Pro SMTP 5.0.8) with ESMTP id 247590042; Mon, 31 Jul 2006 10:24:40 +0200 From: Hans Petter Selasky To: freebsd-arch@freebsd.org Date: Mon, 31 Jul 2006 10:24:48 +0200 User-Agent: KMail/1.7 References: <20060730141642.D16341@fledge.watson.org> <44CCFC2C.20402@errno.com> <20060730200929.J16341@fledge.watson.org> In-Reply-To: <20060730200929.J16341@fledge.watson.org> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200607311024.50537.hselasky@c2i.net> Cc: net@freebsd.org, Robert Watson , arch@freebsd.org Subject: Re: Changes in the network interface queueing handoff model X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 31 Jul 2006 08:24:46 -0000 On Sunday 30 July 2006 21:23, Robert Watson wrote: > On Sun, 30 Jul 2006, Sam Leffler wrote: Just a comment while the iron is hot: Maybe you can make the network model safe against detach. Currently I see that the processor can be stuck in routines like "if_start" after that "if_free()" has been called. This can be critical for USB ethernet devices. --HPS From owner-freebsd-arch@FreeBSD.ORG Mon Jul 31 09:53:51 2006 Return-Path: X-Original-To: freebsd-arch@freebsd.org Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id A4AAA16A4DE; Mon, 31 Jul 2006 09:53:51 +0000 (UTC) (envelope-from rwatson@FreeBSD.org) Received: from cyrus.watson.org (cyrus.watson.org [209.31.154.42]) by mx1.FreeBSD.org (Postfix) with ESMTP id 4E48B43D45; Mon, 31 Jul 2006 09:53:51 +0000 (GMT) (envelope-from rwatson@FreeBSD.org) Received: from fledge.watson.org (fledge.watson.org [209.31.154.41]) by cyrus.watson.org (Postfix) with ESMTP id 2B16A46B0F; Mon, 31 Jul 2006 05:53:39 -0400 (EDT) Date: Mon, 31 Jul 2006 10:53:39 +0100 (BST) From: Robert Watson X-X-Sender: robert@fledge.watson.org To: Hans Petter Selasky In-Reply-To: <200607311024.50537.hselasky@c2i.net> Message-ID: <20060731105045.X16341@fledge.watson.org> References: <20060730141642.D16341@fledge.watson.org> <44CCFC2C.20402@errno.com> <20060730200929.J16341@fledge.watson.org> <200607311024.50537.hselasky@c2i.net> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: arch@freebsd.org, net@freebsd.org, freebsd-arch@freebsd.org Subject: Re: Changes in the network interface queueing handoff model X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 31 Jul 2006 09:53:51 -0000 On Mon, 31 Jul 2006, Hans Petter Selasky wrote: > On Sunday 30 July 2006 21:23, Robert Watson wrote: >> On Sun, 30 Jul 2006, Sam Leffler wrote: > > Just a comment while the iron is hot: > > Maybe you can make the network model safe against detach. Currently I see > that the processor can be stuck in routines like "if_start" after that > "if_free()" has been called. This can be critical for USB ethernet devices. This is something to fix in the short or long term, but I think we should not try to fix everything at once as there is an awful lot to fix. There are really two stages to fixing the ifnet life cycle, which Brooks has been working on for some time. The first is to make it generally make sense -- he moved ifnet out of the softc, has been working to normalize things generally, (add dead ifnets), etc. The second is to add new types of reference and tear-down magic. What Solaris does here, FYI, is basically add a lock around entering the device driver via their mac layer in order to prevent it from "disappearing" while in use via the ifnet interface. I'm not sure if we want the same solution there or not, but it's worth thinking carefully about. We had (and have) similar problems in a number of other places, where races between consumers of an API and a detach of the provider cause problems. Robert N M Watson Computer Laboratory University of Cambridge From owner-freebsd-arch@FreeBSD.ORG Mon Jul 31 09:53:51 2006 Return-Path: X-Original-To: arch@freebsd.org Delivered-To: freebsd-arch@FreeBSD.ORG Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id A4AAA16A4DE; Mon, 31 Jul 2006 09:53:51 +0000 (UTC) (envelope-from rwatson@FreeBSD.org) Received: from cyrus.watson.org (cyrus.watson.org [209.31.154.42]) by mx1.FreeBSD.org (Postfix) with ESMTP id 4E48B43D45; Mon, 31 Jul 2006 09:53:51 +0000 (GMT) (envelope-from rwatson@FreeBSD.org) Received: from fledge.watson.org (fledge.watson.org [209.31.154.41]) by cyrus.watson.org (Postfix) with ESMTP id 2B16A46B0F; Mon, 31 Jul 2006 05:53:39 -0400 (EDT) Date: Mon, 31 Jul 2006 10:53:39 +0100 (BST) From: Robert Watson X-X-Sender: robert@fledge.watson.org To: Hans Petter Selasky In-Reply-To: <200607311024.50537.hselasky@c2i.net> Message-ID: <20060731105045.X16341@fledge.watson.org> References: <20060730141642.D16341@fledge.watson.org> <44CCFC2C.20402@errno.com> <20060730200929.J16341@fledge.watson.org> <200607311024.50537.hselasky@c2i.net> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: arch@freebsd.org, net@freebsd.org, freebsd-arch@freebsd.org Subject: Re: Changes in the network interface queueing handoff model X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 31 Jul 2006 09:53:51 -0000 On Mon, 31 Jul 2006, Hans Petter Selasky wrote: > On Sunday 30 July 2006 21:23, Robert Watson wrote: >> On Sun, 30 Jul 2006, Sam Leffler wrote: > > Just a comment while the iron is hot: > > Maybe you can make the network model safe against detach. Currently I see > that the processor can be stuck in routines like "if_start" after that > "if_free()" has been called. This can be critical for USB ethernet devices. This is something to fix in the short or long term, but I think we should not try to fix everything at once as there is an awful lot to fix. There are really two stages to fixing the ifnet life cycle, which Brooks has been working on for some time. The first is to make it generally make sense -- he moved ifnet out of the softc, has been working to normalize things generally, (add dead ifnets), etc. The second is to add new types of reference and tear-down magic. What Solaris does here, FYI, is basically add a lock around entering the device driver via their mac layer in order to prevent it from "disappearing" while in use via the ifnet interface. I'm not sure if we want the same solution there or not, but it's worth thinking carefully about. We had (and have) similar problems in a number of other places, where races between consumers of an API and a detach of the provider cause problems. Robert N M Watson Computer Laboratory University of Cambridge From owner-freebsd-arch@FreeBSD.ORG Mon Jul 31 10:00:29 2006 Return-Path: X-Original-To: arch@FreeBSD.org Delivered-To: freebsd-arch@FreeBSD.ORG Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 02BBC16A4ED; Mon, 31 Jul 2006 10:00:29 +0000 (UTC) (envelope-from rwatson@FreeBSD.org) Received: from cyrus.watson.org (cyrus.watson.org [209.31.154.42]) by mx1.FreeBSD.org (Postfix) with ESMTP id 921A843D5E; Mon, 31 Jul 2006 10:00:27 +0000 (GMT) (envelope-from rwatson@FreeBSD.org) Received: from fledge.watson.org (fledge.watson.org [209.31.154.41]) by cyrus.watson.org (Postfix) with ESMTP id 2083D46B0D; Mon, 31 Jul 2006 06:00:27 -0400 (EDT) Date: Mon, 31 Jul 2006 11:00:26 +0100 (BST) From: Robert Watson X-X-Sender: robert@fledge.watson.org To: Sam Leffler In-Reply-To: <44CD56BB.6080405@errno.com> Message-ID: <20060731105438.D16341@fledge.watson.org> References: <20060730141642.D16341@fledge.watson.org> <44CCFC2C.20402@errno.com> <20060730200929.J16341@fledge.watson.org> <44CD56BB.6080405@errno.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: arch@FreeBSD.org, net@FreeBSD.org Subject: Re: Changes in the network interface queueing handoff model X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 31 Jul 2006 10:00:29 -0000 On Sun, 30 Jul 2006, Sam Leffler wrote: >> I'm a little uncomfortable with our current m_tag model, as it requires >> significant numbers of additional allocations and frees for each packet, as >> well as walking link lists. It's fine for occasional discretionary use >> (i.e., MAC labels), but I worry about cases where it is used with every >> packet, and we start seeing moderately non-zero numbers of tags on every >> packet. I think I would be more comfortable with an explicit queue >> identifier argument to if_start, where the link layer and driver layer >> agree on how to identify queues. >> >> As a straw man, how would the following strike you: >> >> int if_startmbuf(struct ifnet *ifp, struct mbuf *m, int ifqid); >> >> where for most link layers, the value would be zero, but for some link >> layer/driver combinations, it would identify a specific queue which the >> link layer believes the mbuf should be assigned, if implemented? > > mbuf tags are not a solution; too expensive. I think we need something in > the mbuf header. Agreed. I'm also quite unhappy that we have to use m_tags for VLAN tagging for identical reasons: it basically guarantees at least one extra memory allocation and free, possibly two, for each frame with encapsulation. This is one of the reasons I have been interested in reworking the ethernet link layer parts to increase integration of VLANs into the normal ethernet code, in order to avoid having to unnecessarily use expensive mbuf meta-data. What size field is needed in the mbuf pkthdr to capture all the necessary priority information between driver and link layer? An int? Robert N M Watson Computer Laboratory University of Cambridge From owner-freebsd-arch@FreeBSD.ORG Mon Jul 31 17:05:34 2006 Return-Path: X-Original-To: arch@freebsd.org Delivered-To: freebsd-arch@FreeBSD.ORG Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 6B42816A4DA; Mon, 31 Jul 2006 17:05:34 +0000 (UTC) (envelope-from jdp@polstra.com) Received: from blake.polstra.com (blake.polstra.com [64.81.189.66]) by mx1.FreeBSD.org (Postfix) with ESMTP id E801843D4C; Mon, 31 Jul 2006 17:05:33 +0000 (GMT) (envelope-from jdp@polstra.com) Received: from strings.polstra.com (strings.polstra.com [64.81.189.67]) by blake.polstra.com (8.13.6/8.13.6) with ESMTP id k6VH5XKG038776; Mon, 31 Jul 2006 10:05:33 -0700 (PDT) (envelope-from jdp@polstra.com) Message-ID: X-Mailer: XFMail 1.5.5 on FreeBSD X-Priority: 3 (Normal) Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 8bit MIME-Version: 1.0 In-Reply-To: <20060730141642.D16341@fledge.watson.org> Date: Mon, 31 Jul 2006 10:05:33 -0700 (PDT) From: John Polstra To: Robert Watson Cc: arch@freebsd.org, net@freebsd.org Subject: RE: Changes in the network interface queueing handoff model X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 31 Jul 2006 17:05:34 -0000 > Attached is a patch that maintains the current if_start, but adds > if_startmbuf. If a device driver implements if_startmbuf and the global > sysctl net.startmbuf_enabled is set to 1, then the if_startmbuf path in the > driver will be used. Otherwise, if_start is used. I have modified the if_em > driver to implement if_startmbuf also. If there is no packet backlog in the > if_snd queue, it directly places the packet in the transmit descriptor ring. > If there is a backlog, it uses the if_snd queue protected by driver mutex, > rather than a separate ifq mutex. I question whether you need a fallback software if_snd queue at all for modern devices such as the Intel and Broadcom gigabit chips. The hardware transmit descriptor rings typically have sizes of the order of 256 descriptors. I think if the ring fills up, you could simply drop the packet with ENOBUFS. That's what happens if the if_snd queue fills up, and its maximum size is comparable to the sizes of modern descriptor rings. It would simplify things quite a bit to eliminate the if_snd queue entirely for such devices. In any case, I'm glad you're looking at making this change. I think it's the right thing to do. John From owner-freebsd-arch@FreeBSD.ORG Mon Jul 31 17:08:30 2006 Return-Path: X-Original-To: arch@freebsd.org Delivered-To: freebsd-arch@FreeBSD.ORG Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 9482316A4DE; Mon, 31 Jul 2006 17:08:30 +0000 (UTC) (envelope-from rwatson@FreeBSD.org) Received: from cyrus.watson.org (cyrus.watson.org [209.31.154.42]) by mx1.FreeBSD.org (Postfix) with ESMTP id B4FA943D5C; Mon, 31 Jul 2006 17:08:27 +0000 (GMT) (envelope-from rwatson@FreeBSD.org) Received: from fledge.watson.org (fledge.watson.org [209.31.154.41]) by cyrus.watson.org (Postfix) with ESMTP id A8C8346B9B; Mon, 31 Jul 2006 13:08:24 -0400 (EDT) Date: Mon, 31 Jul 2006 18:08:24 +0100 (BST) From: Robert Watson X-X-Sender: robert@fledge.watson.org To: John Polstra In-Reply-To: Message-ID: <20060731180643.E71432@fledge.watson.org> References: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: arch@freebsd.org, net@freebsd.org Subject: RE: Changes in the network interface queueing handoff model X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 31 Jul 2006 17:08:30 -0000 On Mon, 31 Jul 2006, John Polstra wrote: >> Attached is a patch that maintains the current if_start, but adds >> if_startmbuf. If a device driver implements if_startmbuf and the global >> sysctl net.startmbuf_enabled is set to 1, then the if_startmbuf path in the >> driver will be used. Otherwise, if_start is used. I have modified the >> if_em driver to implement if_startmbuf also. If there is no packet backlog >> in the if_snd queue, it directly places the packet in the transmit >> descriptor ring. If there is a backlog, it uses the if_snd queue protected >> by driver mutex, rather than a separate ifq mutex. > > I question whether you need a fallback software if_snd queue at all for > modern devices such as the Intel and Broadcom gigabit chips. The hardware > transmit descriptor rings typically have sizes of the order of 256 > descriptors. I think if the ring fills up, you could simply drop the packet > with ENOBUFS. That's what happens if the if_snd queue fills up, and its > maximum size is comparable to the sizes of modern descriptor rings. It > would simplify things quite a bit to eliminate the if_snd queue entirely for > such devices. I tend to agree, but implemented full queueing support for if_em to make sure I understood to complexity implications of completely removing queueing from the ifnet side dispatch. I guess an interesting question for us is how we decide what the right threshold is to implement software queuing. Do any if_em cards need software queueing, or do they all have adequate in-hardware queues as is? Entirely cutting the queue code would significantly simplify em_startmbuf. Robert N M Watson Computer Laboratory University of Cambridge From owner-freebsd-arch@FreeBSD.ORG Mon Jul 31 17:22:03 2006 Return-Path: X-Original-To: arch@freebsd.org Delivered-To: freebsd-arch@FreeBSD.ORG Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 14FD316A4DD; Mon, 31 Jul 2006 17:22:03 +0000 (UTC) (envelope-from pete@he.iki.fi) Received: from silver.he.iki.fi (helenius.fi [193.64.42.241]) by mx1.FreeBSD.org (Postfix) with ESMTP id 8BB0343D68; Mon, 31 Jul 2006 17:21:57 +0000 (GMT) (envelope-from pete@he.iki.fi) Received: from localhost (localhost [127.0.0.1]) by silver.he.iki.fi (Postfix) with ESMTP id 8F85DBBFB; Mon, 31 Jul 2006 20:21:53 +0300 (EEST) Received: from silver.he.iki.fi ([127.0.0.1]) by localhost (silver.he.iki.fi [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 1ZGb9qUh8C8P; Mon, 31 Jul 2006 20:21:50 +0300 (EEST) Received: from [IPv6:2001:670:84:0:2410:b116:d67f:84b] (unknown [IPv6:2001:670:84:0:2410:b116:d67f:84b]) by silver.he.iki.fi (Postfix) with ESMTP; Mon, 31 Jul 2006 20:21:50 +0300 (EEST) Message-ID: <44CE3C2E.80007@he.iki.fi> Date: Mon, 31 Jul 2006 20:21:50 +0300 From: Petri Helenius User-Agent: Thunderbird 1.5.0.5 (Windows/20060719) MIME-Version: 1.0 To: Robert Watson References: <20060731180643.E71432@fledge.watson.org> In-Reply-To: <20060731180643.E71432@fledge.watson.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: arch@freebsd.org, net@freebsd.org, John Polstra Subject: Re: Changes in the network interface queueing handoff model X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 31 Jul 2006 17:22:03 -0000 Robert Watson wrote: > > I tend to agree, but implemented full queueing support for if_em to > make sure I understood to complexity implications of completely > removing queueing from the ifnet side dispatch. I guess an > interesting question for us is how we decide what the right threshold > is to implement software queuing. Do any if_em cards need software > queueing, or do they all have adequate in-hardware queues as is? > Entirely cutting the queue code would significantly simplify > em_startmbuf. Actually most em cards support 4096 descriptors each way. Pete From owner-freebsd-arch@FreeBSD.ORG Mon Jul 31 17:34:55 2006 Return-Path: X-Original-To: arch@freebsd.org Delivered-To: freebsd-arch@FreeBSD.ORG Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 5914116A4DD; Mon, 31 Jul 2006 17:34:55 +0000 (UTC) (envelope-from jdp@polstra.com) Received: from blake.polstra.com (blake.polstra.com [64.81.189.66]) by mx1.FreeBSD.org (Postfix) with ESMTP id 70D6643D46; Mon, 31 Jul 2006 17:34:54 +0000 (GMT) (envelope-from jdp@polstra.com) Received: from strings.polstra.com (strings.polstra.com [64.81.189.67]) by blake.polstra.com (8.13.6/8.13.6) with ESMTP id k6VHYrdi039191; Mon, 31 Jul 2006 10:34:53 -0700 (PDT) (envelope-from jdp@polstra.com) Message-ID: X-Mailer: XFMail 1.5.5 on FreeBSD X-Priority: 3 (Normal) Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 8bit MIME-Version: 1.0 In-Reply-To: <44CE3C2E.80007@he.iki.fi> Date: Mon, 31 Jul 2006 10:34:53 -0700 (PDT) From: John Polstra To: Petri Helenius Cc: arch@freebsd.org, net@freebsd.org Subject: Re: Changes in the network interface queueing handoff model X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 31 Jul 2006 17:34:55 -0000 On 31-Jul-2006 Petri Helenius wrote: > Robert Watson wrote: >> >> I tend to agree, but implemented full queueing support for if_em to >> make sure I understood to complexity implications of completely >> removing queueing from the ifnet side dispatch. I guess an >> interesting question for us is how we decide what the right threshold >> is to implement software queuing. Do any if_em cards need software >> queueing, or do they all have adequate in-hardware queues as is? >> Entirely cutting the queue code would significantly simplify >> em_startmbuf. > Actually most em cards support 4096 descriptors each way. Yes, even the earliest ones supported 4096 descriptors on paper. In practice, the early chips had bugs that required the entire descriptor ring to fit in a single page of memory. That limited them to 4096/16 = 256 transmit descriptors on x86 hardware at the time. That chip bug was fixed a long time ago, though, and in any case 256 transmit descriptors is a lot for most applications. John From owner-freebsd-arch@FreeBSD.ORG Mon Jul 31 19:09:25 2006 Return-Path: X-Original-To: arch@FreeBSD.org Delivered-To: freebsd-arch@FreeBSD.ORG Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 344DC16A4E2; Mon, 31 Jul 2006 19:09:25 +0000 (UTC) (envelope-from jmg@hydrogen.funkthat.com) Received: from hydrogen.funkthat.com (gate.funkthat.com [69.17.45.168]) by mx1.FreeBSD.org (Postfix) with ESMTP id 8A27443D60; Mon, 31 Jul 2006 19:09:24 +0000 (GMT) (envelope-from jmg@hydrogen.funkthat.com) Received: from hydrogen.funkthat.com (pq0v1uefe3wdhse6@localhost.funkthat.com [127.0.0.1]) by hydrogen.funkthat.com (8.13.6/8.13.3) with ESMTP id k6VJ9OGn098109; Mon, 31 Jul 2006 12:09:24 -0700 (PDT) (envelope-from jmg@hydrogen.funkthat.com) Received: (from jmg@localhost) by hydrogen.funkthat.com (8.13.6/8.13.3/Submit) id k6VJ9MbM098108; Mon, 31 Jul 2006 12:09:22 -0700 (PDT) (envelope-from jmg) Date: Mon, 31 Jul 2006 12:09:22 -0700 From: John-Mark Gurney To: Robert Watson Message-ID: <20060731190922.GJ96589@funkthat.com> Mail-Followup-To: Robert Watson , John Polstra , arch@freebsd.org, net@freebsd.org References: <20060731180643.E71432@fledge.watson.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20060731180643.E71432@fledge.watson.org> User-Agent: Mutt/1.4.2.1i X-Operating-System: FreeBSD 5.4-RELEASE-p6 i386 X-PGP-Fingerprint: B7 EC EF F8 AE ED A7 31 96 7A 22 B3 D8 56 36 F4 X-Files: The truth is out there X-URL: http://resnet.uoregon.edu/~gurney_j/ X-Resume: http://resnet.uoregon.edu/~gurney_j/resume.html Cc: arch@FreeBSD.org, net@FreeBSD.org, John Polstra Subject: Re: Changes in the network interface queueing handoff model X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: John-Mark Gurney List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 31 Jul 2006 19:09:25 -0000 Robert Watson wrote this message on Mon, Jul 31, 2006 at 18:08 +0100: > >I question whether you need a fallback software if_snd queue at all for > >modern devices such as the Intel and Broadcom gigabit chips. The hardware > >transmit descriptor rings typically have sizes of the order of 256 > >descriptors. I think if the ring fills up, you could simply drop the > >packet with ENOBUFS. That's what happens if the if_snd queue fills up, > >and its maximum size is comparable to the sizes of modern descriptor > >rings. It would simplify things quite a bit to eliminate the if_snd queue > >entirely for such devices. > > I tend to agree, but implemented full queueing support for if_em to make > sure I understood to complexity implications of completely removing > queueing from the ifnet side dispatch. I guess an interesting question for > us is how we decide what the right threshold is to implement software > queuing. Do any if_em cards need software queueing, or do they all have > adequate in-hardware queues as is? Entirely cutting the queue code would > significantly simplify em_startmbuf. This work tends to lead to a generic ethernet card framework that I've been thinking about.. where instead of cards doing all the handling of a ring buffer, the card registers a few functions to manipulate a ring buffer (if it has one), and does the necessary work... Though encoding all the different style of ring buffers may be interesting, per packet instead of per segment (if_re)... The other part is to digest the current monolithic lock structure that the ethernet cards have, into three (or four) different locks, tx head, tx tail, rx head & tail... -- John-Mark Gurney Voice: +1 415 225 5579 "All that I will do, has been done, All that I have, has not." From owner-freebsd-arch@FreeBSD.ORG Tue Aug 1 12:21:55 2006 Return-Path: X-Original-To: arch@FreeBSD.org Delivered-To: freebsd-arch@FreeBSD.ORG Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 5D4B716A4DD; Tue, 1 Aug 2006 12:21:55 +0000 (UTC) (envelope-from gallatin@cs.duke.edu) Received: from duke.cs.duke.edu (duke.cs.duke.edu [152.3.140.1]) by mx1.FreeBSD.org (Postfix) with ESMTP id DCE3C43D69; Tue, 1 Aug 2006 12:21:54 +0000 (GMT) (envelope-from gallatin@cs.duke.edu) Received: from grasshopper.cs.duke.edu (grasshopper.cs.duke.edu [152.3.145.30]) by duke.cs.duke.edu (8.13.6/8.13.6) with ESMTP id k71CLnCd024908 (version=TLSv1/SSLv3 cipher=EDH-RSA-DES-CBC3-SHA bits=168 verify=NO); Tue, 1 Aug 2006 08:21:49 -0400 (EDT) Received: (from gallatin@localhost) by grasshopper.cs.duke.edu (8.12.9p2/8.12.9/Submit) id k71CLixZ060991; Tue, 1 Aug 2006 08:21:44 -0400 (EDT) (envelope-from gallatin) From: Andrew Gallatin MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <17615.18264.172863.892776@grasshopper.cs.duke.edu> Date: Tue, 1 Aug 2006 08:21:44 -0400 (EDT) To: Robert Watson In-Reply-To: <20060731105045.X16341@fledge.watson.org> References: <20060730141642.D16341@fledge.watson.org> <44CCFC2C.20402@errno.com> <20060730200929.J16341@fledge.watson.org> <200607311024.50537.hselasky@c2i.net> <20060731105045.X16341@fledge.watson.org> X-Mailer: VM 6.75 under 21.1 (patch 12) "Channel Islands" XEmacs Lucid Cc: arch@FreeBSD.org, net@FreeBSD.org, freebsd-arch@FreeBSD.org, Hans Petter Selasky Subject: Re: Changes in the network interface queueing handoff model X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 01 Aug 2006 12:21:55 -0000 Robert Watson writes: > tear-down magic. What Solaris does here, FYI, is basically add a lock around > entering the device driver via their mac layer in order to prevent it from > "disappearing" while in use via the ifnet interface. I'm not sure if we want At least for GLDv2, this is a reader-writer lock. The transmit and receive paths take a read lock on the device's macinfo (like ifnet) struct, and the detach code takes a write lock. The Solaris driver model does not serialize transmits (or receives), as one might think from reading the above. Drew From owner-freebsd-arch@FreeBSD.ORG Tue Aug 1 12:21:55 2006 Return-Path: X-Original-To: freebsd-arch@FreeBSD.org Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 5D4B716A4DD; Tue, 1 Aug 2006 12:21:55 +0000 (UTC) (envelope-from gallatin@cs.duke.edu) Received: from duke.cs.duke.edu (duke.cs.duke.edu [152.3.140.1]) by mx1.FreeBSD.org (Postfix) with ESMTP id DCE3C43D69; Tue, 1 Aug 2006 12:21:54 +0000 (GMT) (envelope-from gallatin@cs.duke.edu) Received: from grasshopper.cs.duke.edu (grasshopper.cs.duke.edu [152.3.145.30]) by duke.cs.duke.edu (8.13.6/8.13.6) with ESMTP id k71CLnCd024908 (version=TLSv1/SSLv3 cipher=EDH-RSA-DES-CBC3-SHA bits=168 verify=NO); Tue, 1 Aug 2006 08:21:49 -0400 (EDT) Received: (from gallatin@localhost) by grasshopper.cs.duke.edu (8.12.9p2/8.12.9/Submit) id k71CLixZ060991; Tue, 1 Aug 2006 08:21:44 -0400 (EDT) (envelope-from gallatin) From: Andrew Gallatin MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <17615.18264.172863.892776@grasshopper.cs.duke.edu> Date: Tue, 1 Aug 2006 08:21:44 -0400 (EDT) To: Robert Watson In-Reply-To: <20060731105045.X16341@fledge.watson.org> References: <20060730141642.D16341@fledge.watson.org> <44CCFC2C.20402@errno.com> <20060730200929.J16341@fledge.watson.org> <200607311024.50537.hselasky@c2i.net> <20060731105045.X16341@fledge.watson.org> X-Mailer: VM 6.75 under 21.1 (patch 12) "Channel Islands" XEmacs Lucid Cc: arch@FreeBSD.org, net@FreeBSD.org, freebsd-arch@FreeBSD.org, Hans Petter Selasky Subject: Re: Changes in the network interface queueing handoff model X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 01 Aug 2006 12:21:55 -0000 Robert Watson writes: > tear-down magic. What Solaris does here, FYI, is basically add a lock around > entering the device driver via their mac layer in order to prevent it from > "disappearing" while in use via the ifnet interface. I'm not sure if we want At least for GLDv2, this is a reader-writer lock. The transmit and receive paths take a read lock on the device's macinfo (like ifnet) struct, and the detach code takes a write lock. The Solaris driver model does not serialize transmits (or receives), as one might think from reading the above. Drew From owner-freebsd-arch@FreeBSD.ORG Tue Aug 1 12:30:49 2006 Return-Path: X-Original-To: arch@FreeBSD.org Delivered-To: freebsd-arch@FreeBSD.ORG Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id BB52F16A4E0; Tue, 1 Aug 2006 12:30:49 +0000 (UTC) (envelope-from gallatin@cs.duke.edu) Received: from duke.cs.duke.edu (duke.cs.duke.edu [152.3.140.1]) by mx1.FreeBSD.org (Postfix) with ESMTP id 3AC9C43D73; Tue, 1 Aug 2006 12:30:43 +0000 (GMT) (envelope-from gallatin@cs.duke.edu) Received: from grasshopper.cs.duke.edu (grasshopper.cs.duke.edu [152.3.145.30]) by duke.cs.duke.edu (8.13.6/8.13.6) with ESMTP id k71CUgBl026750 (version=TLSv1/SSLv3 cipher=EDH-RSA-DES-CBC3-SHA bits=168 verify=NO); Tue, 1 Aug 2006 08:30:42 -0400 (EDT) Received: (from gallatin@localhost) by grasshopper.cs.duke.edu (8.12.9p2/8.12.9/Submit) id k71CUXn6061010; Tue, 1 Aug 2006 08:30:33 -0400 (EDT) (envelope-from gallatin) From: Andrew Gallatin MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <17615.18793.700752.342809@grasshopper.cs.duke.edu> Date: Tue, 1 Aug 2006 08:30:33 -0400 (EDT) To: Robert Watson In-Reply-To: <20060730141642.D16341@fledge.watson.org> References: <20060730141642.D16341@fledge.watson.org> X-Mailer: VM 6.75 under 21.1 (patch 12) "Channel Islands" XEmacs Lucid Cc: arch@FreeBSD.org, net@FreeBSD.org Subject: Re: Changes in the network interface queueing handoff model X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 01 Aug 2006 12:30:49 -0000 Robert Watson writes: > > 5BOne of the ideas that I, Scott Long, and a few others have been bouncing > around for some time is a restructuring of the network interface packet > transmission API to reduce the number of locking operations and allow network > device drivers increased control of the queueing behavior. Right now, it <....> > - The ifnet send queue is a separately locked object from the device driver, > meaning that for a single enqueue/dequeue pair, we pay an extra four lock > operations (two for insert, two for remove) per packet. > Going forward, especially now that we support sun4v CoolThreads hardware, we're going to want to rethink the "single lock" per transmit routine model that most drivers have. The most expensive operation in transmit routines is bus_dmamap_load_mbuf_sg(), especially when there is an IOMMU involved (like on CoolThreads machines) and there is no reason why this needs to be called with a driver's transmit lock held. I have hard data (from Solaris) about how much fine grained locking in a 10GbE driver's transmit routine helps. Drew From owner-freebsd-arch@FreeBSD.ORG Tue Aug 1 12:41:07 2006 Return-Path: X-Original-To: arch@FreeBSD.org Delivered-To: freebsd-arch@FreeBSD.ORG Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 3A14F16A4DD; Tue, 1 Aug 2006 12:41:07 +0000 (UTC) (envelope-from gallatin@cs.duke.edu) Received: from duke.cs.duke.edu (duke.cs.duke.edu [152.3.140.1]) by mx1.FreeBSD.org (Postfix) with ESMTP id AE87943D46; Tue, 1 Aug 2006 12:41:06 +0000 (GMT) (envelope-from gallatin@cs.duke.edu) Received: from grasshopper.cs.duke.edu (grasshopper.cs.duke.edu [152.3.145.30]) by duke.cs.duke.edu (8.13.6/8.13.6) with ESMTP id k71Cf5ll028998 (version=TLSv1/SSLv3 cipher=EDH-RSA-DES-CBC3-SHA bits=168 verify=NO); Tue, 1 Aug 2006 08:41:05 -0400 (EDT) Received: (from gallatin@localhost) by grasshopper.cs.duke.edu (8.12.9p2/8.12.9/Submit) id k71Cf0NH061024; Tue, 1 Aug 2006 08:41:00 -0400 (EDT) (envelope-from gallatin) From: Andrew Gallatin MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <17615.19420.172545.986872@grasshopper.cs.duke.edu> Date: Tue, 1 Aug 2006 08:41:00 -0400 (EDT) To: Robert Watson In-Reply-To: <20060730141642.D16341@fledge.watson.org> References: <20060730141642.D16341@fledge.watson.org> X-Mailer: VM 6.75 under 21.1 (patch 12) "Channel Islands" XEmacs Lucid Cc: arch@FreeBSD.org, net@FreeBSD.org Subject: Re: Changes in the network interface queueing handoff model X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 01 Aug 2006 12:41:07 -0000 Robert Watson writes: > The immediate practical benefit is > clear: if the queueing at the ifnet layer is unnecessary, it is entirely > avoided, skipping enqueue, dequeue, and four mutex operations. This is indeed nice, but for TCP I think the benefit would be far greater if somebody would PLEASE, PLEASE, PLEASE implement TSO (aka LSO). Consider a 1460 byte mss and 64KB of data that is ready to be sent. With the current model, that is 45 separate calls to if_output(), and 45*4 (queuing) + 45 (tx routine) == 225 mutex operations. Using your model, we're down to 45 mutex operations. Using TSO, we have 4 + 1 == 5 mutex operations with the old model, and 1 with the your model. This is not even considering all the other overhead involved in 45 transmits vs TSO... Just something to think about.. Drew From owner-freebsd-arch@FreeBSD.ORG Tue Aug 1 13:25:30 2006 Return-Path: X-Original-To: arch@FreeBSD.org Delivered-To: freebsd-arch@FreeBSD.ORG Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id E16EA16A4E2; Tue, 1 Aug 2006 13:25:30 +0000 (UTC) (envelope-from rwatson@FreeBSD.org) Received: from cyrus.watson.org (cyrus.watson.org [209.31.154.42]) by mx1.FreeBSD.org (Postfix) with ESMTP id 8B85A43D64; Tue, 1 Aug 2006 13:25:30 +0000 (GMT) (envelope-from rwatson@FreeBSD.org) Received: from fledge.watson.org (fledge.watson.org [209.31.154.41]) by cyrus.watson.org (Postfix) with ESMTP id E087846C7B; Tue, 1 Aug 2006 09:25:30 -0400 (EDT) Date: Tue, 1 Aug 2006 14:25:30 +0100 (BST) From: Robert Watson X-X-Sender: robert@fledge.watson.org To: Andrew Gallatin In-Reply-To: <17615.19420.172545.986872@grasshopper.cs.duke.edu> Message-ID: <20060801142056.C64452@fledge.watson.org> References: <20060730141642.D16341@fledge.watson.org> <17615.19420.172545.986872@grasshopper.cs.duke.edu> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: arch@FreeBSD.org, net@FreeBSD.org Subject: Re: Changes in the network interface queueing handoff model X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 01 Aug 2006 13:25:31 -0000 On Tue, 1 Aug 2006, Andrew Gallatin wrote: > Robert Watson writes: > > > The immediate practical benefit is clear: if the queueing at the ifnet > > layer is unnecessary, it is entirely avoided, skipping enqueue, dequeue, > > and four mutex operations. > > This is indeed nice, but for TCP I think the benefit would be far greater if > somebody would PLEASE, PLEASE, PLEASE implement TSO (aka LSO). > > Consider a 1460 byte mss and 64KB of data that is ready to be sent. With the > current model, that is 45 separate calls to if_output(), and 45*4 (queuing) > + 45 (tx routine) == 225 mutex operations. > > Using your model, we're down to 45 mutex operations. > > Using TSO, we have 4 + 1 == 5 mutex operations with the old model, and 1 > with the your model. > > This is not even considering all the other overhead involved in 45 transmits > vs TSO... Jack Vogel at Intel has previously talked about having TSO patches for FreeBSD to use with if_em, but was running into stability/correctness problems on 7.x. I e-mailed him a few minutes ago to ask to take a look at the patches. Since I've not yet seen them, I don't know the details of how they work -- I assume that some amount of tweaking is required not just to TCP so that it passes down larger segments, but so that larger segments are used in the right ways, and in keeping with the facilities offered by the underlying interface, especially before a routing decision has been made. Robert N M Watson Computer Laboratory University of Cambridge From owner-freebsd-arch@FreeBSD.ORG Tue Aug 1 13:27:56 2006 Return-Path: X-Original-To: arch@FreeBSD.org Delivered-To: freebsd-arch@FreeBSD.ORG Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id A138716A4DD; Tue, 1 Aug 2006 13:27:56 +0000 (UTC) (envelope-from rwatson@FreeBSD.org) Received: from cyrus.watson.org (cyrus.watson.org [209.31.154.42]) by mx1.FreeBSD.org (Postfix) with ESMTP id 4522243D8C; Tue, 1 Aug 2006 13:27:49 +0000 (GMT) (envelope-from rwatson@FreeBSD.org) Received: from fledge.watson.org (fledge.watson.org [209.31.154.41]) by cyrus.watson.org (Postfix) with ESMTP id CE2D246BB0; Tue, 1 Aug 2006 09:27:49 -0400 (EDT) Date: Tue, 1 Aug 2006 14:27:49 +0100 (BST) From: Robert Watson X-X-Sender: robert@fledge.watson.org To: Andrew Gallatin In-Reply-To: <17615.18793.700752.342809@grasshopper.cs.duke.edu> Message-ID: <20060801142558.M64452@fledge.watson.org> References: <20060730141642.D16341@fledge.watson.org> <17615.18793.700752.342809@grasshopper.cs.duke.edu> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: arch@FreeBSD.org, net@FreeBSD.org Subject: Re: Changes in the network interface queueing handoff model X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 01 Aug 2006 13:27:56 -0000 On Tue, 1 Aug 2006, Andrew Gallatin wrote: > > - The ifnet send queue is a separately locked object from the device driver, > > meaning that for a single enqueue/dequeue pair, we pay an extra four lock > > operations (two for insert, two for remove) per packet. > > Going forward, especially now that we support sun4v CoolThreads hardware, > we're going to want to rethink the "single lock" per transmit routine model > that most drivers have. The most expensive operation in transmit routines > is bus_dmamap_load_mbuf_sg(), especially when there is an IOMMU involved > (like on CoolThreads machines) and there is no reason why this needs to be > called with a driver's transmit lock held. I have hard data (from Solaris) > about how much fine grained locking in a 10GbE driver's transmit routine > helps. Right now, with the exception of locking for the ifnet dispatch queue, I believe our ifnet API pretty much leaves decisions about the nature and granularity of synchronization to the device driver author. The ifnet queue is high on my list to address (hence this thread) -- are there any other parts of our device driver framework that stand in the way from a device driver being modified to support greater parallelism in sending? Robert N M Watson Computer Laboratory University of Cambridge From owner-freebsd-arch@FreeBSD.ORG Tue Aug 1 13:37:24 2006 Return-Path: X-Original-To: arch@FreeBSD.org Delivered-To: freebsd-arch@FreeBSD.ORG Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 526AE16A504; Tue, 1 Aug 2006 13:37:24 +0000 (UTC) (envelope-from gallatin@cs.duke.edu) Received: from duke.cs.duke.edu (duke.cs.duke.edu [152.3.140.1]) by mx1.FreeBSD.org (Postfix) with ESMTP id D976243D4C; Tue, 1 Aug 2006 13:37:23 +0000 (GMT) (envelope-from gallatin@cs.duke.edu) Received: from grasshopper.cs.duke.edu (grasshopper.cs.duke.edu [152.3.145.30]) by duke.cs.duke.edu (8.13.6/8.13.6) with ESMTP id k71DbNX7011165 (version=TLSv1/SSLv3 cipher=EDH-RSA-DES-CBC3-SHA bits=168 verify=NO); Tue, 1 Aug 2006 09:37:23 -0400 (EDT) Received: (from gallatin@localhost) by grasshopper.cs.duke.edu (8.12.9p2/8.12.9/Submit) id k71DbI1c061075; Tue, 1 Aug 2006 09:37:18 -0400 (EDT) (envelope-from gallatin) From: Andrew Gallatin MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <17615.22798.24602.160771@grasshopper.cs.duke.edu> Date: Tue, 1 Aug 2006 09:37:18 -0400 (EDT) To: Robert Watson In-Reply-To: <20060801142056.C64452@fledge.watson.org> References: <20060730141642.D16341@fledge.watson.org> <17615.19420.172545.986872@grasshopper.cs.duke.edu> <20060801142056.C64452@fledge.watson.org> X-Mailer: VM 6.75 under 21.1 (patch 12) "Channel Islands" XEmacs Lucid Cc: arch@FreeBSD.org, net@FreeBSD.org Subject: Re: Changes in the network interface queueing handoff model X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 01 Aug 2006 13:37:24 -0000 Robert Watson writes: > > Jack Vogel at Intel has previously talked about having TSO patches for FreeBSD > to use with if_em, but was running into stability/correctness problems on 7.x. > I e-mailed him a few minutes ago to ask to take a look at the patches. Since > I've not yet seen them, I don't know the details of how they work -- I assume > that some amount of tweaking is required not just to TCP so that it passes > down larger segments, but so that larger segments are used in the right ways, > and in keeping with the facilities offered by the underlying interface, > especially before a routing decision has been made. I'm not sure about what stack changes are required for TSO. I do know that NetBSD has had TSO for roughly 2 years, so they might be a good place to look.. Drew From owner-freebsd-arch@FreeBSD.ORG Tue Aug 1 13:48:18 2006 Return-Path: X-Original-To: arch@FreeBSD.org Delivered-To: freebsd-arch@FreeBSD.ORG Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id A811016A4E2; Tue, 1 Aug 2006 13:48:18 +0000 (UTC) (envelope-from gallatin@cs.duke.edu) Received: from duke.cs.duke.edu (duke.cs.duke.edu [152.3.140.1]) by mx1.FreeBSD.org (Postfix) with ESMTP id 2D11143DA2; Tue, 1 Aug 2006 13:47:59 +0000 (GMT) (envelope-from gallatin@cs.duke.edu) Received: from grasshopper.cs.duke.edu (grasshopper.cs.duke.edu [152.3.145.30]) by duke.cs.duke.edu (8.13.6/8.13.6) with ESMTP id k71Dlxbp012757 (version=TLSv1/SSLv3 cipher=EDH-RSA-DES-CBC3-SHA bits=168 verify=NO); Tue, 1 Aug 2006 09:47:59 -0400 (EDT) Received: (from gallatin@localhost) by grasshopper.cs.duke.edu (8.12.9p2/8.12.9/Submit) id k71Dlr6g061084; Tue, 1 Aug 2006 09:47:53 -0400 (EDT) (envelope-from gallatin) From: Andrew Gallatin MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <17615.23433.918293.466584@grasshopper.cs.duke.edu> Date: Tue, 1 Aug 2006 09:47:53 -0400 (EDT) To: Robert Watson In-Reply-To: <20060801142558.M64452@fledge.watson.org> References: <20060730141642.D16341@fledge.watson.org> <17615.18793.700752.342809@grasshopper.cs.duke.edu> <20060801142558.M64452@fledge.watson.org> X-Mailer: VM 6.75 under 21.1 (patch 12) "Channel Islands" XEmacs Lucid Cc: arch@FreeBSD.org, net@FreeBSD.org Subject: Re: Changes in the network interface queueing handoff model X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 01 Aug 2006 13:48:18 -0000 Robert Watson writes: > > On Tue, 1 Aug 2006, Andrew Gallatin wrote: > > > > - The ifnet send queue is a separately locked object from the device driver, > > > meaning that for a single enqueue/dequeue pair, we pay an extra four lock > > > operations (two for insert, two for remove) per packet. > > > > Going forward, especially now that we support sun4v CoolThreads hardware, > > we're going to want to rethink the "single lock" per transmit routine model > > that most drivers have. The most expensive operation in transmit routines > > is bus_dmamap_load_mbuf_sg(), especially when there is an IOMMU involved > > (like on CoolThreads machines) and there is no reason why this needs to be > > called with a driver's transmit lock held. I have hard data (from Solaris) > > about how much fine grained locking in a 10GbE driver's transmit routine > > helps. > > Right now, with the exception of locking for the ifnet dispatch queue, I > believe our ifnet API pretty much leaves decisions about the nature and > granularity of synchronization to the device driver author. The ifnet queue > is high on my list to address (hence this thread) -- are there any other parts > of our device driver framework that stand in the way from a device driver > being modified to support greater parallelism in sending? No, not that is directly related to ethernet drivers. However, busdma is a pain. Specifically, I hate that bus_dmamap_load_mbuf_sg() requires a bus_dmamap_t. That means that any fine-grained driver will need to "allocate" a bus_dmamap_t either via bus_dmamap_create(), or by pulling a pre-allocated bus_dmamap_t from a pre-allocated pool. Either will require a lock. Solaris has a similar problem, and I use the pool approach in my Solaris driver. Linux's pci_map_single()/pci_unmap_addr_set()/pci_unmap_len_set() is just so much nicer to use... Drew From owner-freebsd-arch@FreeBSD.ORG Wed Aug 2 10:01:26 2006 Return-Path: X-Original-To: arch@FreeBSD.org Delivered-To: freebsd-arch@FreeBSD.ORG Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 8421E16A4E2; Wed, 2 Aug 2006 10:01:26 +0000 (UTC) (envelope-from bde@zeta.org.au) Received: from mailout1.pacific.net.au (mailout1.pacific.net.au [61.8.0.84]) by mx1.FreeBSD.org (Postfix) with ESMTP id 6F66243D5A; Wed, 2 Aug 2006 10:01:20 +0000 (GMT) (envelope-from bde@zeta.org.au) Received: from mailproxy1.pacific.net.au (mailproxy1.pacific.net.au [61.8.2.162]) by mailout1.pacific.net.au (Postfix) with ESMTP id 31436328233; Wed, 2 Aug 2006 20:01:19 +1000 (EST) Received: from katana.zip.com.au (katana.zip.com.au [61.8.7.246]) by mailproxy1.pacific.net.au (8.13.4/8.13.4/Debian-3sarge1) with ESMTP id k72A1GmC004061; Wed, 2 Aug 2006 20:01:17 +1000 Date: Wed, 2 Aug 2006 20:01:16 +1000 (EST) From: Bruce Evans X-X-Sender: bde@delplex.bde.org To: John Polstra In-Reply-To: Message-ID: <20060802184349.K90387@delplex.bde.org> References: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: arch@FreeBSD.org, Robert Watson , net@FreeBSD.org Subject: RE: Changes in the network interface queueing handoff model X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 02 Aug 2006 10:01:26 -0000 On Mon, 31 Jul 2006, John Polstra wrote: > I question whether you need a fallback software if_snd queue at all > for modern devices such as the Intel and Broadcom gigabit chips. The > hardware transmit descriptor rings typically have sizes of the order > of 256 descriptors. I think if the ring fills up, you could simply > drop the packet with ENOBUFS. That's what happens if the if_snd queue > fills up, and its maximum size is comparable to the sizes of modern > descriptor rings. It would simplify things quite a bit to eliminate > the if_snd queue entirely for such devices. I use an if_snd queue length of about 5000 in my version of the sk driver to work around suckage in ENOBUFS handling. The hardware (*) tx ring size is 512, and tiny packets can be sent in 4 usec, so the hardware queue provides only 2 msec worth of buffering. select(2) for output on sockets doesn't work right, so there is no good way (**) for applications to proceed when a syscall returns ENOBUFS. An extra queue length of 500 provides an extra 20 msec worth of buffering which is usually enough when HZ = 100. (*) I think the sk tx ring is not really in hardware, so it can be much larger than 512, but a length of > 5000 for it seems excessive and caused panics when I tried it. (**) Various bad ways can be found in various versions of ttcp and tools/netrate. They involve either backing off by sleeping (which doesn't keep the tx active unless the sleep granularity is small (which only happens under FreeBSD if HZ is too large)), or by never backing off (which gives busy-waiting). Instead, select() on the output socket should actually work -- it should succeed if the tx queue length is below a low watermark. Apparently, select() on output sockets normally doesn't work, since no version of ttcp that I've looked at (not many) even tries this. Bruce