Date: Sun, 30 Jul 2006 16:59:10 +0200 From: Max Laier <max@love2party.net> To: freebsd-arch@freebsd.org Cc: Robert Watson <rwatson@freebsd.org>, freeebsd-net@freebsd.org Subject: Re: Changes in the network interface queueing handoff model Message-ID: <200607301659.16323.max@love2party.net> In-Reply-To: <20060730141642.D16341@fledge.watson.org> References: <20060730141642.D16341@fledge.watson.org>
next in thread | previous in thread | raw e-mail | index | archive | help
--nextPart2120229.eJNeJPqOEV Content-Type: text/plain; charset="iso-8859-6" Content-Transfer-Encoding: quoted-printable Content-Disposition: inline On Sunday 30 July 2006 16:04, Robert Watson wrote: > One of the ideas that I, Scott Long, and a few others have been bouncing > around for some time is a restructuring of the network interface packet > transmission API to reduce the number of locking operations and allow > network device drivers increased control of the queueing behavior. Right > now, it works something like that following: > > - When a network protocol wants to transmit, it calls the ifnet's link > layer output routine via ifp->if_output() with the ifnet pointer, packet, > destination address information, and route information. > > - The link layer (e.g., ether_output() + ether_output_frame()) encapsulat= es > the packet as necessary, performs a link layer address translation (su= ch > as ARP), and hands off to the ifnet driver via a call to IFQ_HANDOFF(), > which accepts the ifnet pointer and packet. > > - The ifnet layer enqueues the packet in the ifnet send queue > (ifp->if_snd), and then looks at the driver's IFF_DRV_OACTIVE flag to > determine if it needs to "start" output by the driver. If the driver is > already active, it doesn't, and otherwise, it does. > > - The driver dequeues the packet from ifp->if_snd, performs any driver > encapsulation and wrapping, and notifies the hardware. In modern > hardware, this consists of hooking the data of the packet up to the > descriptor ring and notifying the hardware to pick it up via DMA. In ord= er > hardware, the driver would perform a series of I/O operations to send the > entire packet directly to the card via a system bus. > > Why change this? A few reasons: > > - The ifnet layer send queue is becoming decreasingly useful over time.=20 > Most modern hardware has a significant number of slots in its transmit > descriptor ring, tuned for the performance of the hardware, etc, which is > the effective transmit queue in practice. The additional queue depth > doesn't increase throughput substantially (if at all) but does consume > memory. > > - On extremely fast hardware (with respect to CPU speed), the queue remai= ns > essentially empty, so we pay the cost of enqueueing and dequeuing a > packet from an empty queue. > > - The ifnet send queue is a separately locked object from the device > driver, meaning that for a single enqueue/dequeue pair, we pay an extra > four lock operations (two for insert, two for remove) per packet. > > - For synthetic link layer drivers, such as if_vlan, which have no need f= or > queueing at all, the cost of queueing is eliminated. > > - IFF_DRV_OACTIVE is no longer inspected by the link layer, only by the > driver, which helps eliminate a latent race condition involving use of > the flag. > > The proposed change is simple: right now one or more enqueue operations > occurs, when a call to ifp->if_start() is made to notify the driver that = it > may need to do something (if the ACTIVE flag isn't set). In the new world > order, the driver is directly passed the mbuf, and may then choose to que= ue > it or otherwise handle it as it sees fit. The immediate practical benefit > is clear: if the queueing at the ifnet layer is unnecessary, it is entire= ly > avoided, skipping enqueue, dequeue, and four mutex operations. This > applies immediately for VLAN processing, but also means that for modern > gigabit cards, the hardware queue (which will be used anyway) is the only > queue necessary. > > There are a few downsides, of course: > > - For older hardware without its own queueing, the queue is still required > -- not only that, but we've now introduced an unconditional function > pointer invocation, which on older hardware, is has more significant > relative cost than it has on more recent CPUs. > > - If drivers still require or use a queue, they must now synchronize acce= ss > to the queue. The obvious choices are to use the ifq lock (and restore t= he > above four lock operations), or to use the driver mutex (and risk higher > contention). Right now, if the driver is busy (driver mutex held) then an > enqueue is still possible, but with this change and a single mutex > protecting the send queue and driver, that is no longer possible. > > Attached is a patch that maintains the current if_start, but adds > if_startmbuf. If a device driver implements if_startmbuf and the global > sysctl net.startmbuf_enabled is set to 1, then the if_startmbuf path in t= he > driver will be used. Otherwise, if_start is used. I have modified the > if_em driver to implement if_startmbuf also. If there is no packet backl= og > in the if_snd queue, it directly places the packet in the transmit > descriptor ring. If there is a backlog, it uses the if_snd queue protected > by driver mutex, rather than a separate ifq mutex. > > In some basic local micro-benchmarks, I saw a 5% improvement in UDP 0-byte > paylod PPS on UP, and a 10% improvement on SMP. I saw a 1.7% performance > improvement in the bulk serving of 1k files over HTTP. These are only > micro-benchmarks, and reflect a configuration in which the CPU is unable = to > keep up with the output rate of the 1gbps ethernet card in the device, so > reductions in host CPU usage are immediately visible in increased output = as > the CPU is able to better keep up with the network hardware. Other > configurations are also of interest of interesting, especially ones in > which the network device is unable to keep up with the CPU, resulting in > more queueing. > > Conceptual review as well as banchmarking, etc, would be most welcome. This begs the question: What about ALTQ? If we maintain the fallback mechanism in _handoff, we can just add=20 ALTQ_IS_ENABLED() to the test. Otherwise every driver's startmbuf function= =20 would have to take care of ALTQ itself, which is not preferable. I strongly agree with you comment about how messed up ifq_*/if_* in if_var.= h=20 are - and I'm afraid that's partly me fault for bringing in ALTQ. =2D-=20 /"\ Best regards, | mlaier@freebsd.org \ / Max Laier | ICQ #67774661 X http://pf4freebsd.love2party.net/ | mlaier@EFnet / \ ASCII Ribbon Campaign | Against HTML Mail and News --nextPart2120229.eJNeJPqOEV Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.4 (FreeBSD) iD8DBQBEzMlEXyyEoT62BG0RAvsrAJ4v2m/yc+PHoUM+kPE0ZZUVknJbTgCfeJYN uQVwRejml24OusLMlSIJV5A= =OUxd -----END PGP SIGNATURE----- --nextPart2120229.eJNeJPqOEV--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200607301659.16323.max>