Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 30 Jul 2006 16:59:10 +0200
From:      Max Laier <max@love2party.net>
To:        freebsd-arch@freebsd.org
Cc:        Robert Watson <rwatson@freebsd.org>, freeebsd-net@freebsd.org
Subject:   Re: Changes in the network interface queueing handoff model
Message-ID:  <200607301659.16323.max@love2party.net>
In-Reply-To: <20060730141642.D16341@fledge.watson.org>
References:  <20060730141642.D16341@fledge.watson.org>

next in thread | previous in thread | raw e-mail | index | archive | help
--nextPart2120229.eJNeJPqOEV
Content-Type: text/plain;
  charset="iso-8859-6"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline

On Sunday 30 July 2006 16:04, Robert Watson wrote:
> One of the ideas that I, Scott Long, and a few others have been bouncing
> around for some time is a restructuring of the network interface packet
> transmission API to reduce the number of locking operations and allow
> network device drivers increased control of the queueing behavior.  Right
> now, it works something like that following:
>
> - When a network protocol wants to transmit, it calls the ifnet's link
> layer output routine via ifp->if_output() with the ifnet pointer, packet,
> destination address information, and route information.
>
> - The link layer (e.g., ether_output() + ether_output_frame()) encapsulat=
es
>    the packet as necessary, performs a link layer address translation (su=
ch
> as ARP), and hands off to the ifnet driver via a call to IFQ_HANDOFF(),
> which accepts the ifnet pointer and packet.
>
> - The ifnet layer enqueues the packet in the ifnet send queue
> (ifp->if_snd), and then looks at the driver's IFF_DRV_OACTIVE flag to
> determine if it needs to "start" output by the driver.  If the driver is
> already active, it doesn't, and otherwise, it does.
>
> - The driver dequeues the packet from ifp->if_snd, performs any driver
>    encapsulation and wrapping, and notifies the hardware.  In modern
> hardware, this consists of hooking the data of the packet up to the
> descriptor ring and notifying the hardware to pick it up via DMA.  In ord=
er
> hardware, the driver would perform a series of I/O operations to send the
> entire packet directly to the card via a system bus.
>
> Why change this?  A few reasons:
>
> - The ifnet layer send queue is becoming decreasingly useful over time.=20
> Most modern hardware has a significant number of slots in its transmit
> descriptor ring, tuned for the performance of the hardware, etc, which is
> the effective transmit queue in practice.  The additional queue depth
> doesn't increase throughput substantially (if at all) but does consume
> memory.
>
> - On extremely fast hardware (with respect to CPU speed), the queue remai=
ns
>    essentially empty, so we pay the cost of enqueueing and dequeuing a
> packet from an empty queue.
>
> - The ifnet send queue is a separately locked object from the device
> driver, meaning that for a single enqueue/dequeue pair, we pay an extra
> four lock operations (two for insert, two for remove) per packet.
>
> - For synthetic link layer drivers, such as if_vlan, which have no need f=
or
>    queueing at all, the cost of queueing is eliminated.
>
> - IFF_DRV_OACTIVE is no longer inspected by the link layer, only by the
>    driver, which helps eliminate a latent race condition involving use of
> the flag.
>
> The proposed change is simple: right now one or more enqueue operations
> occurs, when a call to ifp->if_start() is made to notify the driver that =
it
> may need to do something (if the ACTIVE flag isn't set).  In the new world
> order, the driver is directly passed the mbuf, and may then choose to que=
ue
> it or otherwise handle it as it sees fit.  The immediate practical benefit
> is clear: if the queueing at the ifnet layer is unnecessary, it is entire=
ly
> avoided, skipping enqueue, dequeue, and four mutex operations.  This
> applies immediately for VLAN processing, but also means that for modern
> gigabit cards, the hardware queue (which will be used anyway) is the only
> queue necessary.
>
> There are a few downsides, of course:
>
> - For older hardware without its own queueing, the queue is still required
> -- not only that, but we've now introduced an unconditional function
> pointer invocation, which on older hardware, is has more significant
> relative cost than it has on more recent CPUs.
>
> - If drivers still require or use a queue, they must now synchronize acce=
ss
> to the queue.  The obvious choices are to use the ifq lock (and restore t=
he
> above four lock operations), or to use the driver mutex (and risk higher
> contention).  Right now, if the driver is busy (driver mutex held) then an
> enqueue is still possible, but with this change and a single mutex
> protecting the send queue and driver, that is no longer possible.
>
> Attached is a patch that maintains the current if_start, but adds
> if_startmbuf.  If a device driver implements if_startmbuf and the global
> sysctl net.startmbuf_enabled is set to 1, then the if_startmbuf path in t=
he
> driver will be used.  Otherwise, if_start is used.  I have modified the
> if_em driver to implement if_startmbuf also.  If there is no packet backl=
og
> in the if_snd queue, it directly places the packet in the transmit
> descriptor ring. If there is a backlog, it uses the if_snd queue protected
> by driver mutex, rather than a separate ifq mutex.
>
> In some basic local micro-benchmarks, I saw a 5% improvement in UDP 0-byte
> paylod PPS on UP, and a 10% improvement on SMP.  I saw a 1.7% performance
> improvement in the bulk serving of 1k files over HTTP.  These are only
> micro-benchmarks, and reflect a configuration in which the CPU is unable =
to
> keep up with the output rate of the 1gbps ethernet card in the device, so
> reductions in host CPU usage are immediately visible in increased output =
as
> the CPU is able to better keep up with the network hardware.  Other
> configurations are also of interest of interesting, especially ones in
> which the network device is unable to keep up with the CPU, resulting in
> more queueing.
>
> Conceptual review as well as banchmarking, etc, would be most welcome.

This begs the question: What about ALTQ?

If we maintain the fallback mechanism in _handoff, we can just add=20
ALTQ_IS_ENABLED() to the test.  Otherwise every driver's startmbuf function=
=20
would have to take care of ALTQ itself, which is not preferable.

I strongly agree with you comment about how messed up ifq_*/if_* in if_var.=
h=20
are - and I'm afraid that's partly me fault for bringing in ALTQ.

=2D-=20
/"\  Best regards,                      | mlaier@freebsd.org
\ /  Max Laier                          | ICQ #67774661
 X   http://pf4freebsd.love2party.net/  | mlaier@EFnet
/ \  ASCII Ribbon Campaign              | Against HTML Mail and News

--nextPart2120229.eJNeJPqOEV
Content-Type: application/pgp-signature

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.4 (FreeBSD)

iD8DBQBEzMlEXyyEoT62BG0RAvsrAJ4v2m/yc+PHoUM+kPE0ZZUVknJbTgCfeJYN
uQVwRejml24OusLMlSIJV5A=
=OUxd
-----END PGP SIGNATURE-----

--nextPart2120229.eJNeJPqOEV--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200607301659.16323.max>