Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 06 Apr 2009 00:47:49 +0200
From:      Ivan Voras <ivoras@freebsd.org>
To:        freebsd-net@freebsd.org
Subject:   Re: Advice on a multithreaded netisr  patch?
Message-ID:  <grbcfg$poe$1@ger.gmane.org>
In-Reply-To: <alpine.BSF.2.00.0904052243250.34905@fledge.watson.org>
References:  <gra7mq$ei8$1@ger.gmane.org>	<alpine.BSF.2.00.0904051422280.12639@fledge.watson.org>	<grac1s$p56$1@ger.gmane.org>	<alpine.BSF.2.00.0904051440460.12639@fledge.watson.org>	<grappq$tsg$1@ger.gmane.org> <alpine.BSF.2.00.0904052243250.34905@fledge.watson.org>

next in thread | previous in thread | raw e-mail | index | archive | help
This is an OpenPGP/MIME signed message (RFC 2440 and 3156)
--------------enig078FFC936793C9EB67C0FB65
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Thanks for the ideas, I will try some of them. But I'd also like some
more clarifications:

Robert Watson wrote:
> On Sun, 5 Apr 2009, Ivan Voras wrote:

>> I'd like to understand more. If (in netisr) I have a mbuf with
>> headers, is this data already transfered from the card or is it
>> magically "not here yet"?
>=20
> A lot depends on the details of the card and driver.  The driver will
> take cache misses on the descriptor ring entry, if it's not already in
> cache, and the link layer will take a cache miss on the front of the
> ethernet frame in the cluster pointed to by the mbuf header as part of
> its demux.  What happens next depends on your dispatch model and cache
> line size.  Let's make a few simplifying assumptions that are mostly tr=
ue:

So, a mbuf can reference data not yet copied from the NIC hardware? I'm
specifically trying to undestand what m_pullup() does.

>> As the card and the OS can already process many packets per second for=

>> something fairly complex as routing
>> (http://www.tancsa.com/blast.html), and TCP chokes swi:net at 100% of
>> a core, isn't this indication there's certainly more space for
>> improvement even with a single-queue old-fashioned NICs?
>=20
> Maybe.  It depends on the relative costs of local processing vs
> redistributing the work, which involves schedulers, IPIs, additional
> cache misses, lock contention, and so on.  This means there's a period
> where it can't possibly be a win, and then at some point it's a win as
> long as the stack scales.  This is essentially the usual trade-off in
> using threads and parallelism: does the benefit of multiple parallel
> execution units make up for the overheads of synchronization and data
> migration?

Do you have any idea at all why I'm seeing the weird difference of
netstat packets per second (250,000) and my application's TCP
performance (< 1,000 pps)? Summary: each packet is guaranteed to be a
whole message causing a transaction in the application - without the
changes I see pps almost identical to tps. Even if the source of netstat
statistics somehow manages to count packets multiple time (I don't see
how that can happen), no relation can describe differences this huge. It
almost looks like something in the upper layers is discarding packets
(also not likely: TCP timeouts would occur and the application wouldn't
be able to push 250,000 pps) - but what? Where to look?

> FYI, the localhost case is a bit weird -- I think we have some
> scheduling issues that are causing loopback netisr stuff to be
> pessimally scheduled. Here are some suggestions for things to try and
> see if they help, though:
>=20
> - Comment out all ifnet, IP, and TCP global statistics in your local
> stack --
>   especially look for things tcpstat.whatever++;.

You mean for the general code? I purposely don't lock my statistics
variables because I'm not that interested in exact numbers (orders of
magnitude are relevant). As far as I understand, unlocked "x++" should
be trivially fast in this case?

> - Use cpuset to pin ithreads, the netisr, and whatever else, to specifi=
c
> cores
>   so that they don't migrate, and if your system uses HTT, experiment w=
ith
>   pinning the ithread and the netisr on different threads on the same
> core, or
>   at least, different cores on the same die.

I'm using em hardware; I still think there's a possibility I'm fighting
the driver in some cases but this has priority #2.

> - Experiment with using just the source IP, the source + destination IP=
,
> and
>   both IPs plus TCP ports in your hash.

Ok. Currently I'm using ip1+ip2+port1+port2.

> - If your card supports RSS, pass the flowid up the stack in the mbuf
> packet
>   header flowid field, and use that instead of the hash for work placem=
ent.

Don't know about em. Don't really want to touch it if I don't have to :)




--------------enig078FFC936793C9EB67C0FB65
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: OpenPGP digital signature
Content-Disposition: attachment; filename="signature.asc"

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAknZNRwACgkQldnAQVacBcj7hQCfRE35c+nkAhCYp4+neW2Da6xk
kNsAnRxRXOoJR0udvActmaO+azYDeXhn
=aVa7
-----END PGP SIGNATURE-----

--------------enig078FFC936793C9EB67C0FB65--




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?grbcfg$poe$1>