Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 23 Jan 2016 18:22:13 -0200
From:      Eduardo Meyer <dudu.meyer@gmail.com>
To:        Luigi Rizzo <rizzo@iet.unipi.it>
Cc:        Marcus Cenzatti <cenzatti@hush.com>, Adrian Chadd <adrian.chadd@gmail.com>, Pavel Odintsov <pavel.odintsov@gmail.com>,  "freebsd-net@freebsd.org" <freebsd-net@freebsd.org>
Subject:   Re: netmap design question - accessing netmap:X-n individual queues on FreeBSD
Message-ID:  <CAEqdE_5xxez4ggTry3J8zZU=jiHYWkcb9Ck%2B3QEBMRYT48ZbNw@mail.gmail.com>
In-Reply-To: <CA%2BhQ2%2BhR80nAn3mGW_zMniBM8Z=YPpvQsd=sO-Zt9PgyowJ8iQ@mail.gmail.com>
References:  <CAEqdE_4ANVrGP2hKA4nT=AJqJ5M80A%2BHy2srjoe8wfugvmbypg@mail.gmail.com> <CALgsdbe7W2fPEWMnobXWebn63J9kYhupE-C=JM2xfQKBfnQwaw@mail.gmail.com> <CAJ-VmonCizOxfsV7kH_7GTHYXW8FgvjEV-8zt7qxE6b5tUadxg@mail.gmail.com> <CALgsdbdZOkpX-hzLVkLxz=1kGOjnNCNSZz4B1HfJy2hART4%2B0w@mail.gmail.com> <CAJ-VmokP2f5Uv5ubn2Cico_iXN20jL9tPuYHE2gomGLWkPLsUA@mail.gmail.com> <20160123184320.CA903A0126@smtp.hushmail.com> <CA%2BhQ2%2BhR80nAn3mGW_zMniBM8Z=YPpvQsd=sO-Zt9PgyowJ8iQ@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Sat, Jan 23, 2016 at 5:49 PM, Luigi Rizzo <rizzo@iet.unipi.it> wrote:

> On Sat, Jan 23, 2016 at 10:43 AM, Marcus Cenzatti <cenzatti@hush.com>
> wrote:
> >
> >
> > On 1/23/2016 at 1:31 PM, "Adrian Chadd" <adrian.chadd@gmail.com> wrote:
> >>
> >>For random src/dst ports and IPs and on the chelsio t5 40gig
> >>hardware,
> >>I was getting what, uhm, 40mil tx pps and around 25ish mil rx pps?
> >>
> >>The chelsio rx path really wants to be coalescing rx buffers, which
> >>the netmap API currently doesn't support. I've no idea if luigi has
> >>plans to add that. So, it has the hilarious side effect of "adding
> >>more RX queues" translates to "drops in RX performance." :(
> >>
> >>Thanks,
> >
> > hello,
> >
> > I am sorry, are you saying intel and chelsio distribute RX packet load
> differently? If I am not mistaken intel will distributed traffic among
> queues based on ip addresses flow/hashes/whatever, does chelsio make it per
> packet or somethig other?
> >
>
> I think there are several orthogonal issues here:
> - traffic distribution has been discussed by Adrian
>   so please look at the email he just sent;
>
> - when you use netmap on a single queue ie netmap:ix0-X
>   the software side is as efficient as it can, as it needs
>   to check the status of a single queue on poll() or ioctl(..RXSYNC..).
>   On the contrary, when you access netmap:if0  (i.e. all
>   queues on a single file descriptor) every system call
>   has to check all the queues so you are better off with
>   a smaller number of queues.
>
> - on the hardware side, distributing traffic to multiple RX queues
>   has also a cost that increases with the number of queues, as the
>   NIC needs to update the ring pointers and fetch buffers for
>   multiple queues, and you can easily run out of PCIe bandwidth for
>   these transactions. This affects all NICs.
>   Some (ix ?) have parameters to configure how often to update the rings
>   and fetch descriptors, mitigating the problem. Some (ixl) don't.
>
> My opinion is that you should use multiple queues only if you want
> to rely on hw-based traffic steering, and/or your workload is
> bottlenecked by the CPU rather than bus I/O bandwidth. Even so,
> use as few queues as possible.
>
> Sometimes people use multiple queues to increase the number of
> receive buffers and tolerate more latency in the software side, but
> this really depends on the traffic distribution, so in the worst case
> you are still dealing with a single ring.
>
> Often you are better off using a single hw queue and have a
> process read from it using netmap and demultiplex to different
> netmap pipes (zero copy). That reduces bus transactions.
>
> Another option which I am experimenting these days is forget about
> individual packets once you are off the wire, and connect the various
> processes in your pipeline with a stream (TCP or similar) where packets
> and descriptors are back to back. CPUs and OSes are very efficient in
> dealing with streams of data.
>
> cheers
> luigi
>
>
>
> Another motivation would be to have more
>
> Often you are better off d
> CPU limited. on multiqueue is that you should use it only if your workload
>


Thanks for the explanation.

What i was trying to achieve is more performance using more than one CPU to
actually bridge at line rate, since I have several cores (16 cores) but low
CPU clock, and growing up horizontally is easier and cheaper than getting
faster CPUs. I though that using 2 queues with two bridge instances or two
threads (adrian's bridge) and using that on one of the other 15 idle cores
would just allow me to grow from 9Mpps to 14Mpps. It looks like it's not
that simple.

If I was a developer making a multithreaded netmap application to increase
pps rates, is there any other/better strategy than using multiple queues?
Should I distribute the load along the application reading and writing to
one single Q or something better? I mean, a multithreaded bridge to check
how many packets we have in the queue and distributing a constant number of
packets to each thread, is it possible/efficient? If I don't have the
constant number of packets I should be able to see TAIL via netmap, right?

Thank you all for all the details on how things actually work.



-- 
===========
Eduardo Meyer
pessoal: dudu.meyer@gmail.com
profissional: ddm.farmaciap@saude.gov.br



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAEqdE_5xxez4ggTry3J8zZU=jiHYWkcb9Ck%2B3QEBMRYT48ZbNw>