From owner-freebsd-net@freebsd.org  Fri Dec  9 09:02:45 2016
Return-Path: <owner-freebsd-net@freebsd.org>
Delivered-To: freebsd-net@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id A7B70C6DE99
 for <freebsd-net@mailman.ysv.freebsd.org>;
 Fri,  9 Dec 2016 09:02:45 +0000 (UTC)
 (envelope-from sunxiaoye07@gmail.com)
Received: from mail-io0-x243.google.com (mail-io0-x243.google.com
 [IPv6:2607:f8b0:4001:c06::243])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id 6E9711ED5
 for <freebsd-net@freebsd.org>; Fri,  9 Dec 2016 09:02:45 +0000 (UTC)
 (envelope-from sunxiaoye07@gmail.com)
Received: by mail-io0-x243.google.com with SMTP id p13so5488806ioi.0
 for <freebsd-net@freebsd.org>; Fri, 09 Dec 2016 01:02:45 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=mime-version:sender:in-reply-to:references:from:date:message-id
 :subject:to:cc;
 bh=+KZPuYGpFNX0JgPSXgSDuTWoxStZS6unRhoQjUW9Sow=;
 b=IE1OqeiYIitvfrGWtXDOClnrF7rkXn6sHlw2xUv7F4ts6OuSOof5wXXSjI/eNUFH2r
 pp8DPQnGlzQUEhxrVJ0nNv4EB8R+ShgQUDfFRkqFoJj/oVsdavD65ZBARldnTSAbQyyv
 NkcHYzwO3vJMunuOgKV+2R0PxcriIz3E3pXhMSEnlweuwQ8D6gBfjklf/prq47sAAljE
 LW3MF6WUoiz3HdOCoZbJunNZySE2Dyfi9uDXvAnzVd8Gpnciqi3Oal4ZaJgBkoagewGl
 rTVsjV2pk5l7Z/1b8swM6IDjRn35Qfc9gJ2sIEctK4mi5soVApg33WVezynwmdT2ejLq
 ftuA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20130820;
 h=x-gm-message-state:mime-version:sender:in-reply-to:references:from
 :date:message-id:subject:to:cc;
 bh=+KZPuYGpFNX0JgPSXgSDuTWoxStZS6unRhoQjUW9Sow=;
 b=OR8bqQHHunV0pXc+Scqv4YUZgHnn2JIdEbC5j2AtmrV8FtHeqZ6uSvGc3miD0MMA0Z
 vMPTnGHk5sJ3XdkPWA8wk35JjdHBjdsWMhr4EcKODC6NbZD1Wz8lII63NGxyrmqnjV0O
 MXv7vZzIIejqEhVh8TIvGwwSvO9NGZ+HJfyNlAgzsJqy6fPqqCHxkK1zVOr6lR0lst4l
 noW7dDz0Q7+ANi3khJoT2KnKgiM97t4LoqoPHhlp+CfNyKA20eCogVK3IruV7xlt7rMN
 wpLU5yzh33eY1XmUSDALf0PJ5NNsVX+Li5iz7tNp5LUudkbw1545xVekPfuK+jUhHEiT
 sAAA==
X-Gm-Message-State: AKaTC014iaJnOrQDW94WH2ilz/aVCBYqaT1N1/Ug2+gFpAEjo4iL2wHLsiIPUUbZg56F5szOCbpNF0ZPOaU6Rg==
X-Received: by 10.36.211.137 with SMTP id n131mr5819836itg.28.1481274164653;
 Fri, 09 Dec 2016 01:02:44 -0800 (PST)
MIME-Version: 1.0
Sender: sunxiaoye07@gmail.com
Received: by 10.64.213.98 with HTTP; Fri, 9 Dec 2016 01:02:44 -0800 (PST)
In-Reply-To: <CA+_eA9hPmDjz7VZZ2BUeRpoZG7U2at_pLk40p0c7SR_XNgBXBQ@mail.gmail.com>
References: <CAJnByzh8ypkWYfXd8U5ACLKp1d_KcJjHBY740wUFnS1WKiEdfw@mail.gmail.com>
 <CA+_eA9iROAt3qWmpmHj=05Cfz+tBizSSvPWB9eEcw4+cFmaT-g@mail.gmail.com>
 <CA+_eA9hPmDjz7VZZ2BUeRpoZG7U2at_pLk40p0c7SR_XNgBXBQ@mail.gmail.com>
From: Xiaoye Sun <Xiaoye.Sun@rice.edu>
Date: Fri, 9 Dec 2016 03:02:44 -0600
X-Google-Sender-Auth: RMyg2qmPGRzFSvhBsc__8Vx6tqU
Message-ID: <CAJnByzhW7S_bPh5=ts=hxALXjyCaXagXGNkUyh2Qz9u5PB98Ew@mail.gmail.com>
Subject: Re: Can netmap be more efficient when it just does bridging between
 NIC and Linux kernal?
To: Vincenzo Maffione <v.maffione@gmail.com>
Cc: FreeBSD Net <freebsd-net@freebsd.org>
Content-Type: text/plain; charset=UTF-8
X-Content-Filtered-By: Mailman/MimeDel 2.1.23
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net/>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 09 Dec 2016 09:02:45 -0000

Hi Vincenzo,

Thank you for your suggestion. I think attaching only subset of NIC queues
to netmap is a brilliant idea!!!

I am going through the instructions on the blog you send to me.
https://blog.cloudflare.com/single-rx-queue-kernel-bypass-with-netmap/

Now, I can use "ethtool -N eth3" (Configure Rx network flow classification)
command to set up filters so that type 1 data goes to the netmap nic queues
and the type 2 data goes to other queues at the receiver side.

However, it seems that my NIC (Intel 10G IXGBE) does not support
indirection table, since when I use the command "ethtool -X eth3 weight 0 1
1 1", I got error message like
Cannot get RX flow hash indirection table size: Operation not supported
This makes the kernel not isolate the queues given to netmap.

In such case, outgoing packets from the kernel stack are stuck and never
sent out, since these packets may want to go to the TX nic queues that have
been given to netmap (I guess).

I am wondering is there a way to work around this issue.

Best,
Xiaoye

On Thu, Dec 8, 2016 at 5:39 AM, Vincenzo Maffione <v.maffione@gmail.com>
wrote:

> Hi,
>
> 2016-12-07 2:36 GMT+01:00 Xiaoye Sun <Xiaoye.Sun@rice.edu>:
>
>> Hi,
>>
>> I am wondering if there a way to reduce the CPU usage of a netmap program
>> similar to the bridge.c example.
>>
>> In my use case, I have a distributed application/framework (e.g. Spark or
>> Hadoop) running on a cluster of machines (each of the machines runs Linux
>> and has an Intel 10Gbps NIC). The application is both computation and
>> network intensive. So there is a lot of data transfers between machines. I
>> divide different data into two types (type 1 and type 2). Packets of type
>> 1
>> data are sent through netmap (these packets don't go through Linux network
>> stack). Packets of type 2 data are sent through Linux network stack. Both
>> type 1 and type 2 data could be small or large.
>>
>> My netmap program runs on all the machines in the cluster and processes
>> the
>> packets of type 1 data  (create, send, receive the packets) and forward
>> packets of type 2 data between the NIC and the kernel by swapping the
>> pointer to the NIC slot and the pointer to the kernel stack slot (similar
>> to the bridge.c example in netmap repository).
>>
>> With my netmap program running on the machines, for an application having
>> no type 1 data (netmap program behaves like a bridge which only does slot
>> pointer swapping), the total running time of the application is longer
>> than
>> the case where no netmap program runs on the machines.
>>
>
> Yes, but this is not surprising. If the only thing your netmap application
> is doing is forwardinig all the traffic between the nework stack and the
> NIC, then your netmap application is a process that is doing an useless
> job: netmap is intercepting packets from the network stack and reinjecting
> them back in the network stack (where their goes on as they were not
> intercepted). It's just wasting resources. Netmap is designed to let netmap
> applications use efficiently the NICs and/or talk efficently to each other
> (e.g. using the VALE switch or the virtualization extensions).
> The "host rings" are instead useful in some use-cases, for example (1) you
> want to implement an high performance input packet filter for your network
> stack, that is able to manage Ddos attacks: your netmap application would
> receive somthing like 10 Mpps from the NIC, drop 99% of it (since it
> realize it is not legitimate traffic) and forward the remaining packets to
> the network stack; (2) you want to manage (forward, drop, modify, etc.)
> most of the traffic in your netmap application, but there are some low
> badwidth protocols that you want to manage using standard socket
> applications (e.g. SSH).
>
>
>>
>> It seems to me that the netmap program either slows down the network
>> transfer for type 2 data, or it eats up too many CPU cycles and competes
>> with the application process. However, with my netmap program running,
>> iperf can reach 10Gbps bandwidth with 40-50% CPU usage on the netmap
>> program (the netmap program is doing pointer swaping for iperf packets). I
>> also found that after each poll returns, most of the time, the program
>> might just swap one pointer, so there is a lot of system call overhead.
>>
>> This is also not surprising, since you are probably iperf is generating
> large packets (1500 bytes or more). As a consequence, the packet rate is
> something like 800Kpps, which is not extremely high (netmap applications
> can work with workloads of 5, 10, 20 or more Mpps; since the packet rate is
> not high, it means that the interval between two packets arriving is
> greater than the time needed to do a poll()/ioctl() syscall and process the
> packet, and so the batches don't get formed.
>
>
>> Can anybody help me diagnose the source of the problem or is there a
>> better
>> way to write such program?
>
>
>
>> I am wondering if there is a way to tuning the configuration so that the
>> netmap program won't take up too much extra CPU when it runs like the
>> bridge.c program.
>>
>
> The point is that when you have only type 2 data you shouldn't use netmap,
> as it does not make sense. Unfortunately, the fact that packet batches
> (with more than 1 packet) get formed or not depends on the external traffic
> input patterns: it's basically a producer/consumer problem, and there are
> no tunable for this. One thing you may do is to rate-limit the calls to
> poll()/ioctl() in order to artificially create the batches; in this way you
> would trade off a bit of latency for the sake of energy efficiency.
>
> Another approach that you may be interested in is using NIC hardware
> features like "flow-director" or "receive-flow-steering" to classify input
> packets and steer different classes into specific NIC queues. In this way
> you could open with netmap just a subset of the NIC queues (the type 1
> data1traffic), and let the network stack directly process the traffic on
> the other queues (type 2 data). There are some blog posts about this kind
> of setup, here is one https://blog.cloudflare.com/si
> ngle-rx-queue-kernel-bypass-with-netmap/
>
> Cheers,
>   Vincenzo
>
>>
>>
>> Best,
>> Xiaoye
>> _______________________________________________
>> freebsd-net@freebsd.org mailing list
>> https://lists.freebsd.org/mailman/listinfo/freebsd-net
>> To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org"
>>
>
>
>
> --
> Vincenzo Maffione
>
>
>
> --
> Vincenzo Maffione
>