From owner-freebsd-net@FreeBSD.ORG  Tue Oct 29 22:03:14 2013
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTP id 3518960A;
 Tue, 29 Oct 2013 22:03:14 +0000 (UTC)
 (envelope-from nparhar@gmail.com)
Received: from mail-pb0-x233.google.com (mail-pb0-x233.google.com
 [IPv6:2607:f8b0:400e:c01::233])
 (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits))
 (No client certificate requested)
 by mx1.freebsd.org (Postfix) with ESMTPS id 052802BBC;
 Tue, 29 Oct 2013 22:03:13 +0000 (UTC)
Received: by mail-pb0-f51.google.com with SMTP id wz7so465047pbc.10
 for <multiple recipients>; Tue, 29 Oct 2013 15:03:13 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=sender:message-id:date:from:user-agent:mime-version:to:cc:subject
 :references:in-reply-to:content-type:content-transfer-encoding;
 bh=n5pZqujjMbW1gsx4qa8sGOM3f+lVa4umij4bRjcFZKM=;
 b=huuCOQ4dMOx5dzw3I/jgJ4+t8H6GakygCfCdcTknnlGKka26pWxE7uKmJHoG80X3HB
 j1Tzmw74N9PUY7/bLEuto5UXWqsoz+pDeAbUh8W+CR/gJoqUSpi7mvtsFvRn6AEb9Qs9
 XObSNe/F0CfjyouAiPNLj6VRxcNCzzPF20feax+y7TJxfJWZe8Bqom2mVz2O7UUe3YUZ
 nrLQuaLbvpO7RVIhfsiAhXu0mSVucynUXhFvc1mnxL6p2XD0H5WGCuQUgmohVAjvf/wI
 Rzj1/HGUgClmd0oXRKPSMRnciy7XDcjUolFnuajaVTu4GgLGk3wuhkrhJVjhjXw8oj9S
 KG2w==
X-Received: by 10.68.228.138 with SMTP id si10mr838544pbc.13.1383084193322;
 Tue, 29 Oct 2013 15:03:13 -0700 (PDT)
Received: from [10.192.166.0] (stargate.chelsio.com. [67.207.112.58])
 by mx.google.com with ESMTPSA id qp10sm44953730pab.13.2013.10.29.15.03.11
 for <multiple recipients>
 (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128);
 Tue, 29 Oct 2013 15:03:12 -0700 (PDT)
Sender: Navdeep Parhar <nparhar@gmail.com>
Message-ID: <5270309E.5090403@FreeBSD.org>
Date: Tue, 29 Oct 2013 15:03:10 -0700
From: Navdeep Parhar <np@FreeBSD.org>
User-Agent: Mozilla/5.0 (X11; FreeBSD amd64;
 rv:24.0) Gecko/20100101 Thunderbird/24.0.1
MIME-Version: 1.0
To: Andre Oppermann <andre@freebsd.org>, Luigi Rizzo <rizzo@iet.unipi.it>
Subject: Re: MQ Patch.
References: <40948D79-E890-4360-A3F2-BEC34A389C7E@lakerest.net>	<526FFED9.1070704@freebsd.org>
 <CA+hQ2+gTc87M0f5pvFeW_GCZDogrLkT_1S2bKHngNcDEBUeZYQ@mail.gmail.com>
 <52701D8B.8050907@freebsd.org> <527022AC.4030502@FreeBSD.org>
 <527027CE.5040806@freebsd.org>
In-Reply-To: <527027CE.5040806@freebsd.org>
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: 7bit
Cc: Randall Stewart <rrs@lakerest.net>,
 "freebsd-net@freebsd.org" <net@freebsd.org>
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 29 Oct 2013 22:03:14 -0000

On 10/29/13 14:25, Andre Oppermann wrote:
> On 29.10.2013 22:03, Navdeep Parhar wrote:
>> On 10/29/13 13:41, Andre Oppermann wrote:
>>> Let me jump in here and explain roughly the ideas/path I'm exploring
>>> in creating and eventually implementing a big picture for drivers,
>>> queues, queue management, various QoS and so on:
>>>
>>> Situation: We're still mostly based on the old 4.4BSD IFQ model with
>>> a couple of work-arounds (sndring, drbr) and the bit-rotten ALTQ we
>>> have in tree aren't helpful at all.
>>>
>>> Steps:
>>>
>>> 1. take the soft-queuing method out of the ifnet layer and make it
>>>     a property of the driver, so that the upper stack (or actually
>>>     protocol L3/L2 mapping/encapsulation layer) calls (*if_transmit)
>>>     without any queuing at that point.  It then is up to the driver
>>>     to decide how it multiplexes multi-core access to its queue(s)
>>>     and how they are configured.
>>
>> It would work out much better if the kernel was aware of the number of
>> tx queues of a multiq driver and explicitly selected one in if_transmit.
>>   The driver has no information on the CPU affinity etc. of the
>> applications generating the traffic; the kernel does.  In general, the
>> kernel has a much better "global view" of the system and some of the
>> stuff currently in the drivers really should move up into the stack.
> 
> I've been thinking a lot about this and come to the preliminary conclusion
> that the upper stack should not tell the driver which queue to use.  There
> are way to many possible and depending on the use-case, better or worse
> performing approaches.  Also we have a big problem with cores vs. queues
> mismatches either way (more cores than queues or more queues than cores,
> though the latter is much less of problem).
> 
> For now I see these primary multi-hardware-queue approaches to be
> implemented
> first:
> 
> a) the drivers (*if_transmit) takes the flowid from the mbuf header and
>    selects one of the N hardware DMA rings based on it.  Each of the DMA
>    rings is protected by a lock.  Here the assumption is that by having
>    enough DMA rings the contention on each of them will be relatively low
>    and ideally a flow and ring sort of sticks to a core that sends lots
>    of packets into that flow.  Of course it is a statistical certainty that
>    some bouncing will be going on.
> 
> b) the driver assigns the DMA rings to particular cores which by that,
> through
>    a critnest++ can drive them lockless.  The drivers (*if_transmit)
> will look
>    up the core it got called on and push the traffic out on that DMA ring.
>    The problem is the actual upper stacks affinity which is not guaranteed.
>    This has to consequences: there may be reordering of packets of the same
>    flow because the protocols send function happens to be called from a
>    different core the second time.  Or the drivers (*if_transmit) has to
>    switch to the right core to complete the transmit for this flow if the
>    upper stack migrated/bounced around.  It is rather difficult to assure
>    full affinity from userspace down through the upper stack and then to
>    the driver.
> 
> c) non-multi-queue capable hardware uses a kernel provided set of functions
>    to manage the contention for the single resource of a DMA ring.
> 
> The point here is that the driver is the right place to make these
> decisions
> because the upper stack lacks (and shouldn't care about) the actual
> available
> hardware and its capabilities.  All necessary information is available
> to the
> driver as well through the appropriate mbuf header fields and the core
> it is
> called on.
> 

I mildly disagree with most of this, specifically with the part that the
driver is the right place to make these decisions.  But you did say this
was a "preliminary conclusion" so there's hope yet ;-)

Let's wait till you have an early implementation and we are all able to
experiment with it.  To be continued...

Regards,
Navdeep