From owner-freebsd-net@FreeBSD.ORG Tue Oct 29 22:03:14 2013 Return-Path: Delivered-To: net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id 3518960A; Tue, 29 Oct 2013 22:03:14 +0000 (UTC) (envelope-from nparhar@gmail.com) Received: from mail-pb0-x233.google.com (mail-pb0-x233.google.com [IPv6:2607:f8b0:400e:c01::233]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 052802BBC; Tue, 29 Oct 2013 22:03:13 +0000 (UTC) Received: by mail-pb0-f51.google.com with SMTP id wz7so465047pbc.10 for ; Tue, 29 Oct 2013 15:03:13 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=sender:message-id:date:from:user-agent:mime-version:to:cc:subject :references:in-reply-to:content-type:content-transfer-encoding; bh=n5pZqujjMbW1gsx4qa8sGOM3f+lVa4umij4bRjcFZKM=; b=huuCOQ4dMOx5dzw3I/jgJ4+t8H6GakygCfCdcTknnlGKka26pWxE7uKmJHoG80X3HB j1Tzmw74N9PUY7/bLEuto5UXWqsoz+pDeAbUh8W+CR/gJoqUSpi7mvtsFvRn6AEb9Qs9 XObSNe/F0CfjyouAiPNLj6VRxcNCzzPF20feax+y7TJxfJWZe8Bqom2mVz2O7UUe3YUZ nrLQuaLbvpO7RVIhfsiAhXu0mSVucynUXhFvc1mnxL6p2XD0H5WGCuQUgmohVAjvf/wI Rzj1/HGUgClmd0oXRKPSMRnciy7XDcjUolFnuajaVTu4GgLGk3wuhkrhJVjhjXw8oj9S KG2w== X-Received: by 10.68.228.138 with SMTP id si10mr838544pbc.13.1383084193322; Tue, 29 Oct 2013 15:03:13 -0700 (PDT) Received: from [10.192.166.0] (stargate.chelsio.com. [67.207.112.58]) by mx.google.com with ESMTPSA id qp10sm44953730pab.13.2013.10.29.15.03.11 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Tue, 29 Oct 2013 15:03:12 -0700 (PDT) Sender: Navdeep Parhar Message-ID: <5270309E.5090403@FreeBSD.org> Date: Tue, 29 Oct 2013 15:03:10 -0700 From: Navdeep Parhar User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:24.0) Gecko/20100101 Thunderbird/24.0.1 MIME-Version: 1.0 To: Andre Oppermann , Luigi Rizzo Subject: Re: MQ Patch. References: <40948D79-E890-4360-A3F2-BEC34A389C7E@lakerest.net> <526FFED9.1070704@freebsd.org> <52701D8B.8050907@freebsd.org> <527022AC.4030502@FreeBSD.org> <527027CE.5040806@freebsd.org> In-Reply-To: <527027CE.5040806@freebsd.org> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit Cc: Randall Stewart , "freebsd-net@freebsd.org" X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 29 Oct 2013 22:03:14 -0000 On 10/29/13 14:25, Andre Oppermann wrote: > On 29.10.2013 22:03, Navdeep Parhar wrote: >> On 10/29/13 13:41, Andre Oppermann wrote: >>> Let me jump in here and explain roughly the ideas/path I'm exploring >>> in creating and eventually implementing a big picture for drivers, >>> queues, queue management, various QoS and so on: >>> >>> Situation: We're still mostly based on the old 4.4BSD IFQ model with >>> a couple of work-arounds (sndring, drbr) and the bit-rotten ALTQ we >>> have in tree aren't helpful at all. >>> >>> Steps: >>> >>> 1. take the soft-queuing method out of the ifnet layer and make it >>> a property of the driver, so that the upper stack (or actually >>> protocol L3/L2 mapping/encapsulation layer) calls (*if_transmit) >>> without any queuing at that point. It then is up to the driver >>> to decide how it multiplexes multi-core access to its queue(s) >>> and how they are configured. >> >> It would work out much better if the kernel was aware of the number of >> tx queues of a multiq driver and explicitly selected one in if_transmit. >> The driver has no information on the CPU affinity etc. of the >> applications generating the traffic; the kernel does. In general, the >> kernel has a much better "global view" of the system and some of the >> stuff currently in the drivers really should move up into the stack. > > I've been thinking a lot about this and come to the preliminary conclusion > that the upper stack should not tell the driver which queue to use. There > are way to many possible and depending on the use-case, better or worse > performing approaches. Also we have a big problem with cores vs. queues > mismatches either way (more cores than queues or more queues than cores, > though the latter is much less of problem). > > For now I see these primary multi-hardware-queue approaches to be > implemented > first: > > a) the drivers (*if_transmit) takes the flowid from the mbuf header and > selects one of the N hardware DMA rings based on it. Each of the DMA > rings is protected by a lock. Here the assumption is that by having > enough DMA rings the contention on each of them will be relatively low > and ideally a flow and ring sort of sticks to a core that sends lots > of packets into that flow. Of course it is a statistical certainty that > some bouncing will be going on. > > b) the driver assigns the DMA rings to particular cores which by that, > through > a critnest++ can drive them lockless. The drivers (*if_transmit) > will look > up the core it got called on and push the traffic out on that DMA ring. > The problem is the actual upper stacks affinity which is not guaranteed. > This has to consequences: there may be reordering of packets of the same > flow because the protocols send function happens to be called from a > different core the second time. Or the drivers (*if_transmit) has to > switch to the right core to complete the transmit for this flow if the > upper stack migrated/bounced around. It is rather difficult to assure > full affinity from userspace down through the upper stack and then to > the driver. > > c) non-multi-queue capable hardware uses a kernel provided set of functions > to manage the contention for the single resource of a DMA ring. > > The point here is that the driver is the right place to make these > decisions > because the upper stack lacks (and shouldn't care about) the actual > available > hardware and its capabilities. All necessary information is available > to the > driver as well through the appropriate mbuf header fields and the core > it is > called on. > I mildly disagree with most of this, specifically with the part that the driver is the right place to make these decisions. But you did say this was a "preliminary conclusion" so there's hope yet ;-) Let's wait till you have an early implementation and we are all able to experiment with it. To be continued... Regards, Navdeep