From owner-freebsd-net@FreeBSD.ORG  Wed Apr 11 13:00:03 2012
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id E8A311065670;
	Wed, 11 Apr 2012 13:00:02 +0000 (UTC)
	(envelope-from luigi@onelab2.iet.unipi.it)
Received: from onelab2.iet.unipi.it (onelab2.iet.unipi.it [131.114.59.238])
	by mx1.freebsd.org (Postfix) with ESMTP id 981C98FC0C;
	Wed, 11 Apr 2012 13:00:02 +0000 (UTC)
Received: by onelab2.iet.unipi.it (Postfix, from userid 275)
	id 5E1F17300A; Wed, 11 Apr 2012 15:19:20 +0200 (CEST)
Date: Wed, 11 Apr 2012 15:19:20 +0200
From: Luigi Rizzo <rizzo@iet.unipi.it>
To: Andre Oppermann <andre@freebsd.org>
Message-ID: <20120411131920.GA61027@onelab2.iet.unipi.it>
References: <20120410225257.GB53350@onelab2.iet.unipi.it>
	<4F84B6DB.5040904@freebsd.org>
	<20120410230500.GA22829@pit.databus.com>
	<20120410233211.GA53829@onelab2.iet.unipi.it>
	<4F855E5E.5000107@freebsd.org>
	<20120411110036.GA60031@onelab2.iet.unipi.it>
	<4F857631.8040309@freebsd.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <4F857631.8040309@freebsd.org>
User-Agent: Mutt/1.4.2.3i
Cc: Barney Wolff <barney@databus.com>, current@freebsd.org, net@freebsd.org
Subject: Re: strange ping response times...
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
	<mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
	<mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 11 Apr 2012 13:00:03 -0000

On Wed, Apr 11, 2012 at 02:16:49PM +0200, Andre Oppermann wrote:
> On 11.04.2012 13:00, Luigi Rizzo wrote:
> >On Wed, Apr 11, 2012 at 12:35:10PM +0200, Andre Oppermann wrote:
> >>On 11.04.2012 01:32, Luigi Rizzo wrote:
> >>>On Tue, Apr 10, 2012 at 07:05:00PM -0400, Barney Wolff wrote:
> >>>>CPU cache?
> >>>>Cx states?
> >>>>powerd?
> >>>
> >>>powerd is disabled, and i am going down to C1 at most
> >>>	>   sysctl -a | grep cx
> >>>	hw.acpi.cpu.cx_lowest: C1
> >>>	dev.cpu.0.cx_supported: C1/1 C2/80 C3/104
> >>>
> >>>which shouldn't take so much. Sure, cache matters, but the
> >>>fact is, icmp processing on loopback should occur inline.
> >>>
> >>>unless there is a forced descheduling on a select with timeout>   0
> >>>which would explain the extra few microseconds (and makes me worry
> >>>on how expensive is a scheduling decision...)
> >>
> >>Things going through loopback go through a NETISR and may
> >>end up queued to avoid LOR situations.  In addition per-cpu
> >>queues with hash-distribution for affinity may cause your
> >>packet to be processed by a different core.  Hence the additional
> >>delay.
> >
> >so you suggest that the (de)scheduling is costing several microseconds ?
> 
> Not directly.  I'm just trying to explain what's going on to
> get a better idea where it may go wrong.
> 
> There may be a poor ISR/scheduler interaction that prevents that
> causes the packet to be processed only on the next tick or something
> like that.  I don't have a better explanation for this.

ok, some final remarks just for archival purposes
(still related to the loopback ping)

ping takes a timestamp in userspace before trying to transmit
the packet, and then the timestamp for the received packet
is recorded in the kernel (in the interrupt or netisr thread
i believe -- anyways, not in userspace). A thing that seems to make
the difference is the choice of idle states:
	machdep.idle = {acpi, hlt, mwait, spin}
	hw.acpi.cpu.cx_lowest={C1, C2, C3}
results in 10-15% of samples with significantly different values
(see table below, all values in microseconds)

	config		normal		top 10-15%
	-----------------------------------------
	acpi+c3		13		85
	acpi+c2		10		40
	acpi+c1		7		10

	spin		7		8
	hlt,mwait		6..10 (no clear distinction)
	ping -f			5-6

This suggests that sometimes the responding thread
may be started on a sleeping core and so takes
longer to wake up.

The high values are not periodic but the ratio is close to
the 128/1000 which relates profhz and tick, so that might be
something to look at.

cheers
luigi