Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 20 Mar 2014 18:03:43 +0800 (CST)
From:      mstian88 <mstian88@163.com>
To:        net@freebsd.org
Subject:   some problem about netmap
Message-ID:  <59d70d80.2b3cc.144def24445.Coremail.mstian88@163.com>

next in thread | raw e-mail | index | archive | help
Li9wa3QtZ2VuIC1pIGV0aDEgLWYgcnggLVgKdGhlIHByaW50IGluZm8gc2hvd3Mgc2xvdC0+bGVu
IGlzIDIwNDg/ICB3aHk/CnRoZSBwYWNrZXRzIHdhcyBzZW5kZWQgd2l0aCB0Y3ByZXBsYXkuCndo
ZW4gTXVsdGlwbGUgcGFja2V0cyBsZW4gaXMgMTUxNCwgcGt0LWdlbiByZWNlaXZlIGF2aWFsID4x
LCAgdGhlIHNsb3QtPmxlbiBpcyAyMDQ4CgoKbGlrZSB0aGlzOgpbMjIxM106c2xvdGxlblsxNTE0
XSxjdXJpZHhbMTg5XSxidWZpZHhbMjQ3NjddLGF2YWlsWzNdLGl3aGlsZVs0NzVdLGlmb3JbMTZd
LGJ1Zl9vZnNbNTYwMzMyOF0scmluZ3NbMzJdICAKWzIyMTRdOnNsb3RsZW5bMjA0OF0sY3VyaWR4
WzE5MF0sYnVmaWR4WzI0NzY4XSxhdmFpbFsyXSxpd2hpbGVbNDc1XSxpZm9yWzE2XSxidWZfb2Zz
WzU2MDMzMjhdLHJpbmdzWzMyXSAgClsyMjE1XTpzbG90bGVuWzkyNl0sY3VyaWR4WzE5MV0sYnVm
aWR4WzI0NzY5XSxhdmFpbFsxXSxpd2hpbGVbNDc1XSxpZm9yWzE2XSxidWZfb2ZzWzU2MDMzMjhd
LHJpbmdzWzMyXSAgClsyMjE2XTpzbG90bGVuWzE1MTRdLGN1cmlkeFsxOTJdLGJ1ZmlkeFsyNDc3
MF0sYXZhaWxbMV0saXdoaWxlWzQ3Nl0saWZvclsxNl0sYnVmX29mc1s1NjAzMzI4XSxyaW5nc1sz
Ml0gIA==
From owner-freebsd-net@FreeBSD.ORG  Thu Mar 20 10:41:00 2014
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: freebsd-net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 648345D2
 for <freebsd-net@freebsd.org>; Thu, 20 Mar 2014 10:41:00 +0000 (UTC)
Received: from mail.adm.hostpoint.ch (mail.adm.hostpoint.ch
 [IPv6:2a00:d70:0:a::e0])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (No client certificate requested)
 by mx1.freebsd.org (Postfix) with ESMTPS id 230A8101
 for <freebsd-net@freebsd.org>; Thu, 20 Mar 2014 10:40:59 +0000 (UTC)
Received: from [2001:1620:2013:1:4535:ed23:3991:6e11] (port=62880)
 by mail.adm.hostpoint.ch with esmtpsa (TLSv1:AES128-SHA:128)
 (Exim 4.80.1 (FreeBSD)) (envelope-from <markus.gebert@hostpoint.ch>)
 id 1WQaPB-000Ee0-7e; Thu, 20 Mar 2014 11:40:57 +0100
Content-Type: text/plain; charset=windows-1252
Mime-Version: 1.0 (Mac OS X Mail 7.2 \(1874\))
Subject: Re: 9.2 ixgbe tx queue hang
From: Markus Gebert <markus.gebert@hostpoint.ch>
In-Reply-To: <CAB2_NwDG=gB1WCJ7JKTHpkJCrvPuAhipkn+vPyT+xXzOBrTGkg@mail.gmail.com>
Date: Thu, 20 Mar 2014 11:40:18 +0100
Content-Transfer-Encoding: quoted-printable
Message-Id: <FA262955-B3A9-48EC-828B-FF0D4D5D0498@hostpoint.ch>
References: <CAB2_NwDG=gB1WCJ7JKTHpkJCrvPuAhipkn+vPyT+xXzOBrTGkg@mail.gmail.com>
To: Christopher Forgeron <csforgeron@gmail.com>
X-Mailer: Apple Mail (2.1874)
Cc: freebsd-net@freebsd.org, Rick Macklem <rmacklem@uoguelph.ca>,
 Jack Vogel <jfvogel@gmail.com>
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.17
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net/>;
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 20 Mar 2014 10:41:00 -0000


On 19.03.2014, at 20:17, Christopher Forgeron <csforgeron@gmail.com> =
wrote:

> Hello,
>=20
>=20
>=20
> I can report this problem as well on 10.0-RELEASE.
>=20
>=20
>=20
> I think it's the same as kern/183390?

Possible. We still see this on nfsclients only, but I=92m not convinced =
that nfs is the only trigger.


> I have two physically identical machines, one running 9.2-STABLE, and =
one
> on 10.0-RELEASE.
>=20
>=20
>=20
> My 10.0 machine used to be running 9.0-STABLE for over a year without =
any
> problems.
>=20
>=20
>=20
> I'm not having the problems with 9.2-STABLE as far as I can tell, but =
it
> does seem to be a load-based issue more than anything. Since my 9.2 =
system
> is in production, I'm unable to load it to see if the problem exists =
there.
> I have a ping_logger.py running on it now to see if it's experiencing
> problems briefly or not.

I our case, when it happens, the problem persists for quite some time =
(minutes or hours) if we don=92t interact (ifconfig or reboot).


> I am able to reproduce it fairly reliably within 15 min of a reboot by
> loading the server via NFS with iometer and some large NFS file copies =
at
> the same time. I seem to need to sustain ~2 Gbps for a few minutes.

That=92s probably why we can=92t reproduce it reliably here. Although =
having 10gig cards in our blade servers, the ones affected are connected =
to a 1gig switch.


> It will happen with just ix0 (no lagg) or with lagg enabled across ix0 =
and
> ix1.

Same here.


> I've been load-testing new FreeBSD-10.0-RELEASE SAN's for production =
use
> here, so I'm quite willing to put time into this to help find out =
where
> it's coming from.  It took me a day to track down my iometer issues as
> being network related, and another day to isolate and write scripts to
> reproduce.
>=20
>=20
>=20
> The symptom I notice is:
>=20
> -          A running flood ping (ping -f 172.16.0.31) to the same =
hardware
> (running 9.2) will come back with "ping: sendto: File too large" when =
the
> problem occurs
>=20
> -          Network connectivity is very spotty during these incidents
>=20
> -          It can run with sporadic ping errors, or it can run a =
straight
> set of errors for minutes at a time
>=20
> -          After a long run of ping errors, ESXi will show a =
disconnect
> from the hosted NFS stores on this machine.
>=20
> -          I've yet to see it happen right after boot. Fastest is =
around 5
> min, normally it's within 15 min.

Can you try this when the problem occurs?

for CPU in {0..7}; do echo "CPU${CPU}"; cpuset -l ${CPU} ping -i 0.2 -c =
2 -W 1 10.0.0.1 | grep sendto; done

It will tie ping to certain cpus to test the different tx queues of your =
ix interface. If the pings reliably fail only on some queues, then your =
problem is more likely to be the same as ours.

Also, if you have dtrace available:

kldload dtraceall
dtrace -n 'fbt:::return / arg1 =3D=3D EFBIG && execname =3D=3D "ping" / =
{ stack(); }'

while you run pings over the interface affected. This will give you =
hints about where the EFBIG error comes from.

> [=85]


Markus





Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?59d70d80.2b3cc.144def24445.Coremail.mstian88>