From owner-freebsd-net@FreeBSD.ORG  Mon May 11 08:37:09 2015
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: freebsd-net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 0A8A0D90
 for <freebsd-net@freebsd.org>; Mon, 11 May 2015 08:37:09 +0000 (UTC)
Received: from mail-qk0-x233.google.com (mail-qk0-x233.google.com
 [IPv6:2607:f8b0:400d:c09::233])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id B98351055
 for <freebsd-net@freebsd.org>; Mon, 11 May 2015 08:37:08 +0000 (UTC)
Received: by qkgx75 with SMTP id x75so82349338qkg.1
 for <freebsd-net@freebsd.org>; Mon, 11 May 2015 01:37:07 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=mime-version:in-reply-to:references:date:message-id:subject:from:to
 :cc:content-type;
 bh=cXBbG/9/ubNOa9GaLcj6swBcC2t+YOjsUu8eq0TB+W4=;
 b=Xqhg1ME/981l8otNbpFjlUN8tsHiUICCVsfdHAyj3IgMMsYspDsCmRqIAbm8JbdKwR
 KGtgPsT9sJ1NRztfsRzk02GBMp6uAtlPhLSWDYgWuAqW9i00zra5Pb6VhjjuwAZaM/+V
 V5dhca2Fq+c81qwCWwonuAzeY8J3+TG76PgNZkdSprZtu/WuF1wFDoHkGbtN9aOGAQVE
 CItIaQGq/1XjFmmXH4d+zsU+P9KTyhdcxcNVvtz26hhzHH6NTKDOtrF0CrWIU2GMiSad
 0hNgtkavnEzWklByA/r0my9yGCA+IhkIjBxoRXhIe9tF9tqxFkLvc72N/dSTMC9Zk38G
 jnow==
MIME-Version: 1.0
X-Received: by 10.140.34.215 with SMTP id l81mr12054510qgl.43.1431333427835;
 Mon, 11 May 2015 01:37:07 -0700 (PDT)
Received: by 10.96.110.229 with HTTP; Mon, 11 May 2015 01:37:07 -0700 (PDT)
In-Reply-To: <1107864458-32391@kerio.tuxis.nl>
References: <bug-199174-2472-LonL56obUY@https.bugs.freebsd.org/bugzilla/>
 <1107864458-32391@kerio.tuxis.nl>
Date: Mon, 11 May 2015 05:37:07 -0300
Message-ID: <CAB2_NwBKZhwWe-cfb0GRiEX1iUbzte9=jhYpvV-ALwURJHcJUg@mail.gmail.com>
Subject: Re: [Bug 199174] em tx and rx hang
From: Christopher Forgeron <csforgeron@gmail.com>
To: Mark Schouten <mark@tuxis.nl>
Cc: FreeBSD Net <freebsd-net@freebsd.org>
Content-Type: text/plain; charset=UTF-8
X-Content-Filtered-By: Mailman/MimeDel 2.1.20
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net/>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 11 May 2015 08:37:09 -0000

I'd go a step further and say it's the _exact_ same problem.

If you're using anything other than 4k clusters on a heavily loaded system,
you'll probably have issues.

It's not just the MTU - Case in point - I set my MTU to 4000, but since my
iSCSI block size is 8k, I noticed that I still had plenty of 9k Jumbo
Clusters in use. I still crash within 1/2 - 1 and 1/2 days of uptime.
Ususally 'ix0 is flapping' or perhaps a kernel panic, or just dead ix's
that won't transmit.

I patched my ixgbe.c to only use 4k clusters, and now I can use a MTU of
9000 again without issue.

I want to take the time to dig up more of my info on this to present to the
list, but I've lost a lot of time to tracking this down.. still cleaning up
as we speak.

The worst about the Jumbo Clusters bug is that it's very specific to a
particular load - My systems were fine, until I took on a new Exchange 2013
load that started popping all the FreeBSD SAN's - And these were
load-tested production machines that had been in service for months without
issues.

In one of these threads, Garrett Wollman points out his ideas for a fix - I
second the idea of a large ring buffer being created at boot for the
network cards to use - and like him, I regretfully have no time to spare to
help.. well, perhaps I can get some time for this.. but I can only help,
not lead.

Here's one of the last machines popping on me tonight before I could get to
it with a patched kernel.  This is a unusual error, usually the 'ix0
flapping' is the most common.

May 11 04:04:06 aa_fast_b kernel: panic: solaris assert: 0 ==
dmu_buf_hold_array(os, object, off

set, size, FALSE, FTAG, &numbufs, &dbp), file:
/usr/src/sys/modules/zfs/../../cddl/contrib/opens

olaris/uts/common/fs/zfs/dmu.c, line: 830

May 11 04:04:06 aa_fast_b kernel: cpuid = 1

May 11 04:04:06 aa_fast_b kernel: KDB: stack backtrace:

May 11 04:04:06 aa_fast_b kernel: #0 0xffffffff80962fd0 at
kdb_backtrace+0x60

May 11 04:04:06 aa_fast_b kernel: #1 0xffffffff809280f5 at panic+0x155

May 11 04:04:06 aa_fast_b kernel: #2 0xffffffff81bbe1fd at assfail+0x1d

May 11 04:04:06 aa_fast_b kernel: #3 0xffffffff81983388 at dmu_write+0x98

May 11 04:04:06 aa_fast_b kernel: #4 0xffffffff819c8ec5 at
space_map_write+0x3c5

May 11 04:04:06 aa_fast_b kernel: #5 0xffffffff819afb30 at
metaslab_sync+0x4e0

May 11 04:04:06 aa_fast_b kernel: #6 0xffffffff819cf69b at vdev_sync+0xcb

May 11 04:04:06 aa_fast_b kernel: #7 0xffffffff819c0fdb at spa_sync+0x5db

May 11 04:04:06 aa_fast_b kernel: #8 0xffffffff819ca3f6 at
txg_sync_thread+0x3a6

May 11 04:04:06 aa_fast_b kernel: #9 0xffffffff808f8b3a at fork_exit+0x9a

May 11 04:04:06 aa_fast_b kernel: #10 0xffffffff80d0ac8e at
fork_trampoline+0xe

May 11 04:04:06 aa_fast_b kernel: Uptime: 1d12h7m45s

May 11 04:04:06 aa_fast_b kernel: (da1:iscsi7:0:0:0): Synchronize cache
failed

May 11 04:04:06 aa_fast_b kernel: (da3:iscsi5:0:0:0): Synchronize cache
failed

May 11 04:04:06 aa_fast_b kernel: (da4:iscsi11:0:0:0): Synchronize cache
failed

May 11 04:04:06 aa_fast_b kernel: (da7:iscsi4:0:0:0): Synchronize cache
failed

May 11 04:04:06 aa_fast_b kernel: (da8:iscsi6:0:0:0): Synchronize cache
failed

May 11 04:04:06 aa_fast_b kernel: (da9:iscsi10:0:0:0): Synchronize cache
failed

May 11 04:04:06 aa_fast_b kernel: (da10:iscsi1:0:0:0): Synchronize cache
failed

It's lots of fun.. it really is.  I'm glad I have a lot of redundancy and
backups.

On Mon, May 11, 2015 at 5:13 AM, Mark Schouten <mark@tuxis.nl> wrote:

> Please note that these issues look very much like the issues I had, before
> I switched from an MTU of 9000 to 1500 ...
>
>
> Met vriendelijke groeten,
>
> --
> Kerio Operator in de Cloud? https://www.kerioindecloud.nl/
> Mark Schouten  | Tuxis Internet Engineering
> KvK: 61527076 | http://www.tuxis.nl/
> T: 0318 200208 | info@tuxis.nl
>
>
>
>  Van:    <bugzilla-noreply@freebsd.org>
>  Aan:    <freebsd-net@FreeBSD.org>
>  Verzonden:   8-5-2015 19:42
>  Onderwerp:   [Bug 199174] em tx and rx hang
>
> https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=199174
>
> --- Comment #15 from Sean Bruno <sbruno@FreeBSD.org> ---
> (In reply to david.keller from comment #14)
> Nothing fancy here.
>
> Server runs "iperf -p 8000 -s"  (8core amd box)
> Client under test runs this forever:
>
> #!/bin/sh
>
> FILE=test.out
>
> if [ -f ${FILE} ]; then
>     rm $FILE;
> fi
>
> while [ 1 ]; do
>     date;
>     iperf -p 8000 -c 192.168.100.1 -t 600 -P ${1} >> $FILE;
> done
>
> --
> You are receiving this mail because:
> You are the assignee for the bug.
> _______________________________________________
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org"
>
>
>