From owner-freebsd-net@FreeBSD.ORG  Thu Feb 27 21:55:03 2014
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: freebsd-net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 7C221DBE
 for <freebsd-net@freebsd.org>; Thu, 27 Feb 2014 21:55:03 +0000 (UTC)
Received: from mail-pb0-f54.google.com (mail-pb0-f54.google.com
 [209.85.160.54])
 (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits))
 (No client certificate requested)
 by mx1.freebsd.org (Postfix) with ESMTPS id 4F2A41A36
 for <freebsd-net@freebsd.org>; Thu, 27 Feb 2014 21:55:02 +0000 (UTC)
Received: by mail-pb0-f54.google.com with SMTP id uo5so3099391pbc.13
 for <freebsd-net@freebsd.org>; Thu, 27 Feb 2014 13:54:56 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20130820;
 h=x-gm-message-state:mime-version:in-reply-to:references:date
 :message-id:subject:from:to:cc:content-type;
 bh=La9Yg4ge7LsxMAU606UYPSqpl6n+aSLOpw02t8ihD28=;
 b=O/Zxn68jmaz53MRE04wyezDMSt3p4+nKrovKBMWoei7m30QY8jKJx/CTTNX+2M0q4C
 u6IpeHbtxfpfVY94LcT243doWG5nNt6dz27WWDNVKY3LKYXwox2onGvPiNTpQxWXYIgv
 iNan5rFwv6PQ4hnZ5RMOo1O16BXBJf5GOL//eZ+mFSC1tpL0n/iLg8YTP7XjpoF+7Qqc
 Kirq1Bezz6wXfx10z2Endf3fTVBGItshS2AXi3GmHYEJCe8j/aL0cFhcDXwfgiJd0gzh
 yqx/6hXnBvCgdx5+ibW8VxzPCtyFI2viTi+BoTpHgpgaXvmnXTa80qO+rMFuoPQdat30
 xHEg==
X-Gm-Message-State: ALoCoQklleZzgPCsaN/ILS+C6wmpe3AFhUtaJkBsP1h5yJuBxEI3Cxx0MXKN5yTmZojR/ego9F1w
MIME-Version: 1.0
X-Received: by 10.66.180.200 with SMTP id dq8mr17708425pac.104.1393538096541; 
 Thu, 27 Feb 2014 13:54:56 -0800 (PST)
Received: by 10.68.111.37 with HTTP; Thu, 27 Feb 2014 13:54:56 -0800 (PST)
In-Reply-To: <76EBC5F0-DA4E-4A60-A10E-093F4E1BD1EF@hostpoint.ch>
References: <532475749.13937791.1393462831884.JavaMail.root@uoguelph.ca>
 <76EBC5F0-DA4E-4A60-A10E-093F4E1BD1EF@hostpoint.ch>
Date: Thu, 27 Feb 2014 22:54:56 +0100
Message-ID: <CAHvs-HUpG9deHHekTdsQxNcZ63=VKHVm4miVLjxw=VzD-wgmrQ@mail.gmail.com>
Subject: Re: Network loss
From: Johan Kooijman <mail@johankooijman.com>
To: Markus Gebert <markus.gebert@hostpoint.ch>
Content-Type: text/plain; charset=ISO-8859-1
X-Content-Filtered-By: Mailman/MimeDel 2.1.17
Cc: freebsd-net@freebsd.org, Rick Macklem <rmacklem@uoguelph.ca>,
 Jack Vogel <jfvogel@gmail.com>, John Baldwin <jhb@freebsd.org>
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.17
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net/>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 27 Feb 2014 21:55:03 -0000

Ok, so 9.1 is 100% OK then? Do you have any idea about 10.0 ?


On Thu, Feb 27, 2014 at 5:05 PM, Markus Gebert
<markus.gebert@hostpoint.ch>wrote:

>
> On 27.02.2014, at 02:00, Rick Macklem <rmacklem@uoguelph.ca> wrote:
>
> > John Baldwin wrote:
> >> On Tuesday, February 25, 2014 2:19:01 am Johan Kooijman wrote:
> >>> Hi all,
> >>>
> >>> I have a weird situation here where I can't get my head around.
> >>>
> >>> One FreeBSD 9.2-STABLE ZFS/NFS box, multiple Linux clients. Once in
> >>> a while
> >>> the Linux clients loose their NFS connection:
> >>>
> >>> Feb 25 06:24:09 hv3 kernel: nfs: server 10.0.24.1 not responding,
> >>> timed out
> >>>
> >>> Not all boxes, just one out of the cluster. The weird part is that
> >>> when I
> >>> try to ping a Linux client from the FreeBSD box, I have between 10
> >>> and 30%
> >>> packetloss - all day long, no specific timeframe. If I ping the
> >>> Linux
> >>> clients - no loss. If I ping back from the Linux clients to FBSD
> >>> box - no
> >>> loss.
> >>>
> >>> The errors I get when pinging a Linux client is this one:
> >>> ping: sendto: File too large
>
> We were facing similar problems when upgrading to 9.2 and have stayed with
> 9.1 on affected systems for now. We've seen this on HP G8 blades with
> 82599EB controllers:
>
> ix0@pci0:4:0:0: class=0x020000 card=0x18d0103c chip=0x10f88086 rev=0x01
> hdr=0x00
>     vendor     = 'Intel Corporation'
>     device     = '82599EB 10 Gigabit Dual Port Backplane Connection'
>     class      = network
>     subclass   = ethernet
>
> We didn't find a way to trigger the problem reliably. But when it occurs,
> it usually affects only one interface. Symptoms include:
>
> - socket functions return the 'File too large' error mentioned by Johan
> - socket functions return 'No buffer space' available
> - heavy to full packet loss on the affected interface
> - "stuck" TCP connection, i.e. ESTABLISHED TCP connections that should
> have timed out stick around forever (socket on the other side could have
> been closed ours ago)
> - userland programs using the corresponding sockets usually got stuck too
> (can't find kernel traces right now, but always in network related syscalls)
>
> Network is only lightly loaded on the affected systems (usually 5-20 mbit,
> capped at 200 mbit, per server), and netstat never showed any indication of
> ressource shortage (like mbufs).
>
> What made the problem go away temporariliy was to ifconfig down/up the
> affected interface.
>
> We tested a 9.2 kernel with the 9.1 ixgbe driver, which was not really
> stable. Also, we tested a few revisions between 9.1 and 9.2 to find out
> when the problem started. Unfortunately, the ixgbe driver turned out to be
> mostly unstable on our systems between these releases, worse than on 9.2.
> The instability was introduced shortly after to 9.1 and fixed only very
> shortly before 9.2 release. So no luck there. We ended up using 9.1 with
> backports of 9.2 features we really need.
>
> What we can't tell is wether it's the 9.2 kernel or the 9.2 ixgbe driver
> or a combination of both that causes these problems. Unfortunately we ran
> out of time (and ideas).
>
>
> >> EFBIG is sometimes used for drivers when a packet takes too many
> >> scatter/gather entries.  Since you mentioned NFS, one thing you can
> >> try is to
> >> disable TSO on the intertface you are using for NFS to see if that
> >> "fixes" it.
> >>
> > And please email if you try it and let us know if it helps.
> >
> > I've think I've figured out how 64K NFS read replies can do this,
> > but I'll admit "ping" is a mystery? (Doesn't it just send a single
> > packet that would be in a single mbuf?)
> >
> > I think the EFBIG is replied by bus_dmamap_load_mbuf_sg(), but I
> > don't know if it can happen for an mbuf chain with < 32 entries?
>
> We don't use the nfs server on our systems, but they're (new)nfsclients.
> So I don't think our problem is nfs related, unless the default rsize/wsize
> for client mounts is not 8K, which I thought it was. Can you confirm this,
> Rick?
>
> IIRC, disabling TSO did not make any difference in our case.
>
>
> Markus
>
>


-- 
Met vriendelijke groeten / With kind regards,
Johan Kooijman

T +31(0) 6 43 44 45 27
F +31(0) 162 82 00 01
E mail@johankooijman.com