Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 27 Feb 2014 17:05:30 +0100
From:      Markus Gebert <markus.gebert@hostpoint.ch>
To:        Rick Macklem <rmacklem@uoguelph.ca>, John Baldwin <jhb@freebsd.org>
Cc:        Johan Kooijman <mail@johankooijman.com>, freebsd-net@freebsd.org, Jack Vogel <jfvogel@gmail.com>
Subject:   Re: Network loss
Message-ID:  <76EBC5F0-DA4E-4A60-A10E-093F4E1BD1EF@hostpoint.ch>
In-Reply-To: <532475749.13937791.1393462831884.JavaMail.root@uoguelph.ca>
References:  <532475749.13937791.1393462831884.JavaMail.root@uoguelph.ca>

next in thread | previous in thread | raw e-mail | index | archive | help

On 27.02.2014, at 02:00, Rick Macklem <rmacklem@uoguelph.ca> wrote:

> John Baldwin wrote:
>> On Tuesday, February 25, 2014 2:19:01 am Johan Kooijman wrote:
>>> Hi all,
>>>=20
>>> I have a weird situation here where I can't get my head around.
>>>=20
>>> One FreeBSD 9.2-STABLE ZFS/NFS box, multiple Linux clients. Once in
>>> a while
>>> the Linux clients loose their NFS connection:
>>>=20
>>> Feb 25 06:24:09 hv3 kernel: nfs: server 10.0.24.1 not responding,
>>> timed out
>>>=20
>>> Not all boxes, just one out of the cluster. The weird part is that
>>> when I
>>> try to ping a Linux client from the FreeBSD box, I have between 10
>>> and 30%
>>> packetloss - all day long, no specific timeframe. If I ping the
>>> Linux
>>> clients - no loss. If I ping back from the Linux clients to FBSD
>>> box - no
>>> loss.
>>>=20
>>> The errors I get when pinging a Linux client is this one:
>>> ping: sendto: File too large

We were facing similar problems when upgrading to 9.2 and have stayed =
with 9.1 on affected systems for now. We=92ve seen this on HP G8 blades =
with 82599EB controllers:

ix0@pci0:4:0:0:	class=3D0x020000 card=3D0x18d0103c chip=3D0x10f88086 =
rev=3D0x01 hdr=3D0x00
    vendor     =3D 'Intel Corporation'
    device     =3D '82599EB 10 Gigabit Dual Port Backplane Connection'
    class      =3D network
    subclass   =3D ethernet

We didn=92t find a way to trigger the problem reliably. But when it =
occurs, it usually affects only one interface. Symptoms include:

- socket functions return the 'File too large' error mentioned by Johan
- socket functions return 'No buffer space=92 available
- heavy to full packet loss on the affected interface
- =93stuck=94 TCP connection, i.e. ESTABLISHED TCP connections that =
should have timed out stick around forever (socket on the other side =
could have been closed ours ago)
- userland programs using the corresponding sockets usually got stuck =
too (can=92t find kernel traces right now, but always in network related =
syscalls)

Network is only lightly loaded on the affected systems (usually 5-20 =
mbit, capped at 200 mbit, per server), and netstat never showed any =
indication of ressource shortage (like mbufs).

What made the problem go away temporariliy was to ifconfig down/up the =
affected interface.

We tested a 9.2 kernel with the 9.1 ixgbe driver, which was not really =
stable. Also, we tested a few revisions between 9.1 and 9.2 to find out =
when the problem started. Unfortunately, the ixgbe driver turned out to =
be mostly unstable on our systems between these releases, worse than on =
9.2. The instability was introduced shortly after to 9.1 and fixed only =
very shortly before 9.2 release. So no luck there. We ended up using 9.1 =
with backports of 9.2 features we really need.

What we can=92t tell is wether it=92s the 9.2 kernel or the 9.2 ixgbe =
driver or a combination of both that causes these problems. =
Unfortunately we ran out of time (and ideas).


>> EFBIG is sometimes used for drivers when a packet takes too many
>> scatter/gather entries.  Since you mentioned NFS, one thing you can
>> try is to
>> disable TSO on the intertface you are using for NFS to see if that
>> "fixes" it.
>>=20
> And please email if you try it and let us know if it helps.
>=20
> I've think I've figured out how 64K NFS read replies can do this,
> but I'll admit "ping" is a mystery? (Doesn't it just send a single
> packet that would be in a single mbuf?)
>=20
> I think the EFBIG is replied by bus_dmamap_load_mbuf_sg(), but I
> don't know if it can happen for an mbuf chain with < 32 entries?

We don=92t use the nfs server on our systems, but they=92re =
(new)nfsclients. So I don=92t think our problem is nfs related, unless =
the default rsize/wsize for client mounts is not 8K, which I thought it =
was. Can you confirm this, Rick?

IIRC, disabling TSO did not make any difference in our case.


Markus




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?76EBC5F0-DA4E-4A60-A10E-093F4E1BD1EF>