Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 28 Feb 2014 18:15:11 +0100
From:      Markus Gebert <markus.gebert@hostpoint.ch>
To:        Johan Kooijman <mail@johankooijman.com>
Cc:        freebsd-net@freebsd.org, Rick Macklem <rmacklem@uoguelph.ca>, Jack Vogel <jfvogel@gmail.com>, John Baldwin <jhb@freebsd.org>
Subject:   Re: Network loss
Message-ID:  <29112316-5FC9-4DA1-BD0C-BCA61D3997E3@hostpoint.ch>
In-Reply-To: <CAHvs-HUpG9deHHekTdsQxNcZ63=VKHVm4miVLjxw=VzD-wgmrQ@mail.gmail.com>
References:  <532475749.13937791.1393462831884.JavaMail.root@uoguelph.ca> <76EBC5F0-DA4E-4A60-A10E-093F4E1BD1EF@hostpoint.ch> <CAHvs-HUpG9deHHekTdsQxNcZ63=VKHVm4miVLjxw=VzD-wgmrQ@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help

On 27.02.2014, at 22:54, Johan Kooijman <mail@johankooijman.com> wrote:

> Ok, so 9.1 is 100% OK then? Do you have any idea about 10.0 ?

It=92s at least good enough and way better than 9.2. But you=92ll have =
to test yourself, I don=92t think we=92re running the same hardware.

Problem is, that 9.1 will go EOL at some point and we do not know if =
this will be fixed until then.


> On Thu, Feb 27, 2014 at 5:05 PM, Markus Gebert
> <markus.gebert@hostpoint.ch>wrote:
>=20
>>=20
>> On 27.02.2014, at 02:00, Rick Macklem <rmacklem@uoguelph.ca> wrote:
>>=20
>>> John Baldwin wrote:
>>>> On Tuesday, February 25, 2014 2:19:01 am Johan Kooijman wrote:
>>>>> Hi all,
>>>>>=20
>>>>> I have a weird situation here where I can't get my head around.
>>>>>=20
>>>>> One FreeBSD 9.2-STABLE ZFS/NFS box, multiple Linux clients. Once =
in
>>>>> a while
>>>>> the Linux clients loose their NFS connection:
>>>>>=20
>>>>> Feb 25 06:24:09 hv3 kernel: nfs: server 10.0.24.1 not responding,
>>>>> timed out
>>>>>=20
>>>>> Not all boxes, just one out of the cluster. The weird part is that
>>>>> when I
>>>>> try to ping a Linux client from the FreeBSD box, I have between 10
>>>>> and 30%
>>>>> packetloss - all day long, no specific timeframe. If I ping the
>>>>> Linux
>>>>> clients - no loss. If I ping back from the Linux clients to FBSD
>>>>> box - no
>>>>> loss.
>>>>>=20
>>>>> The errors I get when pinging a Linux client is this one:
>>>>> ping: sendto: File too large
>>=20
>> We were facing similar problems when upgrading to 9.2 and have stayed =
with
>> 9.1 on affected systems for now. We've seen this on HP G8 blades with
>> 82599EB controllers:
>>=20
>> ix0@pci0:4:0:0: class=3D0x020000 card=3D0x18d0103c chip=3D0x10f88086 =
rev=3D0x01
>> hdr=3D0x00
>>    vendor     =3D 'Intel Corporation'
>>    device     =3D '82599EB 10 Gigabit Dual Port Backplane Connection'
>>    class      =3D network
>>    subclass   =3D ethernet
>>=20
>> We didn't find a way to trigger the problem reliably. But when it =
occurs,
>> it usually affects only one interface. Symptoms include:
>>=20
>> - socket functions return the 'File too large' error mentioned by =
Johan
>> - socket functions return 'No buffer space' available
>> - heavy to full packet loss on the affected interface
>> - "stuck" TCP connection, i.e. ESTABLISHED TCP connections that =
should
>> have timed out stick around forever (socket on the other side could =
have
>> been closed ours ago)
>> - userland programs using the corresponding sockets usually got stuck =
too
>> (can't find kernel traces right now, but always in network related =
syscalls)
>>=20
>> Network is only lightly loaded on the affected systems (usually 5-20 =
mbit,
>> capped at 200 mbit, per server), and netstat never showed any =
indication of
>> ressource shortage (like mbufs).
>>=20
>> What made the problem go away temporariliy was to ifconfig down/up =
the
>> affected interface.
>>=20
>> We tested a 9.2 kernel with the 9.1 ixgbe driver, which was not =
really
>> stable. Also, we tested a few revisions between 9.1 and 9.2 to find =
out
>> when the problem started. Unfortunately, the ixgbe driver turned out =
to be
>> mostly unstable on our systems between these releases, worse than on =
9.2.
>> The instability was introduced shortly after to 9.1 and fixed only =
very
>> shortly before 9.2 release. So no luck there. We ended up using 9.1 =
with
>> backports of 9.2 features we really need.
>>=20
>> What we can't tell is wether it's the 9.2 kernel or the 9.2 ixgbe =
driver
>> or a combination of both that causes these problems. Unfortunately we =
ran
>> out of time (and ideas).
>>=20
>>=20
>>>> EFBIG is sometimes used for drivers when a packet takes too many
>>>> scatter/gather entries.  Since you mentioned NFS, one thing you can
>>>> try is to
>>>> disable TSO on the intertface you are using for NFS to see if that
>>>> "fixes" it.
>>>>=20
>>> And please email if you try it and let us know if it helps.
>>>=20
>>> I've think I've figured out how 64K NFS read replies can do this,
>>> but I'll admit "ping" is a mystery? (Doesn't it just send a single
>>> packet that would be in a single mbuf?)
>>>=20
>>> I think the EFBIG is replied by bus_dmamap_load_mbuf_sg(), but I
>>> don't know if it can happen for an mbuf chain with < 32 entries?
>>=20
>> We don't use the nfs server on our systems, but they're =
(new)nfsclients.
>> So I don't think our problem is nfs related, unless the default =
rsize/wsize
>> for client mounts is not 8K, which I thought it was. Can you confirm =
this,
>> Rick?
>>=20
>> IIRC, disabling TSO did not make any difference in our case.
>>=20
>>=20
>> Markus
>>=20
>>=20
>=20
>=20
> --=20
> Met vriendelijke groeten / With kind regards,
> Johan Kooijman
>=20
> T +31(0) 6 43 44 45 27
> F +31(0) 162 82 00 01
> E mail@johankooijman.com
> _______________________________________________
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org"
>=20




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?29112316-5FC9-4DA1-BD0C-BCA61D3997E3>