Date: Fri, 28 Feb 2014 18:15:11 +0100 From: Markus Gebert <markus.gebert@hostpoint.ch> To: Johan Kooijman <mail@johankooijman.com> Cc: freebsd-net@freebsd.org, Rick Macklem <rmacklem@uoguelph.ca>, Jack Vogel <jfvogel@gmail.com>, John Baldwin <jhb@freebsd.org> Subject: Re: Network loss Message-ID: <29112316-5FC9-4DA1-BD0C-BCA61D3997E3@hostpoint.ch> In-Reply-To: <CAHvs-HUpG9deHHekTdsQxNcZ63=VKHVm4miVLjxw=VzD-wgmrQ@mail.gmail.com> References: <532475749.13937791.1393462831884.JavaMail.root@uoguelph.ca> <76EBC5F0-DA4E-4A60-A10E-093F4E1BD1EF@hostpoint.ch> <CAHvs-HUpG9deHHekTdsQxNcZ63=VKHVm4miVLjxw=VzD-wgmrQ@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On 27.02.2014, at 22:54, Johan Kooijman <mail@johankooijman.com> wrote: > Ok, so 9.1 is 100% OK then? Do you have any idea about 10.0 ? It=92s at least good enough and way better than 9.2. But you=92ll have = to test yourself, I don=92t think we=92re running the same hardware. Problem is, that 9.1 will go EOL at some point and we do not know if = this will be fixed until then. > On Thu, Feb 27, 2014 at 5:05 PM, Markus Gebert > <markus.gebert@hostpoint.ch>wrote: >=20 >>=20 >> On 27.02.2014, at 02:00, Rick Macklem <rmacklem@uoguelph.ca> wrote: >>=20 >>> John Baldwin wrote: >>>> On Tuesday, February 25, 2014 2:19:01 am Johan Kooijman wrote: >>>>> Hi all, >>>>>=20 >>>>> I have a weird situation here where I can't get my head around. >>>>>=20 >>>>> One FreeBSD 9.2-STABLE ZFS/NFS box, multiple Linux clients. Once = in >>>>> a while >>>>> the Linux clients loose their NFS connection: >>>>>=20 >>>>> Feb 25 06:24:09 hv3 kernel: nfs: server 10.0.24.1 not responding, >>>>> timed out >>>>>=20 >>>>> Not all boxes, just one out of the cluster. The weird part is that >>>>> when I >>>>> try to ping a Linux client from the FreeBSD box, I have between 10 >>>>> and 30% >>>>> packetloss - all day long, no specific timeframe. If I ping the >>>>> Linux >>>>> clients - no loss. If I ping back from the Linux clients to FBSD >>>>> box - no >>>>> loss. >>>>>=20 >>>>> The errors I get when pinging a Linux client is this one: >>>>> ping: sendto: File too large >>=20 >> We were facing similar problems when upgrading to 9.2 and have stayed = with >> 9.1 on affected systems for now. We've seen this on HP G8 blades with >> 82599EB controllers: >>=20 >> ix0@pci0:4:0:0: class=3D0x020000 card=3D0x18d0103c chip=3D0x10f88086 = rev=3D0x01 >> hdr=3D0x00 >> vendor =3D 'Intel Corporation' >> device =3D '82599EB 10 Gigabit Dual Port Backplane Connection' >> class =3D network >> subclass =3D ethernet >>=20 >> We didn't find a way to trigger the problem reliably. But when it = occurs, >> it usually affects only one interface. Symptoms include: >>=20 >> - socket functions return the 'File too large' error mentioned by = Johan >> - socket functions return 'No buffer space' available >> - heavy to full packet loss on the affected interface >> - "stuck" TCP connection, i.e. ESTABLISHED TCP connections that = should >> have timed out stick around forever (socket on the other side could = have >> been closed ours ago) >> - userland programs using the corresponding sockets usually got stuck = too >> (can't find kernel traces right now, but always in network related = syscalls) >>=20 >> Network is only lightly loaded on the affected systems (usually 5-20 = mbit, >> capped at 200 mbit, per server), and netstat never showed any = indication of >> ressource shortage (like mbufs). >>=20 >> What made the problem go away temporariliy was to ifconfig down/up = the >> affected interface. >>=20 >> We tested a 9.2 kernel with the 9.1 ixgbe driver, which was not = really >> stable. Also, we tested a few revisions between 9.1 and 9.2 to find = out >> when the problem started. Unfortunately, the ixgbe driver turned out = to be >> mostly unstable on our systems between these releases, worse than on = 9.2. >> The instability was introduced shortly after to 9.1 and fixed only = very >> shortly before 9.2 release. So no luck there. We ended up using 9.1 = with >> backports of 9.2 features we really need. >>=20 >> What we can't tell is wether it's the 9.2 kernel or the 9.2 ixgbe = driver >> or a combination of both that causes these problems. Unfortunately we = ran >> out of time (and ideas). >>=20 >>=20 >>>> EFBIG is sometimes used for drivers when a packet takes too many >>>> scatter/gather entries. Since you mentioned NFS, one thing you can >>>> try is to >>>> disable TSO on the intertface you are using for NFS to see if that >>>> "fixes" it. >>>>=20 >>> And please email if you try it and let us know if it helps. >>>=20 >>> I've think I've figured out how 64K NFS read replies can do this, >>> but I'll admit "ping" is a mystery? (Doesn't it just send a single >>> packet that would be in a single mbuf?) >>>=20 >>> I think the EFBIG is replied by bus_dmamap_load_mbuf_sg(), but I >>> don't know if it can happen for an mbuf chain with < 32 entries? >>=20 >> We don't use the nfs server on our systems, but they're = (new)nfsclients. >> So I don't think our problem is nfs related, unless the default = rsize/wsize >> for client mounts is not 8K, which I thought it was. Can you confirm = this, >> Rick? >>=20 >> IIRC, disabling TSO did not make any difference in our case. >>=20 >>=20 >> Markus >>=20 >>=20 >=20 >=20 > --=20 > Met vriendelijke groeten / With kind regards, > Johan Kooijman >=20 > T +31(0) 6 43 44 45 27 > F +31(0) 162 82 00 01 > E mail@johankooijman.com > _______________________________________________ > freebsd-net@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org" >=20
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?29112316-5FC9-4DA1-BD0C-BCA61D3997E3>