Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 6 Mar 2014 11:24:37 +0100
From:      Markus Gebert <markus.gebert@hostpoint.ch>
To:        Jack Vogel <jfvogel@gmail.com>, freebsd-net@freebsd.org
Cc:        Johan Kooijman <mail@johankooijman.com>, Rick Macklem <rmacklem@uoguelph.ca>, John Baldwin <jhb@freebsd.org>
Subject:   9.2 ixgbe tx queue hang (was: Network loss)
Message-ID:  <9C5B43BD-9D80-49EA-8EDC-C7EF53D79C8D@hostpoint.ch>

next in thread | raw e-mail | index | archive | help
(creating a new thread, because I=92m no longer sure this is related to =
Johan=92s thread that I originally used to discuss this)

On 27.02.2014, at 18:02, Jack Vogel <jfvogel@gmail.com> wrote:

> I would make SURE that you have enough mbuf resources of whatever size =
pool
> that you are
> using (2, 4, 9K), and I would try the code in HEAD if you had not.
>=20
> Jack

Jack, we've upgraded some other systems on which I get more time to =
debug (no impact for customers). Although those systems use the =
nfsclient too, I no longer think that NFS is the source of the problem =
(hence the new thread). I think it=92s the ixgbe driver and/or card. =
When our problem occurs, it looks like it=92s a single tx queue that =
gets stuck somehow (its buf_ring remains full).

I tracked ping using dtrace to determine the source of ENOBUFS it =
returns every few packets when things get weird:

# dtrace -n 'fbt:::return / arg1 =3D=3D ENOBUFS && execname =3D=3D =
"ping" / { stack(); }'
dtrace: description 'fbt:::return ' matched 25476 probes
CPU     ID                    FUNCTION:NAME
 26   7730            ixgbe_mq_start:return=20
              if_lagg.ko`lagg_transmit+0xc4
              kernel`ether_output_frame+0x33
              kernel`ether_output+0x4fe
              kernel`ip_output+0xd74
              kernel`rip_output+0x229
              kernel`sosend_generic+0x3f6
              kernel`kern_sendit+0x1a3
              kernel`sendit+0xdc
              kernel`sys_sendto+0x4d
              kernel`amd64_syscall+0x5ea
              kernel`0xffffffff80d35667



The only way ixgbe_mq_start could return ENOBUFS would be when =
drbr_enqueue() encouters a full tx buf_ring. Since a new ping packet =
probably has no flow id, it should be assigned to a queue based on =
curcpu, which made me try to pin ping to single cpus to check wether =
it=92s always the same tx buf_ring that reports being full. This turned =
out to be true:

# cpuset -l 0 ping 10.0.4.5
PING 10.0.4.5 (10.0.4.5): 56 data bytes
64 bytes from 10.0.4.5: icmp_seq=3D0 ttl=3D255 time=3D0.347 ms
64 bytes from 10.0.4.5: icmp_seq=3D1 ttl=3D255 time=3D0.135 ms

# cpuset -l 1 ping 10.0.4.5
PING 10.0.4.5 (10.0.4.5): 56 data bytes
64 bytes from 10.0.4.5: icmp_seq=3D0 ttl=3D255 time=3D0.184 ms
64 bytes from 10.0.4.5: icmp_seq=3D1 ttl=3D255 time=3D0.232 ms

# cpuset -l 2 ping 10.0.4.5
PING 10.0.4.5 (10.0.4.5): 56 data bytes
ping: sendto: No buffer space available
ping: sendto: No buffer space available
ping: sendto: No buffer space available
ping: sendto: No buffer space available
ping: sendto: No buffer space available

# cpuset -l 3 ping 10.0.4.5
PING 10.0.4.5 (10.0.4.5): 56 data bytes
64 bytes from 10.0.4.5: icmp_seq=3D0 ttl=3D255 time=3D0.130 ms
64 bytes from 10.0.4.5: icmp_seq=3D1 ttl=3D255 time=3D0.126 ms
[=85snip...]

The system has 32 cores, if ping runs on cpu 2, 10, 18 or 26, which use =
the third tx buf_ring, ping reliably return ENOBUFS. If ping is run on =
any other cpu using any other tx queue, it runs without any packet loss.

So, when ENOBUFS is returned, this is not due to an mbuf shortage, it=92s =
because the buf_ring is full. Not surprisingly, netstat -m looks pretty =
normal:

# netstat -m
38622/11823/50445 mbufs in use (current/cache/total)
32856/11642/44498/132096 mbuf clusters in use (current/cache/total/max)
32824/6344 mbuf+clusters out of packet secondary zone in use =
(current/cache)
16/3906/3922/66048 4k (page size) jumbo clusters in use =
(current/cache/total/max)
0/0/0/33024 9k jumbo clusters in use (current/cache/total/max)
0/0/0/16512 16k jumbo clusters in use (current/cache/total/max)
75431K/41863K/117295K bytes allocated to network (current/cache/total)
0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters)
0/0/0 requests for mbufs delayed (mbufs/clusters/mbuf+clusters)
0/0/0 requests for jumbo clusters delayed (4k/9k/16k)
0/0/0 requests for jumbo clusters denied (4k/9k/16k)
0/0/0 sfbufs in use (current/peak/max)
0 requests for sfbufs denied
0 requests for sfbufs delayed
0 requests for I/O initiated by sendfile
0 calls to protocol drain routines

In the meantime I=92ve checked the commit log of the ixgbe driver in =
HEAD and besides there are little differences between HEAD and 9.2, I =
don=92t see a commit that fixes anything related to what were seeing=85

So, what=92s the conclusion here? Firmware bug that=92s only triggered =
under 9.2? Driver bug introduced between 9.1 and 9.2 when new multiqueue =
stuff was added? Jack, how should we proceed?


Markus



On Thu, Feb 27, 2014 at 8:05 AM, Markus Gebert
<markus.gebert@hostpoint.ch>wrote:

>=20
> On 27.02.2014, at 02:00, Rick Macklem <rmacklem@uoguelph.ca> wrote:
>=20
>> John Baldwin wrote:
>>> On Tuesday, February 25, 2014 2:19:01 am Johan Kooijman wrote:
>>>> Hi all,
>>>>=20
>>>> I have a weird situation here where I can't get my head around.
>>>>=20
>>>> One FreeBSD 9.2-STABLE ZFS/NFS box, multiple Linux clients. Once in
>>>> a while
>>>> the Linux clients loose their NFS connection:
>>>>=20
>>>> Feb 25 06:24:09 hv3 kernel: nfs: server 10.0.24.1 not responding,
>>>> timed out
>>>>=20
>>>> Not all boxes, just one out of the cluster. The weird part is that
>>>> when I
>>>> try to ping a Linux client from the FreeBSD box, I have between 10
>>>> and 30%
>>>> packetloss - all day long, no specific timeframe. If I ping the
>>>> Linux
>>>> clients - no loss. If I ping back from the Linux clients to FBSD
>>>> box - no
>>>> loss.
>>>>=20
>>>> The errors I get when pinging a Linux client is this one:
>>>> ping: sendto: File too large
>=20
> We were facing similar problems when upgrading to 9.2 and have stayed =
with
> 9.1 on affected systems for now. We've seen this on HP G8 blades with
> 82599EB controllers:
>=20
> ix0@pci0:4:0:0: class=3D0x020000 card=3D0x18d0103c chip=3D0x10f88086 =
rev=3D0x01
> hdr=3D0x00
>    vendor     =3D 'Intel Corporation'
>    device     =3D '82599EB 10 Gigabit Dual Port Backplane Connection'
>    class      =3D network
>    subclass   =3D ethernet
>=20
> We didn't find a way to trigger the problem reliably. But when it =
occurs,
> it usually affects only one interface. Symptoms include:
>=20
> - socket functions return the 'File too large' error mentioned by =
Johan
> - socket functions return 'No buffer space' available
> - heavy to full packet loss on the affected interface
> - "stuck" TCP connection, i.e. ESTABLISHED TCP connections that should
> have timed out stick around forever (socket on the other side could =
have
> been closed ours ago)
> - userland programs using the corresponding sockets usually got stuck =
too
> (can't find kernel traces right now, but always in network related =
syscalls)
>=20
> Network is only lightly loaded on the affected systems (usually 5-20 =
mbit,
> capped at 200 mbit, per server), and netstat never showed any =
indication of
> ressource shortage (like mbufs).
>=20
> What made the problem go away temporariliy was to ifconfig down/up the
> affected interface.
>=20
> We tested a 9.2 kernel with the 9.1 ixgbe driver, which was not really
> stable. Also, we tested a few revisions between 9.1 and 9.2 to find =
out
> when the problem started. Unfortunately, the ixgbe driver turned out =
to be
> mostly unstable on our systems between these releases, worse than on =
9.2.
> The instability was introduced shortly after to 9.1 and fixed only =
very
> shortly before 9.2 release. So no luck there. We ended up using 9.1 =
with
> backports of 9.2 features we really need.
>=20
> What we can't tell is wether it's the 9.2 kernel or the 9.2 ixgbe =
driver
> or a combination of both that causes these problems. Unfortunately we =
ran
> out of time (and ideas).
>=20
>=20
>>> EFBIG is sometimes used for drivers when a packet takes too many
>>> scatter/gather entries.  Since you mentioned NFS, one thing you can
>>> try is to
>>> disable TSO on the intertface you are using for NFS to see if that
>>> "fixes" it.
>>>=20
>> And please email if you try it and let us know if it helps.
>>=20
>> I've think I've figured out how 64K NFS read replies can do this,
>> but I'll admit "ping" is a mystery? (Doesn't it just send a single
>> packet that would be in a single mbuf?)
>>=20
>> I think the EFBIG is replied by bus_dmamap_load_mbuf_sg(), but I
>> don't know if it can happen for an mbuf chain with < 32 entries?
>=20
> We don't use the nfs server on our systems, but they're =
(new)nfsclients.
> So I don't think our problem is nfs related, unless the default =
rsize/wsize
> for client mounts is not 8K, which I thought it was. Can you confirm =
this,
> Rick?
>=20
> IIRC, disabling TSO did not make any difference in our case.
>=20
>=20
> Markus
>=20
>=20






Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?9C5B43BD-9D80-49EA-8EDC-C7EF53D79C8D>