FreeBSD Mail Archives

Date:      Fri, 7 Mar 2014 14:55:44 +0100
From:      Markus Gebert <markus.gebert@hostpoint.ch>
To:        Jack Vogel <jfvogel@gmail.com>
Cc:        Johan Kooijman <mail@johankooijman.com>, FreeBSD Net <freebsd-net@freebsd.org>, Rick Macklem <rmacklem@uoguelph.ca>, John Baldwin <jhb@freebsd.org>
Subject:   Re: 9.2 ixgbe tx queue hang (was: Network loss)
Message-ID:  <BB22EF79-ECCF-42C7-8FD7-236565D5B540@hostpoint.ch>
In-Reply-To: <CAFOYbcny_dn2sh7QuQX33rzAFn%2B_y-848dqCUR%2BncKeSe-D3DA@mail.gmail.com>
References:  <9C5B43BD-9D80-49EA-8EDC-C7EF53D79C8D@hostpoint.ch> <CAFOYbcmrVms7VJmPCZHCTMDvBfsV775aDFkHhMrGAEAtPx8-Mw@mail.gmail.com> <02AD7510-C862-4C29-9420-25ABF1A6E744@hostpoint.ch> <CAFOYbcny_dn2sh7QuQX33rzAFn%2B_y-848dqCUR%2BncKeSe-D3DA@mail.gmail.com>


On 06.03.2014, at 22:38, Jack Vogel <jfvogel@gmail.com> wrote:

> I suppose to be sure where the issue really occurs it would be best if =
both
> pseudo drivers
> were out of the picture.  Once we see if it still occurs we can take =
the
> next step.
>=20
> Regards,
>=20
> Jack

I removed lagg and vlan interfaces on four servers shortly after your =
email. And one of these is already showing the symptoms. I think this =
time we caught it earlier, and saw EFBIG instead of ENOBUFS. Still =
coming from only two cpus (or two tx queues):

# cpuset -l 0 ping 10.0.0.1
PING 10.0.0.1 (10.0.0.1): 56 data bytes
64 bytes from 10.0.0.1: icmp_seq=3D0 ttl=3D255 time=3D1.150 ms
[=85]
# cpuset -l 1 ping 10.0.0.1
PING 10.0.0.1 (10.0.0.1): 56 data bytes
64 bytes from 10.0.0.1: icmp_seq=3D0 ttl=3D255 time=3D0.385 ms
[=85]
# cpuset -l 2 ping 10.0.0.1
PING 10.0.0.1 (10.0.0.1): 56 data bytes
64 bytes from 10.0.0.1: icmp_seq=3D0 ttl=3D255 time=3D0.385 ms
[=85]
# cpuset -l 3 ping 10.0.0.1
PING 10.0.0.1 (10.0.0.1): 56 data bytes
ping: sendto: File too large
ping: sendto: File too large
ping: sendto: File too large
[=85]
# cpuset -l 4 ping 10.0.0.1
PING 10.0.0.1 (10.0.0.1): 56 data bytes
ping: sendto: File too large
ping: sendto: File too large
ping: sendto: File too large
[=85]
# cpuset -l 5 ping 10.0.0.1
PING 10.0.0.1 (10.0.0.1): 56 data bytes
64 bytes from 10.0.0.1: icmp_seq=3D0 ttl=3D255 time=3D0.316 ms
[=85 (other cpus work)]

So the fact that only a few cpus/tx queues are affected remains true =
even without lagg/vlans. But we=92ve got a new stack trace:

# dtrace -n 'fbt:::return / arg1 =3D=3D EFBIG && execname =3D=3D "ping" =
/ { stack(); }'
dtrace: description 'fbt:::return ' matched 25400 probes
[=85]
 3  28502   _bus_dmamap_load_buffer:return=20
              kernel`_bus_dmamap_load_mbuf_sg+0x5f
              kernel`bus_dmamap_load_mbuf_sg+0x38
              kernel`ixgbe_xmit+0xcf
              kernel`ixgbe_mq_start_locked+0x94
              kernel`ixgbe_mq_start+0x12a
              kernel`ether_output_frame+0x33
              kernel`ether_output+0x4fe
              kernel`ip_output+0xd74
              kernel`rip_output+0x229
              kernel`sosend_generic+0x3f6
              kernel`kern_sendit+0x1a3
              kernel`sendit+0xdc
              kernel`sys_sendto+0x4d
              kernel`amd64_syscall+0x5ea
              kernel`0xffffffff80d35667
[=85]

But after a while, everything went back to like it was yesterday =
(ENOBUFS):

# cpuset -l 3 ping 10.0.0.1
PING 10.0.0.1 (10.0.0.1): 56 data bytes
ping: sendto: No buffer space available
[=85 (cpu4 the same, other cpus work]

and the already known stacktrace:

# dtrace -n 'fbt:::return / arg1 =3D=3D ENOBUFS && execname =3D=3D =
"ping" / { stack(); }'
dtrace: description 'fbt:::return ' matched 25400 probes
CPU     ID                    FUNCTION:NAME
  3   7730            ixgbe_mq_start:return=20
              kernel`ether_output_frame+0x33
              kernel`ether_output+0x4fe
              kernel`ip_output+0xd74
              kernel`rip_output+0x229
              kernel`sosend_generic+0x3f6
              kernel`kern_sendit+0x1a3
              kernel`sendit+0xdc
              kernel`sys_sendto+0x4d
              kernel`amd64_syscall+0x5ea
              kernel`0xffffffff80d35667


So in my opinion, the different situation in the first few minutes was =
not because lagg and vlans were removed, it=92s just that we caught it =
early and the buf_ring was not full yet, but at least now we=92ve got a =
stack trace of what happens before, which hopefully helps debugging =
this.

I also noticed that queue txd sysctls barely change or don=92t change at =
all on affected queues:

# while true; do sysctl dev.ix.1.queue3 dev.ix.1.queue4 | grep txd_; =
echo ----; sleep 1; done
dev.ix.1.queue3.txd_head: 717
dev.ix.1.queue3.txd_tail: 717
dev.ix.1.queue4.txd_head: 1517
dev.ix.1.queue4.txd_tail: 1517
----
dev.ix.1.queue3.txd_head: 717
dev.ix.1.queue3.txd_tail: 717
dev.ix.1.queue4.txd_head: 1517
dev.ix.1.queue4.txd_tail: 1517
----
dev.ix.1.queue3.txd_head: 717
dev.ix.1.queue3.txd_tail: 717
dev.ix.1.queue4.txd_head: 1517
dev.ix.1.queue4.txd_tail: 1517
----
dev.ix.1.queue3.txd_head: 717
dev.ix.1.queue3.txd_tail: 717
dev.ix.1.queue4.txd_head: 1517
dev.ix.1.queue4.txd_tail: 1517

But I guess that=92s expected when the buf_ring is full=85 But maybe =
it=92s important to you, that I might have seen these values rise just =
once, when the problem already had started.

And netstat -m:

# netstat -m
37872/10263/48135 mbufs in use (current/cache/total)
33142/11680/44822/132096 mbuf clusters in use (current/cache/total/max)
33044/6508 mbuf+clusters out of packet secondary zone in use =
(current/cache)
94/2995/3089/66048 4k (page size) jumbo clusters in use =
(current/cache/total/max)
0/0/0/33024 9k jumbo clusters in use (current/cache/total/max)
0/0/0/16512 16k jumbo clusters in use (current/cache/total/max)
76128K/37905K/114033K bytes allocated to network (current/cache/total)
0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters)
0/0/0 requests for mbufs delayed (mbufs/clusters/mbuf+clusters)
0/0/0 requests for jumbo clusters delayed (4k/9k/16k)
0/0/0 requests for jumbo clusters denied (4k/9k/16k)
0/0/0 sfbufs in use (current/peak/max)
0 requests for sfbufs denied
0 requests for sfbufs delayed
0 requests for I/O initiated by sendfile
0 calls to protocol drain routines


Markus



> On Thu, Mar 6, 2014 at 12:58 PM, Markus Gebert
> <markus.gebert@hostpoint.ch>wrote:
>=20
>>=20
>> On 06.03.2014, at 19:33, Jack Vogel <jfvogel@gmail.com> wrote:
>>=20
>>> You did not make it explicit before, but I noticed in your dtrace =
info
>> that
>>> you are using
>>> lagg, its been the source of lots of problems, so take it out of the
>> setup
>>> and see if this
>>> queue problem still happens please.
>>>=20
>>> Jack
>>=20
>> Well, last year when upgrading another batch of servers (same =
hardware) to
>> 9.2, we tried find a solution to this network problem, and we =
eliminated
>> lagg where we had used it before, which did not help at all. That's =
why I
>> didn't mention it explicitly.
>>=20
>> My point is, I can confirm that 9.2 has network problems on this same
>> hardware with or without lagg, so it's unlikely that removing it will =
bring
>> immediate success. OTOH, I didn't have this tx queue theory back =
then, so I
>> cannot be sure that what we saw then without lagg, and what we see =
now with
>> lagg, really are the same problem.
>>=20
>> I guess, for the sake of simplicity I will remove lagg on these new
>> systems. But before I do that, to save time, I wanted to ask wether I
>> should remove vlan interfaces too? While that didn't help either last =
year,
>> my guess is that I should take them out of the picture, unless you =
say
>> otherwise.
>>=20
>> Thanks for looking into this.
>>=20
>>=20
>> Markus
>>=20
>>=20
>>=20
>>> On Thu, Mar 6, 2014 at 2:24 AM, Markus Gebert <
>> markus.gebert@hostpoint.ch>wrote:
>>>=20
>>>> (creating a new thread, because I'm no longer sure this is related =
to
>>>> Johan's thread that I originally used to discuss this)
>>>>=20
>>>> On 27.02.2014, at 18:02, Jack Vogel <jfvogel@gmail.com> wrote:
>>>>=20
>>>>> I would make SURE that you have enough mbuf resources of whatever =
size
>>>> pool
>>>>> that you are
>>>>> using (2, 4, 9K), and I would try the code in HEAD if you had not.
>>>>>=20
>>>>> Jack
>>>>=20
>>>> Jack, we've upgraded some other systems on which I get more time to
>> debug
>>>> (no impact for customers). Although those systems use the nfsclient
>> too, I
>>>> no longer think that NFS is the source of the problem (hence the =
new
>>>> thread). I think it's the ixgbe driver and/or card. When our =
problem
>>>> occurs, it looks like it's a single tx queue that gets stuck =
somehow
>> (its
>>>> buf_ring remains full).
>>>>=20
>>>> I tracked ping using dtrace to determine the source of ENOBUFS it
>> returns
>>>> every few packets when things get weird:
>>>>=20
>>>> # dtrace -n 'fbt:::return / arg1 =3D=3D ENOBUFS && execname =3D=3D =
"ping" / {
>>>> stack(); }'
>>>> dtrace: description 'fbt:::return ' matched 25476 probes
>>>> CPU     ID                    FUNCTION:NAME
>>>> 26   7730            ixgbe_mq_start:return
>>>>             if_lagg.ko`lagg_transmit+0xc4
>>>>             kernel`ether_output_frame+0x33
>>>>             kernel`ether_output+0x4fe
>>>>             kernel`ip_output+0xd74
>>>>             kernel`rip_output+0x229
>>>>             kernel`sosend_generic+0x3f6
>>>>             kernel`kern_sendit+0x1a3
>>>>             kernel`sendit+0xdc
>>>>             kernel`sys_sendto+0x4d
>>>>             kernel`amd64_syscall+0x5ea
>>>>             kernel`0xffffffff80d35667
>>>>=20
>>>>=20
>>>>=20
>>>> The only way ixgbe_mq_start could return ENOBUFS would be when
>>>> drbr_enqueue() encouters a full tx buf_ring. Since a new ping =
packet
>>>> probably has no flow id, it should be assigned to a queue based on
>> curcpu,
>>>> which made me try to pin ping to single cpus to check wether it's =
always
>>>> the same tx buf_ring that reports being full. This turned out to be
>> true:
>>>>=20
>>>> # cpuset -l 0 ping 10.0.4.5
>>>> PING 10.0.4.5 (10.0.4.5): 56 data bytes
>>>> 64 bytes from 10.0.4.5: icmp_seq=3D0 ttl=3D255 time=3D0.347 ms
>>>> 64 bytes from 10.0.4.5: icmp_seq=3D1 ttl=3D255 time=3D0.135 ms
>>>>=20
>>>> # cpuset -l 1 ping 10.0.4.5
>>>> PING 10.0.4.5 (10.0.4.5): 56 data bytes
>>>> 64 bytes from 10.0.4.5: icmp_seq=3D0 ttl=3D255 time=3D0.184 ms
>>>> 64 bytes from 10.0.4.5: icmp_seq=3D1 ttl=3D255 time=3D0.232 ms
>>>>=20
>>>> # cpuset -l 2 ping 10.0.4.5
>>>> PING 10.0.4.5 (10.0.4.5): 56 data bytes
>>>> ping: sendto: No buffer space available
>>>> ping: sendto: No buffer space available
>>>> ping: sendto: No buffer space available
>>>> ping: sendto: No buffer space available
>>>> ping: sendto: No buffer space available
>>>>=20
>>>> # cpuset -l 3 ping 10.0.4.5
>>>> PING 10.0.4.5 (10.0.4.5): 56 data bytes
>>>> 64 bytes from 10.0.4.5: icmp_seq=3D0 ttl=3D255 time=3D0.130 ms
>>>> 64 bytes from 10.0.4.5: icmp_seq=3D1 ttl=3D255 time=3D0.126 ms
>>>> [...snip...]
>>>>=20
>>>> The system has 32 cores, if ping runs on cpu 2, 10, 18 or 26, which =
use
>>>> the third tx buf_ring, ping reliably return ENOBUFS. If ping is run =
on
>> any
>>>> other cpu using any other tx queue, it runs without any packet =
loss.
>>>>=20
>>>> So, when ENOBUFS is returned, this is not due to an mbuf shortage, =
it's
>>>> because the buf_ring is full. Not surprisingly, netstat -m looks =
pretty
>>>> normal:
>>>>=20
>>>> # netstat -m
>>>> 38622/11823/50445 mbufs in use (current/cache/total)
>>>> 32856/11642/44498/132096 mbuf clusters in use =
(current/cache/total/max)
>>>> 32824/6344 mbuf+clusters out of packet secondary zone in use
>>>> (current/cache)
>>>> 16/3906/3922/66048 4k (page size) jumbo clusters in use
>>>> (current/cache/total/max)
>>>> 0/0/0/33024 9k jumbo clusters in use (current/cache/total/max)
>>>> 0/0/0/16512 16k jumbo clusters in use (current/cache/total/max)
>>>> 75431K/41863K/117295K bytes allocated to network =
(current/cache/total)
>>>> 0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters)
>>>> 0/0/0 requests for mbufs delayed (mbufs/clusters/mbuf+clusters)
>>>> 0/0/0 requests for jumbo clusters delayed (4k/9k/16k)
>>>> 0/0/0 requests for jumbo clusters denied (4k/9k/16k)
>>>> 0/0/0 sfbufs in use (current/peak/max)
>>>> 0 requests for sfbufs denied
>>>> 0 requests for sfbufs delayed
>>>> 0 requests for I/O initiated by sendfile
>>>> 0 calls to protocol drain routines
>>>>=20
>>>> In the meantime I've checked the commit log of the ixgbe driver in =
HEAD
>>>> and besides there are little differences between HEAD and 9.2, I =
don't
>> see
>>>> a commit that fixes anything related to what were seeing...
>>>>=20
>>>> So, what's the conclusion here? Firmware bug that's only triggered =
under
>>>> 9.2? Driver bug introduced between 9.1 and 9.2 when new multiqueue =
stuff
>>>> was added? Jack, how should we proceed?
>>>>=20
>>>>=20
>>>> Markus
>>>>=20
>>>>=20
>>>>=20
>>>> On Thu, Feb 27, 2014 at 8:05 AM, Markus Gebert
>>>> <markus.gebert@hostpoint.ch>wrote:
>>>>=20
>>>>>=20
>>>>> On 27.02.2014, at 02:00, Rick Macklem <rmacklem@uoguelph.ca> =
wrote:
>>>>>=20
>>>>>> John Baldwin wrote:
>>>>>>> On Tuesday, February 25, 2014 2:19:01 am Johan Kooijman wrote:
>>>>>>>> Hi all,
>>>>>>>>=20
>>>>>>>> I have a weird situation here where I can't get my head around.
>>>>>>>>=20
>>>>>>>> One FreeBSD 9.2-STABLE ZFS/NFS box, multiple Linux clients. =
Once in
>>>>>>>> a while
>>>>>>>> the Linux clients loose their NFS connection:
>>>>>>>>=20
>>>>>>>> Feb 25 06:24:09 hv3 kernel: nfs: server 10.0.24.1 not =
responding,
>>>>>>>> timed out
>>>>>>>>=20
>>>>>>>> Not all boxes, just one out of the cluster. The weird part is =
that
>>>>>>>> when I
>>>>>>>> try to ping a Linux client from the FreeBSD box, I have between =
10
>>>>>>>> and 30%
>>>>>>>> packetloss - all day long, no specific timeframe. If I ping the
>>>>>>>> Linux
>>>>>>>> clients - no loss. If I ping back from the Linux clients to =
FBSD
>>>>>>>> box - no
>>>>>>>> loss.
>>>>>>>>=20
>>>>>>>> The errors I get when pinging a Linux client is this one:
>>>>>>>> ping: sendto: File too large
>>>>>=20
>>>>> We were facing similar problems when upgrading to 9.2 and have =
stayed
>>>> with
>>>>> 9.1 on affected systems for now. We've seen this on HP G8 blades =
with
>>>>> 82599EB controllers:
>>>>>=20
>>>>> ix0@pci0:4:0:0: class=3D0x020000 card=3D0x18d0103c chip=3D0x10f88086=

>> rev=3D0x01
>>>>> hdr=3D0x00
>>>>>  vendor     =3D 'Intel Corporation'
>>>>>  device     =3D '82599EB 10 Gigabit Dual Port Backplane =
Connection'
>>>>>  class      =3D network
>>>>>  subclass   =3D ethernet
>>>>>=20
>>>>> We didn't find a way to trigger the problem reliably. But when it
>> occurs,
>>>>> it usually affects only one interface. Symptoms include:
>>>>>=20
>>>>> - socket functions return the 'File too large' error mentioned by =
Johan
>>>>> - socket functions return 'No buffer space' available
>>>>> - heavy to full packet loss on the affected interface
>>>>> - "stuck" TCP connection, i.e. ESTABLISHED TCP connections that =
should
>>>>> have timed out stick around forever (socket on the other side =
could
>> have
>>>>> been closed ours ago)
>>>>> - userland programs using the corresponding sockets usually got =
stuck
>> too
>>>>> (can't find kernel traces right now, but always in network related
>>>> syscalls)
>>>>>=20
>>>>> Network is only lightly loaded on the affected systems (usually =
5-20
>>>> mbit,
>>>>> capped at 200 mbit, per server), and netstat never showed any
>> indication
>>>> of
>>>>> ressource shortage (like mbufs).
>>>>>=20
>>>>> What made the problem go away temporariliy was to ifconfig down/up =
the
>>>>> affected interface.
>>>>>=20
>>>>> We tested a 9.2 kernel with the 9.1 ixgbe driver, which was not =
really
>>>>> stable. Also, we tested a few revisions between 9.1 and 9.2 to =
find out
>>>>> when the problem started. Unfortunately, the ixgbe driver turned =
out to
>>>> be
>>>>> mostly unstable on our systems between these releases, worse than =
on
>> 9.2.
>>>>> The instability was introduced shortly after to 9.1 and fixed only =
very
>>>>> shortly before 9.2 release. So no luck there. We ended up using =
9.1
>> with
>>>>> backports of 9.2 features we really need.
>>>>>=20
>>>>> What we can't tell is wether it's the 9.2 kernel or the 9.2 ixgbe
>> driver
>>>>> or a combination of both that causes these problems. Unfortunately =
we
>> ran
>>>>> out of time (and ideas).
>>>>>=20
>>>>>=20
>>>>>>> EFBIG is sometimes used for drivers when a packet takes too many
>>>>>>> scatter/gather entries.  Since you mentioned NFS, one thing you =
can
>>>>>>> try is to
>>>>>>> disable TSO on the intertface you are using for NFS to see if =
that
>>>>>>> "fixes" it.
>>>>>>>=20
>>>>>> And please email if you try it and let us know if it helps.
>>>>>>=20
>>>>>> I've think I've figured out how 64K NFS read replies can do this,
>>>>>> but I'll admit "ping" is a mystery? (Doesn't it just send a =
single
>>>>>> packet that would be in a single mbuf?)
>>>>>>=20
>>>>>> I think the EFBIG is replied by bus_dmamap_load_mbuf_sg(), but I
>>>>>> don't know if it can happen for an mbuf chain with < 32 entries?
>>>>>=20
>>>>> We don't use the nfs server on our systems, but they're
>> (new)nfsclients.
>>>>> So I don't think our problem is nfs related, unless the default
>>>> rsize/wsize
>>>>> for client mounts is not 8K, which I thought it was. Can you =
confirm
>>>> this,
>>>>> Rick?
>>>>>=20
>>>>> IIRC, disabling TSO did not make any difference in our case.
>>>>>=20
>>>>>=20
>>>>> Markus
>>>>>=20
>>>>>=20
>>>>=20
>>>>=20
>>>>=20
>>>>=20
>>> _______________________________________________
>>> freebsd-net@freebsd.org mailing list
>>> http://lists.freebsd.org/mailman/listinfo/freebsd-net
>>> To unsubscribe, send any mail to =
"freebsd-net-unsubscribe@freebsd.org"
>>>=20
>>=20
>>=20
> _______________________________________________
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org"
>=20

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?BB22EF79-ECCF-42C7-8FD7-236565D5B540>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation