Date: Fri, 7 Mar 2014 14:55:44 +0100 From: Markus Gebert <markus.gebert@hostpoint.ch> To: Jack Vogel <jfvogel@gmail.com> Cc: Johan Kooijman <mail@johankooijman.com>, FreeBSD Net <freebsd-net@freebsd.org>, Rick Macklem <rmacklem@uoguelph.ca>, John Baldwin <jhb@freebsd.org> Subject: Re: 9.2 ixgbe tx queue hang (was: Network loss) Message-ID: <BB22EF79-ECCF-42C7-8FD7-236565D5B540@hostpoint.ch> In-Reply-To: <CAFOYbcny_dn2sh7QuQX33rzAFn%2B_y-848dqCUR%2BncKeSe-D3DA@mail.gmail.com> References: <9C5B43BD-9D80-49EA-8EDC-C7EF53D79C8D@hostpoint.ch> <CAFOYbcmrVms7VJmPCZHCTMDvBfsV775aDFkHhMrGAEAtPx8-Mw@mail.gmail.com> <02AD7510-C862-4C29-9420-25ABF1A6E744@hostpoint.ch> <CAFOYbcny_dn2sh7QuQX33rzAFn%2B_y-848dqCUR%2BncKeSe-D3DA@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On 06.03.2014, at 22:38, Jack Vogel <jfvogel@gmail.com> wrote: > I suppose to be sure where the issue really occurs it would be best if = both > pseudo drivers > were out of the picture. Once we see if it still occurs we can take = the > next step. >=20 > Regards, >=20 > Jack I removed lagg and vlan interfaces on four servers shortly after your = email. And one of these is already showing the symptoms. I think this = time we caught it earlier, and saw EFBIG instead of ENOBUFS. Still = coming from only two cpus (or two tx queues): # cpuset -l 0 ping 10.0.0.1 PING 10.0.0.1 (10.0.0.1): 56 data bytes 64 bytes from 10.0.0.1: icmp_seq=3D0 ttl=3D255 time=3D1.150 ms [=85] # cpuset -l 1 ping 10.0.0.1 PING 10.0.0.1 (10.0.0.1): 56 data bytes 64 bytes from 10.0.0.1: icmp_seq=3D0 ttl=3D255 time=3D0.385 ms [=85] # cpuset -l 2 ping 10.0.0.1 PING 10.0.0.1 (10.0.0.1): 56 data bytes 64 bytes from 10.0.0.1: icmp_seq=3D0 ttl=3D255 time=3D0.385 ms [=85] # cpuset -l 3 ping 10.0.0.1 PING 10.0.0.1 (10.0.0.1): 56 data bytes ping: sendto: File too large ping: sendto: File too large ping: sendto: File too large [=85] # cpuset -l 4 ping 10.0.0.1 PING 10.0.0.1 (10.0.0.1): 56 data bytes ping: sendto: File too large ping: sendto: File too large ping: sendto: File too large [=85] # cpuset -l 5 ping 10.0.0.1 PING 10.0.0.1 (10.0.0.1): 56 data bytes 64 bytes from 10.0.0.1: icmp_seq=3D0 ttl=3D255 time=3D0.316 ms [=85 (other cpus work)] So the fact that only a few cpus/tx queues are affected remains true = even without lagg/vlans. But we=92ve got a new stack trace: # dtrace -n 'fbt:::return / arg1 =3D=3D EFBIG && execname =3D=3D "ping" = / { stack(); }' dtrace: description 'fbt:::return ' matched 25400 probes [=85] 3 28502 _bus_dmamap_load_buffer:return=20 kernel`_bus_dmamap_load_mbuf_sg+0x5f kernel`bus_dmamap_load_mbuf_sg+0x38 kernel`ixgbe_xmit+0xcf kernel`ixgbe_mq_start_locked+0x94 kernel`ixgbe_mq_start+0x12a kernel`ether_output_frame+0x33 kernel`ether_output+0x4fe kernel`ip_output+0xd74 kernel`rip_output+0x229 kernel`sosend_generic+0x3f6 kernel`kern_sendit+0x1a3 kernel`sendit+0xdc kernel`sys_sendto+0x4d kernel`amd64_syscall+0x5ea kernel`0xffffffff80d35667 [=85] But after a while, everything went back to like it was yesterday = (ENOBUFS): # cpuset -l 3 ping 10.0.0.1 PING 10.0.0.1 (10.0.0.1): 56 data bytes ping: sendto: No buffer space available [=85 (cpu4 the same, other cpus work] and the already known stacktrace: # dtrace -n 'fbt:::return / arg1 =3D=3D ENOBUFS && execname =3D=3D = "ping" / { stack(); }' dtrace: description 'fbt:::return ' matched 25400 probes CPU ID FUNCTION:NAME 3 7730 ixgbe_mq_start:return=20 kernel`ether_output_frame+0x33 kernel`ether_output+0x4fe kernel`ip_output+0xd74 kernel`rip_output+0x229 kernel`sosend_generic+0x3f6 kernel`kern_sendit+0x1a3 kernel`sendit+0xdc kernel`sys_sendto+0x4d kernel`amd64_syscall+0x5ea kernel`0xffffffff80d35667 So in my opinion, the different situation in the first few minutes was = not because lagg and vlans were removed, it=92s just that we caught it = early and the buf_ring was not full yet, but at least now we=92ve got a = stack trace of what happens before, which hopefully helps debugging = this. I also noticed that queue txd sysctls barely change or don=92t change at = all on affected queues: # while true; do sysctl dev.ix.1.queue3 dev.ix.1.queue4 | grep txd_; = echo ----; sleep 1; done dev.ix.1.queue3.txd_head: 717 dev.ix.1.queue3.txd_tail: 717 dev.ix.1.queue4.txd_head: 1517 dev.ix.1.queue4.txd_tail: 1517 ---- dev.ix.1.queue3.txd_head: 717 dev.ix.1.queue3.txd_tail: 717 dev.ix.1.queue4.txd_head: 1517 dev.ix.1.queue4.txd_tail: 1517 ---- dev.ix.1.queue3.txd_head: 717 dev.ix.1.queue3.txd_tail: 717 dev.ix.1.queue4.txd_head: 1517 dev.ix.1.queue4.txd_tail: 1517 ---- dev.ix.1.queue3.txd_head: 717 dev.ix.1.queue3.txd_tail: 717 dev.ix.1.queue4.txd_head: 1517 dev.ix.1.queue4.txd_tail: 1517 But I guess that=92s expected when the buf_ring is full=85 But maybe = it=92s important to you, that I might have seen these values rise just = once, when the problem already had started. And netstat -m: # netstat -m 37872/10263/48135 mbufs in use (current/cache/total) 33142/11680/44822/132096 mbuf clusters in use (current/cache/total/max) 33044/6508 mbuf+clusters out of packet secondary zone in use = (current/cache) 94/2995/3089/66048 4k (page size) jumbo clusters in use = (current/cache/total/max) 0/0/0/33024 9k jumbo clusters in use (current/cache/total/max) 0/0/0/16512 16k jumbo clusters in use (current/cache/total/max) 76128K/37905K/114033K bytes allocated to network (current/cache/total) 0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters) 0/0/0 requests for mbufs delayed (mbufs/clusters/mbuf+clusters) 0/0/0 requests for jumbo clusters delayed (4k/9k/16k) 0/0/0 requests for jumbo clusters denied (4k/9k/16k) 0/0/0 sfbufs in use (current/peak/max) 0 requests for sfbufs denied 0 requests for sfbufs delayed 0 requests for I/O initiated by sendfile 0 calls to protocol drain routines Markus > On Thu, Mar 6, 2014 at 12:58 PM, Markus Gebert > <markus.gebert@hostpoint.ch>wrote: >=20 >>=20 >> On 06.03.2014, at 19:33, Jack Vogel <jfvogel@gmail.com> wrote: >>=20 >>> You did not make it explicit before, but I noticed in your dtrace = info >> that >>> you are using >>> lagg, its been the source of lots of problems, so take it out of the >> setup >>> and see if this >>> queue problem still happens please. >>>=20 >>> Jack >>=20 >> Well, last year when upgrading another batch of servers (same = hardware) to >> 9.2, we tried find a solution to this network problem, and we = eliminated >> lagg where we had used it before, which did not help at all. That's = why I >> didn't mention it explicitly. >>=20 >> My point is, I can confirm that 9.2 has network problems on this same >> hardware with or without lagg, so it's unlikely that removing it will = bring >> immediate success. OTOH, I didn't have this tx queue theory back = then, so I >> cannot be sure that what we saw then without lagg, and what we see = now with >> lagg, really are the same problem. >>=20 >> I guess, for the sake of simplicity I will remove lagg on these new >> systems. But before I do that, to save time, I wanted to ask wether I >> should remove vlan interfaces too? While that didn't help either last = year, >> my guess is that I should take them out of the picture, unless you = say >> otherwise. >>=20 >> Thanks for looking into this. >>=20 >>=20 >> Markus >>=20 >>=20 >>=20 >>> On Thu, Mar 6, 2014 at 2:24 AM, Markus Gebert < >> markus.gebert@hostpoint.ch>wrote: >>>=20 >>>> (creating a new thread, because I'm no longer sure this is related = to >>>> Johan's thread that I originally used to discuss this) >>>>=20 >>>> On 27.02.2014, at 18:02, Jack Vogel <jfvogel@gmail.com> wrote: >>>>=20 >>>>> I would make SURE that you have enough mbuf resources of whatever = size >>>> pool >>>>> that you are >>>>> using (2, 4, 9K), and I would try the code in HEAD if you had not. >>>>>=20 >>>>> Jack >>>>=20 >>>> Jack, we've upgraded some other systems on which I get more time to >> debug >>>> (no impact for customers). Although those systems use the nfsclient >> too, I >>>> no longer think that NFS is the source of the problem (hence the = new >>>> thread). I think it's the ixgbe driver and/or card. When our = problem >>>> occurs, it looks like it's a single tx queue that gets stuck = somehow >> (its >>>> buf_ring remains full). >>>>=20 >>>> I tracked ping using dtrace to determine the source of ENOBUFS it >> returns >>>> every few packets when things get weird: >>>>=20 >>>> # dtrace -n 'fbt:::return / arg1 =3D=3D ENOBUFS && execname =3D=3D = "ping" / { >>>> stack(); }' >>>> dtrace: description 'fbt:::return ' matched 25476 probes >>>> CPU ID FUNCTION:NAME >>>> 26 7730 ixgbe_mq_start:return >>>> if_lagg.ko`lagg_transmit+0xc4 >>>> kernel`ether_output_frame+0x33 >>>> kernel`ether_output+0x4fe >>>> kernel`ip_output+0xd74 >>>> kernel`rip_output+0x229 >>>> kernel`sosend_generic+0x3f6 >>>> kernel`kern_sendit+0x1a3 >>>> kernel`sendit+0xdc >>>> kernel`sys_sendto+0x4d >>>> kernel`amd64_syscall+0x5ea >>>> kernel`0xffffffff80d35667 >>>>=20 >>>>=20 >>>>=20 >>>> The only way ixgbe_mq_start could return ENOBUFS would be when >>>> drbr_enqueue() encouters a full tx buf_ring. Since a new ping = packet >>>> probably has no flow id, it should be assigned to a queue based on >> curcpu, >>>> which made me try to pin ping to single cpus to check wether it's = always >>>> the same tx buf_ring that reports being full. This turned out to be >> true: >>>>=20 >>>> # cpuset -l 0 ping 10.0.4.5 >>>> PING 10.0.4.5 (10.0.4.5): 56 data bytes >>>> 64 bytes from 10.0.4.5: icmp_seq=3D0 ttl=3D255 time=3D0.347 ms >>>> 64 bytes from 10.0.4.5: icmp_seq=3D1 ttl=3D255 time=3D0.135 ms >>>>=20 >>>> # cpuset -l 1 ping 10.0.4.5 >>>> PING 10.0.4.5 (10.0.4.5): 56 data bytes >>>> 64 bytes from 10.0.4.5: icmp_seq=3D0 ttl=3D255 time=3D0.184 ms >>>> 64 bytes from 10.0.4.5: icmp_seq=3D1 ttl=3D255 time=3D0.232 ms >>>>=20 >>>> # cpuset -l 2 ping 10.0.4.5 >>>> PING 10.0.4.5 (10.0.4.5): 56 data bytes >>>> ping: sendto: No buffer space available >>>> ping: sendto: No buffer space available >>>> ping: sendto: No buffer space available >>>> ping: sendto: No buffer space available >>>> ping: sendto: No buffer space available >>>>=20 >>>> # cpuset -l 3 ping 10.0.4.5 >>>> PING 10.0.4.5 (10.0.4.5): 56 data bytes >>>> 64 bytes from 10.0.4.5: icmp_seq=3D0 ttl=3D255 time=3D0.130 ms >>>> 64 bytes from 10.0.4.5: icmp_seq=3D1 ttl=3D255 time=3D0.126 ms >>>> [...snip...] >>>>=20 >>>> The system has 32 cores, if ping runs on cpu 2, 10, 18 or 26, which = use >>>> the third tx buf_ring, ping reliably return ENOBUFS. If ping is run = on >> any >>>> other cpu using any other tx queue, it runs without any packet = loss. >>>>=20 >>>> So, when ENOBUFS is returned, this is not due to an mbuf shortage, = it's >>>> because the buf_ring is full. Not surprisingly, netstat -m looks = pretty >>>> normal: >>>>=20 >>>> # netstat -m >>>> 38622/11823/50445 mbufs in use (current/cache/total) >>>> 32856/11642/44498/132096 mbuf clusters in use = (current/cache/total/max) >>>> 32824/6344 mbuf+clusters out of packet secondary zone in use >>>> (current/cache) >>>> 16/3906/3922/66048 4k (page size) jumbo clusters in use >>>> (current/cache/total/max) >>>> 0/0/0/33024 9k jumbo clusters in use (current/cache/total/max) >>>> 0/0/0/16512 16k jumbo clusters in use (current/cache/total/max) >>>> 75431K/41863K/117295K bytes allocated to network = (current/cache/total) >>>> 0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters) >>>> 0/0/0 requests for mbufs delayed (mbufs/clusters/mbuf+clusters) >>>> 0/0/0 requests for jumbo clusters delayed (4k/9k/16k) >>>> 0/0/0 requests for jumbo clusters denied (4k/9k/16k) >>>> 0/0/0 sfbufs in use (current/peak/max) >>>> 0 requests for sfbufs denied >>>> 0 requests for sfbufs delayed >>>> 0 requests for I/O initiated by sendfile >>>> 0 calls to protocol drain routines >>>>=20 >>>> In the meantime I've checked the commit log of the ixgbe driver in = HEAD >>>> and besides there are little differences between HEAD and 9.2, I = don't >> see >>>> a commit that fixes anything related to what were seeing... >>>>=20 >>>> So, what's the conclusion here? Firmware bug that's only triggered = under >>>> 9.2? Driver bug introduced between 9.1 and 9.2 when new multiqueue = stuff >>>> was added? Jack, how should we proceed? >>>>=20 >>>>=20 >>>> Markus >>>>=20 >>>>=20 >>>>=20 >>>> On Thu, Feb 27, 2014 at 8:05 AM, Markus Gebert >>>> <markus.gebert@hostpoint.ch>wrote: >>>>=20 >>>>>=20 >>>>> On 27.02.2014, at 02:00, Rick Macklem <rmacklem@uoguelph.ca> = wrote: >>>>>=20 >>>>>> John Baldwin wrote: >>>>>>> On Tuesday, February 25, 2014 2:19:01 am Johan Kooijman wrote: >>>>>>>> Hi all, >>>>>>>>=20 >>>>>>>> I have a weird situation here where I can't get my head around. >>>>>>>>=20 >>>>>>>> One FreeBSD 9.2-STABLE ZFS/NFS box, multiple Linux clients. = Once in >>>>>>>> a while >>>>>>>> the Linux clients loose their NFS connection: >>>>>>>>=20 >>>>>>>> Feb 25 06:24:09 hv3 kernel: nfs: server 10.0.24.1 not = responding, >>>>>>>> timed out >>>>>>>>=20 >>>>>>>> Not all boxes, just one out of the cluster. The weird part is = that >>>>>>>> when I >>>>>>>> try to ping a Linux client from the FreeBSD box, I have between = 10 >>>>>>>> and 30% >>>>>>>> packetloss - all day long, no specific timeframe. If I ping the >>>>>>>> Linux >>>>>>>> clients - no loss. If I ping back from the Linux clients to = FBSD >>>>>>>> box - no >>>>>>>> loss. >>>>>>>>=20 >>>>>>>> The errors I get when pinging a Linux client is this one: >>>>>>>> ping: sendto: File too large >>>>>=20 >>>>> We were facing similar problems when upgrading to 9.2 and have = stayed >>>> with >>>>> 9.1 on affected systems for now. We've seen this on HP G8 blades = with >>>>> 82599EB controllers: >>>>>=20 >>>>> ix0@pci0:4:0:0: class=3D0x020000 card=3D0x18d0103c chip=3D0x10f88086= >> rev=3D0x01 >>>>> hdr=3D0x00 >>>>> vendor =3D 'Intel Corporation' >>>>> device =3D '82599EB 10 Gigabit Dual Port Backplane = Connection' >>>>> class =3D network >>>>> subclass =3D ethernet >>>>>=20 >>>>> We didn't find a way to trigger the problem reliably. But when it >> occurs, >>>>> it usually affects only one interface. Symptoms include: >>>>>=20 >>>>> - socket functions return the 'File too large' error mentioned by = Johan >>>>> - socket functions return 'No buffer space' available >>>>> - heavy to full packet loss on the affected interface >>>>> - "stuck" TCP connection, i.e. ESTABLISHED TCP connections that = should >>>>> have timed out stick around forever (socket on the other side = could >> have >>>>> been closed ours ago) >>>>> - userland programs using the corresponding sockets usually got = stuck >> too >>>>> (can't find kernel traces right now, but always in network related >>>> syscalls) >>>>>=20 >>>>> Network is only lightly loaded on the affected systems (usually = 5-20 >>>> mbit, >>>>> capped at 200 mbit, per server), and netstat never showed any >> indication >>>> of >>>>> ressource shortage (like mbufs). >>>>>=20 >>>>> What made the problem go away temporariliy was to ifconfig down/up = the >>>>> affected interface. >>>>>=20 >>>>> We tested a 9.2 kernel with the 9.1 ixgbe driver, which was not = really >>>>> stable. Also, we tested a few revisions between 9.1 and 9.2 to = find out >>>>> when the problem started. Unfortunately, the ixgbe driver turned = out to >>>> be >>>>> mostly unstable on our systems between these releases, worse than = on >> 9.2. >>>>> The instability was introduced shortly after to 9.1 and fixed only = very >>>>> shortly before 9.2 release. So no luck there. We ended up using = 9.1 >> with >>>>> backports of 9.2 features we really need. >>>>>=20 >>>>> What we can't tell is wether it's the 9.2 kernel or the 9.2 ixgbe >> driver >>>>> or a combination of both that causes these problems. Unfortunately = we >> ran >>>>> out of time (and ideas). >>>>>=20 >>>>>=20 >>>>>>> EFBIG is sometimes used for drivers when a packet takes too many >>>>>>> scatter/gather entries. Since you mentioned NFS, one thing you = can >>>>>>> try is to >>>>>>> disable TSO on the intertface you are using for NFS to see if = that >>>>>>> "fixes" it. >>>>>>>=20 >>>>>> And please email if you try it and let us know if it helps. >>>>>>=20 >>>>>> I've think I've figured out how 64K NFS read replies can do this, >>>>>> but I'll admit "ping" is a mystery? (Doesn't it just send a = single >>>>>> packet that would be in a single mbuf?) >>>>>>=20 >>>>>> I think the EFBIG is replied by bus_dmamap_load_mbuf_sg(), but I >>>>>> don't know if it can happen for an mbuf chain with < 32 entries? >>>>>=20 >>>>> We don't use the nfs server on our systems, but they're >> (new)nfsclients. >>>>> So I don't think our problem is nfs related, unless the default >>>> rsize/wsize >>>>> for client mounts is not 8K, which I thought it was. Can you = confirm >>>> this, >>>>> Rick? >>>>>=20 >>>>> IIRC, disabling TSO did not make any difference in our case. >>>>>=20 >>>>>=20 >>>>> Markus >>>>>=20 >>>>>=20 >>>>=20 >>>>=20 >>>>=20 >>>>=20 >>> _______________________________________________ >>> freebsd-net@freebsd.org mailing list >>> http://lists.freebsd.org/mailman/listinfo/freebsd-net >>> To unsubscribe, send any mail to = "freebsd-net-unsubscribe@freebsd.org" >>>=20 >>=20 >>=20 > _______________________________________________ > freebsd-net@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org" >=20
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?BB22EF79-ECCF-42C7-8FD7-236565D5B540>