Date: Thu, 6 Mar 2014 11:24:37 +0100 From: Markus Gebert <markus.gebert@hostpoint.ch> To: Jack Vogel <jfvogel@gmail.com>, freebsd-net@freebsd.org Cc: Johan Kooijman <mail@johankooijman.com>, Rick Macklem <rmacklem@uoguelph.ca>, John Baldwin <jhb@freebsd.org> Subject: 9.2 ixgbe tx queue hang (was: Network loss) Message-ID: <9C5B43BD-9D80-49EA-8EDC-C7EF53D79C8D@hostpoint.ch>
next in thread | raw e-mail | index | archive | help
(creating a new thread, because I=92m no longer sure this is related to = Johan=92s thread that I originally used to discuss this) On 27.02.2014, at 18:02, Jack Vogel <jfvogel@gmail.com> wrote: > I would make SURE that you have enough mbuf resources of whatever size = pool > that you are > using (2, 4, 9K), and I would try the code in HEAD if you had not. >=20 > Jack Jack, we've upgraded some other systems on which I get more time to = debug (no impact for customers). Although those systems use the = nfsclient too, I no longer think that NFS is the source of the problem = (hence the new thread). I think it=92s the ixgbe driver and/or card. = When our problem occurs, it looks like it=92s a single tx queue that = gets stuck somehow (its buf_ring remains full). I tracked ping using dtrace to determine the source of ENOBUFS it = returns every few packets when things get weird: # dtrace -n 'fbt:::return / arg1 =3D=3D ENOBUFS && execname =3D=3D = "ping" / { stack(); }' dtrace: description 'fbt:::return ' matched 25476 probes CPU ID FUNCTION:NAME 26 7730 ixgbe_mq_start:return=20 if_lagg.ko`lagg_transmit+0xc4 kernel`ether_output_frame+0x33 kernel`ether_output+0x4fe kernel`ip_output+0xd74 kernel`rip_output+0x229 kernel`sosend_generic+0x3f6 kernel`kern_sendit+0x1a3 kernel`sendit+0xdc kernel`sys_sendto+0x4d kernel`amd64_syscall+0x5ea kernel`0xffffffff80d35667 The only way ixgbe_mq_start could return ENOBUFS would be when = drbr_enqueue() encouters a full tx buf_ring. Since a new ping packet = probably has no flow id, it should be assigned to a queue based on = curcpu, which made me try to pin ping to single cpus to check wether = it=92s always the same tx buf_ring that reports being full. This turned = out to be true: # cpuset -l 0 ping 10.0.4.5 PING 10.0.4.5 (10.0.4.5): 56 data bytes 64 bytes from 10.0.4.5: icmp_seq=3D0 ttl=3D255 time=3D0.347 ms 64 bytes from 10.0.4.5: icmp_seq=3D1 ttl=3D255 time=3D0.135 ms # cpuset -l 1 ping 10.0.4.5 PING 10.0.4.5 (10.0.4.5): 56 data bytes 64 bytes from 10.0.4.5: icmp_seq=3D0 ttl=3D255 time=3D0.184 ms 64 bytes from 10.0.4.5: icmp_seq=3D1 ttl=3D255 time=3D0.232 ms # cpuset -l 2 ping 10.0.4.5 PING 10.0.4.5 (10.0.4.5): 56 data bytes ping: sendto: No buffer space available ping: sendto: No buffer space available ping: sendto: No buffer space available ping: sendto: No buffer space available ping: sendto: No buffer space available # cpuset -l 3 ping 10.0.4.5 PING 10.0.4.5 (10.0.4.5): 56 data bytes 64 bytes from 10.0.4.5: icmp_seq=3D0 ttl=3D255 time=3D0.130 ms 64 bytes from 10.0.4.5: icmp_seq=3D1 ttl=3D255 time=3D0.126 ms [=85snip...] The system has 32 cores, if ping runs on cpu 2, 10, 18 or 26, which use = the third tx buf_ring, ping reliably return ENOBUFS. If ping is run on = any other cpu using any other tx queue, it runs without any packet loss. So, when ENOBUFS is returned, this is not due to an mbuf shortage, it=92s = because the buf_ring is full. Not surprisingly, netstat -m looks pretty = normal: # netstat -m 38622/11823/50445 mbufs in use (current/cache/total) 32856/11642/44498/132096 mbuf clusters in use (current/cache/total/max) 32824/6344 mbuf+clusters out of packet secondary zone in use = (current/cache) 16/3906/3922/66048 4k (page size) jumbo clusters in use = (current/cache/total/max) 0/0/0/33024 9k jumbo clusters in use (current/cache/total/max) 0/0/0/16512 16k jumbo clusters in use (current/cache/total/max) 75431K/41863K/117295K bytes allocated to network (current/cache/total) 0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters) 0/0/0 requests for mbufs delayed (mbufs/clusters/mbuf+clusters) 0/0/0 requests for jumbo clusters delayed (4k/9k/16k) 0/0/0 requests for jumbo clusters denied (4k/9k/16k) 0/0/0 sfbufs in use (current/peak/max) 0 requests for sfbufs denied 0 requests for sfbufs delayed 0 requests for I/O initiated by sendfile 0 calls to protocol drain routines In the meantime I=92ve checked the commit log of the ixgbe driver in = HEAD and besides there are little differences between HEAD and 9.2, I = don=92t see a commit that fixes anything related to what were seeing=85 So, what=92s the conclusion here? Firmware bug that=92s only triggered = under 9.2? Driver bug introduced between 9.1 and 9.2 when new multiqueue = stuff was added? Jack, how should we proceed? Markus On Thu, Feb 27, 2014 at 8:05 AM, Markus Gebert <markus.gebert@hostpoint.ch>wrote: >=20 > On 27.02.2014, at 02:00, Rick Macklem <rmacklem@uoguelph.ca> wrote: >=20 >> John Baldwin wrote: >>> On Tuesday, February 25, 2014 2:19:01 am Johan Kooijman wrote: >>>> Hi all, >>>>=20 >>>> I have a weird situation here where I can't get my head around. >>>>=20 >>>> One FreeBSD 9.2-STABLE ZFS/NFS box, multiple Linux clients. Once in >>>> a while >>>> the Linux clients loose their NFS connection: >>>>=20 >>>> Feb 25 06:24:09 hv3 kernel: nfs: server 10.0.24.1 not responding, >>>> timed out >>>>=20 >>>> Not all boxes, just one out of the cluster. The weird part is that >>>> when I >>>> try to ping a Linux client from the FreeBSD box, I have between 10 >>>> and 30% >>>> packetloss - all day long, no specific timeframe. If I ping the >>>> Linux >>>> clients - no loss. If I ping back from the Linux clients to FBSD >>>> box - no >>>> loss. >>>>=20 >>>> The errors I get when pinging a Linux client is this one: >>>> ping: sendto: File too large >=20 > We were facing similar problems when upgrading to 9.2 and have stayed = with > 9.1 on affected systems for now. We've seen this on HP G8 blades with > 82599EB controllers: >=20 > ix0@pci0:4:0:0: class=3D0x020000 card=3D0x18d0103c chip=3D0x10f88086 = rev=3D0x01 > hdr=3D0x00 > vendor =3D 'Intel Corporation' > device =3D '82599EB 10 Gigabit Dual Port Backplane Connection' > class =3D network > subclass =3D ethernet >=20 > We didn't find a way to trigger the problem reliably. But when it = occurs, > it usually affects only one interface. Symptoms include: >=20 > - socket functions return the 'File too large' error mentioned by = Johan > - socket functions return 'No buffer space' available > - heavy to full packet loss on the affected interface > - "stuck" TCP connection, i.e. ESTABLISHED TCP connections that should > have timed out stick around forever (socket on the other side could = have > been closed ours ago) > - userland programs using the corresponding sockets usually got stuck = too > (can't find kernel traces right now, but always in network related = syscalls) >=20 > Network is only lightly loaded on the affected systems (usually 5-20 = mbit, > capped at 200 mbit, per server), and netstat never showed any = indication of > ressource shortage (like mbufs). >=20 > What made the problem go away temporariliy was to ifconfig down/up the > affected interface. >=20 > We tested a 9.2 kernel with the 9.1 ixgbe driver, which was not really > stable. Also, we tested a few revisions between 9.1 and 9.2 to find = out > when the problem started. Unfortunately, the ixgbe driver turned out = to be > mostly unstable on our systems between these releases, worse than on = 9.2. > The instability was introduced shortly after to 9.1 and fixed only = very > shortly before 9.2 release. So no luck there. We ended up using 9.1 = with > backports of 9.2 features we really need. >=20 > What we can't tell is wether it's the 9.2 kernel or the 9.2 ixgbe = driver > or a combination of both that causes these problems. Unfortunately we = ran > out of time (and ideas). >=20 >=20 >>> EFBIG is sometimes used for drivers when a packet takes too many >>> scatter/gather entries. Since you mentioned NFS, one thing you can >>> try is to >>> disable TSO on the intertface you are using for NFS to see if that >>> "fixes" it. >>>=20 >> And please email if you try it and let us know if it helps. >>=20 >> I've think I've figured out how 64K NFS read replies can do this, >> but I'll admit "ping" is a mystery? (Doesn't it just send a single >> packet that would be in a single mbuf?) >>=20 >> I think the EFBIG is replied by bus_dmamap_load_mbuf_sg(), but I >> don't know if it can happen for an mbuf chain with < 32 entries? >=20 > We don't use the nfs server on our systems, but they're = (new)nfsclients. > So I don't think our problem is nfs related, unless the default = rsize/wsize > for client mounts is not 8K, which I thought it was. Can you confirm = this, > Rick? >=20 > IIRC, disabling TSO did not make any difference in our case. >=20 >=20 > Markus >=20 >=20
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?9C5B43BD-9D80-49EA-8EDC-C7EF53D79C8D>