Date: Thu, 20 Mar 2014 12:34:49 -0300 From: Christopher Forgeron <csforgeron@gmail.com> To: Markus Gebert <markus.gebert@hostpoint.ch> Cc: freebsd-net@freebsd.org, Rick Macklem <rmacklem@uoguelph.ca>, Jack Vogel <jfvogel@gmail.com> Subject: Re: 9.2 ixgbe tx queue hang Message-ID: <CAB2_NwDGb=NS8ghWfcuB7mrmr9_VzRnZ_yg9M-qAGESCShB4VQ@mail.gmail.com> In-Reply-To: <FA262955-B3A9-48EC-828B-FF0D4D5D0498@hostpoint.ch> References: <CAB2_NwDG=gB1WCJ7JKTHpkJCrvPuAhipkn%2BvPyT%2BxXzOBrTGkg@mail.gmail.com> <FA262955-B3A9-48EC-828B-FF0D4D5D0498@hostpoint.ch>
next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, Mar 20, 2014 at 7:40 AM, Markus Gebert <markus.gebert@hostpoint.ch>wrote: > > > Possible. We still see this on nfsclients only, but I'm not convinced that > nfs is the only trigger. > > Just to clarify, I'm experiencing this error with NFS, but also with iSCSI - I turned off my NFS server in rc.conf and rebooted, and I'm still able to create the error. This is not just a NFS issue on my machine. I our case, when it happens, the problem persists for quite some time > (minutes or hours) if we don't interact (ifconfig or reboot). > > The first few times that I ran into it, I had similar issues - Because I was keeping my system up and treating it like a temporary problem/issue. Worst case scenario resulted in reboots to reset the NIC. Then again, I find the ix's to be cranky if you ifconfig them too much. Now, I'm trying to find a root cause, so as soon as I start seeing any errors, I abort and reboot the machine to test the next theory. Additionally, I'm often able to create the problem with just 1 VM running iometer on the SAN storage. When the problem occurs, that connection is broken temporarily, taking network load off the SAN - That may improve my chances of keeping this running. > > > I am able to reproduce it fairly reliably within 15 min of a reboot by > > loading the server via NFS with iometer and some large NFS file copies at > > the same time. I seem to need to sustain ~2 Gbps for a few minutes. > > That's probably why we can't reproduce it reliably here. Although having > 10gig cards in our blade servers, the ones affected are connected to a 1gig > switch. > > It seems that it needs a lot of traffic. I have a 10 gig backbone between my SANs and my ESXi machines, so I can saturate quite quickly (just now I hit a record.. the error occurred within ~5 min of reboot and testing). In your case, I recommend firing up multiple VM's running iometer on different 1 gig connections and see if you can make it pop. I also often turn off ix1 to drive all traffic through ix0 - I've noticed it happens faster this way, but once again I'm not taking enough observations to make decent time predictions. > > > Can you try this when the problem occurs? > > for CPU in {0..7}; do echo "CPU${CPU}"; cpuset -l ${CPU} ping -i 0.2 -c 2 > -W 1 10.0.0.1 | grep sendto; done > > It will tie ping to certain cpus to test the different tx queues of your > ix interface. If the pings reliably fail only on some queues, then your > problem is more likely to be the same as ours. > > Also, if you have dtrace available: > > kldload dtraceall > dtrace -n 'fbt:::return / arg1 == EFBIG && execname == "ping" / { stack(); > }' > > while you run pings over the interface affected. This will give you hints > about where the EFBIG error comes from. > > > [...] > > > Markus > > Will do. I'm not sure what shell the first script was written for, it's not working in csh, here's a re-write that does work in csh in case others are using the default shell: #!/bin/csh foreach CPU (`seq 0 23`) echo "CPU$CPU"; cpuset -l $CPU ping -i 0.2 -c 2 -W 1 10.0.0.1 | grep sendto; end Thanks for your input. I should have results to post to the list shortly.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAB2_NwDGb=NS8ghWfcuB7mrmr9_VzRnZ_yg9M-qAGESCShB4VQ>