Date: Fri, 20 Jun 2014 17:11:01 -0400 (EDT) From: Rick Macklem <rmacklem@uoguelph.ca> To: Daniel Mayfield <dan@3geeks.org> Cc: freebsd-fs@freebsd.org Subject: Re: Debugging newnfs Message-ID: <373087919.2114818.1403298661172.JavaMail.root@uoguelph.ca> In-Reply-To: <CAE=e2zwZqPoCs17rkKCXt2B4aj4SG7tCEe29Khjf_kV%2BLrM%2BsQ@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
Daniel Mayfield wrote: > > > The server side is a set of vlans on a lagg of 4 igbs. I think igb net interfaces have a limit of 64 transmit segments (IGB_MAX_SCATTER), so they should be ok with TSO enabled. > The Xen side > is the same setup, with the VMs in question attached to two > different vlans. > Well, from what I know, using lagg on top of a Xen/netfront net device will definitely be a problem, unless you have r265290 and r265412. (Without these patches, the setting of if_hw_tsomax done by Xen's netfront is not propagated up to tcp_output(). The same statements apply to if_vlan.c, with the patch r265291.) I know nothing about Xen, so I have no idea if you are using the Xen/netfront virtual net driver, but using lagg and/or vlan on top of it is definitely broken without the recent patches. If you can disable TSO, that will be a workaround for this. > > Many different mounts, but the mount options all look like this: > > > > nfsv3,tcp,resvport,hard,cto,lockd,sec=sys,acdirmin=3,acdirmax=60,acregmin=5,acregmax=60,nametimeo=60,negnametimeo=60,rsize=65536,wsize=65536,readdirsize=65536,readahead=1,wcommitsize=4048762,timeout=120,retrans=2 > > > The permissions do not change, but repeat operations succeed and fail > randomly. > > > > There aren't any clients concurrently accessing the same mount. > > > > > > > On Fri, Jun 20, 2014 at 9:16 AM, Rick Macklem < rmacklem@uoguelph.ca > > wrote: > > > > > Daniel Mayfield wrote: > > I have a very strange problem between an NFS server running FreeBSD > > 10 w/ ZFS and a number of FreeBSD 10 VMs running on a XenServer 6.2 > > SP1 host. The problem manifests as seemingly random permissions > > issues and/or IO errors on the clients when the ZFS pool is busy. > > There are no entries in dmesg on either side, and no errors logged > > in nfsstat either. If I keep the traffic down, the errors subside, > > but not completely. Other than tcpdump, how can I go about > > debugging this? > > > Well, you didn't mention what mount options you are using or what > network interfaces that you are using, but here's a few things that > might be worth looking at... > > The TSO max transmit segments issue: > - Without going into all the details (there have been some recent > commits like r264630 to try and alleviate this), if a net device > driver cannot handle 35 mbufs in a transmit TSO segment, things > will get broken. > - Xen/netfront is a weird exception, which I think is ok so long > as lagg or a vlan isn't layered on top of it. > --> If can disable TSO on both server and clients or reduce > rsize,wsize > to 32K on all client mounts and see if the problem persists, that > is probably the best way to check this. (Since Xen/netfront is > such a weird case, I am not 100% sure if doing the above will fix > this problem, if it is being used) > > I also don't know if it is possible to have corrupted packets due to > a hardware problem (bad memory or...) where the Xen/netfront world > doesn't catch it. > > If you use the "soft" mount option, you could easily get this when > the server is slow to respond. I'd strongly recommend using "tcp" > and not "soft" for your mounts. ("nfsstat -m" on the client will > show you what the actual mount options is use are. This can be > somewhat different than what is specified on the command line, since > servers limit rsize/wsize, as an example.) > > When you get a "permissions failure" case, check on the server to > see if the permissions for the file appear correct on ZFS. If they > are (or the problem disappears when you retry a command without > changing permissions), you could have a caching issue. Other than > capturing the packets and looking at them in wireshark (which knows > NFS, unlike tcpdump) all you can do is try fiddling with the mount > options related to caching and see if that helps. (Note that NFS > does not have a cache coherency protocol, so if files are > concurrently > shared among multiple clients, all bets are off w.r.t. what the > behaviour is. jhb@ is much better at this than I, since he seems > to find lots of these weird cases at his workplace.) > > Good luck with it, rick > > > Dan > > _______________________________________________ > > freebsd-fs@freebsd.org mailing list > > http://lists.freebsd.org/mailman/listinfo/freebsd-fs > > To unsubscribe, send any mail to " > > freebsd-fs-unsubscribe@freebsd.org " > > > >
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?373087919.2114818.1403298661172.JavaMail.root>