Date: Mon, 22 Mar 2010 12:44:04 -0400 From: Steve Polyack <korvus@comcast.net> To: John Baldwin <jhb@freebsd.org> Cc: freebsd-fs@freebsd.org, User Questions <freebsd-questions@freebsd.org>, bseklecki@noc.cfi.pgh.pa.us Subject: Re: FreeBSD NFS client goes into infinite retry loop Message-ID: <4BA79E54.5030504@comcast.net> In-Reply-To: <201003221200.41607.jhb@freebsd.org> References: <4BA3613F.4070606@comcast.net> <4BA78444.4040707@comcast.net> <4BA7911F.5060905@comcast.net> <201003221200.41607.jhb@freebsd.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On 03/22/10 12:00, John Baldwin wrote: > On Monday 22 March 2010 11:47:43 am Steve Polyack wrote: > >> On 03/22/10 10:52, Steve Polyack wrote: >> >>> On 3/19/2010 11:27 PM, Rick Macklem wrote: >>> >>>> On Fri, 19 Mar 2010, Steve Polyack wrote: >>>> >>>> [good stuff snipped] >>>> >>>>> This makes sense. According to wireshark, the server is indeed >>>>> transmitting "Status: NFS3ERR_IO (5)". Perhaps this should be STALE >>>>> instead; it sounds more correct than marking it a general IO error. >>>>> Also, the NFS server is serving its share off of a ZFS filesystem, >>>>> if it makes any difference. I suppose ZFS could be talking to the >>>>> NFS server threads with some mismatched language, but I doubt it. >>>>> >>>>> >>>> Ok, now I think we're making progress. If VFS_FHTOVP() doesn't return >>>> ESTALE when the file no longer exists, the NFS server returns whatever >>>> error it has returned. >>>> >>>> So, either VFS_FHTOVP() succeeds after the file has been deleted, which >>>> would be a problem that needs to be fixed within ZFS >>>> OR >>>> ZFS returns an error other than ESTALE when it doesn't exist. >>>> >>>> Try the following patch on the server (which just makes any error >>>> returned by VFS_FHTOVP() into ESTALE) and see if that helps. >>>> >>>> --- nfsserver/nfs_srvsubs.c.sav 2010-03-19 22:06:43.000000000 -0400 >>>> +++ nfsserver/nfs_srvsubs.c 2010-03-19 22:07:22.000000000 -0400 >>>> @@ -1127,6 +1127,8 @@ >>>> } >>>> } >>>> error = VFS_FHTOVP(mp,&fhp->fh_fid, vpp); >>>> + if (error != 0) >>>> + error = ESTALE; >>>> vfs_unbusy(mp); >>>> if (error) >>>> goto out; >>>> >>>> Please let me know if the patch helps, rick >>>> >>>> >>>> >>> The patch seems to fix the bad behavior. Running with the patch, I >>> see the following output from my patch (return code of nfs_doio from >>> within nfsiod): >>> nfssvc_iod: iod 0 nfs_doio returned errno: 70 >>> >>> Furthermore, when inspecting the transaction with Wireshark, after >>> deleting the file on the NFS server it looks like there is only a >>> single error. This time there it is a reply to a V3 Lookup call that >>> contains a status of "NFS3ERR_NOENT (2)" coming from the NFS server. >>> The client also does not repeatedly try to complete the failed request. >>> >>> Any suggestions on the next step here? Based on what you said it >>> looks like ZFS is falsely reporting an IO error to VFS instead of >>> ESTALE / NOENT. I tried looking around zfs_fhtovp() and only saw >>> returns of EINVAL, but I'm not even sure I'm looking in the right place. >>> >> Further on down the rabbit hole... here's the piece in zfs_fhtovp() >> where it's kicking out EINVAL instead of ESTALE - the following patch >> corrects the behavior, but of course also suggests further digging >> within the zfs_zget() function to ensure that _it_ is returning the >> correct thing and whether or not it needs to be handled there or within >> zfs_fhtovp(). >> >> --- >> src-orig/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c >> 2010-03-22 11:41:21.000000000 -0400 >> +++ src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c >> 2010-03-22 16:25:21.000000000 -0400 >> @@ -1246,7 +1246,7 @@ >> dprintf("getting %llu [%u mask %llx]\n", object, fid_gen, gen_mask); >> if (err = zfs_zget(zfsvfs, object,&zp)) { >> ZFS_EXIT(zfsvfs); >> - return (err); >> + return (ESTALE); >> } >> zp_gen = zp->z_phys->zp_gen& gen_mask; >> if (zp_gen == 0) >> > So the odd thing here is that ffs_fhtovp() doesn't return ESTALE if VFS_VGET() > (which calls ffs_vget()) fails, it only returns ESTALE if the generation count > doesn't matter. > > It looks like it also returns ESTALE when the inode is invalid (< ROOTINO || > max inodes?) - would an unlinked file in FFS referenced at a later time report an invalid inode? But back to your point, zfs_zget() seems to be failing and returning the EINVAL before zfs_fhtovp() even has a chance to set and check zp_gen. I'm trying to get some more details through the use of gratuitous dprintf()'s, but they don't seem to be making it to any logs or the console even with vfs.zfs.debug=1 set. Any pointers on how to get these dprintf() calls working? Thanks again.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4BA79E54.5030504>