Date: Fri, 21 Jul 95 11:39:04 MDT From: terry@cs.weber.edu (Terry Lambert) To: dfr@render.com (Doug Rabson) Cc: peter@haywire.dialix.com, freebsd-current@freebsd.org Subject: Re: what's going on here? (NFSv3 problem?) Message-ID: <9507211739.AA06208@cs.weber.edu> In-Reply-To: <Pine.BSF.3.91.950721091549.3226B-100000@minnow.render.com> from "Doug Rabson" at Jul 21, 95 09:22:06 am
next in thread | previous in thread | raw e-mail | index | archive | help
> No, the bug is that nfs_readdir is making larger protocol requests than > it used to and the server is puking. NFSv3 allows the server to hint at > a size to use. I don't have the rfc for NFSv2 handy so I can't check. > It is possible that we are violating the protocol here. It is also > possible that the SVR4 server is returning crap data. The NFS READDIR is a special case of the getdents call interface. The getdents call interface is not guaranteed to work on objects smaller than a full (512b) directory block, since this is the directory compation boundry in UFS. This is actually the one remaining file system dependency in the directory read code. The typical behaviour is to use the file system block size, or the system page size, whichever is larger, since the directory block is guaranteed by the file system interface to be some power of two value smaller or equal to the page size. The problem is *bound* to occur when the VOP uses entry-at-a-time retrieval, or odd-entry-retrieval over the NFSlink with the current code. The complication in the NFSv3 protocol is the extension that we (well, it was me, when I used to work at Novell) added to the directory entry retrieval interface to return blocks of entries and stat information simultaneously, and which was added to NFSv3 by Sun right after we demonstrated the code doubling the speed of ls in a NUC/NWU (NetWare UNIX Client/WetWare for UNIX) environment (several years ago). The fact is that neither Novell nor I originated this idea: it's been present in AppleTalk and several Intel protocols from day one... a rather old idea, actually. The code hacks for the per entry at a time retrieval for the NFSv2 code *do not work* for buffer sizes larger than the page size, a fact I pointed out when the changes were rolled in (knowing full well that I wanted to do NetWare work on FreeBSD and knowing that NFSv3 was on its way). This isn't even considering the potential race conditions which are caused by the stat operation in the FS itself being seperate from the readdir operation, or by directory compaction occuring between non-block requests. The first race condition can only be resolved by changing the interface; this is probably something that wants to be done anyway, since file operations should probably have stat information associated at all times. The potential error here is that another caller could delte the file before the stat information was obtained and (in the case of only one entry in the return buffer), the directory must be locally retraversed on the server from the last offset. Even then, you are relatively screwed if what is happening is a copy/unlink/rename operation. The second race condition, above, can be handled internally only with an interface other than readdir, or with a substantial change to the operation of readdir, at the very least. The way you do a resyncronization over a (potential) directory compaction is you find the block that the next entry offset is in, then read entries forward until the offset equals or exceeds the offset requested, saving one-behind. If the offset is equal, you return the entry, otherwise you return the entry from the previous offset (assuming that the entry was compacted back). This can result in duplicate entries, which the client must filter out, since it has state information, and it is unacceptable in the search for an exact match to omit the file being searched for. The buffer crap that got done to avoid a file system top end user presentation layer is totally bogus, and remains the cause of the prblem. If no one is interested in fixing it, I suggest reducing the transfer size to the page size or smaller. And, of course, at the same time eat the increased and otherwise unnecessary overhead in the read/write path transfers that will result from doing this "fix". Regards, Terry Lambert terry@cs.weber.edu --- Any opinions in this posting are my own and not those of my present or previous employers.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?9507211739.AA06208>