Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 21 Jul 95 11:39:04 MDT
From:      terry@cs.weber.edu (Terry Lambert)
To:        dfr@render.com (Doug Rabson)
Cc:        peter@haywire.dialix.com, freebsd-current@freebsd.org
Subject:   Re: what's going on here? (NFSv3 problem?)
Message-ID:  <9507211739.AA06208@cs.weber.edu>
In-Reply-To: <Pine.BSF.3.91.950721091549.3226B-100000@minnow.render.com> from "Doug Rabson" at Jul 21, 95 09:22:06 am

next in thread | previous in thread | raw e-mail | index | archive | help
> No, the bug is that nfs_readdir is making larger protocol requests than 
> it used to and the server is puking.  NFSv3 allows the server to hint at 
> a size to use.  I don't have the rfc for NFSv2 handy so I can't check.  
> It is possible that we are violating the protocol here.  It is also 
> possible that the SVR4 server is returning crap data.

The NFS READDIR is a special case of the getdents call interface.

The getdents call interface is not guaranteed to work on objects
smaller than a full (512b) directory block, since this is the
directory compation boundry in UFS.  This is actually the one
remaining file system dependency in the directory read code.

The typical behaviour is to use the file system block size, or
the system page size, whichever is larger, since the directory
block is guaranteed by the file system interface to be some
power of two value smaller or equal to the page size.

The problem is *bound* to occur when the VOP uses entry-at-a-time
retrieval, or odd-entry-retrieval over the NFSlink with the current
code.

The complication in the NFSv3 protocol is the extension that we
(well, it was me, when I used to work at Novell) added to the
directory entry retrieval interface to return blocks of entries
and stat information simultaneously, and which was added to NFSv3
by Sun right after we demonstrated the code doubling the speed
of ls in a NUC/NWU (NetWare UNIX Client/WetWare for UNIX)
environment (several years ago).  The fact is that neither Novell
nor I originated this idea: it's been present in AppleTalk and
several Intel protocols from day one... a rather old idea, actually.

The code hacks for the per entry at a time retrieval for the NFSv2
code *do not work* for buffer sizes larger than the page size, a
fact I pointed out when the changes were rolled in (knowing full
well that I wanted to do NetWare work on FreeBSD and knowing that
NFSv3 was on its way).

This isn't even considering the potential race conditions which
are caused by the stat operation in the FS itself being seperate
from the readdir operation, or by directory compaction occuring
between non-block requests.

The first race condition can only be resolved by changing the
interface; this is probably something that wants to be done
anyway, since file operations should probably have stat information
associated at all times.  The potential error here is that another
caller could delte the file before the stat information was obtained
and (in the case of only one entry in the return buffer), the
directory must be locally retraversed on the server from the last
offset.  Even then, you are relatively screwed if what is happening
is a copy/unlink/rename operation.

The second race condition, above, can be handled internally only
with an interface other than readdir, or with a substantial change
to the operation of readdir, at the very least.  The way you do
a resyncronization over a (potential) directory compaction is
you find the block that the next entry offset is in, then read
entries forward until the offset equals or exceeds the offset
requested, saving one-behind.  If the offset is equal, you return
the entry, otherwise you return the entry from the previous offset
(assuming that the entry was compacted back).  This can result in
duplicate entries, which the client must filter out, since it has
state information, and it is unacceptable in the search for an
exact match to omit the file being searched for.

The buffer crap that got done to avoid a file system top end user
presentation layer is totally bogus, and remains the cause of the
prblem.  If no one is interested in fixing it, I suggest reducing
the transfer size to the page size or smaller.

And, of course, at the same time eat the increased and otherwise
unnecessary overhead in the read/write path transfers that will
result from doing this "fix".


					Regards,
					Terry Lambert
					terry@cs.weber.edu
---
Any opinions in this posting are my own and not those of my present
or previous employers.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?9507211739.AA06208>