Date: Sat, 25 Apr 2015 09:04:32 -0400 (EDT) From: Rick Macklem <rmacklem@uoguelph.ca> To: Julian Elischer <julian@freebsd.org> Cc: freebsd-current@freebsd.org, John Baldwin <jhb@freebsd.org> Subject: Re: readdir/telldir/seekdir problem (i think) Message-ID: <1934232494.25671027.1429967072794.JavaMail.root@uoguelph.ca> In-Reply-To: <553AFFD9.9000405@freebsd.org>
next in thread | previous in thread | raw e-mail | index | archive | help
Julian Elischer wrote: > On 4/25/15 4:28 AM, John Baldwin wrote: > > On Saturday, April 25, 2015 02:36:24 AM Julian Elischer wrote: > >> On 4/25/15 1:30 AM, Julian Elischer wrote: > >>> On 4/24/15 10:59 PM, John Baldwin wrote: > >>>> On Friday, April 24, 2015 01:02:39 PM Julian Elischer wrote: > >>>>> On 4/23/15 9:54 PM, John Baldwin wrote: > >>>>>> On Thursday, April 23, 2015 05:02:08 PM Julian Elischer wrote: > >>>>>>> On 4/23/15 11:20 AM, Julian Elischer wrote: > >>>>>>>> I'm debugging a problem being seen with samba 3.6. > >>>>>>>> > >>>>>>>> basically telldir/seekdir/readdir don't seem to work as > >>>>>>>> advertised.. > >>>>>>> ok so it looks like readdir() (and friends) is totally broken > >>>>>>> in > >>>>>>> the face > >>>>>>> of deletes unless you read the entire directory at once or > >>>>>>> reset > >>>>>>> to the > >>>>>>> the first file before the deletes, or earlier. > >>>>>> I'm not sure that Samba isn't assuming non-portable behavior. > >>>>>> For example: > >>>>>> > >>>>>> >From > >>>>>> http://pubs.opengroup.org/onlinepubs/009695399/functions/readdir_r.html > >>>>>> > >>>>>> > >>>>>> If a file is removed from or added to the directory after the > >>>>>> most recent call > >>>>>> to opendir() or rewinddir(), whether a subsequent call to > >>>>>> readdir() returns an > >>>>>> entry for that file is unspecified. > >>>>>> > >>>>>> While this doesn't speak directly to your case, it does note > >>>>>> that > >>>>>> you will > >>>>>> get inconsistencies if you scan a directory concurrent with > >>>>>> add > >>>>>> and remove. > >>>>>> > >>>>>> UFS might kind of work actually since deletes do not compact > >>>>>> the > >>>>>> backing > >>>>>> directory, but I suspect NFS and ZFS would not work. In > >>>>>> addition, our > >>>>>> current NFS support for seekdir is pretty flaky and can't be > >>>>>> fixed without > >>>>>> changes to return the seek offset for each directory entry (I > >>>>>> believe that > >>>>>> the projects/ino64 patches include this since they are > >>>>>> breaking > >>>>>> the ABI of > >>>>>> the relevant structures already). The ABI breakage makes this > >>>>>> a > >>>>>> very > >>>>>> non-trivial task. However, even if you have that per-item > >>>>>> cookie, it is > >>>>>> likely meaningless in the face of filesystems that use any > >>>>>> sort > >>>>>> of more > >>>>>> advanced structure than an array (such as trees, etc.) to > >>>>>> store > >>>>>> directory > >>>>>> entries. POSIX specifically mentions this in the rationale > >>>>>> for > >>>>>> seekdir: > >>>>>> > >>>>>> One of the perceived problems of implementation is that > >>>>>> returning > >>>>>> to a given point in a directory is quite difficult to describe > >>>>>> formally, in spite of its intuitive appeal, when systems that > >>>>>> use > >>>>>> B-trees, hashing functions, or other similar mechanisms to > >>>>>> order > >>>>>> their directories are considered. The definition of seekdir() > >>>>>> and > >>>>>> telldir() does not specify whether, when using these > >>>>>> interfaces, > >>>>>> a given directory entry will be seen at all, or more than > >>>>>> once. > >>>>>> > >>>>>> In fact, given that quote, I would argue that what Samba is > >>>>>> doing is > >>>>>> non-portable. This would seem to indicate that a conforming > >>>>>> seekdir could > >>>>>> just change readdir to immediately return EOF until you call > >>>>>> rewinddir. > >>>>> how does readdir know that the directory has been changed? > >>>>> fstat? > >>>> Undefined. FreeBSD's libc doesn't cache the entire directory > >>>> (unless you > >>>> are using a union mount), instead it just caches the most recent > >>>> call to > >>>> getdirentries(). It then serves up entries from that block > >>>> until > >>>> you hit > >>>> the end. When you hit the end it reads the next block, etc. > >>>> When you > >>>> call telldir(), the kernel saves the seek offset from the start > >>>> of the > >>>> current block and the offset within that block and returns a > >>>> cookie to > >>>> you. seekdir() looks up the cookie to find the (seek offset, > >>>> block > >>>> offset) > >>>> tuple. If the location matches the directory's current location > >>>> it > >>>> doesn't > >>>> do anything, otherwise it seeks to the seek offset and reads a > >>>> new > >>>> block via > >>>> getdirentries(). There is no check for seeing if a directory is > >>>> changed. Instead, you can only be stale by one "block". When > >>>> you > >>>> read > >>>> a new block it is relative to the directory's state at that > >>>> time. > >>>> > >>>> Rick's suggestion of reusing the block for a seek within a block > >>>> would be > >>>> fairly easy to implement, I think it would just be: > >>>> > >>>> Index: head/lib/libc/gen/telldir.c > >>>> =================================================================== > >>>> --- head/lib/libc/gen/telldir.c (revision 281929) > >>>> +++ head/lib/libc/gen/telldir.c (working copy) > >>>> @@ -101,8 +101,10 @@ > >>>> return; > >>>> if (lp->loc_loc == dirp->dd_loc && lp->loc_seek == > >>>> dirp->dd_seek) > >>>> return; > >>>> - (void) lseek(dirp->dd_fd, (off_t)lp->loc_seek, > >>>> SEEK_SET); > >>>> - dirp->dd_seek = lp->loc_seek; > >>>> + if (lp->loc_seek != dirp->dd_seek) { > >>>> + (void) lseek(dirp->dd_fd, (off_t)lp->loc_seek, > >>>> SEEK_SET); > >>>> + dirp->dd_seek = lp->loc_seek; > >>>> + } > >>> yes I did that yesterday but it still fails when you transition > >>> blocks.. (badly). > >>> > >>> I also tried bigger blocks.. also fails (eventually) > >>> > >>> I did find a way to make it work... you had to seek back > >>> to the first block you deleted on each set.. > >>> then work forward from there again.. unfortunately since > >>> I'm trying to make a microsoft program not fail (via samba) > >>> I have no control over how it does things and seekdir doesn't > >>> know what was deleted anyway... (so the fix is fine for the > >>> test program but not for real life) > >>> > >>> I think I can make the BSD one act like the linux one by changing > >>> the lseek being done to use the offset (loc) plus the buffer seek > >>> address of the target, instead of just going for the buffer base > >>> and > >>> stepping forward through the entries.. > >>> > >>> maybe tomorrow. > >>> > >> The following conditional code makes ours behave the same as the > >> linux > >> one. > >> it breaks several 'rules' but works where ours is clean but > >> fails.. > >> as Rick said.. "maybe that's what we should do too." > >> > >> > >> this is at the end of seekdir() > >> > >> > >> The new code does what linux does.. and shouldn't work.. but does > >> // at least in the limited conditions I need it to. > >> // We'll probably need to do this at work...: > >> > >> > >> The original code is what we have now, but gets mightily confused > >> sometimes. > >> // This is clean(er) but fails in specific > >> situations(when > >> doing commands > >> // from Microft windows, via samba). > >> > >> > >> root@vps1:/tmp # diff -u dir.c.orig dir.c > >> --- dir.c.orig 2015-04-24 11:29:36.855317000 -0700 > >> +++ dir.c 2015-04-24 11:15:49.058500000 -0700 > >> @@ -1105,6 +1105,13 @@ > >> dirp->dd_loc = lp->loc_loc; > >> return; > >> } > >> +#ifdef GLIBC_SEEK > >> + (void) lseek(dirp->dd_fd, (off_t)lp->loc_seek + lp->loc_loc, > >> SEEK_SET); > >> + dirp->dd_seek = lp->loc_seek + lp->loc_loc; > >> + dirp->dd_loc = 0; > >> + lp->loc_seek = dirp->dd_seek; > >> + lp->loc_loc = 0; > >> +#else > >> (void) lseek(dirp->dd_fd, (off_t)lp->loc_seek, SEEK_SET); > >> dirp->dd_seek = lp->loc_seek; > >> dirp->dd_loc = 0; > >> @@ -1114,6 +1121,7 @@ > >> if (dp == NULL) > >> break; > >> } > >> +#endif > >> } > > Yes, this isn't at all safe. There's no guarantee whatsoever that > > the offset on the directory fd that isn't something returned by > > getdirentries has any meaning. > the value returned by getdirentries is the original seek for the > block, and the offset is not sent obviosly.. but we could > do somethign where getdirentries 'remembers' all teh offsets it sent > for the last set of entries, of we could make it return a a seekable > cookie for every entry... > The problem is that there are two separate sets of offsets. There are the byte offsets in the "logical UFS-like directory blocks" generated by the file systems for VOP_READDIR() and then there are the "physical offset cookies" which are opaque bits for anything but the underlying file system. The latter is a location for the directory entry within the directory structure it maintains. My thinking was that d_off would be this latter physical offset cookie and that a new getdirenties(2) syscall and a new VOP_READDIR() would allow passage of this to the file system, so that it could return directory entries starting at that place in its directory. I have never gotten around to looking closely at the libc stuff, to try and determine if this can all be made to work correctly. rick ps: This is all "future" stuff and probably doesn't help come up with a better set of libc functions for what is in head/current. (Didn't want to hijack the discussion, but it is in a sense relevant.) > > In particular, the size of the > > directory entry in a random filesystem might be a different size > > than the structure returned by getdirentries (since it converts > > things into a FS-independent format). > > yes I understand all that.. :-) > > > > This might work for UFS by accident, but this is probably why ZFS > > doesn't work. > actually ZFS seems to work too which stunned me.. > > > However, this might be properly fixed by the thing that ino64 is > > doing where each directory entry returned by getdirentries gives > > you a seek offset that you _can_ directly seek to (as opposed to > > seeking to the start of the block and then walking forward N > > entries until you get an inter-block entry that is the same). > > yeah that might be nice. > > > > > _______________________________________________ > freebsd-current@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-current > To unsubscribe, send any mail to > "freebsd-current-unsubscribe@freebsd.org" >
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1934232494.25671027.1429967072794.JavaMail.root>