Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 25 Apr 2015 15:07:41 -0400 (EDT)
From:      Rick Macklem <rmacklem@uoguelph.ca>
To:        Julian Elischer <julian@freebsd.org>
Cc:        freebsd-current@freebsd.org, Jilles Tjoelker <jilles@stack.nl>,  John Baldwin <jhb@freebsd.org>
Subject:   Re: readdir/telldir/seekdir problem (i think)
Message-ID:  <1316859246.25755743.1429988861067.JavaMail.root@uoguelph.ca>
In-Reply-To: <553B0100.4070307@freebsd.org>

next in thread | previous in thread | raw e-mail | index | archive | help
Julian Elischer wrote:
> On 4/25/15 5:52 AM, Jilles Tjoelker wrote:
> > On Fri, Apr 24, 2015 at 04:28:12PM -0400, John Baldwin wrote:
> >> Yes, this isn't at all safe.  There's no guarantee whatsoever that
> >> the offset on the directory fd that isn't something returned by
> >> getdirentries has any meaning.  In particular, the size of the
> >> directory entry in a random filesystem might be a different size
> >> than the structure returned by getdirentries (since it converts
> >> things into a FS-independent format).
> >> This might work for UFS by accident, but this is probably why ZFS
> >> doesn't work.
I took a quick look at UFS, ZFS and NFS:
- UFS appears to use the offset as a block + offset within
  the block, so it is understandable that it works for the simple
  "seek to byte offset" case.
  - One oddity here is that the loop that copies out directory entries
    (about line#2188 in ufs_vnops.c) skips over d_ino == 0. Then uiomove()
    copies the data out and offset is updated.
    --> Seems to me that uio_offset "skips" entries with d_ino == 0, so
        it isn't the actual byte offset in the on-disk directory.
        *** Somehow, this seems it would break seekdir/telldir at some
            point? (This handling of uio_offset seems just plain broken to me?)
- ZFS appears to number directory entries 1, 2, 3, 4, ...
  But, except for 1, 2, 3 (which are somewhat special, indicating ".",
  ".." and one other I've already forgotten, maybe ".zfs"), the rest
  are "serialized", which seems to mean that the byte offset is kept in
  a list along with some hash function used to find them.
  --> I didn't really follow the code, but it appears to "offset >> 28",
      so I don't think the low order 28bits are used to get the
      directory block.
  As such, I'm not too surprised that adding in the byte position within
  the block when doing lseek() doesn't break it.
  (I didn't really follow the stuff in zap_micro.c, so I could be all
   wrong here.)
  - I don't know if removal of an entry changes these indices 3, 4, ...?
- NFS gets the block and mostly ignores the byte position within the
  block. (ie. It does "uio_offset / NFS_DIRBLKSIZ" and then uses that
  to get a buffer cache block. This will, in turn, look up this value
  to find a cached cookie.)
  --> The only effect of "uio_offset % NFS_DIRBLKSIZ" != 0 is that it
      will return data from this logical position within a block for
      the readdir.
  --> If a Readdir RPC is done for the cached cookie after a Remove,
      it is up to the NFS server's implementation w.r.t. what you'll
      get back.

Not sure if the above is actually useful, at least until all file
systems are evaluated, rick.

> >> However, this might be properly fixed by the thing that ino64 is
> >> doing where each directory entry returned by getdirentries gives
> >> you a seek offset that you _can_ directly seek to (as opposed to
> >> seeking to the start of the block and then walking forward N
> >> entries until you get an inter-block entry that is the same).
> > The ino64 branch only reserves space for d_off and does not use it
> > in
> > any way. This is appropriate since actually using d_off is a major
> > feature addition.
> >
> > A proper d_off would still be useful even if UFS's readdir keeps
> > masking
> > off the offset so a directory read always starts at the beginning
> > of a
> > 512-byte directory block, since this allows more distinct offset
> > values
> > than safely using getdirentries()'s *basep. With d_off, one outer
> > loop
> > must read at least one directory block to avoid spinning
> > indefinitely,
> > while using getdirentries()'s *basep requires reading the whole
> > getdirentries() buffer.
> >
> > Some Linux filesystems go further and provide a unique d_off for
> > each
> > entry.
> >
> > Another idea would be to store the last d_ino instead of dd_loc
> > into the
> > struct ddloc. On seekdir(), this would seek to loc_seek as before
> > and
> > skip entries until that d_ino is found, or to the start of the
> > buffer if
> > not found (and possibly return some entries again that should not
> > be
> > returned, but Samba copes with that).
> 
> yes.. though maybe a hash of hte inode number and the name may be a
> better thing to search on..
> 
> I need to check on your statement about samba to see if its true for
> 3.6..
> 
> 
> >
> 
> _______________________________________________
> freebsd-current@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-current
> To unsubscribe, send any mail to
> "freebsd-current-unsubscribe@freebsd.org"
> 



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1316859246.25755743.1429988861067.JavaMail.root>