Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 30 Apr 2015 14:56:25 +0800
From:      Julian Elischer <julian@freebsd.org>
To:        Jilles Tjoelker <jilles@stack.nl>, John Baldwin <jhb@freebsd.org>
Cc:        freebsd-current@freebsd.org
Subject:   Re: readdir/telldir/seekdir problem (i think)
Message-ID:  <5541D219.6060900@freebsd.org>
In-Reply-To: <20150424215249.GA96554@stack.nl>
References:  <55386505.70708@freebsd.org> <553A7DB0.8080308@freebsd.org> <553A8D28.7090901@freebsd.org> <4718551.Y2ZnMk6NSM@ralph.baldwin.cx> <20150424215249.GA96554@stack.nl>

next in thread | previous in thread | raw e-mail | index | archive | help
On 4/25/15 5:52 AM, Jilles Tjoelker wrote:
> On Fri, Apr 24, 2015 at 04:28:12PM -0400, John Baldwin wrote:
>> Yes, this isn't at all safe.  There's no guarantee whatsoever that
>> the offset on the directory fd that isn't something returned by
>> getdirentries has any meaning.  In particular, the size of the
>> directory entry in a random filesystem might be a different size
>> than the structure returned by getdirentries (since it converts
>> things into a FS-independent format).
>> This might work for UFS by accident, but this is probably why ZFS
>> doesn't work.
Turns out they are all broken in the face of deletes.

I managed to make a patch that fixes things in the simple case of "backing
up by one (and only one time) record", which is what samba needs to do.

Samba reads until the latest entry won't fit (buffer is full)  and then
effectively pushes the last one back by doing a seekdir back one record.
It then sends the full buffer back to the client and continus reading 
at the
next record, which is the one it pushed back.

It turs out that because the buffer holding the last dirent presented 
is always
still resident in memory, it is ALWAYS possible to slip the 'current 
entry' pointer
back ONE entry, (but no more).

This unbreaks samba, even in the case of deletes. (at least with UFS 
and ZFS).

I'll make the patch available when I've tested it more and cleaned it up.



>> However, this might be properly fixed by the thing that ino64 is
>> doing where each directory entry returned by getdirentries gives
>> you a seek offset that you _can_ directly seek to (as opposed to
>> seeking to the start of the block and then walking forward N
>> entries until you get an inter-block entry that is the same).
> The ino64 branch only reserves space for d_off and does not use it in
> any way. This is appropriate since actually using d_off is a major
> feature addition.
d_cookie please.. :-)
>
> A proper d_off would still be useful even if UFS's readdir keeps masking
> off the offset so a directory read always starts at the beginning of a
> 512-byte directory block, since this allows more distinct offset values
> than safely using getdirentries()'s *basep. With d_off, one outer loop
> must read at least one directory block to avoid spinning indefinitely,
> while using getdirentries()'s *basep requires reading the whole
> getdirentries() buffer.
not sure I follow what you mean by that.
>
> Some Linux filesystems go further and provide a unique d_off for each
> entry.

yes ZFS provides that.
we currently don't export it beyond teh VFS, but there is code that is 
supposed to
supply  a cookie for each entry. It could be put into the dirent 
structure should
we want to do that. Currently it is put into a separate memory array 
in parallel
to the dirents, except that we give it a NULL pointer, which means "we 
don't want
cookies thanks" .

We really need to do something because the current system is really 
broken.
And the fact that dirent has  *32 bit* inode number in it was a 
shock.. I'd presumed
that had gone the way of the dinosaurs and dodo.
I think 11 needs to have a new dirent structure given out by a new 
syscall.
(old one still present for compat reasons). Whether we need a 
readdir64() and friends
I have not yet decided. Maybe it's time to bump libc's number again :-)



>
> Another idea would be to store the last d_ino instead of dd_loc into the
> struct ddloc. On seekdir(), this would seek to loc_seek as before and
> skip entries until that d_ino is found, or to the start of the buffer if
> not found (and possibly return some entries again that should not be
> returned, but Samba copes with that).

I didn't see it coping with that..  an earlier version of my patch 
suffered from duplicate entries in samba.
it is also dangerous because multiple files linked to the same inode 
would get confused.

it may be possible however to link the telldir cookies to a HASH of 
the filename
and inode number.
(assuming of course we don't have cookies directly from the fiesystem 
itself.)
the current scheme of seeking to a buffer boundary and seeking forward in
the buffer looking for a matching entry would become quite robust 
using this scheme,
assuming that the lseek actually worked on that file system.
>




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?5541D219.6060900>