Date: Wed, 06 May 2015 11:54:38 +0800 From: Julian Elischer <julian@freebsd.org> To: Rick Macklem <rmacklem@uoguelph.ca> Cc: "current@freebsd.org" <current@FreeBSD.ORG>, Jilles Tjoelker <jilles@stack.nl>, Konstantin Belousov <kostikbel@gmail.com> Subject: Re: seekdir/readdir patch .. Call for Review. Message-ID: <5549907E.1020501@freebsd.org> In-Reply-To: <1054739779.32254571.1430868807663.JavaMail.root@uoguelph.ca> References: <1054739779.32254571.1430868807663.JavaMail.root@uoguelph.ca>
next in thread | previous in thread | raw e-mail | index | archive | help
On 5/6/15 7:33 AM, Rick Macklem wrote: > Julian Elischer wrote: >> On 5/3/15 10:33 PM, Jilles Tjoelker wrote: >>> On Fri, May 01, 2015 at 07:17:42PM +0300, Konstantin Belousov >>> wrote: >>>> On Fri, May 01, 2015 at 03:04:51PM +0800, Julian Elischer wrote: >>>>> if you are interested in readdir(3), seekdir(3) and telldir(3) >>>>> then >>>>> you should look at >>>>> https://reviews.freebsd.org/D2410 >>>>> this patches around a problem in seekdir() that breaks Samba. >>>>> Seekdir(3) will not work as expected when files prior to the >>>>> point of >>>>> interest in directory have been deleted since the directory was >>>>> opened. >>>>> Windows clients using Samba cause both these things to happen, >>>>> causing >>>>> the next readdir(3) after the bad seekdir(3) to skip some entries >>>>> and >>>>> return the wrong file. >>>>> Samba only needs to step back a single directory entry in the >>>>> case >>>>> where it reads an entry and then discovers it can't fit it into >>>>> the >>>>> buffer it is sending to the windows client. It turns out we can >>>>> reliably cater to Samba's requirement because the "last returned >>>>> element" is always still in memory, so with a little care, we can >>>>> set our filepointer back to it safely. (once) >>>>> seekdir and readdir (and telldir()) need a complete rewrite along >>>>> with >>>>> getdirentries() but that is more than a small edit like this. >>>> Can you explain your expectations from the whole readdir() vs. >>>> parallel >>>> directory modifications interaction ? From what I understood so >>>> far, >>>> there is unlocked modification of the container and parallel >>>> iterator >>>> over the same container. IMO, in such situation, whatever tweaks >>>> you >>>> apply to the iterator, it is still cannot be made reliable. >>>> Before making single-purpose changes to the libc readdir and >>>> seekdir >>>> code, or to the kernel code, it would be useful to state exact >>>> behaviour >>>> of the dirent machinery we want to see. No, 'make samba works in >>>> my >>>> situation' does not sound good enough. >>> Consider the subsequence of entries that existed at opendir() time >>> and >>> were not removed until now. This subsequence is clearly defined and >>> does >>> not have concurrency problems. The order of this subsequence must >>> remain >>> unchanged and seekdir() must be correct with respect to this >>> subsequence. >>> >>> Additionally, two other kinds of entries may be returned. New >>> entries >>> may be inserted anywhere in between the entries of the subsequence, >>> and >>> removed entries may be returned as if they were still part of the >>> subsequence (so that not every readdir() needs a system call). >>> >>> A simple implementation for UFS-style directories is to store the >>> offset >>> in the directory (all bits of it, not masking off the lower 9 >>> bits). >>> This needs d_off or similar in struct dirent. The kernel >>> getdirentries() >>> then needs a similar loop as the old libc seekdir() to go from the >>> start >>> of the 512-byte directory block to the desired entry (since an >>> entry may >>> not exist at the stored offset within the directory block). >> At least it needs some more information in struct dirent than there >> is >> now.. >> A cookie is the current fashion.. that assumes however that the >> filesystems >> are capable of converting backwards from cookie to 'location'. ZFS >> claims to >> be able to do so.. > My current plan for a patch is... > - d_cookie would be the "physical" file system position of the next > directory entry > - a ngetdirentries() would take a "physical" cookie as a value/result > argument. It would indicate where to start and would return the cookie > for the next entry after what is returned. (It could probably be stuffed > in uio_offset, but I think it might be clearer to make it a separate arg.) > --> It would pass this physical cookie down to the file system's > VOP_READDIR(). (For UFS it would be the byte offset in the on-disk > directory. For ZFS, I believe it is an index for the entry. For > NFS, it is the cookie that is sent to the server. For others, I > don't yet know.) > - dd_seek, dd_loc, loc_seek and loc_loc would be replaced by dd_cookie and > loc_cookie. (For arches where sizeof(long) == 8, I think telldir() could > just return the cookie and forget about the loc_XX structures?) > This would get rid of the loc_seek, loc_loc bogosity that no longer makes > much sense, since the byte offset in what is returned by getdirentries() > has little to do with the "physical" directory position. > > I have already done the kernel stuff for some file systems and the libc > changes actually simplify things compared to what is there now. zfs already has a cookie production facility as part of the VFS readdir (or is it dir-read?) method. > > rick > >> Thee other thing to do would be to store some sort >> of strong >> hash of the name and inode# in each telldir entry.. >> we would seek to the saved seek location and seek forward computing >> or >> looking >> for a matching hash. I woudl also add that the man pages talk about >> the >> readdir blocksize a bit and mention the file blocksize (and stat) >> which is often >> way dfferent from 512 bytes.. usually 16k or so these days. >> I found setting the read size to the same as the fs blocksize seemd >> to >> work well. >>> This means that a UFS-style directory cannot be compacted (existing >>> entries moved from higher to lower offsets to fill holes) while it >>> is >>> open for reading. An NFS exported directory is always open for >>> reading. >> yes so a UFS filesystem that is exported could never do garbage >> collection. >>> This also means that duplicate entries can only be returned if that >>> particular filename was deleted and created again. >>> >>> Without kernel support, it is hard to get telldir/seekdir >>> completely >>> reliable. The current libc implementation is wrong since the >>> "holes" >>> within the block just disappear and change the offsets of the >>> following >>> entries; the kernel cannot fix this using entries with d_fileno = 0 >>> since it cannot know, in the general case, how long the deleted >>> entry >>> was in the filesystem-independent dirent format. >> yes it's the filesystem that knows.. we USED to return empty entries >> in the dirent list but that was removed recently think. >>> My previous idea of >>> storing one d_fileno during telldir() is wrong since it will fail >>> if >>> that entry is deleted. >>> >>> If you do not care about memory usage (which probably is already >>> excessive with the current libc implementation), you could store at >>> telldir() time the offset of the current block returned by >>> getdirentries() and the d_fileno of all entries already returned in >>> the >>> current block. >>> >>> The D2410 patch can conceptually work for what Samba needs, >>> stepping >>> back one directory entry. I will comment on it. >>> >> _______________________________________________ >> freebsd-current@freebsd.org mailing list >> http://lists.freebsd.org/mailman/listinfo/freebsd-current >> To unsubscribe, send any mail to >> "freebsd-current-unsubscribe@freebsd.org" >> >
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?5549907E.1020501>