From owner-freebsd-current@FreeBSD.ORG Fri Apr 24 17:21:17 2015 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 8C2E7EFB; Fri, 24 Apr 2015 17:21:17 +0000 (UTC) Received: from vps1.elischer.org (vps1.elischer.org [204.109.63.16]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "vps1.elischer.org", Issuer "CA Cert Signing Authority" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id 6A0FD1961; Fri, 24 Apr 2015 17:21:17 +0000 (UTC) Received: from Julian-MBP3.local (ppp121-45-229-105.lns20.per1.internode.on.net [121.45.229.105]) (authenticated bits=0) by vps1.elischer.org (8.14.9/8.14.9) with ESMTP id t3OHL6G3027860 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES128-SHA bits=128 verify=NO); Fri, 24 Apr 2015 10:21:13 -0700 (PDT) (envelope-from julian@freebsd.org) Message-ID: <553A7B7C.2060305@freebsd.org> Date: Sat, 25 Apr 2015 01:21:00 +0800 From: Julian Elischer User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:31.0) Gecko/20100101 Thunderbird/31.6.0 MIME-Version: 1.0 To: John Baldwin , freebsd-current@freebsd.org Subject: Re: readdir/telldir/seekdir problem (i think) References: <1427525043.24829432.1429827165715.JavaMail.root@uoguelph.ca> <553A1DF9.8060009@freebsd.org> <7363082.VbsWk0ixI7@ralph.baldwin.cx> In-Reply-To: <7363082.VbsWk0ixI7@ralph.baldwin.cx> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 24 Apr 2015 17:21:17 -0000 On 4/24/15 10:43 PM, John Baldwin wrote: > On Friday, April 24, 2015 06:42:01 PM Julian Elischer wrote: >> On 4/24/15 6:12 AM, Rick Macklem wrote: >>> John Baldwin wrote: >>>> On Thursday, April 23, 2015 05:02:08 PM Julian Elischer wrote: >>>>> On 4/23/15 11:20 AM, Julian Elischer wrote: >>>>>> I'm debugging a problem being seen with samba 3.6. >>>>>> >>>>>> basically telldir/seekdir/readdir don't seem to work as >>>>>> advertised.. >>>>> ok so it looks like readdir() (and friends) is totally broken in >>>>> the face >>>>> of deletes unless you read the entire directory at once or reset to >>>>> the >>>>> the first file before the deletes, or earlier. >>>> I'm not sure that Samba isn't assuming non-portable behavior. For >>>> example: >>>> >>>> From >>>> http://pubs.opengroup.org/onlinepubs/009695399/functions/readdir_r.html >>>> >>>> If a file is removed from or added to the directory after the most >>>> recent call >>>> to opendir() or rewinddir(), whether a subsequent call to readdir() >>>> returns an >>>> entry for that file is unspecified. >>>> >>>> While this doesn't speak directly to your case, it does note that you >>>> will >>>> get inconsistencies if you scan a directory concurrent with add and >>>> remove. >>>> >>>> UFS might kind of work actually since deletes do not compact the >>>> backing >>>> directory, but I suspect NFS and ZFS would not work. In addition, >>>> our >>>> current NFS support for seekdir is pretty flaky and can't be fixed >>>> without >>>> changes to return the seek offset for each directory entry (I believe >>>> that >>>> the projects/ino64 patches include this since they are breaking the >>>> ABI of >>>> the relevant structures already). The ABI breakage makes this a very >>>> non-trivial task. However, even if you have that per-item cookie, it >>>> is >>>> likely meaningless in the face of filesystems that use any sort of >>>> more >>>> advanced structure than an array (such as trees, etc.) to store >>>> directory >>>> entries. POSIX specifically mentions this in the rationale for >>>> seekdir: >>>> >>>> http://pubs.opengroup.org/onlinepubs/009695399/functions/seekdir.html >>>> >>>> One of the perceived problems of implementation is that returning to >>>> a given point in a directory is quite difficult to describe >>>> formally, in spite of its intuitive appeal, when systems that use >>>> B-trees, hashing functions, or other similar mechanisms to order >>>> their directories are considered. The definition of seekdir() and >>>> telldir() does not specify whether, when using these interfaces, a >>>> given directory entry will be seen at all, or more than once. >>>> >>>> In fact, given that quote, I would argue that what Samba is doing is >>>> non-portable. This would seem to indicate that a conforming seekdir >>>> could >>>> just change readdir to immediately return EOF until you call >>>> rewinddir. >>>> >>> Btw, Linux somehow makes readdir()/unlink() work for NFS. I haven't looked, >>> but I strongly suspect that it reads the entire directory upon either opendir() >>> or the first readdir(). >>> >>> Oh, and I hate to say it, but I suspect Linux defines the "standard" on >>> this and not POSIX. (In other words, if it works on Linux, it isn't broken;-) >>> >>> rick >> here's an interesting datapoint. If the test program is run on >> kFreeBSD using glibc, it runs without flaw. >> >> OS-X (bsd derived libc) HFS+ fails >> FreeBSD libc (UFS) fails >> FreeBSD libc (ZFS) fails >> FreeBSD glibc succceeds >> Centos 6.5 glibc succeeds >> >> some NFS tests would be nice to do too I guess... >> glibc authors seem to have done something right.. it even copes with >> FreeBSD kernel.. > It's probably just reading the entire directory and caching it until > rewinddir is called. FreeBSD's libc does this if you have a unionfs > mount. It would be a trivial change to always do this, is just means > you will never notice any concurrent deletes, adds, or renames until > you call rewinddir again. At that point you might as well have the > client just do rewinddir each time. You are just moving the caching that > Samba should be doing to be portable from samba into libc. I'm not sure > that's really an improvement so much as shuffling deck chairs. > > Also, that is going to keep giving you directory entries for the files > you've already removed (unless you patch libc to explicitly hack around > that case by stating each entry and skipping ones that fail with ENOENT > which would add a lot of overhead). SO I rewrote/ported glibc telldir/readdir to our FreeBSD.. Firstly, a slight addition.. BSD libc also fails on tmpfs ( I found that out by accident.. I thought I was on UFS and forgot I had a tmpfs there) ported glibc readdir/friends works an all three. and it is not caching the whole thing.. I'm doing the deletes in batches of 5. and every 5 deletes I see it doing: 95985 testit2 RET write 30/0x1e 95985 testit2 CALL write(0x1,0x801007000,0x18) 95985 testit2 GIO fd 1 wrote 24 bytes "file file-1756 returned " 95985 testit2 RET write 24/0x18 95985 testit2 CALL write(0x1,0x801007000,0x1d) 95985 testit2 GIO fd 1 wrote 29 bytes "Seeking back to location 144 " 95985 testit2 RET write 29/0x1d 95985 testit2 CALL lseek(0x3,0x90,SEEK_SET) 95985 testit2 RET lseek 144/0x90 95985 testit2 CALL write(0x1,0x801007000,0x1e) 95985 testit2 GIO fd 1 wrote 30 bytes "telldir assigned location 144 " " 95985 testit2 RET write 30/0x1e 95985 testit2 CALL getdents(0x3,0x801008000,0x20000) 95985 testit2 RET getdents 79464/0x13668 95985 testit2 CALL write(0x1,0x801007000,0x26) 95985 testit2 GIO fd 1 wrote 38 bytes "readdir (144) returned file file-1756 " so the matrix appears to be: ZFS UFS TMPFS EXT2FS BSD GLIBC OK OK OK . BSD LIBC BAD BAD BAD . LINUX GLIBC . . . OK KFREEBSD OK OK . . (tmpfs and ZFS on 10.1, UFS on i386/-current (March) (that's the machines I have..) the BSD libc appeared to work on ZFS until I had more files... I think because the block size is bigger... then it failed. I've put both version and the test program at: https://people.freebsd.org/~julian/readdir/ testit is linked with dir.c which is our code extracted out to a standalone file.. testit2 is the same test program linked with dir2.c which is the glibc based code smashed into our format and massively cleaned up. both will write 40,000 files to the directory 'test2' and then delete them in chunks... The glibc inspired one hasn't been seen to fail yet. but I'm not sure what it's doing is actually kosher.