From owner-freebsd-current@FreeBSD.ORG Tue May 5 11:04:47 2015 Return-Path: Delivered-To: current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 337D91FA; Tue, 5 May 2015 11:04:47 +0000 (UTC) Received: from esa-jnhn.mail.uoguelph.ca (esa-jnhn.mail.uoguelph.ca [131.104.91.44]) by mx1.freebsd.org (Postfix) with ESMTP id A21F31120; Tue, 5 May 2015 11:04:46 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: A2CsBACWokhV/95baINWBoNfXAWDGMJogUwKhTdOAoFxEgEBAQEBAQGBCoQgAQEBAwEBAQEaBisgCwUWGAICDRkCKQEJGAENBggHBAEcBIgDCA2wEZNTAQEBAQEFAQEBAQEBAQEBGYEhihiEMwEBDQ8BMweCaIFFBZYwgkaBTINVPYYOim2DUyOBZR8DHIFtIjEHgQQ5gQEBAQE X-IronPort-AV: E=Sophos;i="5.13,372,1427774400"; d="scan'208";a="208694941" Received: from muskoka.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.222]) by esa-jnhn.mail.uoguelph.ca with ESMTP; 05 May 2015 07:04:44 -0400 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id E2C09B3F45; Tue, 5 May 2015 07:04:43 -0400 (EDT) Date: Tue, 5 May 2015 07:04:43 -0400 (EDT) From: Rick Macklem To: Julian Elischer Cc: Jilles Tjoelker , Jilles Tjoelker , "current@freebsd.org" Message-ID: <1741454718.31580103.1430823883918.JavaMail.root@uoguelph.ca> In-Reply-To: <55482CAF.8010905@freebsd.org> Subject: Re: seekdir/readdir patch .. Call for Review. MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [172.17.95.10] X-Mailer: Zimbra 7.2.6_GA_2926 (ZimbraWebClient - FF3.0 (Win)/7.2.6_GA_2926) X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 05 May 2015 11:04:47 -0000 Julian Elischer wrote: > On 5/5/15 8:42 AM, Rick Macklem wrote: > > Julian Elischer wrote: > >> On 5/3/15 10:33 PM, Jilles Tjoelker wrote: > >>> On Fri, May 01, 2015 at 07:17:42PM +0300, Konstantin Belousov > >>> wrote: > >>>> On Fri, May 01, 2015 at 03:04:51PM +0800, Julian Elischer wrote: > >>>>> if you are interested in readdir(3), seekdir(3) and telldir(3) > >>>>> then > >>>>> you should look at > >>>>> https://reviews.freebsd.org/D2410 > >>>>> this patches around a problem in seekdir() that breaks Samba. > >>>>> Seekdir(3) will not work as expected when files prior to the > >>>>> point of > >>>>> interest in directory have been deleted since the directory was > >>>>> opened. > >>>>> Windows clients using Samba cause both these things to happen, > >>>>> causing > >>>>> the next readdir(3) after the bad seekdir(3) to skip some > >>>>> entries > >>>>> and > >>>>> return the wrong file. > >>>>> Samba only needs to step back a single directory entry in the > >>>>> case > >>>>> where it reads an entry and then discovers it can't fit it into > >>>>> the > >>>>> buffer it is sending to the windows client. It turns out we can > >>>>> reliably cater to Samba's requirement because the "last > >>>>> returned > >>>>> element" is always still in memory, so with a little care, we > >>>>> can > >>>>> set our filepointer back to it safely. (once) > >>>>> seekdir and readdir (and telldir()) need a complete rewrite > >>>>> along > >>>>> with > >>>>> getdirentries() but that is more than a small edit like this. > >>>> Can you explain your expectations from the whole readdir() vs. > >>>> parallel > >>>> directory modifications interaction ? From what I understood so > >>>> far, > >>>> there is unlocked modification of the container and parallel > >>>> iterator > >>>> over the same container. IMO, in such situation, whatever > >>>> tweaks > >>>> you > >>>> apply to the iterator, it is still cannot be made reliable. > >>>> Before making single-purpose changes to the libc readdir and > >>>> seekdir > >>>> code, or to the kernel code, it would be useful to state exact > >>>> behaviour > >>>> of the dirent machinery we want to see. No, 'make samba works in > >>>> my > >>>> situation' does not sound good enough. > >>> Consider the subsequence of entries that existed at opendir() > >>> time > >>> and > >>> were not removed until now. This subsequence is clearly defined > >>> and > >>> does > >>> not have concurrency problems. The order of this subsequence must > >>> remain > >>> unchanged and seekdir() must be correct with respect to this > >>> subsequence. > >>> > >>> Additionally, two other kinds of entries may be returned. New > >>> entries > >>> may be inserted anywhere in between the entries of the > >>> subsequence, > >>> and > >>> removed entries may be returned as if they were still part of the > >>> subsequence (so that not every readdir() needs a system call). > >>> > >>> A simple implementation for UFS-style directories is to store the > >>> offset > >>> in the directory (all bits of it, not masking off the lower 9 > >>> bits). > >>> This needs d_off or similar in struct dirent. The kernel > >>> getdirentries() > >>> then needs a similar loop as the old libc seekdir() to go from > >>> the > >>> start > >>> of the 512-byte directory block to the desired entry (since an > >>> entry may > >>> not exist at the stored offset within the directory block). > >>> > >>> This means that a UFS-style directory cannot be compacted > >>> (existing > >>> entries moved from higher to lower offsets to fill holes) while > >>> it > >>> is > >>> open for reading. An NFS exported directory is always open for > >>> reading. > >>> > >>> This also means that duplicate entries can only be returned if > >>> that > >>> particular filename was deleted and created again. > >>> > >>> Without kernel support, it is hard to get telldir/seekdir > >>> completely > >>> reliable. The current libc implementation is wrong since the > >>> "holes" > >>> within the block just disappear and change the offsets of the > >>> following > >>> entries; the kernel cannot fix this using entries with d_fileno = > >>> 0 > >>> since it cannot know, in the general case, how long the deleted > >>> entry > >>> was in the filesystem-independent dirent format. My previous idea > >>> of > >>> storing one d_fileno during telldir() is wrong since it will fail > >>> if > >>> that entry is deleted. > >>> > >>> If you do not care about memory usage (which probably is already > >>> excessive with the current libc implementation), you could store > >>> at > >>> telldir() time the offset of the current block returned by > >>> getdirentries() and the d_fileno of all entries already returned > >>> in > >>> the > >>> current block. > >>> > >>> The D2410 patch can conceptually work for what Samba needs, > >>> stepping > >>> back one directory entry. I will comment on it. > >>> > >> how long do I have to wait until I can commit this and was kib's > >> comment a > >> "do not commit"? > >> > > What about the bug Jilles reports in D2410? > > - He said you might fix the problem by having telldir move the > > entry > > to the head of the list when it has a hit. However, this means > > that > > an "old" entry gets modified. Is it possible for this "modified" > > entry to be a match and get modified again, and so on...? > > > > I will comment on the patch once you decide how to deal with Jilles > > bug. > I don't think is is a "bug" as such.. > it wasn't a case I was looking to fix and it is just as it was > before. > I'd rephrase it as: "Jilles asks that we also fix the case where the > previous telldir response is > a recycling of an old telldir response." > For the case where it hits a matching entry, the entry at the head of the list is some older entry. Your fixtelldir() changes that older entry. Is that what you want? > In actual fact this scenario will almost never happen. > because the previous time the telldir call returned that location, > the 'fixtelldir' function would have later been called on the > result, > which > would have been modified to point to the start of the NEXT buffer, > and as such would not get matched on the next telldir to the same > place, as that would be looking for the first location after the end > of the previous buffer. OR it does match because we seeked back to > that location, o which case we will get the same buffer alredy fixed. If at this point, it fixes it again by replacing the values with the same ones, then I guess it may be ok. I'll have to take another look at this. I may remove myself from the review, since I'll admit I don't understand the implications of this patch well enough. I'm looking for weird cases where it would cause a regression, because to me, if it doesn't break anything not already broken, I don't have a problem with it. I will comment in phabricator soon (maybe to-night). rick ps: This should probably go in the phabricator comments if it isn't already there. > The two addresses are logically equivalent, > but you can only know that after you have loaded the next buffer. > > i.e. the first time telldir is called on location X it returns > seek=123, location= 4096 (one past end of buffer) > then a read happens and it gets converted to: > seek= 124 location=0 (beginning of next buffer.. > then some more reads happen or so and we > seek back to 124,0 > and do a new telldir, in which case we get a 'pre-fixed' value of > 124,0 not 123,4096. > so while fixtelldir is not called, it doesn't matter. > > If we seek back FURTHER than 124,0 then we are into the "unreliable > unfixed zone". > Assuming that by some luck we don't get confused (there were no > deletes) and we then work forwards back to 123,4096 and do a telldir > again, > we will NOT match the old one, which was changed to be 124,0 > so we will allocate a new one at 123,4096, > which in turn will get 'fixed' to be 124,0 So it worked. the only > 'less > than optimal' thing here is that we now have two entries saying > 124,0. > But the case is so unusual and in a totally unsupported mode of > operation, that as far as I know no-one uses, that adding code > to merge the two entries is just not worth it. It still returns the > correct values, just wastes some memory. Anyone who wants to > do a DOS wasting these 16 bytes of memory, has to do it via writing > code to do it. so.. why would one DOS oneself? > > we COULD mode the item up front but it'd never get matched unless > there had not been a matching read following the first telldir which > is really unlikely. > > > rick > > > >> _______________________________________________ > >> freebsd-current@freebsd.org mailing list > >> http://lists.freebsd.org/mailman/listinfo/freebsd-current > >> To unsubscribe, send any mail to > >> "freebsd-current-unsubscribe@freebsd.org" > >> > > > >