From owner-freebsd-bugs@FreeBSD.ORG Thu Feb 5 05:17:07 2015 Return-Path: Delivered-To: freebsd-bugs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 56C897C5; Thu, 5 Feb 2015 05:17:07 +0000 (UTC) Received: from mail109.syd.optusnet.com.au (mail109.syd.optusnet.com.au [211.29.132.80]) by mx1.freebsd.org (Postfix) with ESMTP id E26BE2DE; Thu, 5 Feb 2015 05:17:06 +0000 (UTC) Received: from c211-30-166-197.carlnfd1.nsw.optusnet.com.au (c211-30-166-197.carlnfd1.nsw.optusnet.com.au [211.30.166.197]) by mail109.syd.optusnet.com.au (Postfix) with ESMTPS id D9FE6D67252; Thu, 5 Feb 2015 16:16:56 +1100 (AEDT) Date: Thu, 5 Feb 2015 16:16:55 +1100 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: bugzilla-noreply@freebsd.org Subject: Re: [Bug 197336] find command cannot see more than 32765 subdirectories when using ZFS In-Reply-To: Message-ID: <20150205150044.M1011@besplex.bde.org> References: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.1 cv=Za4kaKlA c=1 sm=1 tr=0 a=KA6XNC2GZCFrdESI5ZmdjQ==:117 a=PO7r1zJSAAAA:8 a=kj9zAlcOel0A:10 a=JzwRw_2MAAAA:8 a=6I5d2MoRAAAA:8 a=-YFtLSKOLxlGngPI8gcA:9 a=QrPKUQeEO9MBfjw3:21 a=Fo0Dt8iYkMBxhfyz:21 a=CjuIK1q_8ugA:10 Cc: freebsd-bugs@freebsd.org X-BeenThere: freebsd-bugs@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Bug reports List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 05 Feb 2015 05:17:07 -0000 On Wed, 4 Feb 2015 bugzilla-noreply@freebsd.org wrote: > Created attachment 152566 > --> https://bugs.freebsd.org/bugzilla/attachment.cgi?id=152566&action=edit > python script to generate a bunch of subdirectories with files in them This may be considered a feature -- it detected a bad script that created too many files > When a directory has more than 32765 subdirectories in it, the find command > fails to find all of the contents if the find command is executed in a ZFS > filesystem. FreeBSD only supports file systems that support at most 32767 links. This is mainly a problem for subdirectories, since each subdirectory has a ".." link to the same parent. It could support at most 65535 links, but this would break API compatibility, but the API is already broken. More than that would break binary compatibility. The limit of 65535 is from nlink_t being 16 bits unsigned, and the limit of 32767 is a bug that has survived for more than 20 years to keep the API bug for bug compatible. There are many bugs in this support. Most are in individual file systems. At the top level, the only know bug is that LINK_MAX is defined at all (as 32767). Defining it means that the limit {LINK_MAX}, i.e., pathconf(path, _PC_LINK_MAX) is the same for all files on all file systems systems, but FreeBSD supports many file systems with widely varying {LINK_MAX}. Some file systems actually implement {LINK_MAX} correctly as the limit that applies to them: - this is easy if it is <= LINK_MAX. If it is < LINK_MAX, this is incompatible with the definition of LINK_MAX, but any software that is naive or broken enough to use LINK_MAX probably won't notice any problem. - if it is > LINK_MAX but <= 65535, then returning the correct limit in pathconf() is again incompatible with LINK_MAX being smaller, and this now breaks the naive/broken software (e.g., arrays sized with LINK_MAX may be overrun). - if it is > 65535, then FreeBSD cannot support the file system properly. However, if there are no files with more tha 65535 links at mount time, then it is good enough to maintain this invariant. The python script should break trying to create the 65536th file in this case (even earlier on file systems with a smaller limit). If there is just one file with more than 65535 links, then the file is not supported and perhaps all operations on it should fail, starting with stat() to see what it is, since stat() cannot return its correct link count and has no way of reporting this error except by failing the while syscall. zfs apparently supports a large number of links, but has many bugs: - in pathconf() it says that {LINK_MAX} is INT_MAX for all files - in stat() (VOP_GETATTR()) it breaks this even for files with a link count between 32768 and 65535 inclusive, by clamping to LINK_MAX = 32767. This inconsistency explains the behaviour seen. The python script might be sophisticated to a fault, but believe the broken {LINK_MAX}. It might do fancy splitting of subdirectories to avoid hitting the limit, but not do any splitting since the limit is large. Then find might be confused by stat() returning the clamped number of links. I suspect that the actual reasons are more complicated. find doesn't use link counts much directly, but it uses fts which probably makes critical use of them. nfs is much more broken than nfs here. The server file system may support anything for {LINK_MAX} and st_nlink. nfs seems to blindly assign the server values (except for {LINK_MAX} in the v2 case, it invents a value). So if {LINK_MAX} > 65535 on the server, the large server value is normally returned (not truncated since rlim_t is large enough for anything). This matches the zfs behaviour of returning a larger-than-possible value. But if st_nlink > 65535 on the server, it is blindly truncated to a value <= 65335 (possibly 0, but not negative since nlink_t is signed. Oops, va_nlink is still short, so negative values occur too). This is more dangerous than the clamping in zfs. nfs mostly uses va_nlink internally, and uses it it in a critical way for at least the test (va_nlink > 1). Truncation to a signed value breaks this for all values that were between 32768 and 65535 before truncation. Truncation to an unsigned value would have only broken it for 65536, Similarly for all values equal mod 65536 (or 32768). > If the same command is executed in another filesystem that FreeBSD supports > that also supports large counts of subdirectories, the find command sees > everything. I've confirmed the correct behavior with both Reiserfs and > unionfs. So it appears to be something about the interaction between find and > ZFS that triggers the bug. It is impossible for the other file systems to work much better. Perhaps they work up to 65535, or have the correct {LINK_MAX} and the python script is smart enough to avoid it. I doubt that python messes with {LINK_MAX}, but creation of subdirectories should stop when the advertized limit is hit, and python or the script should handle that, possibly just by stopping. Bruce