Date: Mon, 11 Mar 2002 16:16:48 -0500 (EST) From: Robert Watson <rwatson@FreeBSD.ORG> To: Harti Brandt <brandt@fokus.gmd.de> Cc: Garance A Drosihn <drosih@rpi.edu>, Poul-Henning Kamp <phk@critter.freebsd.dk>, arch@FreeBSD.ORG Subject: Re: Increasing the size of dev_t and ino_t Message-ID: <Pine.NEB.3.96L.1020311160835.46602A-100000@fledge.watson.org> In-Reply-To: <20020311172142.K1371-100000@beagle.fokus.gmd.de>
next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, 11 Mar 2002, Harti Brandt wrote: > I suppose the AFS volumes themself have some kind of unique identifier, > otherwise there would be no way to tell that you are mounting the same > volume in different places, there wouldn't even be the notion of 'the > same volume'. Given that, it should be simple to map between those AFS > volume identifiers and st_dev's. How this mapping is done depends on the > kind of the volume id. If you have 33,000 mounts in you system, adding a > uint32_t to each of these mounts will not be your main problem. AFS, Coda, and various other "global scale" filesystems rely on a much larger unique identifier space than the traditional 64-bit (dev_t, ino_t) pair. Coda, for example, uses a 96-bit "Vice ID" which is per-realm. That is partitioned into volume ID's and individual file ID's, which are similar to "filesystems" and "inode numbers". However, the problem occurs because our mount system doesn't scale to the level required for Coda or AFS to function. As such, Coda and AFS have their own light-weight mounting scheme inside the filesystem implementation, so it appears to the kernel as though it's a single huge filesystem, rather than a composite of many filesystems. In AFS, these mountpoints are stored in symlinks identifying the realm and volume name of the target. The complicating factor comes when you try and map the 96-bit (plus realm) into the 32-bit inode number. FreeBSD runs fine, but some applications assuming the POSIX device number/inode number equality behave poorly. For example, gnu tar may find collisions and assume files are a hard link when they are not. Linux, on the other hand, uses the inode numbers within the kernel, and may panic if there is a collision. The "uniqueness" aspect for these numbers is a serious scaling problem: global filesystems can and will name trillions of file system objects. Squeezing them into a single 32-bit number, or even a pair, simply doesn't work. Moving to a 64-bit inode number in FreeBSD would reduce the chances of a collision dramatically, and probably enough that the risk would become acceptable. A preferred solution approximates the POSIX conventions but allows for a special call into the filesystem to check collision cases. I actually implemented this on FreeBSD at one point. The filesystem implementation attempts to maintain a unique inode number by hashing the vice ID. For applications maintaining tables, such as tar, a collision can be resolved by calling samefile() or fsamefile(), which compare the vnode pointers, or call into the individual filesystem to inquire using a VOP. In this manner, the efficiency gains are largely still present, except that the identical values are a hint as opposed to a guarantee. Robert N M Watson FreeBSD Core Team, TrustedBSD Project robert@fledge.watson.org NAI Labs, Safeport Network Services To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?Pine.NEB.3.96L.1020311160835.46602A-100000>