Date: Sat, 25 Jun 2011 10:04:20 -0400 (EDT) From: Rick Macklem <rmacklem@uoguelph.ca> To: Benjamin Kaduk <kaduk@MIT.EDU> Cc: Garance A Drosehn <gad@freebsd.org>, freebsd-fs@freebsd.org, Robert Watson <rwatson@freebsd.org>, shadow@gmail.com Subject: Re: [rfc] 64-bit inode numbers Message-ID: <1182998178.1062689.1309010660304.JavaMail.root@erie.cs.uoguelph.ca> In-Reply-To: <alpine.GSO.1.10.1106242244170.6818@multics.mit.edu>
next in thread | previous in thread | raw e-mail | index | archive | help
Benjamin Kaduk wrote: > Hmm, several messages regarding AFS that I will try to address at > once. > > > On Fri, 24 Jun 2011, Kostik Belousov wrote: > > On Thu, Jun 23, 2011 at 06:05:56PM -0400, Garance A Drosehn wrote: > >> Consider the thread "Increasing the size of dev_t and ino_t" from > >> freebsd-arch in 2002: > >> > >> http://docs.freebsd.org/mail/archive/2002/freebsd-arch/20020317.freebsd-arch.html > >> > >> In particular, this message by Robert Watson: > >> > >> http://docs.freebsd.org/cgi/getmsg.cgi?fetch=139853+0+archive/2002/freebsd-arch/20020317.freebsd-arch > >> > >> I just participated in an online conference for OpenAFS, and while > >> it > >> isn't exactly taking the world by storm, I keep thinking it would > >> be > >> useful if FreeBSD could map individual AFS volumes to unique dev_t > >> identifiers. And given the way AFS is implemented (as a global > >> filesystem > >> with many cells all reachable at the same time), and given the way > >> most > >> sites deploy AFS (with thousands or tens-of-thousands of individual > >> AFS > >> volumes *per site*), that adds up to a lot of values for dev_t. > >> > >> The upcoming release of OpenAFS should include a working and pretty > >> stable AFS client for FreeBSD, so having a larger dev_t would have > >> a > >> more immediate application than it did back in 2002. > > Am I right that the issue is the uniqueness of the dev_t for each > > AFS volume, as reported by stat(2) ? > > > > Shouldn't the AFS client synthesize the dev_t for each new volume > > mounted ? It seems that the current 32bit dev_t would be enough, > > since I do not expect to see hundreds of thousands of mounts > > on an single system. > > The current OpenAFS implementation only presents a single mountpoint, > /afs, and does not really distinguish between different mounted > volumes. > This is not ideal, and we would like to be able to make each volume > appear > as a separate device if there's a good way to do so. The technical > challenge of doing this while sill only having a single mount method > for > AFS is not something that I've looked at, there being more pressing > issues > on my plate. > With a single mount point in the client (struct mount *), if the st_dev remains the same throughout the mountpoint, then all st_ino's must be unique (ie. no duplicate ino# == 2 or similar) or fts(3) complains about cycles in the tree and gives up. (Shows up when you do "ls -lR".) On the other hand, if st_dev changes within the single client mountpoint, then the value of d_ino in the directory entry for it (I've heard of this being referred to as the "mounted on inode#") must be different than the st_ino reported for the object via stat(2) or getcwd() gets confused, if I recall correctly. > > > > Please note that we do not guarantee dev_t stability across reboots > > even > > for real devices. > > Hmm, this is somewhat annoying, as the AFS global namespace does > provide a > stable unique identifier for files/directories using a unique cell ID, > volume ID, per-file ID, and uniquifier. Being able to directly use the > cell/volume information for a dev_t would be quite convenient. > > > > > > On Fri, 24 Jun 2011, Bruce Evans wrote: > > > > mnt_stat.f_fsid is generated from the dev_t, and tries to give > > stability > > across reboots. Otherwise, IIRC, nfs mounts break if the server is > > rebooted. Not only the dev_t part, but other things in f_fsid, > > depend > > on the order of initialization, but the ids usually end up the same > > if > > you don't reconfigure much on the server. > > > > f_fsid also has a problem with uniqeness, but that is mainly because > > it > > wants to be unique when truncated to a 16-bit dev_t. dev_t is only > > 16 > > bits in some versions of Linux, including in the FreeBSD i386 Linux > > emulator (I can see traces of 32-bit dev_t in Linux-2.6.10 but not > > in > > the FreeBSD emulator). > > > > I hope AFS ids could be implemented like fsids and not need to > > literally > > match foreign ids, but if they are synthesized then they might be > > harder > > than fsids to keep invariant across reboots. > > I'm not sure how one would have a chance of keeping things invariant > across reboots other than to use the cell/volume IDs in some fashion. > That said, the AFS client maintains its own copy of these unique IDs > in > the fs-specific vnode area, and should be able to talk to the server > just > fine if the fsids end up faked. Keeping the fake fsids consistent if a > file goes in and out of the local cache may be a different issue, > though. > > > > > > On Fri, 24 Jun 2011, Rick Macklem wrote: > > > Garance A Drosehn wrote: > >> The AFS cell at RPI has approximately 40,000 AFS volumes, and each > >> volume should have it's own dev_t (IMO). That's just counting the > >> collection of AFS volumes which are on RPI file servers, and any > >> user sitting on one computer could access AFS volumes which are > >> made available by other sites (aka "AFS cells"). Most RPI users > >> would only have access to maybe 1/4 of those volumes which exist > >> at RPI, but we do know that individual users have run 'find' over > >> the entire RPI cell looking for whatever they're looking for. I > >> once did a run of 'md5deep' on the entire RPI cell, thanks to a > >> symlink which I didn't realize was in my home directory! > > We have almost 50,000 volumes in the athena cell, here. > > >> > > Note that it the value in mnt_stat.f_fsid that needs to be unique > > w.r.t > > other mount points in the machine. If AFS appears to be one mount > > point in the FreeBSD client, then the only issue I know of is how > > the client is expected to handle changes in dev_t within the mount > > Er, how is the client expected to communicate these changes? As > mentioned > above, I believe we currently present only a single device and > mountpoint, > which is suboptimal. (Actually, it looks like we don't even initialize > mnt_stat.f_fsid at all if I'm reading the current code correctly. > Oops.) > I would love to be able to present volume mountpoints as actually > being > mountpoints. > > > point. fts(3) and friends will assume that it is a mount point > > crossing when st_dev changes. It will then expect that the funny > > rule that the d_ino in dirent will not be the same as st_ino. > > > > What I do for NFSv4 is sythesize the mnt_stat.f_fsid value and > > return that as st_dev for the mounted volume until I see the fsid > > returned by the server change. Below that point, I return the fsid > > from the server as st_dev so long as it isn't the same as the > > I think I'm confused. You're ... walking a directory heirarchy, and > return a fake st_dev value but hold onto the fsid value from the > server, > then when the fsid from the server changes (due to a ... different NFS > mount?), start reporting that new fsid and throw away the fake st_dev > value? Can you point me at the code that is doing this? > > > synthesized one. That way, fts(3) and friends figure out the mount > > point crossings within the server. > > > > "ls -lR" will usually find problems if this is broken. > >> So one person can easily trigger the access of 10,000 AFS volumes > >> on one computer using one command. That might sound terrifying if > >> you imagine it as being 10,000 NFS mounts, but accessing AFS > >> volumes > >> isn't the same amount of work as auto-mounting NFS filesystems. > >> So ignore whatever problems you might expect to see with 10,000 > >> filesystems mounted on one computer. Just realize that it is very > >> easy for a single user to access tens of thousands of AFS volumes > >> from one computer, and it would be "most correct" (programming > >> wise) > >> if all of those AFS volumes were to get a unique value for dev_t. > >> And of course it's even easier for a remote-access system to access > >> tens-of-thousands of AFS volumes, since it would have a few dozen > >> users logged in at the same time. > >> > > > > I guess, at the end of the day, it's not clear to me what OpenAFS > should > do when we finally get around to exposing AFS volume mountpoints as > device > mountpoints to userland. Reusing existing globally-unique AFS ID > information would be nice, but how to cleanly transform that to a > smaller > unique ID for the particular machine in question? > > -Ben Kaduk
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1182998178.1062689.1309010660304.JavaMail.root>