From owner-freebsd-arch Sat May 4 0:58:43 2002 Delivered-To: freebsd-arch@freebsd.org Received: from fledge.watson.org (fledge.watson.org [204.156.12.50]) by hub.freebsd.org (Postfix) with ESMTP id A4CE337B416 for ; Sat, 4 May 2002 00:58:37 -0700 (PDT) Received: from fledge.watson.org (fledge.pr.watson.org [192.0.2.3]) by fledge.watson.org (8.12.3/8.12.3) with SMTP id g447wMb5023842; Sat, 4 May 2002 03:58:22 -0400 (EDT) (envelope-from robert@fledge.watson.org) Date: Sat, 4 May 2002 03:58:21 -0400 (EDT) From: Robert Watson X-Sender: robert@fledge.watson.org To: cjclark@alum.mit.edu Cc: arch@freebsd.org Subject: Re: df(1) Broken in jail(8) In-Reply-To: <20020503203340.A74245@blossom.cjclark.org> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG On Fri, 3 May 2002, Crist J. Clark wrote: > The df(1) utility is broken in jail(8) environments. I could understand > if it was totally broken, there are things you can't and shouldn't be > able to do in a jail(8). However, df(1) behavior is inconsistent. Arguably yes. > The getmntinfo(3) function (via the getfsstat(2) call) works in a > jail(8). When the output is generated from its output, df(1) works (but > the info isn't offset to the jail(8)'s root). However, when one > specifies individual filesystems or uses the '-t' option, the > information on the mount point is gathered using a statfs(2) call. > Since this takes a path, which will be offset to the jail(8) root when > processed, as an argument, the results are basically broken. This is a property of the way VFS works with regards to pathnames. In VFS, pathnames don't really exist, there are just vnodes. :-) getmntinfo() and relate calls cache a pathname at mount-time, which is then regurgitated later when requested. That path isn't just wrong in jail, it's frequently wrong outside of jail. For example, mount /dev/ad0s1e /mnt/this/is/one/path mv /mnt/this/is /mnt/this/was df The path reported is now wrong. As you've observed, when you use chroot(), it's also wrong. In fact, there are countless ways to make it wrong. I don't think it's reasonable to ever expect it to be right given how our VFS works. Nothing says, for example, that you can't mount from within a chroot() -- I do it all the time in at least one diskless environment I work with. > There are several ways to fix this, and I've come here for opinions. > > 1) One can not use statfs(2) for '-t,' but stick with > getmntinfo(3)'s info only. But it makes some sense to stick with > statfs(2) for file aguments provided to df(1). This is fairly > easy to implement. It depends what you want df to do. Typically, df and mount report the path where the filesystem was mounted. That's what you get with statfs(). In fact, I know a number of people who abuse df to figure out what filesystem a directory is actually in. Vis. cd /usr/obj df . As pointed out, this can be wrong in a variety of situations. > 2) One can remove the ability to use df(1) at all in a jail(8). It > could be argued that there is no real reason to be able to use > things like getfsstat(2) or statfs(2) in a jail(8) (but what else > might this break?). This is easy to do. statfs() is used by several applications that you wouldn't expect. Try turning off statfs() sometime and see what happens. Applications like to query the block size, check for free space, etc. > 3) One can fix getfstat(2) and statfs(2) so they are > "jail(8)-aware." That is, getfstat(2) knows only to return info > on filesystems mounted at or above the jail(8)'s root. Both calls > learn how to offset their mountpoint names to the jail root. This > is harder. This is effectively impossible, or at least, very difficult. Suppose I construct a file system hierarchy like so: mount -t procfs proc /proc mount /dev/ad0s1e /usr mount /dev/ad1s1a /jailroot mount /dev/ad1s1e /jailroot/var mount /dev/ad1s1e /jailroot/usr mount -t procfs proc /jailroot/proc mount -t devfs devfs /jailroot/dev For each filesystem in jail, how do I determine which to report? Now imagine some of those were mounted under various chroot()'s as the system booted. Imagine they have different blocksizes, etc, etc. Interestingly, the trustedbsd_mac tree does permit filtering of getmntinfo() and statfs(), and accepts the application breakage, but it relies on policies that have a much more well-defined notion of what should be visible. For example, it knows how to block display of information in df based on confidentiality labels: if a filesystem was mounted by a high-confidentiality process, a low confidentiality process won't be able to statfs it, and won't be able to see it in the list of mouted filesystems. But that is possible because the decision is based on a tuple with well-defined parameters: (process sensitivity label, filesystem sensitivity label). With jail, you can't "just look" at the struct mount and trivially determine if it's visible to a jail or not, because jail visibility is subject to the whims of VFS at any given moment. Filesystems are welcome to spit back vnodes you wouldn't expect -- for example, for the longest time procfs would return the actual vnode of the executable run most recently by the process. It didn't return a procfs vnode, but a real reference to p_vtextvp. > And is anyone already working on this? There are some patches in the PR collection. I've generally objected to them. The best hack I've seen so far simply restricts statfs() and getmntinfo() operations to returning information only on the filesystem matching the root directory of the current process. Sure, this is broken for /proc or /dev in jail, among other things, but it's a simple hack and unbreaks many applications. It may be there are cool solutions that we haven't thought of yet that avoid the properties I've described above. However, I'd appreciate the chance to review any changes you come up with before you commit them, so I can look out for the usual complications. Just as a suggestion when experimenting with this: generate some statistics on how much getmntinfo() and statfs() are invoked by your system over the course of a 24 hour period of normal system use. Instrument both system calls so that you get printfs to /dev/log indicating when they were called, and print out p_comm. I think you'll be surprised by the result, especially if you run windowing systems (or ls). Robert N M Watson FreeBSD Core Team, TrustedBSD Project robert@fledge.watson.org NAI Labs, Safeport Network Services To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message