From owner-freebsd-arch  Sat May  4  0:58:43 2002
Delivered-To: freebsd-arch@freebsd.org
Received: from fledge.watson.org (fledge.watson.org [204.156.12.50])
	by hub.freebsd.org (Postfix) with ESMTP id A4CE337B416
	for <arch@freebsd.org>; Sat,  4 May 2002 00:58:37 -0700 (PDT)
Received: from fledge.watson.org (fledge.pr.watson.org [192.0.2.3])
	by fledge.watson.org (8.12.3/8.12.3) with SMTP id g447wMb5023842;
	Sat, 4 May 2002 03:58:22 -0400 (EDT)
	(envelope-from robert@fledge.watson.org)
Date: Sat, 4 May 2002 03:58:21 -0400 (EDT)
From: Robert Watson <rwatson@freebsd.org>
X-Sender: robert@fledge.watson.org
To: cjclark@alum.mit.edu
Cc: arch@freebsd.org
Subject: Re: df(1) Broken in jail(8)
In-Reply-To: <20020503203340.A74245@blossom.cjclark.org>
Message-ID: <Pine.NEB.3.96L.1020504034549.21461h-100000@fledge.watson.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-arch.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-arch>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-arch>
X-Loop: FreeBSD.ORG


On Fri, 3 May 2002, Crist J. Clark wrote:

> The df(1) utility is broken in jail(8) environments. I could understand
> if it was totally broken, there are things you can't and shouldn't be
> able to do in a jail(8). However, df(1) behavior is inconsistent. 

Arguably yes.

> The getmntinfo(3) function (via the getfsstat(2) call) works in a
> jail(8). When the output is generated from its output, df(1) works (but
> the info isn't offset to the jail(8)'s root). However, when one
> specifies individual filesystems or uses the '-t' option, the
> information on the mount point is gathered using a statfs(2)  call.
> Since this takes a path, which will be offset to the jail(8)  root when
> processed, as an argument, the results are basically broken. 

This is a property of the way VFS works with regards to pathnames.  In
VFS, pathnames don't really exist, there are just vnodes.  :-)
getmntinfo() and relate calls cache a pathname at mount-time, which is
then regurgitated later when requested.  That path isn't just wrong in
jail, it's frequently wrong outside of jail.  For example,
 
  mount /dev/ad0s1e /mnt/this/is/one/path
  mv /mnt/this/is /mnt/this/was
  df 
  
The path reported is now wrong.  As you've observed, when you use
chroot(), it's also wrong.  In fact, there are countless ways to make it
wrong.  I don't think it's reasonable to ever expect it to be right given
how our VFS works.  Nothing says, for example, that you can't mount from
within a chroot() -- I do it all the time in at least one diskless
environment I work with. 

> There are several ways to fix this, and I've come here for opinions. 
> 
>   1) One can not use statfs(2) for '-t,' but stick with
>      getmntinfo(3)'s info only. But it makes some sense to stick with
>      statfs(2) for file aguments provided to df(1). This is fairly
>      easy to implement.

It depends what you want df to do.  Typically, df and mount report the
path where the filesystem was mounted.  That's what you get with statfs(). 
In fact, I know a number of people who abuse df to figure out what
filesystem a directory is actually in.  Vis. 

  cd /usr/obj
  df .

As pointed out, this can be wrong in a variety of situations.

>   2) One can remove the ability to use df(1) at all in a jail(8). It
>      could be argued that there is no real reason to be able to use
>      things like getfsstat(2) or statfs(2) in a jail(8) (but what else
>      might this break?). This is easy to do.

statfs() is used by several applications that you wouldn't expect.  Try
turning off statfs() sometime and see what happens.  Applications like to
query the block size, check for free space, etc.

>   3) One can fix getfstat(2) and statfs(2) so they are
>      "jail(8)-aware." That is, getfstat(2) knows only to return info
>      on filesystems mounted at or above the jail(8)'s root. Both calls
>      learn how to offset their mountpoint names to the jail root. This
>      is harder.

This is effectively impossible, or at least, very difficult.  Suppose I
construct a file system hierarchy like so:

  mount -t procfs proc /proc
  mount /dev/ad0s1e /usr
  mount /dev/ad1s1a /jailroot
  mount /dev/ad1s1e /jailroot/var
  mount /dev/ad1s1e /jailroot/usr
  mount -t procfs proc /jailroot/proc
  mount -t devfs devfs /jailroot/dev

For each filesystem in jail, how do I determine which to report?  Now
imagine some of those were mounted under various chroot()'s as the system
booted.  Imagine they have different blocksizes, etc, etc.

Interestingly, the trustedbsd_mac tree does permit filtering of
getmntinfo() and statfs(), and accepts the application breakage, but it
relies on policies that have a much more well-defined notion of what
should be visible.  For example, it knows how to block display of
information in df based on confidentiality labels: if a filesystem was
mounted by a high-confidentiality process, a low confidentiality process
won't be able to statfs it, and won't be able to see it in the list of
mouted filesystems.  But that is possible because the decision is based on
a tuple with well-defined parameters: (process sensitivity label,
filesystem sensitivity label).  With jail, you can't "just look" at the
struct mount and trivially determine if it's visible to a jail or not,
because jail visibility is subject to the whims of VFS at any given
moment.  Filesystems are welcome to spit back vnodes you wouldn't expect
-- for example, for the longest time procfs would return the actual vnode
of the executable run most recently by the process.  It didn't return a
procfs vnode, but a real reference to p_vtextvp.

> And is anyone already working on this?

There are some patches in the PR collection.  I've generally objected to
them.  The best hack I've seen so far simply restricts statfs() and
getmntinfo() operations to returning information only on the filesystem
matching the root directory of the current process.  Sure, this is broken
for /proc or /dev in jail, among other things, but it's a simple hack and
unbreaks many applications.  It may be there are cool solutions that we
haven't thought of yet that avoid the properties I've described above.
However, I'd appreciate the chance to review any changes you come up with
before you commit them, so I can look out for the usual complications.

Just as a suggestion when experimenting with this: generate some
statistics on how much getmntinfo() and statfs() are invoked by your
system over the course of a 24 hour period of normal system use.
Instrument both system calls so that you get printfs to /dev/log
indicating when they were called, and print out p_comm.  I think you'll be
surprised by the result, especially if you run windowing systems (or ls). 

Robert N M Watson             FreeBSD Core Team, TrustedBSD Project
robert@fledge.watson.org      NAI Labs, Safeport Network Services


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message