Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 09 May 2009 00:08:35 -0600
From:      Jamie Gritton <jamie@FreeBSD.org>
To:        jail@FreeBSD.org, virtualization@FreeBSD.org
Subject:   Hierarchical jails
Message-ID:  <4A051DE3.30705@FreeBSD.org>

next in thread | raw e-mail | index | archive | help
This is a multi-part message in MIME format.
--------------080507020907010909080502
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

Here's the first round of hierarchical jails under the new framework.

Instead of creds having either a prison or a NULL pointer, they all have
a prison pointer with the default being the global "prison0" that
contains information about the real environment.  Jailed root may (if
granted permission) create prisons that would be under its place in the
hierarchy, but may not alter (or even see) prisons at its level or
above.

The JID space is flat, i.e. every prison in the system has a unique ID.
The prison name space is hierarchical, with jails having dot-separated
component names.

prison0 contains three fields that were system globals: pr_root,
pr_host, and pr_securelevel.  I've kept the globals rootvnode and
hostname, and take care that when one is changed the other changes too
(not yet true for hostname - read on).  But I've actually removed the
global securelevel, instead forcing people to use securelevel_gt() and
securelevel_ge() (or in very rare cases to check prison0.pr_securelevel
directly).  I chose to do that because while using the global rootvnode
and hostname may be incorrect, using the wrong securelevel is, well,
insecure.  Actually it would be insecure to use the wrong rootvnode too,
but I'm not convinced removing that global is worth the headache.

Other globals are subsumed into prison0, but they were only ever part of
the jail system anyway: the various jail-related permission bits and
such administrative things as prisoncount.

The prison hierarchy keeps track of restrictions placed on prisons, and
will reflect them downward so a child jail is always at least as
restricted as its ancestors.  It doesn't go the other way though: if a
prison's restrictions are loosened, the children stay as they are.

This patch doesn't have anything for userland, and hierarchical jails
won't work without that patch (because jails don't have permission to
create sub-jails by default, and jail(2) can't grant that permission).
A userland patch will follow soon, very similar to the version I posted
here recently.

- Jamie


--------------080507020907010909080502
Content-Type: text/plain;
 name="jh.diff"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline;
 filename="jh.diff"

Index: lib/libc/sys/jail.2
===================================================================
--- lib/libc/sys/jail.2	(revision 191896)
+++ lib/libc/sys/jail.2	(working copy)
@@ -25,7 +25,7 @@
 .\"
 .\" $FreeBSD$
 .\"
-.Dd April 29, 2009
+.Dd May 8, 2009
 .Dt JAIL 2
 .Os
 .Sh NAME
@@ -283,7 +283,7 @@
 It is possible to identify a process as jailed by examining
 .Dq Li /proc/<pid>/status :
 it will show a field near the end of the line, either as
-a single hyphen for a process at large, or the hostname currently
+a single hyphen for a process at large, or the name currently
 set for the prison for jailed processes.
 .Sh ERRORS
 The
@@ -292,7 +292,10 @@
 will fail if:
 .Bl -tag -width Er
 .It Bq Er EPERM
-This process is not allowed to create a jail.
+This process is not allowed to create a jail, either because it is not
+the super-user, or the
+.Va security.jail.allow_jails
+sysctl MIB is not set.
 .It Bq Er EFAULT
 .Fa jail
 points to an address outside the allocated address space of the process.
@@ -308,7 +311,10 @@
 will fail if:
 .Bl -tag -width Er
 .It Bq Er EPERM
-This process is not allowed to create a jail.
+This process is not allowed to create a jail, either because it is not
+the super-user, or the
+.Va security.jail.allow_jails
+sysctl MIB is not set.
 .It Bq Er EPERM
 A jail parameter was set to a less restrictive value then the current
 environment.
@@ -429,4 +435,4 @@
 who contributed it to
 .Fx .
 .An James Gritton
-added the extensible jail parameters.
+added the extensible jail parameters and hierchical jails.
Index: sys/ufs/ufs/ufs_vnops.c
===================================================================
--- sys/ufs/ufs/ufs_vnops.c	(revision 191896)
+++ sys/ufs/ufs/ufs_vnops.c	(working copy)
@@ -61,7 +61,6 @@
 #include <sys/lockf.h>
 #include <sys/conf.h>
 #include <sys/acl.h>
-#include <sys/jail.h>
 
 #include <machine/mutex.h>
 
Index: sys/kern/kern_jail.c
===================================================================
--- sys/kern/kern_jail.c	(revision 191896)
+++ sys/kern/kern_jail.c	(working copy)
@@ -41,6 +41,7 @@
 #include <sys/errno.h>
 #include <sys/sysproto.h>
 #include <sys/malloc.h>
+#include <sys/osd.h>
 #include <sys/priv.h>
 #include <sys/proc.h>
 #include <sys/taskqueue.h>
@@ -48,7 +49,6 @@
 #include <sys/jail.h>
 #include <sys/lock.h>
 #include <sys/mutex.h>
-#include <sys/osd.h>
 #include <sys/sx.h>
 #include <sys/namei.h>
 #include <sys/mount.h>
@@ -74,61 +74,43 @@
 SYSCTL_NODE(_security, OID_AUTO, jail, CTLFLAG_RW, 0,
     "Jail rules");
 
-int	jail_set_hostname_allowed = 1;
-SYSCTL_INT(_security_jail, OID_AUTO, set_hostname_allowed, CTLFLAG_RW,
-    &jail_set_hostname_allowed, 0,
-    "Processes in jail can set their hostnames");
+/* prison0 describes what is "real" about the system. */
+struct prison prison0 = {
+	.pr_id		= 0,
+	.pr_name	= "0",
+	.pr_ref		= 1,
+	.pr_uref	= 1,
+	.pr_path	= "/",
+	.pr_securelevel	= -1,
+	.pr_children	= LIST_HEAD_INITIALIZER(&prison0.pr_children),
+	.pr_flags	= PR_ALLOW_ALL,
+	.pr_def_perms	= PR_ALLOW_SET_HOSTNAME |
+			  PR_RESTRICT_SOCKET_UNIXIPROUTE,
+	.pr_def_enforce_statfs = 2,
+#if defined(INET) || defined(INET6)
+	.pr_def_max_af_ips = 255,
+#endif
+};
+MTX_SYSINIT(prison0, &prison0.pr_mtx, "jail mutex", MTX_DEF);
 
-int	jail_socket_unixiproute_only = 1;
-SYSCTL_INT(_security_jail, OID_AUTO, socket_unixiproute_only, CTLFLAG_RW,
-    &jail_socket_unixiproute_only, 0,
-    "Processes in jail are limited to creating UNIX/IP/route sockets only");
-
-int	jail_sysvipc_allowed = 0;
-SYSCTL_INT(_security_jail, OID_AUTO, sysvipc_allowed, CTLFLAG_RW,
-    &jail_sysvipc_allowed, 0,
-    "Processes in jail can use System V IPC primitives");
-
-static int jail_enforce_statfs = 2;
-SYSCTL_INT(_security_jail, OID_AUTO, enforce_statfs, CTLFLAG_RW,
-    &jail_enforce_statfs, 0,
-    "Processes in jail cannot see all mounted file systems");
-
-int	jail_allow_raw_sockets = 0;
-SYSCTL_INT(_security_jail, OID_AUTO, allow_raw_sockets, CTLFLAG_RW,
-    &jail_allow_raw_sockets, 0,
-    "Prison root can create raw sockets");
-
-int	jail_chflags_allowed = 0;
-SYSCTL_INT(_security_jail, OID_AUTO, chflags_allowed, CTLFLAG_RW,
-    &jail_chflags_allowed, 0,
-    "Processes in jail can alter system file flags");
-
-int	jail_mount_allowed = 0;
-SYSCTL_INT(_security_jail, OID_AUTO, mount_allowed, CTLFLAG_RW,
-    &jail_mount_allowed, 0,
-    "Processes in jail can mount/unmount jail-friendly file systems");
-
-int	jail_max_af_ips = 255;
-SYSCTL_INT(_security_jail, OID_AUTO, jail_max_af_ips, CTLFLAG_RW,
-    &jail_max_af_ips, 0,
-    "Number of IP addresses a jail may have at most per address family");
-
-/* allprison, lastprid, and prisoncount are protected by allprison_lock. */
+/* allprison and lastprid are protected by allprison_lock. */
 struct	sx allprison_lock;
 SX_SYSINIT(allprison_lock, &allprison_lock, "allprison");
 struct	prisonlist allprison = TAILQ_HEAD_INITIALIZER(allprison);
 int	lastprid = 0;
-int	prisoncount = 0;
 
 static int do_jail_attach(struct thread *td, struct prison *pr);
 static void prison_complete(void *context, int pending);
 static void prison_deref(struct prison *pr, int flags);
+static char *prison_path(struct prison *pr1, struct prison *pr2);
+static void prison_remove1(struct prison *pr);
 #ifdef INET
 static int _prison_check_ip4(struct prison *pr, struct in_addr *ia);
+static int prison_restrict_ip4(struct prison *pr, struct in_addr *newip4);
 #endif
 #ifdef INET6
 static int _prison_check_ip6(struct prison *pr, struct in6_addr *ia6);
+static int prison_restrict_ip6(struct prison *pr, struct in6_addr *newip6);
 #endif
 static int sysctl_jail_list(SYSCTL_HANDLER_ARGS);
 
@@ -139,7 +121,46 @@
 #define	PD_LIST_SLOCKED	0x08
 #define	PD_LIST_XLOCKED	0x10
 
+/*
+ * Parameter names corresponding to PR_* flag values
+ */
+static char *pr_flag_names[] = {
+	[0] = "persist",
 #ifdef INET
+	[2] = "ipv4",
+#endif
+#ifdef INET6
+	[3] = "ipv6",
+#endif
+	[16] = "perm.set_hostname_allowed",
+	"perm.sysvipc_allowed",
+	"perm.allow_raw_sockets",
+	"perm.chflags_allowed",
+	"perm.mount_allowed",
+	"perm.allow_quotas",
+	"perm.allow_jails",
+	"perm.socket_unixiproute_only",
+};
+
+static char *pr_flag_nonames[] = {
+	[0] = "nopersist",
+#ifdef INET
+	[2] = "noipv4",
+#endif
+#ifdef INET6
+	[3] = "noipv6",
+#endif
+	[16] = "perm.noset_hostname_allowed",
+	"perm.nosysvipc_allowed",
+	"perm.noallow_raw_sockets",
+	"perm.nochflags_allowed",
+	"perm.nomount_allowed",
+	"perm.noallow_quotas",
+	"perm.noallow_jails",
+	"perm.nosocket_unixiproute_only",
+};
+
+#ifdef INET
 static int
 qcmp_v4(const void *ip1, const void *ip2)
 {
@@ -277,7 +298,7 @@
 			return (error);
 		tmplen = MAXPATHLEN + MAXHOSTNAMELEN + MAXHOSTNAMELEN;
 #ifdef INET
-		if (j.ip4s > jail_max_af_ips)
+		if (j.ip4s > td->td_ucred->cr_prison->pr_max_af_ips)
 			return (EINVAL);
 		tmplen += j.ip4s * sizeof(struct in_addr);
 #else
@@ -285,7 +306,7 @@
 			return (EINVAL);
 #endif
 #ifdef INET6
-		if (j.ip6s > jail_max_af_ips)
+		if (j.ip6s > td->td_ucred->cr_prison->pr_max_af_ips)
 			return (EINVAL);
 		tmplen += j.ip6s * sizeof(struct in6_addr);
 #else
@@ -420,23 +441,24 @@
 #endif
 	struct vfsopt *opt;
 	struct vfsoptlist *opts;
-	struct prison *pr, *deadpr, *tpr;
+	struct prison *pr, *deadpr, *mypr, *ppr, *tpr;
 	struct vnode *root;
 	char *errmsg, *host, *name, *p, *path;
 	void *op;
-	int created, cuflags, error, errmsg_len, errmsg_pos;
-	int gotslevel, jid, len;
+	size_t namelen, onamelen;
+	int created, cuflags, descend, enforce, error, errmsg_len, errmsg_pos;
+	int gotenforce, gotslevel, fi, jid, len;
 	int slevel, vfslocked;
 #if defined(INET) || defined(INET6)
-	int ii;
+	int ii, ij, gotmaxips, maxips;
 #endif
 #ifdef INET
-	int ip4s;
+	int ip4s, ip4a, redo_ip4;
 #endif
 #ifdef INET6
-	int ip6s;
+	int ip6s, ip6a, redo_ip6;
 #endif
-	unsigned pr_flags, ch_flags;
+	unsigned pr_flags, ch_flags, tflags;
 	char numbuf[12];
 
 	error = priv_check(td, PRIV_JAIL_SET);
@@ -444,6 +466,9 @@
 		error = priv_check(td, PRIV_JAIL_ATTACH);
 	if (error)
 		return (error);
+	mypr = ppr = td->td_ucred->cr_prison;
+	if ((flags & JAIL_CREATE) && !(mypr->pr_flags & PR_ALLOW_JAILS))
+		return (EPERM);
 	if (flags & ~JAIL_SET_MASK)
 		return (EINVAL);
 
@@ -461,12 +486,15 @@
 	if (error)
 		return (error);
 #ifdef INET
+	ip4a = 0;
 	ip4 = NULL;
 #endif
 #ifdef INET6
+	ip6a = 0;
 	ip6 = NULL;
 #endif
 
+ again:
 	error = vfs_copyopt(opts, "jid", &jid, sizeof(jid));
 	if (error == ENOENT)
 		jid = 0;
@@ -481,9 +509,33 @@
 	else
 		gotslevel = 1;
 
+	error = vfs_copyopt(opts, "perm.enforce_statfs", &enforce,
+	    sizeof(enforce));
+	gotenforce = error == 0;
+	if (gotenforce) {
+		if (enforce < 0 || enforce > 2)
+			return (EINVAL);
+	} else if (error != ENOENT)
+		goto done_free;
+
+#if defined(INET) || defined(INET6)
+	error = vfs_copyopt(opts, "perm.max_af_ips", &maxips, sizeof(maxips));
+	gotmaxips = error == 0;
+	if (maxips) {
+		if (maxips < 1)
+			return (EINVAL);
+	} else if (error != ENOENT)
+		goto done_free;
+#endif
+
 	pr_flags = ch_flags = 0;
-	vfs_flagopt(opts, "persist", &pr_flags, PR_PERSIST);
-	vfs_flagopt(opts, "nopersist", &ch_flags, PR_PERSIST);
+	for (fi = 0; fi < sizeof(pr_flag_names) / sizeof(pr_flag_names[0]);
+	    fi++) {
+		if (pr_flag_names[fi] == NULL)
+			continue;
+		vfs_flagopt(opts, pr_flag_names[fi], &pr_flags, 1 << fi);
+		vfs_flagopt(opts, pr_flag_nonames[fi], &ch_flags, 1 << fi);
+	}
 	ch_flags |= pr_flags;
 	if ((flags & (JAIL_CREATE | JAIL_UPDATE | JAIL_ATTACH)) == JAIL_CREATE
 	    && !(pr_flags & PR_PERSIST)) {
@@ -524,6 +576,7 @@
 		}
 	}
 
+	/* This might be the second time around for this option. */
 #ifdef INET
 	error = vfs_getopt(opts, "ip4.addr", &op, &ip4s);
 	if (error == ENOENT)
@@ -533,43 +586,54 @@
 	else if (ip4s & (sizeof(*ip4) - 1)) {
 		error = EINVAL;
 		goto done_free;
-	} else if (ip4s > 0) {
-		ip4s /= sizeof(*ip4);
-		if (ip4s > jail_max_af_ips) {
-			error = EINVAL;
-			vfs_opterror(opts, "too many IPv4 addresses");
-			goto done_errmsg;
-		}
-		ip4 = malloc(ip4s * sizeof(*ip4), M_PRISON, M_WAITOK);
-		bcopy(op, ip4, ip4s * sizeof(*ip4));
-		/*
-		 * IP addresses are all sorted but ip[0] to preserve the
-		 * primary IP address as given from userland.  This special IP
-		 * is used for unbound outgoing connections as well for
-		 * "loopback" traffic.
-		 */
-		if (ip4s > 1)
-			qsort(ip4 + 1, ip4s - 1, sizeof(*ip4), qcmp_v4);
-		/*
-		 * Check for duplicate addresses and do some simple zero and
-		 * broadcast checks. If users give other bogus addresses it is
-		 * their problem.
-		 *
-		 * We do not have to care about byte order for these checks so
-		 * we will do them in NBO.
-		 */
-		for (ii = 0; ii < ip4s; ii++) {
-			if (ip4[ii].s_addr == INADDR_ANY ||
-			    ip4[ii].s_addr == INADDR_BROADCAST) {
+	} else {
+		ch_flags |= PR_IP4_USER;
+		pr_flags |= PR_IP4_USER;
+		if (ip4s > 0) {
+			ip4s /= sizeof(*ip4);
+			if (gotmaxips && ip4s > maxips) {
 				error = EINVAL;
-				goto done_free;
+				vfs_opterror(opts, "too many IPv4 addresses");
+				goto done_errmsg;
 			}
-			if ((ii+1) < ip4s &&
-			    (ip4[0].s_addr == ip4[ii+1].s_addr ||
-			     ip4[ii].s_addr == ip4[ii+1].s_addr)) {
-				error = EINVAL;
-				goto done_free;
+			if (ip4a < ip4s) {
+				ip4a = ip4s;
+				free(ip4, M_PRISON);
+				ip4 = NULL;
 			}
+			if (ip4 == NULL)
+				ip4 = malloc(ip4a * sizeof(*ip4), M_PRISON,
+				    M_WAITOK);
+			bcopy(op, ip4, ip4s * sizeof(*ip4));
+			/*
+			 * IP addresses are all sorted but ip[0] to preserve
+			 * the primary IP address as given from userland.
+			 * This special IP is used for unbound outgoing
+			 * connections as well for "loopback" traffic.
+			 */
+			if (ip4s > 1)
+				qsort(ip4 + 1, ip4s - 1, sizeof(*ip4), qcmp_v4);
+			/*
+			 * Check for duplicate addresses and do some simple
+			 * zero and broadcast checks. If users give other bogus
+			 * addresses it is their problem.
+			 *
+			 * We do not have to care about byte order for these
+			 * checks so we will do them in NBO.
+			 */
+			for (ii = 0; ii < ip4s; ii++) {
+				if (ip4[ii].s_addr == INADDR_ANY ||
+				    ip4[ii].s_addr == INADDR_BROADCAST) {
+					error = EINVAL;
+					goto done_free;
+				}
+				if ((ii+1) < ip4s &&
+				    (ip4[0].s_addr == ip4[ii+1].s_addr ||
+				     ip4[ii].s_addr == ip4[ii+1].s_addr)) {
+					error = EINVAL;
+					goto done_free;
+				}
+			}
 		}
 	}
 #endif
@@ -583,29 +647,40 @@
 	else if (ip6s & (sizeof(*ip6) - 1)) {
 		error = EINVAL;
 		goto done_free;
-	} else if (ip6s > 0) {
-		ip6s /= sizeof(*ip6);
-		if (ip6s > jail_max_af_ips) {
-			error = EINVAL;
-			vfs_opterror(opts, "too many IPv6 addresses");
-			goto done_errmsg;
-		}
-		ip6 = malloc(ip6s * sizeof(*ip6), M_PRISON, M_WAITOK);
-		bcopy(op, ip6, ip6s * sizeof(*ip6));
-		if (ip6s > 1)
-			qsort(ip6 + 1, ip6s - 1, sizeof(*ip6), qcmp_v6);
-		for (ii = 0; ii < ip6s; ii++) {
-			if (IN6_IS_ADDR_UNSPECIFIED(&ip6[0])) {
+	} else {
+		ch_flags |= PR_IP6_USER;
+		pr_flags |= PR_IP6_USER;
+		if (ip6s > 0) {
+			ip6s /= sizeof(*ip6);
+			if (gotmaxips && ip6s > maxips) {
 				error = EINVAL;
-				goto done_free;
+				vfs_opterror(opts, "too many IPv6 addresses");
+				goto done_errmsg;
 			}
-			if ((ii+1) < ip6s &&
-			    (IN6_ARE_ADDR_EQUAL(&ip6[0], &ip6[ii+1]) ||
-			     IN6_ARE_ADDR_EQUAL(&ip6[ii], &ip6[ii+1])))
-			{
-				error = EINVAL;
-				goto done_free;
+			if (ip6a < ip6s) {
+				ip6a = ip6s;
+				free(ip6, M_PRISON);
+				ip6 = NULL;
 			}
+			if (ip6 == NULL)
+				ip6 = malloc(ip6a * sizeof(*ip6), M_PRISON,
+				    M_WAITOK);
+			bcopy(op, ip6, ip6s * sizeof(*ip6));
+			if (ip6s > 1)
+				qsort(ip6 + 1, ip6s - 1, sizeof(*ip6), qcmp_v6);
+			for (ii = 0; ii < ip6s; ii++) {
+				if (IN6_IS_ADDR_UNSPECIFIED(&ip6[0])) {
+					error = EINVAL;
+					goto done_free;
+				}
+				if ((ii+1) < ip6s &&
+				    (IN6_ARE_ADDR_EQUAL(&ip6[0], &ip6[ii+1]) ||
+				     IN6_ARE_ADDR_EQUAL(&ip6[ii], &ip6[ii+1])))
+				{
+					error = EINVAL;
+					goto done_free;
+				}
+			}
 		}
 	}
 #endif
@@ -627,13 +702,15 @@
 			error = EINVAL;
 			goto done_free;
 		}
-		if (len > MAXPATHLEN) {
-			error = ENAMETOOLONG;
-			goto done_free;
-		}
 		if (len < 2 || (len == 2 && path[0] == '/'))
 			path = NULL;
 		else {
+			/* Leave room for a real-root full pathname. */
+			if (len + (path[0] == '/' && strcmp(mypr->pr_path, "/")
+			    ? strlen(mypr->pr_path) : 0) > MAXPATHLEN) {
+				error = ENAMETOOLONG;
+				goto done_free;
+			}
 			NDINIT(&nd, LOOKUP, MPSAFE | FOLLOW, UIO_SYSSPACE,
 			    path, td);
 			error = namei(&nd);
@@ -683,7 +760,13 @@
 	}
 	pr = NULL;
 	if (jid != 0) {
-		/* See if a requested jid already exists. */
+		/*
+		 * See if a requested jid already exists.  There is an
+		 * information leak here if the jid exists but is not within
+		 * the caller's jail hierarchy.  Jail creators will get EEXIST
+		 * even though they cannot see the jail, and CREATE | UPDATE
+		 * will return ENOENT which is not normally a valid error.
+		 */
 		if (jid < 0) {
 			error = EINVAL;
 			vfs_opterror(opts, "negative jid");
@@ -691,6 +774,7 @@
 		}
 		pr = prison_find(jid);
 		if (pr != NULL) {
+			ppr = pr->pr_parent;
 			/* Create: jid must not exist. */
 			if (cuflags == JAIL_CREATE) {
 				mtx_unlock(&pr->pr_mtx);
@@ -699,7 +783,10 @@
 				    jid);
 				goto done_unlock_list;
 			}
-			if (pr->pr_uref == 0) {
+			if (!prison_ischild(mypr, pr)) {
+				mtx_unlock(&pr->pr_mtx);
+				pr = NULL;
+			} else if (pr->pr_uref == 0) {
 				if (!(flags & JAIL_DYING)) {
 					mtx_unlock(&pr->pr_mtx);
 					error = ENOENT;
@@ -717,7 +804,7 @@
 					 * name.
 					 */
 					if (name == NULL)
-						name = pr->pr_name;
+						name = prison_name(mypr, pr);
 				}
 			}
 		}
@@ -738,12 +825,42 @@
 	 * because that is the jail being updated).
 	 */
 	if (name != NULL) {
+		p = strrchr(name, '.');
+		if (p != NULL) {
+			/*
+			 * This is a hierarchical name.  Split it into the
+			 * parent and child names, and make sure the parent
+			 * exists or matches an already found jail.
+			 */
+			*p = '\0';
+			if (pr != NULL) {
+				if (strncmp(name, ppr->pr_name, p - name) ||
+				    ppr->pr_name[p - name] != '\0') {
+					mtx_unlock(&pr->pr_mtx);
+					error = EINVAL;
+					vfs_opterror(opts,
+					    "cannot change jail's parent");
+					goto done_unlock_list;
+				}
+			} else {
+				ppr = prison_find_name(mypr, name);
+				if (ppr == NULL) {
+					error = ENOENT;
+					vfs_opterror(opts,
+					    "jail \"%s\" not found", name);
+					goto done_unlock_list;
+				}
+				mtx_unlock(&ppr->pr_mtx);
+			}
+			name = p + 1;
+		}
 		if (name[0] != '\0') {
+			namelen = strlen(ppr->pr_name) + 1;
+ name_again:
 			deadpr = NULL;
- name_again:
-			TAILQ_FOREACH(tpr, &allprison, pr_list) {
+			FOREACH_PRISON_CHILD(ppr, tpr) {
 				if (tpr != pr && tpr->pr_ref > 0 &&
-				    !strcmp(tpr->pr_name, name)) {
+				    !strcmp(tpr->pr_name + namelen, name)) {
 					if (pr == NULL &&
 					    cuflags != JAIL_CREATE) {
 						mtx_lock(&tpr->pr_mtx);
@@ -763,7 +880,7 @@
 						/*
 						 * Create, or update(jid):
 						 * name must not exist in an
-						 * active jail.
+						 * active sibling jail.
 						 */
 						error = EEXIST;
 						if (pr != NULL)
@@ -810,6 +927,15 @@
 	/* If there's no prison to update, create a new one and link it in. */
 	if (pr == NULL) {
 		created = 1;
+		mtx_lock(&ppr->pr_mtx);
+		if (ppr->pr_ref == 0 || (ppr->pr_flags & PR_REMOVE)) {
+			mtx_unlock(&ppr->pr_mtx);
+			error = ENOENT;
+			goto done_unlock_list;
+		}
+		ppr->pr_ref++;
+		ppr->pr_uref++;
+		mtx_unlock(&ppr->pr_mtx);
 		pr = malloc(sizeof(*pr), M_PRISON, M_WAITOK | M_ZERO);
 		if (jid == 0) {
 			/* Find the next free jid. */
@@ -829,7 +955,9 @@
 					vfs_opterror(opts,
 					    "no available jail IDs");
 					free(pr, M_PRISON);
-					goto done_unlock_list;
+					prison_deref(ppr, PD_DEREF |
+					    PD_DEUREF | PD_LIST_XLOCKED);
+					goto done_releroot;
 				}
 				jid++;
 				goto findnext;
@@ -848,24 +976,56 @@
 		}
 		if (tpr == NULL)
 			TAILQ_INSERT_TAIL(&allprison, pr, pr_list);
-		prisoncount++;
+		LIST_INSERT_HEAD(&ppr->pr_children, pr, pr_sibling);
+		for (tpr = ppr; tpr != NULL; tpr = tpr->pr_parent)
+			tpr->pr_prisoncount++;
 
+		pr->pr_parent = ppr;
 		pr->pr_id = jid;
+
+		/* Set some default values, and inherit some from the parent. */
 		if (name == NULL)
 			name = "";
 		if (path == NULL) {
 			path = "/";
-			root = rootvnode;
+			root = mypr->pr_root;
 			vref(root);
 		}
+#ifdef INET
+		pr->pr_flags |= ppr->pr_flags & PR_IP4;
+		pr->pr_ip4s = ppr->pr_ip4s;
+		if (ppr->pr_ip4 != NULL) {
+			pr->pr_ip4 = malloc(pr->pr_ip4s *
+			    sizeof(struct in_addr), M_PRISON, M_WAITOK);
+			bcopy(ppr->pr_ip4, pr->pr_ip4,
+			    pr->pr_ip4s * sizeof(*pr->pr_ip4));
+		}
+#endif
+#ifdef INET6
+		pr->pr_flags |= ppr->pr_flags & PR_IP6;
+		pr->pr_ip6s = ppr->pr_ip6s;
+		if (ppr->pr_ip6 != NULL) {
+			pr->pr_ip6 = malloc(pr->pr_ip6s *
+			    sizeof(struct in6_addr), M_PRISON, M_WAITOK);
+			bcopy(ppr->pr_ip6, pr->pr_ip6,
+			    pr->pr_ip6s * sizeof(*pr->pr_ip6));
+		}
+#endif
+		pr->pr_securelevel = ppr->pr_securelevel;
+		pr->pr_flags |= ppr->pr_def_perms;
+		pr->pr_enforce_statfs = ppr->pr_def_enforce_statfs;
+#if defined(INET) || defined(INET6)
+		pr->pr_max_af_ips = ppr->pr_def_max_af_ips;
+#endif
 
-		mtx_init(&pr->pr_mtx, "jail mutex", NULL, MTX_DEF);
+		LIST_INIT(&pr->pr_children);
+		mtx_init(&pr->pr_mtx, "jail mutex", NULL, MTX_DEF | MTX_DUPOK);
 
 		/*
 		 * Allocate a dedicated cpuset for each jail.
 		 * Unlike other initial settings, this may return an erorr.
 		 */
-		error = cpuset_create_root(td, &pr->pr_cpuset);
+		error = cpuset_create_root(ppr, &pr->pr_cpuset);
 		if (error) {
 			prison_deref(pr, PD_LIST_XLOCKED);
 			goto done_releroot;
@@ -887,103 +1047,425 @@
 	}
 
 	/* Do final error checking before setting anything. */
-	error = 0;
+	if (gotslevel) {
+		if (slevel < ppr->pr_securelevel) {
+			error = EPERM;
+			goto done_deref_locked;
+		}
+	}
+	if (gotenforce) {
+		if (enforce < ppr->pr_enforce_statfs) {
+			error = EPERM;
+			goto done_deref_locked;
+		}
+	}
 #if defined(INET) || defined(INET6)
-	if (
-#ifdef INET
-	    ip4s > 0
-#ifdef INET6
-	    ||
+	if (gotmaxips) {
+		if (maxips > ppr->pr_max_af_ips) {
+			error = EPERM;
+			goto done_deref_locked;
+		}
+	}
 #endif
-#endif
-#ifdef INET6
-	    ip6s > 0
-#endif
-	    )
-		/*
-		 * Check for conflicting IP addresses.  We permit them if there
-		 * is no more than 1 IP on each jail.  If there is a duplicate
-		 * on a jail with more than one IP stop checking and return
-		 * error.
-		 */
-		TAILQ_FOREACH(tpr, &allprison, pr_list) {
-			if (tpr == pr || tpr->pr_uref == 0)
-				continue;
 #ifdef INET
-			if ((ip4s > 0 && tpr->pr_ip4s > 1) ||
-			    (ip4s > 1 && tpr->pr_ip4s > 0))
-				for (ii = 0; ii < ip4s; ii++)
+	if (ch_flags & PR_IP4_USER) {
+		if (!gotmaxips && ip4s > pr->pr_max_af_ips) {
+			error = EINVAL;
+			vfs_opterror(opts, "too many IPv4 addresses");
+			goto done_deref_locked;
+		}
+		if (ppr->pr_flags & PR_IP4) {
+			if (!(pr_flags & PR_IP4_USER)) {
+				/*
+				 * Silently ignore attempts to make the IP
+				 * addresses unrestricted when the parent is
+				 * restricted; in other words, interpret
+				 * "unrestricted" as "as unrestricted as
+				 * possible".
+				 */
+				ip4s = ppr->pr_ip4s;
+				if (ip4s == 0) {
+					free(ip4, M_PRISON);
+					ip4 = NULL;
+				} else if (ip4s <= ip4a) {
+					/* Inherit the parent's address(es). */
+					bcopy(ppr->pr_ip4, ip4,
+					    ip4s * sizeof(*ip4));
+				} else {
+					/*
+					 * There's no room for the parent's
+					 * address list.  Allocate some more.
+					 */
+					ip4a = ip4s;
+					free(ip4, M_PRISON);
+					ip4 = malloc(ip4a * sizeof(*ip4),
+					    M_PRISON, M_NOWAIT);
+					if (ip4 != NULL)
+						bcopy(ppr->pr_ip4, ip4,
+						    ip4s * sizeof(*ip4));
+					else {
+						/* Allocation failed without
+						 * sleeping.  Unlocking the
+						 * prison now will invalidate
+						 * some checks and prematurely
+						 * show an unfinished new jail.
+						 * So let go of everything and
+						 * start over.
+						 */
+						prison_deref(pr, created
+						    ? PD_LOCKED |
+						      PD_LIST_XLOCKED
+						    : PD_DEREF | PD_LOCKED |
+						      PD_LIST_XLOCKED);
+						if (root != NULL) {
+							vfslocked =
+							    VFS_LOCK_GIANT(
+							    root->v_mount);
+							vrele(root);
+							VFS_UNLOCK_GIANT(
+							    vfslocked);
+						}
+						ip4 = malloc(ip4a *
+						    sizeof(*ip4), M_PRISON,
+						    M_WAITOK);
+						goto again;
+					}
+				}
+			} else if (ip4s > 0) {
+				/*
+				 * Make sure the new set of IP addresses is a
+				 * subset of the parent's list.  Don't worry
+				 * about the parent being unlocked, as any
+				 * setting is done with allprison_lock held.
+				 */
+				for (ij = 0; ij < ppr->pr_ip4s; ij++)
+					if (ip4[0].s_addr ==
+					    ppr->pr_ip4[ij].s_addr)
+						break;
+				if (ij == ppr->pr_ip4s) {
+					error = EPERM;
+					goto done_deref_locked;
+				}
+				if (ip4s > 1) {
+					for (ii = ij = 1; ii < ip4s; ii++) {
+						if (ip4[ii].s_addr ==
+						    ppr->pr_ip4[0]. s_addr)
+							continue;
+						for (; ij < ppr->pr_ip4s; ij++)
+						    if (ip4[ii].s_addr ==
+							ppr->pr_ip4[ij].s_addr)
+							    break;
+					}
+					if (ij == ppr->pr_ip4s) {
+						error = EPERM;
+						goto done_deref_locked;
+					}
+				}
+			}
+		}
+		if (ip4s > 0) {
+			/*
+			 * Check for conflicting IP addresses.  We permit them
+			 * if there is no more than one IP on each jail.  If
+			 * there is a duplicate on a jail with more than one
+			 * IP stop checking and return error.
+			 */
+			FOREACH_PRISON_DESCENDANT(&prison0, tpr, descend) {
+				if (tpr == pr || tpr->pr_uref == 0) {
+					descend = 0;
+					continue;
+				}
+				if (!(tpr->pr_flags & PR_IP4_USER))
+					continue;
+				descend = 0;
+				if (tpr->pr_ip4 == NULL ||
+				    (ip4s == 1 && tpr->pr_ip4s == 1))
+					continue;
+				for (ii = 0; ii < ip4s; ii++) {
 					if (_prison_check_ip4(tpr,
 					    &ip4[ii]) == 0) {
-						error = EINVAL;
+						error = EADDRINUSE;
 						vfs_opterror(opts,
 						    "IPv4 addresses clash");
 						goto done_deref_locked;
 					}
+				}
+			}
+		}
+	}
 #endif
 #ifdef INET6
-			if ((ip6s > 0 && tpr->pr_ip6s > 1) ||
-			    (ip6s > 1 && tpr->pr_ip6s > 0))
-				for (ii = 0; ii < ip6s; ii++)
+	if (ch_flags & PR_IP6_USER) {
+		if (!gotmaxips && ip6s > pr->pr_max_af_ips) {
+			error = EINVAL;
+			vfs_opterror(opts, "too many IPv6 addresses");
+			goto done_deref_locked;
+		}
+		if (ppr->pr_flags & PR_IP6) {
+			if (!(pr_flags & PR_IP6_USER)) {
+				/*
+				 * Silently ignore attempts to make the IP
+				 * addresses unrestricted when the parent is
+				 * restricted.
+				 */
+				ip6s = ppr->pr_ip6s;
+				if (ip6s == 0) {
+					free(ip6, M_PRISON);
+					ip6 = NULL;
+				} else if (ip6s <= ip6a) {
+					/* Inherit the parent's address(es). */
+					bcopy(ppr->pr_ip6, ip6,
+					    ip6s * sizeof(*ip6));
+				} else {
+					/*
+					 * There's no room for the parent's
+					 * address list.
+					 */
+					ip6a = ip6s;
+					free(ip6, M_PRISON);
+					ip6 = malloc(ip6a * sizeof(*ip6),
+					    M_PRISON, M_NOWAIT);
+					if (ip6 != NULL)
+						bcopy(ppr->pr_ip6, ip6,
+						    ip6s * sizeof(*ip6));
+					else {
+						prison_deref(pr, created
+						    ? PD_LOCKED |
+						      PD_LIST_XLOCKED
+						    : PD_DEREF | PD_LOCKED |
+						      PD_LIST_XLOCKED);
+						if (root != NULL) {
+							vfslocked =
+							    VFS_LOCK_GIANT(
+							    root->v_mount);
+							vrele(root);
+							VFS_UNLOCK_GIANT(
+							    vfslocked);
+						}
+						ip6 = malloc(ip6a *
+						    sizeof(*ip6), M_PRISON,
+						    M_WAITOK);
+						goto again;
+					}
+				}
+			} else if (ip6s > 0) {
+				/*
+				 * Make sure the new set of IP addresses is a
+				 * subset of the parent's list.
+				 */
+				for (ij = 0; ij < ppr->pr_ip6s; ij++)
+					if (IN6_ARE_ADDR_EQUAL(&ip6[0],
+					    &ppr->pr_ip6[ij]))
+						break;
+				if (ij == ppr->pr_ip6s) {
+					error = EPERM;
+					goto done_deref_locked;
+				}
+				if (ip6s > 1) {
+					for (ii = ij = 1; ii < ip6s; ii++) {
+						if (IN6_ARE_ADDR_EQUAL(&ip6[ii],
+						    &ppr->pr_ip6[0]))
+							continue;
+						for (; ij < ppr->pr_ip6s; ij++)
+							if (IN6_ARE_ADDR_EQUAL(
+							    &ip6[ii],
+							    &ppr->pr_ip6[ij]))
+								break;
+					}
+					if (ij == ppr->pr_ip6s) {
+						error = EPERM;
+						goto done_deref_locked;
+					}
+				}
+			}
+		}
+		if (ip6s > 0) {
+			/* Check for conflicting IP addresses. */
+			FOREACH_PRISON_DESCENDANT(&prison0, tpr, descend) {
+				if (tpr == pr || tpr->pr_uref == 0) {
+					descend = 0;
+					continue;
+				}
+				if (!(tpr->pr_flags & PR_IP6_USER))
+					continue;
+				descend = 0;
+				if (tpr->pr_ip6 == NULL ||
+				    (ip6s == 1 && tpr->pr_ip6s == 1))
+					continue;
+				for (ii = 0; ii < ip6s; ii++) {
 					if (_prison_check_ip6(tpr,
 					    &ip6[ii]) == 0) {
-						error = EINVAL;
+						error = EADDRINUSE;
 						vfs_opterror(opts,
 						    "IPv6 addresses clash");
 						goto done_deref_locked;
 					}
-#endif
+				}
+			}
 		}
+	}
 #endif
-	if (error == 0 && name != NULL) {
+	if (name != NULL) {
 		/* Give a default name of the jid. */
 		if (name[0] == '\0')
 			snprintf(name = numbuf, sizeof(numbuf), "%d", jid);
 		else if (strtoul(name, &p, 10) != jid && *p == '\0') {
 			error = EINVAL;
 			vfs_opterror(opts, "name cannot be numeric");
+			goto done_deref_locked;
 		}
+		if (strlen(ppr->pr_name) + strlen(name) + 2 >
+		    sizeof(pr->pr_name)) {
+			error = ENAMETOOLONG;
+			goto done_deref_locked;
+		}
 	}
-	if (error) {
- done_deref_locked:
-		/*
-		 * Some parameter had an error so do not set anything.
-		 * If this is a new jail, it will go away without ever
-		 * having been seen.
-		 */
-		prison_deref(pr, created
-		    ? PD_LOCKED | PD_LIST_XLOCKED
-		    : PD_DEREF | PD_LOCKED | PD_LIST_XLOCKED);
-		goto done_releroot;
+	if ((PR_ALLOW_ALL & pr_flags & ~ppr->pr_flags) |
+	    (PR_RESTRICT_ALL & ch_flags & ~pr_flags & ppr->pr_flags)) {
+		error = EPERM;
+		goto done_deref_locked;
 	}
 
 	/* Set the parameters of the prison. */
 #ifdef INET
-	if (ip4s >= 0) {
-		pr->pr_ip4s = ip4s;
-		free(pr->pr_ip4, M_PRISON);
-		pr->pr_ip4 = ip4;
-		ip4 = NULL;
+	redo_ip4 = 0;
+	if (ch_flags & PR_IP4_USER) {
+		if (pr_flags & PR_IP4_USER) {
+			/* Some restriction set. */
+			pr->pr_flags |= PR_IP4;
+			if (ip4s >= 0) {
+				free(pr->pr_ip4, M_PRISON);
+				pr->pr_ip4s = ip4s;
+				pr->pr_ip4 = ip4;
+				ip4 = NULL;
+			}
+		} else if (ppr->pr_flags & PR_IP4) {
+			/* This restriction cleared, but keep inherited. */
+			free(pr->pr_ip4, M_PRISON);
+			pr->pr_ip4s = ip4s;
+			pr->pr_ip4 = ip4;
+			ip4 = NULL;
+		} else {
+			/* Restriction cleared, now unrestricted. */
+			pr->pr_flags &= ~PR_IP4;
+			free(pr->pr_ip4, M_PRISON);
+			pr->pr_ip4s = 0;
+		}
+		FOREACH_PRISON_DESCENDANT_LOCKED(pr, tpr, descend) {
+			if (prison_restrict_ip4(tpr, NULL)) {
+				redo_ip4 = 1;
+				descend = 0;
+			}
+		}
 	}
 #endif
 #ifdef INET6
-	if (ip6s >= 0) {
-		pr->pr_ip6s = ip6s;
-		free(pr->pr_ip6, M_PRISON);
-		pr->pr_ip6 = ip6;
-		ip6 = NULL;
+	redo_ip6 = 0;
+	if (ch_flags & PR_IP6_USER) {
+		if (pr_flags & PR_IP6_USER) {
+			/* Some restriction set. */
+			pr->pr_flags |= PR_IP6;
+			if (ip6s >= 0) {
+				free(pr->pr_ip6, M_PRISON);
+				pr->pr_ip6s = ip6s;
+				pr->pr_ip6 = ip6;
+				ip6 = NULL;
+			}
+		} else if (ppr->pr_flags & PR_IP6) {
+			/* This restriction cleared, but keep inherited. */
+			free(pr->pr_ip6, M_PRISON);
+			pr->pr_ip6s = ip6s;
+			pr->pr_ip6 = ip6;
+			ip6 = NULL;
+		} else {
+			/* Restriction cleared, now unrestricted. */
+			pr->pr_flags &= ~PR_IP6;
+			free(pr->pr_ip6, M_PRISON);
+			pr->pr_ip6s = 0;
+		}
+		FOREACH_PRISON_DESCENDANT_LOCKED(pr, tpr, descend) {
+			if (prison_restrict_ip6(tpr, NULL)) {
+				redo_ip6 = 1;
+				descend = 0;
+			}
+		}
 	}
 #endif
-	if (gotslevel)
+	if (gotslevel) {
 		pr->pr_securelevel = slevel;
-	if (name != NULL)
-		strlcpy(pr->pr_name, name, sizeof(pr->pr_name));
+		/* Set all child jails to be at least this level. */
+		FOREACH_PRISON_DESCENDANT_LOCKED(pr, tpr, descend)
+			if (tpr->pr_securelevel < slevel)
+				tpr->pr_securelevel = slevel;
+	}
+	if (gotenforce) {
+		pr->pr_enforce_statfs = enforce;
+		if (pr->pr_def_enforce_statfs < enforce)
+			pr->pr_def_enforce_statfs = enforce;
+		/* Pass this restriction on to the children. */
+		FOREACH_PRISON_DESCENDANT_LOCKED(pr, tpr, descend)
+			if (tpr->pr_enforce_statfs < enforce) {
+				tpr->pr_enforce_statfs = enforce;
+				if (tpr->pr_def_enforce_statfs < enforce)
+					tpr->pr_def_enforce_statfs = enforce;
+			}
+	}
+#if defined(INET) || defined(INET6)
+	if (gotmaxips) {
+		pr->pr_max_af_ips = maxips;
+		if (pr->pr_def_max_af_ips > maxips)
+			pr->pr_def_max_af_ips = maxips;
+		/* Pass this restriction on to the children. */
+		FOREACH_PRISON_DESCENDANT_LOCKED(pr, tpr, descend)
+			if (tpr->pr_max_af_ips > maxips) {
+				tpr->pr_max_af_ips = maxips;
+				if (tpr->pr_def_max_af_ips > maxips)
+					tpr->pr_def_max_af_ips = maxips;
+			}
+	}
+#endif
+	if (name != NULL) {
+		onamelen = strlen(pr->pr_name);
+		if (ppr == &prison0)
+			strlcpy(pr->pr_name, name, sizeof(pr->pr_name));
+		else
+			snprintf(pr->pr_name, sizeof(pr->pr_name), "%s.%s",
+			    ppr->pr_name, name);
+		namelen = strlen(pr->pr_name);
+		/* Change this component of child names. */
+		FOREACH_PRISON_DESCENDANT_LOCKED(pr, tpr, descend) {
+			bcopy(tpr->pr_name + onamelen, tpr->pr_name + namelen,
+			    strlen(tpr->pr_name + onamelen) + 1);
+			bcopy(pr->pr_name, tpr->pr_name, namelen);
+		}
+	}
 	if (path != NULL) {
-		strlcpy(pr->pr_path, path, sizeof(pr->pr_path));
+		/* Try to keep a real-rooted full pathname. */
+		if (path[0] == '/' && strcmp(mypr->pr_path, "/"))
+			snprintf(pr->pr_path, sizeof pr->pr_path, "%s%s",
+			    mypr->pr_path, path);
+		else
+			strlcpy(pr->pr_path, path, sizeof(pr->pr_path));
 		pr->pr_root = root;
 	}
 	if (host != NULL)
 		strlcpy(pr->pr_host, host, sizeof(pr->pr_host));
+	if ((tflags = PR_ALLOW_ALL & ch_flags & ~pr_flags)) {
+		/* Clear allow bits on sysctl and all children. */
+		pr->pr_def_perms &= ~tflags;
+		FOREACH_PRISON_DESCENDANT_LOCKED(pr, tpr, descend) {
+			tpr->pr_flags &= ~tflags;
+			tpr->pr_def_perms &= ~tflags;
+		}
+	}
+	if ((tflags = PR_RESTRICT_ALL & pr_flags)) {
+		/* Set restrict bits on sysctl and all children. */
+		pr->pr_def_perms |= tflags;
+		FOREACH_PRISON_DESCENDANT_LOCKED(pr, tpr, descend) {
+			tpr->pr_flags |= tflags;
+			tpr->pr_def_perms |= tflags;
+		}
+	}
 	/*
 	 * Persistent prisons get an extra reference, and prisons losing their
 	 * persist flag lose that reference.  Only do this for existing prisons
@@ -1002,6 +1484,44 @@
 	pr->pr_flags = (pr->pr_flags & ~ch_flags) | pr_flags;
 	mtx_unlock(&pr->pr_mtx);
 
+	/* Locks may have prevented a complete restriction of child IP
+	 * addresses.  If so, allocate some more memory and try again.
+	 */
+#ifdef INET
+	while (redo_ip4) {
+		ip4s = pr->pr_ip4s;
+		ip4 = malloc(ip4s * sizeof(*ip4), M_PRISON, M_WAITOK);
+		mtx_lock(&pr->pr_mtx);
+		redo_ip4 = 0;
+		FOREACH_PRISON_DESCENDANT_LOCKED(pr, tpr, descend) {
+			if (prison_restrict_ip4(tpr, ip4)) {
+				if (ip4 != NULL)
+					ip4 = NULL;
+				else
+					redo_ip4 = 1;
+			}
+		}
+		mtx_unlock(&pr->pr_mtx);
+	}
+#endif
+#ifdef INET6
+	while (redo_ip6) {
+		ip6s = pr->pr_ip6s;
+		ip6 = malloc(ip6s * sizeof(*ip6), M_PRISON, M_WAITOK);
+		mtx_lock(&pr->pr_mtx);
+		redo_ip6 = 0;
+		FOREACH_PRISON_DESCENDANT_LOCKED(pr, tpr, descend) {
+			if (prison_restrict_ip6(tpr, ip6)) {
+				if (ip6 != NULL)
+					ip6 = NULL;
+				else
+					redo_ip6 = 1;
+			}
+		}
+		mtx_unlock(&pr->pr_mtx);
+	}
+#endif
+
 	/* Let the modules do their work. */
 	sx_downgrade(&allprison_lock);
 	if (created) {
@@ -1054,6 +1574,11 @@
 	td->td_retval[0] = pr->pr_id;
 	goto done_errmsg;
 
+ done_deref_locked:
+	prison_deref(pr, created
+	    ? PD_LOCKED | PD_LIST_XLOCKED
+	    : PD_DEREF | PD_LOCKED | PD_LIST_XLOCKED);
+	goto done_releroot;
  done_unlock_list:
 	sx_xunlock(&allprison_lock);
  done_releroot:
@@ -1131,6 +1656,7 @@
 }
 
 SYSCTL_JAIL_PARAM(, jid, CTLTYPE_INT | CTLFLAG_RD, "I", "Jail ID");
+SYSCTL_JAIL_PARAM(, parent, CTLTYPE_INT | CTLFLAG_RD, "I", "Jail parent ID");
 SYSCTL_JAIL_PARAM_STRING(, name, CTLFLAG_RW, MAXHOSTNAMELEN, "Jail name");
 SYSCTL_JAIL_PARAM(, cpuset, CTLTYPE_INT | CTLFLAG_RD, "I", "Jail cpuset ID");
 SYSCTL_JAIL_PARAM_STRING(, path, CTLFLAG_RD, MAXPATHLEN, "Jail root path");
@@ -1147,16 +1673,44 @@
 
 #ifdef INET
 SYSCTL_JAIL_PARAM_NODE(ip4, "Jail IPv4 address virtualization");
+SYSCTL_JAIL_PARAM(, noip4, CTLTYPE_INT | CTLFLAG_RW,
+    "BN", "Jail w/ no IP address virtualization");
 SYSCTL_JAIL_PARAM_STRUCT(_ip4, addr, CTLFLAG_RW, sizeof(struct in_addr),
     "S,in_addr,a", "Jail IPv4 addresses");
 #endif
 #ifdef INET6
 SYSCTL_JAIL_PARAM_NODE(ip6, "Jail IPv6 address virtualization");
+SYSCTL_JAIL_PARAM(, noip6, CTLTYPE_INT | CTLFLAG_RW,
+    "BN", "Jail w/ no IP address virtualization");
 SYSCTL_JAIL_PARAM_STRUCT(_ip6, addr, CTLFLAG_RW, sizeof(struct in6_addr),
     "S,in6_addr,a", "Jail IPv6 addresses");
 #endif
 
+SYSCTL_JAIL_PARAM_NODE(perm, "Jail permissions");
+SYSCTL_JAIL_PARAM(_perm, set_hostname_allowed, CTLTYPE_INT | CTLFLAG_RW,
+    "B", "Jail may set hostname");
+SYSCTL_JAIL_PARAM(_perm, sysvipc_allowed, CTLTYPE_INT | CTLFLAG_RW,
+    "B", "Jail may use SYSV IPC");
+SYSCTL_JAIL_PARAM(_perm, allow_raw_sockets, CTLTYPE_INT | CTLFLAG_RW,
+    "B", "Jail may create raw sockets");
+SYSCTL_JAIL_PARAM(_perm, chflags_allowed, CTLTYPE_INT | CTLFLAG_RW,
+    "B", "Jail may alter system file flags");
+SYSCTL_JAIL_PARAM(_perm, mount_allowed, CTLTYPE_INT | CTLFLAG_RW,
+    "B", "Jail may mount/unmount jail-friendly file systems");
+SYSCTL_JAIL_PARAM(_perm, allow_quotas, CTLTYPE_INT | CTLFLAG_RW,
+    "B", "Jail may set file quotas");
+SYSCTL_JAIL_PARAM(_perm, allow_jails, CTLTYPE_INT | CTLFLAG_RW,
+    "B", "Jail may create child jails");
+SYSCTL_JAIL_PARAM(_perm, socket_unixiproute_only, CTLTYPE_INT | CTLFLAG_RW,
+    "B", "Jail limited to creating UNIX/IPv4/IPv6/route sockets only");
+SYSCTL_JAIL_PARAM(_perm, enforce_statfs, CTLTYPE_INT | CTLFLAG_RW,
+    "I", "Jail cannot see all mounted file systems");
+#if defined(INET) || defined(INET6)
+SYSCTL_JAIL_PARAM(_perm, max_af_ips, CTLTYPE_INT | CTLFLAG_RW,
+    "I", "Number of IP addresses a jail may have at most per address family");
+#endif
 
+
 /*
  * struct jail_get_args {
  *	struct iovec *iovp;
@@ -1188,28 +1742,21 @@
 int
 kern_jail_get(struct thread *td, struct uio *optuio, int flags)
 {
-	struct prison *pr;
+	struct prison *pr, *mypr;
 	struct vfsopt *opt;
 	struct vfsoptlist *opts;
 	char *errmsg, *name;
-	int error, errmsg_len, errmsg_pos, i, jid, len, locked, pos;
+	int error, errmsg_len, errmsg_pos, fi, i, jid, len, locked, pos;
 
 	if (flags & ~JAIL_GET_MASK)
 		return (EINVAL);
-	if (jailed(td->td_ucred)) {
-		/*
-		 * Don't allow a jailed process to see any jails,
-		 * not even its own.
-		 */
-		vfs_opterror(opts, "jail not found");
-		return (ENOENT);
-	}
 
 	/* Get the parameter list. */
 	error = vfs_buildopts(optuio, &opts);
 	if (error)
 		return (error);
 	errmsg_pos = vfs_getopt_pos(opts, "errmsg");
+	mypr = td->td_ucred->cr_prison;
 
 	/*
 	 * Find the prison specified by one of: lastjid, jid, name.
@@ -1218,7 +1765,7 @@
 	error = vfs_copyopt(opts, "lastjid", &jid, sizeof(jid));
 	if (error == 0) {
 		TAILQ_FOREACH(pr, &allprison, pr_list) {
-			if (pr->pr_id > jid) {
+			if (pr->pr_id > jid && prison_ischild(mypr, pr)) {
 				mtx_lock(&pr->pr_mtx);
 				if (pr->pr_ref > 0 &&
 				    (pr->pr_uref > 0 || (flags & JAIL_DYING)))
@@ -1237,7 +1784,7 @@
 	error = vfs_copyopt(opts, "jid", &jid, sizeof(jid));
 	if (error == 0) {
 		if (jid != 0) {
-			pr = prison_find(jid);
+			pr = prison_find_child(mypr, jid);
 			if (pr != NULL) {
 				if (pr->pr_uref == 0 && !(flags & JAIL_DYING)) {
 					mtx_unlock(&pr->pr_mtx);
@@ -1261,7 +1808,7 @@
 			error = EINVAL;
 			goto done_unlock_list;
 		}
-		pr = prison_find_name(name);
+		pr = prison_find_name(mypr, name);
 		if (pr != NULL) {
 			if (pr->pr_uref == 0 && !(flags & JAIL_DYING)) {
 				mtx_unlock(&pr->pr_mtx);
@@ -1290,14 +1837,18 @@
 	error = vfs_setopt(opts, "jid", &pr->pr_id, sizeof(pr->pr_id));
 	if (error != 0 && error != ENOENT)
 		goto done_deref;
-	error = vfs_setopts(opts, "name", pr->pr_name);
+	i = pr->pr_parent == mypr ? 0 : pr->pr_parent->pr_id;
+	error = vfs_setopt(opts, "parent", &i, sizeof(i));
 	if (error != 0 && error != ENOENT)
 		goto done_deref;
+	error = vfs_setopts(opts, "name", prison_name(mypr, pr));
+	if (error != 0 && error != ENOENT)
+		goto done_deref;
 	error = vfs_setopt(opts, "cpuset", &pr->pr_cpuset->cs_id,
 	    sizeof(pr->pr_cpuset->cs_id));
 	if (error != 0 && error != ENOENT)
 		goto done_deref;
-	error = vfs_setopts(opts, "path", pr->pr_path);
+	error = vfs_setopts(opts, "path", prison_path(mypr, pr));
 	if (error != 0 && error != ENOENT)
 		goto done_deref;
 #ifdef INET
@@ -1319,14 +1870,29 @@
 	error = vfs_setopts(opts, "host.hostname", pr->pr_host);
 	if (error != 0 && error != ENOENT)
 		goto done_deref;
-	i = pr->pr_flags & PR_PERSIST ? 1 : 0;
-	error = vfs_setopt(opts, "persist", &i, sizeof(i));
+	error = vfs_setopt(opts, "perm.enforce_statfs", &pr->pr_enforce_statfs,
+	    sizeof(pr->pr_enforce_statfs));
 	if (error != 0 && error != ENOENT)
 		goto done_deref;
-	i = !i;
-	error = vfs_setopt(opts, "nopersist", &i, sizeof(i));
+#if defined(INET) || defined(INET6)
+	error = vfs_setopt(opts, "perm.max_af_ips", &pr->pr_max_af_ips,
+	    sizeof(pr->pr_max_af_ips));
 	if (error != 0 && error != ENOENT)
 		goto done_deref;
+#endif
+	for (fi = 0; fi < sizeof(pr_flag_names) / sizeof(pr_flag_names[0]);
+	    fi++) {
+		if (pr_flag_names[fi] == NULL)
+			continue;
+		i = (pr->pr_flags & (1 << fi)) ? 1 : 0;
+		error = vfs_setopt(opts, pr_flag_names[fi], &i, sizeof(i));
+		if (error != 0 && error != ENOENT)
+			goto done_deref;
+		i = !i;
+		error = vfs_setopt(opts, pr_flag_nonames[fi], &i, sizeof(i));
+		if (error != 0 && error != ENOENT)
+			goto done_deref;
+	}
 	i = (pr->pr_uref == 0);
 	error = vfs_setopt(opts, "dying", &i, sizeof(i));
 	if (error != 0 && error != ENOENT)
@@ -1402,6 +1968,159 @@
 }
 
 /*
+ * Jail permission sysctls.  These are companions to the jail parameters of
+ * similar names, and provide the default values for child jails.
+ */
+
+static int
+sysctl_jail_perm(SYSCTL_HANDLER_ARGS)
+{
+	struct prison *pr, *cpr;
+	int descend, error, i;
+
+	pr = req->td->td_ucred->cr_prison;
+
+	/* Get the current flag value, and convert it to a boolean. */
+	i = (pr->pr_def_perms & arg2) ? 1 : 0;
+	error = sysctl_handle_int(oidp, &i, 0, req);
+	if (error || !req->newptr)
+		return (error);
+	i = i ? arg2 : 0;
+	/* Do not allow more than the current prison itself can do. */
+	sx_slock(&allprison_lock);
+	mtx_lock(&pr->pr_mtx);
+	if ((i & PR_ALLOW_ALL & ~pr->pr_flags) |
+	    (arg2 & PR_RESTRICT_ALL & pr->pr_flags & ~i)) {
+		mtx_unlock(&pr->pr_mtx);
+		sx_sunlock(&allprison_lock);
+		return (EPERM);
+	}
+	pr->pr_def_perms = (pr->pr_def_perms & ~arg2) | i;
+	/* Reflect restrictions to child jails. */
+	if ((arg2 & PR_ALLOW_ALL & ~i) | (arg2 & PR_RESTRICT_ALL & i))
+		FOREACH_PRISON_DESCENDANT_LOCKED(pr, cpr, descend) {
+			cpr->pr_flags = (cpr->pr_flags & ~arg2) | i;
+			cpr->pr_def_perms = (cpr->pr_def_perms & ~arg2) | i;
+		}
+	mtx_unlock(&pr->pr_mtx);
+	sx_sunlock(&allprison_lock);
+	return (0);
+}
+
+SYSCTL_PROC(_security_jail, OID_AUTO, set_hostname_allowed,
+    CTLTYPE_INT | CTLFLAG_RW | CTLFLAG_PRISON | CTLFLAG_MPSAFE,
+    NULL, PR_ALLOW_SET_HOSTNAME, sysctl_jail_perm, "I",
+    "Processes in jail can set their hostnames");
+SYSCTL_PROC(_security_jail, OID_AUTO, socket_unixiproute_only,
+    CTLTYPE_INT | CTLFLAG_RW | CTLFLAG_PRISON | CTLFLAG_MPSAFE,
+    NULL, PR_RESTRICT_SOCKET_UNIXIPROUTE, sysctl_jail_perm, "I",
+    "Processes in jail are limited to creating UNIX/IP/route sockets only");
+SYSCTL_PROC(_security_jail, OID_AUTO, sysvipc_allowed,
+    CTLTYPE_INT | CTLFLAG_RW | CTLFLAG_PRISON | CTLFLAG_MPSAFE,
+    NULL, PR_ALLOW_SYSVIPC, sysctl_jail_perm, "I",
+    "Processes in jail can use System V IPC primitives");
+SYSCTL_PROC(_security_jail, OID_AUTO, allow_raw_sockets,
+    CTLTYPE_INT | CTLFLAG_RW | CTLFLAG_PRISON | CTLFLAG_MPSAFE,
+    NULL, PR_ALLOW_RAW_SOCKETS, sysctl_jail_perm, "I",
+    "Prison root can create raw sockets");
+SYSCTL_PROC(_security_jail, OID_AUTO, chflags_allowed,
+    CTLTYPE_INT | CTLFLAG_RW | CTLFLAG_PRISON | CTLFLAG_MPSAFE,
+    NULL, PR_ALLOW_CHFLAGS, sysctl_jail_perm, "I",
+    "Processes in jail can alter system file flags");
+SYSCTL_PROC(_security_jail, OID_AUTO, mount_allowed,
+    CTLTYPE_INT | CTLFLAG_RW | CTLFLAG_PRISON | CTLFLAG_MPSAFE,
+    NULL, PR_ALLOW_MOUNT, sysctl_jail_perm, "I",
+    "Processes in jail can mount/unmount jail-friendly file systems");
+SYSCTL_PROC(_security_jail, OID_AUTO, allow_quotas,
+    CTLTYPE_INT | CTLFLAG_RW | CTLFLAG_PRISON | CTLFLAG_MPSAFE,
+    NULL, PR_ALLOW_QUOTAS, sysctl_jail_perm, "I",
+    "Processes in jail can set file quotas");
+SYSCTL_PROC(_security_jail, OID_AUTO, allow_jails,
+    CTLTYPE_INT | CTLFLAG_RW | CTLFLAG_PRISON | CTLFLAG_MPSAFE,
+    NULL, PR_ALLOW_JAILS, sysctl_jail_perm, "I",
+    "Processes in jail can create child jails");
+
+static int
+sysctl_jail_enforce_statfs(SYSCTL_HANDLER_ARGS)
+{
+	struct prison *pr, *cpr;
+	int descend, error, i;
+
+	pr = req->td->td_ucred->cr_prison;
+
+	i = pr->pr_def_enforce_statfs;
+	error = sysctl_handle_int(oidp, &i, 0, req);
+	if (error || !req->newptr)
+		return (error);
+	if (i < 0 || i > 2)
+		return (EINVAL);
+	/* Do not allow more than the current prison itself can do. */
+	sx_slock(&allprison_lock);
+	mtx_lock(&pr->pr_mtx);
+	if (i < pr->pr_enforce_statfs) {
+		mtx_unlock(&pr->pr_mtx);
+		sx_sunlock(&allprison_lock);
+		return (EPERM);
+	}
+	pr->pr_def_enforce_statfs = i;
+	/* Reflect restrictions to child jails. */
+	FOREACH_PRISON_DESCENDANT_LOCKED(pr, cpr, descend)
+		if (cpr->pr_enforce_statfs < i) {
+			cpr->pr_enforce_statfs = i;
+			if (cpr->pr_def_enforce_statfs < i)
+				cpr->pr_def_enforce_statfs = i;
+		}
+	mtx_unlock(&pr->pr_mtx);
+	sx_sunlock(&allprison_lock);
+	return (0);
+}
+SYSCTL_PROC(_security_jail, OID_AUTO, enforce_statfs,
+    CTLTYPE_INT | CTLFLAG_RW | CTLFLAG_PRISON | CTLFLAG_MPSAFE,
+    NULL, 0, sysctl_jail_enforce_statfs, "I",
+    "Processes in jail cannot see all mounted file systems");
+
+#if defined(INET) || defined(INET6)
+static int
+sysctl_jail_max_af_ips(SYSCTL_HANDLER_ARGS)
+{
+	struct prison *pr, *cpr;
+	int descend, error, i;
+
+	pr = req->td->td_ucred->cr_prison;
+
+	i = pr->pr_def_max_af_ips;
+	error = sysctl_handle_int(oidp, &i, 0, req);
+	if (error || !req->newptr)
+		return (error);
+	if (i < 1)
+		return (EINVAL);
+	/* Do not allow more than the current prison itself can do. */
+	sx_slock(&allprison_lock);
+	mtx_lock(&pr->pr_mtx);
+	if (i > pr->pr_max_af_ips) {
+		mtx_unlock(&pr->pr_mtx);
+		sx_sunlock(&allprison_lock);
+		return (EPERM);
+	}
+	pr->pr_def_max_af_ips = i;
+	/* Reflect restrictions to child jails. */
+	FOREACH_PRISON_DESCENDANT_LOCKED(pr, cpr, descend)
+		if (cpr->pr_max_af_ips > i) {
+			cpr->pr_max_af_ips = i;
+			if (cpr->pr_def_max_af_ips > i)
+				cpr->pr_def_max_af_ips = i;
+		}
+	mtx_unlock(&pr->pr_mtx);
+	sx_sunlock(&allprison_lock);
+	return (0);
+}
+SYSCTL_PROC(_security_jail, OID_AUTO, max_af_ips,
+    CTLTYPE_INT | CTLFLAG_RW | CTLFLAG_PRISON | CTLFLAG_MPSAFE,
+    NULL, 0, sysctl_jail_max_af_ips, "I",
+    "Number of IP addresses a jail may have at most per address family");
+#endif
+
+/*
  * struct jail_remove_args {
  *	int jid;
  * };
@@ -1409,21 +2128,61 @@
 int
 jail_remove(struct thread *td, struct jail_remove_args *uap)
 {
-	struct prison *pr;
-	struct proc *p;
-	int deuref, error;
+	struct prison *pr, *cpr, *lpr, *tpr;
+	int descend, error;
 
 	error = priv_check(td, PRIV_JAIL_REMOVE);
 	if (error)
 		return (error);
 
 	sx_xlock(&allprison_lock);
-	pr = prison_find(uap->jid);
+	pr = prison_find_child(td->td_ucred->cr_prison, uap->jid);
 	if (pr == NULL) {
 		sx_xunlock(&allprison_lock);
 		return (EINVAL);
 	}
 
+	/* Remove all descendants of this prison, then remove this prison. */
+	pr->pr_ref++;
+	pr->pr_flags |= PR_REMOVE;
+	if (!LIST_EMPTY(&pr->pr_children)) {
+		mtx_unlock(&pr->pr_mtx);
+		lpr = NULL;
+		FOREACH_PRISON_DESCENDANT(pr, cpr, descend) {
+			mtx_lock(&cpr->pr_mtx);
+			if (cpr->pr_ref > 0) {
+				tpr = cpr;
+				cpr->pr_ref++;
+				cpr->pr_flags |= PR_REMOVE;
+			} else {
+				/* Already removed - do not do it again. */
+				tpr = NULL;
+			}
+			mtx_unlock(&cpr->pr_mtx);
+			if (lpr != NULL) {
+				mtx_lock(&lpr->pr_mtx);
+				prison_remove1(lpr);
+				sx_xlock(&allprison_lock);
+			}
+			lpr = tpr;
+		}
+		if (lpr != NULL) {
+			mtx_lock(&lpr->pr_mtx);
+			prison_remove1(lpr);
+			sx_xlock(&allprison_lock);
+		}
+		mtx_lock(&pr->pr_mtx);
+	}
+	prison_remove1(pr);
+	return (0);
+}
+
+static void
+prison_remove1(struct prison *pr)
+{
+	struct proc *p;
+	int deuref;
+
 	/* If the prison was persistent, it is not anymore. */
 	deuref = 0;
 	if (pr->pr_flags & PR_PERSIST) {
@@ -1432,17 +2191,18 @@
 		pr->pr_flags &= ~PR_PERSIST;
 	}
 
-	/* If there are no references left, remove the prison now. */
-	if (pr->pr_ref == 0) {
+	/*
+	 * jail_remove added a reference.  If that's the only one, remove
+	 * the prison now.
+	 */
+	KASSERT(pr->pr_ref > 0,
+	    ("prison_remove1 removing a dead prison (jid=%d)", pr->pr_id));
+	if (pr->pr_ref == 1) {
 		prison_deref(pr,
 		    deuref | PD_DEREF | PD_LOCKED | PD_LIST_XLOCKED);
-		return (0);
+		return;
 	}
 
-	/*
-	 * Keep a temporary reference to make sure this prison sticks around.
-	 */
-	pr->pr_ref++;
 	mtx_unlock(&pr->pr_mtx);
 	sx_xunlock(&allprison_lock);
 	/*
@@ -1457,9 +2217,8 @@
 		PROC_UNLOCK(p);
 	}
 	sx_sunlock(&allproc_lock);
-	/* Remove the temporary reference. */
+	/* Remove the temporary reference added by jail_remove. */
 	prison_deref(pr, deuref | PD_DEREF);
-	return (0);
 }
 
 
@@ -1479,7 +2238,7 @@
 		return (error);
 
 	sx_slock(&allprison_lock);
-	pr = prison_find(uap->jid);
+	pr = prison_find_child(td->td_ucred->cr_prison, uap->jid);
 	if (pr == NULL) {
 		sx_sunlock(&allprison_lock);
 		return (EINVAL);
@@ -1501,6 +2260,7 @@
 static int
 do_jail_attach(struct thread *td, struct prison *pr)
 {
+	struct prison *ppr;
 	struct proc *p;
 	struct ucred *newcred, *oldcred;
 	int vfslocked, error;
@@ -1528,6 +2288,7 @@
 	/*
 	 * Reparent the newly attached process to this jail.
 	 */
+	ppr = td->td_ucred->cr_prison;
 	p = td->td_proc;
 	error = cpuset_setproc_update_set(p, pr->pr_cpuset);
 	if (error)
@@ -1555,6 +2316,7 @@
 	p->p_ucred = newcred;
 	PROC_UNLOCK(p);
 	crfree(oldcred);
+	prison_deref(ppr, PD_DEREF | PD_DEUREF);
 	return (0);
  e_unlock:
 	VOP_UNLOCK(pr->pr_root, 0);
@@ -1562,7 +2324,7 @@
 	VFS_UNLOCK_GIANT(vfslocked);
  e_revert_osd:
 	/* Tell modules this thread is still in its old jail after all. */
-	(void)osd_jail_call(td->td_ucred->cr_prison, PR_METHOD_ATTACH, td);
+	(void)osd_jail_call(ppr, PR_METHOD_ATTACH, td);
 	prison_deref(pr, PD_DEREF | PD_DEUREF);
 	return (error);
 }
@@ -1588,18 +2350,42 @@
 }
 
 /*
- * Look for the named prison.  Returns a locked prison or NULL.
+ * Find a prison that is a descendant of mypr.  Returns a locked prison or NULL.
  */
 struct prison *
-prison_find_name(const char *name)
+prison_find_child(struct prison *mypr, int prid)
 {
+	struct prison *pr;
+	int descend;
+
+	sx_assert(&allprison_lock, SX_LOCKED);
+	FOREACH_PRISON_DESCENDANT(mypr, pr, descend) {
+		if (pr->pr_id == prid) {
+			mtx_lock(&pr->pr_mtx);
+			if (pr->pr_ref > 0)
+				return (pr);
+			mtx_unlock(&pr->pr_mtx);
+		}
+	}
+	return (NULL);
+}
+
+/*
+ * Look for the name relative to mypr.  Returns a locked prison or NULL.
+ */
+struct prison *
+prison_find_name(struct prison *mypr, const char *name)
+{
 	struct prison *pr, *deadpr;
+	size_t mylen;
+	int descend;
 
 	sx_assert(&allprison_lock, SX_LOCKED);
+	mylen = mypr == &prison0 ? 0 : strlen(mypr->pr_name) + 1;
  again:
 	deadpr = NULL;
-	TAILQ_FOREACH(pr, &allprison, pr_list) {
-		if (!strcmp(pr->pr_name, name)) {
+	FOREACH_PRISON_DESCENDANT(mypr, pr, descend) {
+		if (!strcmp(pr->pr_name + mylen, name)) {
 			mtx_lock(&pr->pr_mtx);
 			if (pr->pr_ref > 0) {
 				if (pr->pr_uref > 0)
@@ -1609,7 +2395,7 @@
 			mtx_unlock(&pr->pr_mtx);
 		}
 	}
-	/* There was no valid prison - perhaps there was a dying one */
+	/* There was no valid prison - perhaps there was a dying one. */
 	if (deadpr != NULL) {
 		mtx_lock(&deadpr->pr_mtx);
 		if (deadpr->pr_ref == 0) {
@@ -1663,66 +2449,87 @@
 static void
 prison_deref(struct prison *pr, int flags)
 {
+	struct prison *ppr, *tpr;
 	int vfslocked;
 
 	if (!(flags & PD_LOCKED))
 		mtx_lock(&pr->pr_mtx);
+	/* Decrement the user references in a separate loop. */
 	if (flags & PD_DEUREF) {
-		pr->pr_uref--;
+		for (tpr = pr;; tpr = tpr->pr_parent) {
+			if (tpr != pr)
+				mtx_lock(&tpr->pr_mtx);
+			if (--tpr->pr_uref > 0)
+				break;
+			KASSERT(tpr != &prison0, ("prison0 pr_uref=0"));
+			mtx_unlock(&tpr->pr_mtx);
+		}
 		/* Done if there were only user references to remove. */
 		if (!(flags & PD_DEREF)) {
-			mtx_unlock(&pr->pr_mtx);
+			mtx_unlock(&tpr->pr_mtx);
 			if (flags & PD_LIST_SLOCKED)
 				sx_sunlock(&allprison_lock);
 			else if (flags & PD_LIST_XLOCKED)
 				sx_xunlock(&allprison_lock);
 			return;
 		}
+		if (tpr != pr) {
+			mtx_unlock(&tpr->pr_mtx);
+			mtx_lock(&pr->pr_mtx);
+		}
 	}
-	if (flags & PD_DEREF)
-		pr->pr_ref--;
-	/* If the prison still has references, nothing else to do. */
-	if (pr->pr_ref > 0) {
-		mtx_unlock(&pr->pr_mtx);
-		if (flags & PD_LIST_SLOCKED)
-			sx_sunlock(&allprison_lock);
-		else if (flags & PD_LIST_XLOCKED)
-			sx_xunlock(&allprison_lock);
-		return;
-	}
 
-	KASSERT(pr->pr_uref == 0,
-	    ("%s: Trying to remove an active prison (jid=%d).", __func__,
-	    pr->pr_id));
-	mtx_unlock(&pr->pr_mtx);
-	if (flags & PD_LIST_SLOCKED) {
-		if (!sx_try_upgrade(&allprison_lock)) {
-			sx_sunlock(&allprison_lock);
-			sx_xlock(&allprison_lock);
+	for (;;) {
+		if (flags & PD_DEREF)
+			pr->pr_ref--;
+		/* If the prison still has references, nothing else to do. */
+		if (pr->pr_ref > 0) {
+			mtx_unlock(&pr->pr_mtx);
+			if (flags & PD_LIST_SLOCKED)
+				sx_sunlock(&allprison_lock);
+			else if (flags & PD_LIST_XLOCKED)
+				sx_xunlock(&allprison_lock);
+			return;
 		}
-	} else if (!(flags & PD_LIST_XLOCKED))
-		sx_xlock(&allprison_lock);
 
-	TAILQ_REMOVE(&allprison, pr, pr_list);
-	prisoncount--;
-	sx_xunlock(&allprison_lock);
+		mtx_unlock(&pr->pr_mtx);
+		if (flags & PD_LIST_SLOCKED) {
+			if (!sx_try_upgrade(&allprison_lock)) {
+				sx_sunlock(&allprison_lock);
+				sx_xlock(&allprison_lock);
+			}
+		} else if (!(flags & PD_LIST_XLOCKED))
+			sx_xlock(&allprison_lock);
 
-	if (pr->pr_root != NULL) {
-		vfslocked = VFS_LOCK_GIANT(pr->pr_root->v_mount);
-		vrele(pr->pr_root);
-		VFS_UNLOCK_GIANT(vfslocked);
-	}
-	mtx_destroy(&pr->pr_mtx);
+		TAILQ_REMOVE(&allprison, pr, pr_list);
+		LIST_REMOVE(pr, pr_sibling);
+		ppr = pr->pr_parent;
+		for (tpr = ppr; tpr != NULL; tpr = tpr->pr_parent)
+			tpr->pr_prisoncount--;
+		sx_downgrade(&allprison_lock);
+
+		if (pr->pr_root != NULL) {
+			vfslocked = VFS_LOCK_GIANT(pr->pr_root->v_mount);
+			vrele(pr->pr_root);
+			VFS_UNLOCK_GIANT(vfslocked);
+		}
+		mtx_destroy(&pr->pr_mtx);
 #ifdef INET
-	free(pr->pr_ip4, M_PRISON);
+		free(pr->pr_ip4, M_PRISON);
 #endif
 #ifdef INET6
-	free(pr->pr_ip6, M_PRISON);
+		free(pr->pr_ip6, M_PRISON);
 #endif
-	if (pr->pr_cpuset != NULL)
-		cpuset_rel(pr->pr_cpuset);
-	osd_jail_exit(pr);
-	free(pr, M_PRISON);
+		if (pr->pr_cpuset != NULL)
+			cpuset_rel(pr->pr_cpuset);
+		osd_jail_exit(pr);
+		free(pr, M_PRISON);
+
+		/* Removing a prison frees a reference on its parent. */
+		pr = ppr;
+		mtx_lock(&pr->pr_mtx);
+		flags = PD_DEREF | PD_LIST_SLOCKED;
+	}
 }
 
 void
@@ -1768,10 +2575,94 @@
 
 #ifdef INET
 /*
+ * Restrict a prison's IP address list with its parent's, possibly replacing
+ * it.  Return true if the replacement buffer was used (or would have been).
+ */
+static int
+prison_restrict_ip4(struct prison *pr, struct in_addr *newip4)
+{
+	int ii, ij, used;
+	struct prison *ppr;
+
+	ppr = pr->pr_parent;
+	if (!(pr->pr_flags & PR_IP4_USER)) {
+		/* This has no user settings, so just copy the parent's list. */
+		if (pr->pr_ip4s < ppr->pr_ip4s) {
+			/*
+			 * There's no room for the parent's list.  Use the
+			 * new list buffer, which is assumed to be big enough
+			 * (if it was passed).  If there's no buffer, try to
+			 * allocate one.
+			 */
+			used = 1;
+			if (newip4 == NULL) {
+				newip4 = malloc(ppr->pr_ip4s * sizeof(*newip4),
+				    M_PRISON, M_NOWAIT);
+				if (newip4 != NULL)
+					used = 0;
+			}
+			if (newip4 != NULL) {
+				pr->pr_ip4s = ppr->pr_ip4s;
+				free(pr->pr_ip4, M_PRISON);
+				pr->pr_ip4 = newip4;
+				bcopy(ppr->pr_ip4, newip4,
+				    pr->pr_ip4s * sizeof(*newip4));
+				pr->pr_flags |= PR_IP4;
+			}
+			return (used);
+		}
+		pr->pr_ip4s = ppr->pr_ip4s;
+		if (pr->pr_ip4s > 0)
+			bcopy(ppr->pr_ip4, pr->pr_ip4,
+			    pr->pr_ip4s * sizeof(*newip4));
+		else if (pr->pr_ip4 != NULL) {
+			free(pr->pr_ip4, M_PRISON);
+			pr->pr_ip4 = NULL;
+		}
+		pr->pr_flags =
+			(pr->pr_flags & ~PR_IP4) | (ppr->pr_flags & PR_IP4);
+	} else if (pr->pr_ip4s > 0 && (ppr->pr_flags & PR_IP4)) {
+		/* Remove addresses that aren't in the parent. */
+		for (ij = 0; ij < ppr->pr_ip4s; ij++)
+			if (pr->pr_ip4[0].s_addr == ppr->pr_ip4[ij].s_addr)
+				break;
+		if (ij == ppr->pr_ip4s)
+			bcopy(pr->pr_ip4 + 1, pr->pr_ip4,
+			    --pr->pr_ip4s * sizeof(*pr->pr_ip4));
+		for (ii = ij = 1; ii < pr->pr_ip4s; ii++) {
+			if (pr->pr_ip4[ii].s_addr == ppr->pr_ip4[0].s_addr)
+				continue;
+			for (; ij < ppr->pr_ip4s; ij++) {
+				if (qcmp_v4(&pr->pr_ip4[ii],
+				    &ppr->pr_ip4[ij].s_addr) <= 0)
+					break;
+			}
+			if (ij == ppr->pr_ip4s) {
+				pr->pr_ip4s = ii;
+				break;
+			}
+			if (qcmp_v4(&pr->pr_ip4[ii], &ppr->pr_ip4[ij]) > 0) {
+				if (ii < --pr->pr_ip4s)
+					bcopy(pr->pr_ip4 + ii + 1,
+					    pr->pr_ip4 + ii,
+					    (pr->pr_ip4s - ii) *
+					    sizeof(*pr->pr_ip4));
+				ii--;
+			}
+		}
+		if (pr->pr_ip4s == 0) {
+			free(pr->pr_ip4, M_PRISON);
+			pr->pr_ip4 = NULL;
+		}
+	}
+	return (0);
+}
+
+/*
  * Pass back primary IPv4 address of this jail.
  *
- * If not jailed return success but do not alter the address.  Caller has to
- * make sure to initialize it correctly (e.g. INADDR_ANY).
+ * If not restricted return success but do not alter the address.  Caller has
+ * to make sure to initialize it correctly (e.g. INADDR_ANY).
  *
  * Returns 0 on success, EAFNOSUPPORT if the jail doesn't allow IPv4.
  * Address returned in NBO.
@@ -1784,10 +2675,14 @@
 	KASSERT(cred != NULL, ("%s: cred is NULL", __func__));
 	KASSERT(ia != NULL, ("%s: ia is NULL", __func__));
 
-	if (!jailed(cred))
+	pr = cred->cr_prison;
+	if (!(pr->pr_flags & PR_IP4))
 		return (0);
-	pr = cred->cr_prison;
 	mtx_lock(&pr->pr_mtx);
+	if (!(pr->pr_flags & PR_IP4)) {
+		mtx_unlock(&pr->pr_mtx);
+		return (0);
+	}
 	if (pr->pr_ip4 == NULL) {
 		mtx_unlock(&pr->pr_mtx);
 		return (EAFNOSUPPORT);
@@ -1799,12 +2694,36 @@
 }
 
 /*
+ * Return true if pr1 and pr2 have the same IPv4 address restrictions.
+ */
+int
+prison_equal_ip4(struct prison *pr1, struct prison *pr2)
+{
+	if (pr1 == pr2)
+		return (1);
+
+	/*
+	 * jail_set maintains an exclusive hold on allprison_lock while it
+	 * changes the IP addresses, so only a shared hold is needed.  This is
+	 * easier than locking the two prisons which would require finding the
+	 * proper locking order and end up needing allprison_lock anyway.
+	 */
+	sx_slock(&allprison_lock);
+	while (pr1 != &prison0 && !(pr1->pr_flags & PR_IP4_USER))
+		pr1 = pr1->pr_parent;
+	while (pr2 != &prison0 && !(pr2->pr_flags & PR_IP4_USER))
+		pr2 = pr2->pr_parent;
+	sx_sunlock(&allprison_lock);
+	return (pr1 == pr2);
+}
+
+/*
  * Make sure our (source) address is set to something meaningful to this
  * jail.
  *
- * Returns 0 if not jailed or if address belongs to jail, EADDRNOTAVAIL if
- * the address doesn't belong, or EAFNOSUPPORT if the jail doesn't allow IPv4.
- * Address passed in in NBO and returned in NBO.
+ * Returns 0 if jail doesn't restrict IPv4 or if address belongs to jail,
+ * EADDRNOTAVAIL if the address doesn't belong, or EAFNOSUPPORT if the jail
+ * doesn't allow IPv4.  Address passed in in NBO and returned in NBO.
  */
 int
 prison_local_ip4(struct ucred *cred, struct in_addr *ia)
@@ -1816,10 +2735,14 @@
 	KASSERT(cred != NULL, ("%s: cred is NULL", __func__));
 	KASSERT(ia != NULL, ("%s: ia is NULL", __func__));
 
-	if (!jailed(cred))
+	pr = cred->cr_prison;
+	if (!(pr->pr_flags & PR_IP4))
 		return (0);
-	pr = cred->cr_prison;
 	mtx_lock(&pr->pr_mtx);
+	if (!(pr->pr_flags & PR_IP4)) {
+		mtx_unlock(&pr->pr_mtx);
+		return (0);
+	}
 	if (pr->pr_ip4 == NULL) {
 		mtx_unlock(&pr->pr_mtx);
 		return (EAFNOSUPPORT);
@@ -1861,10 +2784,14 @@
 	KASSERT(cred != NULL, ("%s: cred is NULL", __func__));
 	KASSERT(ia != NULL, ("%s: ia is NULL", __func__));
 
-	if (!jailed(cred))
+	pr = cred->cr_prison;
+	if (!(pr->pr_flags & PR_IP4))
 		return (0);
-	pr = cred->cr_prison;
 	mtx_lock(&pr->pr_mtx);
+	if (!(pr->pr_flags & PR_IP4)) {
+		mtx_unlock(&pr->pr_mtx);
+		return (0);
+	}
 	if (pr->pr_ip4 == NULL) {
 		mtx_unlock(&pr->pr_mtx);
 		return (EAFNOSUPPORT);
@@ -1886,9 +2813,9 @@
 /*
  * Check if given address belongs to the jail referenced by cred/prison.
  *
- * Returns 0 if not jailed or if address belongs to jail, EADDRNOTAVAIL if
- * the address doesn't belong, or EAFNOSUPPORT if the jail doesn't allow IPv4.
- * Address passed in in NBO.
+ * Returns 0 if jail doesn't restrict IPv4 or if address belongs to jail,
+ * EADDRNOTAVAIL if the address doesn't belong, or EAFNOSUPPORT if the jail
+ * doesn't allow IPv4.  Address passed in in NBO.
  */
 static int
 _prison_check_ip4(struct prison *pr, struct in_addr *ia)
@@ -1929,10 +2856,14 @@
 	KASSERT(cred != NULL, ("%s: cred is NULL", __func__));
 	KASSERT(ia != NULL, ("%s: ia is NULL", __func__));
 
-	if (!jailed(cred))
+	pr = cred->cr_prison;
+	if (!(pr->pr_flags & PR_IP4))
 		return (0);
-	pr = cred->cr_prison;
 	mtx_lock(&pr->pr_mtx);
+	if (!(pr->pr_flags & PR_IP4)) {
+		mtx_unlock(&pr->pr_mtx);
+		return (0);
+	}
 	if (pr->pr_ip4 == NULL) {
 		mtx_unlock(&pr->pr_mtx);
 		return (EAFNOSUPPORT);
@@ -1945,11 +2876,93 @@
 #endif
 
 #ifdef INET6
+static int
+prison_restrict_ip6(struct prison *pr, struct in6_addr *newip6)
+{
+	int ii, ij, used;
+	struct prison *ppr;
+
+	ppr = pr->pr_parent;
+	if (!(pr->pr_flags & PR_IP6_USER)) {
+		/* This has no user settings, so just copy the parent's list. */
+		if (pr->pr_ip6s < ppr->pr_ip6s) {
+			/*
+			 * There's no room for the parent's list.  Use the
+			 * new list buffer, which is assumed to be big enough
+			 * (if it was passed).  If there's no buffer, try to
+			 * allocate one.
+			 */
+			used = 1;
+			if (newip6 == NULL) {
+				newip6 = malloc(ppr->pr_ip6s * sizeof(*newip6),
+				    M_PRISON, M_NOWAIT);
+				if (newip6 != NULL)
+					used = 0;
+			}
+			if (newip6 != NULL) {
+				pr->pr_ip6s = ppr->pr_ip6s;
+				free(pr->pr_ip6, M_PRISON);
+				pr->pr_ip6 = newip6;
+				bcopy(ppr->pr_ip6, newip6,
+				    ppr->pr_ip6s * sizeof(*newip6));
+				pr->pr_flags |= PR_IP6;
+			}
+			return (used);
+		}
+		pr->pr_ip6s = ppr->pr_ip6s;
+		if (pr->pr_ip6s > 0)
+			bcopy(ppr->pr_ip6, pr->pr_ip6,
+			    pr->pr_ip6s * sizeof(*newip6));
+		else if (pr->pr_ip6 != NULL) {
+			free(pr->pr_ip6, M_PRISON);
+			pr->pr_ip6 = NULL;
+		}
+		pr->pr_flags =
+			(pr->pr_flags & ~PR_IP6) | (ppr->pr_flags & PR_IP6);
+	} else if (pr->pr_ip6s > 0 && (ppr->pr_flags & PR_IP6)) {
+		/* Remove addresses that aren't in the parent. */
+		for (ij = 0; ij < ppr->pr_ip6s; ij++)
+			if (IN6_ARE_ADDR_EQUAL(&pr->pr_ip6[0],
+			    &ppr->pr_ip6[ij]))
+				break;
+		if (ij == ppr->pr_ip6s)
+			bcopy(pr->pr_ip6 + 1, pr->pr_ip6,
+			    --pr->pr_ip6s * sizeof(*pr->pr_ip6));
+		for (ii = ij = 1; ii < pr->pr_ip6s; ii++) {
+			if (IN6_ARE_ADDR_EQUAL(&pr->pr_ip6[ii],
+			    &ppr->pr_ip6[0]))
+				continue;
+			for (; ij < ppr->pr_ip6s; ij++) {
+				if (qcmp_v6(&pr->pr_ip6[ii],
+				    &ppr->pr_ip6[ij]) <= 0)
+					break;
+			}
+			if (ij == ppr->pr_ip6s) {
+				pr->pr_ip6s = ii;
+				break;
+			}
+			if (qcmp_v6(&pr->pr_ip6[ii], &ppr->pr_ip6[ij]) > 0) {
+				if (ii < --pr->pr_ip6s)
+					bcopy(pr->pr_ip6 + ii + 1,
+					    pr->pr_ip6 + ii,
+					    (pr->pr_ip6s - ii) *
+					    sizeof(*pr->pr_ip6));
+				ii--;
+			}
+		}
+		if (pr->pr_ip6s == 0) {
+			free(pr->pr_ip6, M_PRISON);
+			pr->pr_ip6 = NULL;
+		}
+	}
+	return 0;
+}
+
 /*
  * Pass back primary IPv6 address for this jail.
  *
- * If not jailed return success but do not alter the address.  Caller has to
- * make sure to initialize it correctly (e.g. IN6ADDR_ANY_INIT).
+ * If not restricted return success but do not alter the address.  Caller has
+ * to make sure to initialize it correctly (e.g. IN6ADDR_ANY_INIT).
  *
  * Returns 0 on success, EAFNOSUPPORT if the jail doesn't allow IPv6.
  */
@@ -1961,10 +2974,14 @@
 	KASSERT(cred != NULL, ("%s: cred is NULL", __func__));
 	KASSERT(ia6 != NULL, ("%s: ia6 is NULL", __func__));
 
-	if (!jailed(cred))
+	pr = cred->cr_prison;
+	if (!(pr->pr_flags & PR_IP6))
 		return (0);
-	pr = cred->cr_prison;
 	mtx_lock(&pr->pr_mtx);
+	if (!(pr->pr_flags & PR_IP6)) {
+		mtx_unlock(&pr->pr_mtx);
+		return (0);
+	}
 	if (pr->pr_ip6 == NULL) {
 		mtx_unlock(&pr->pr_mtx);
 		return (EAFNOSUPPORT);
@@ -1976,13 +2993,32 @@
 }
 
 /*
+ * Return true if pr1 and pr2 have the same IPv6 address restrictions.
+ */
+int
+prison_equal_ip6(struct prison *pr1, struct prison *pr2)
+{
+	if (pr1 == pr2)
+		return (1);
+
+	sx_slock(&allprison_lock);
+	while (pr1 != &prison0 && !(pr1->pr_flags & PR_IP6_USER))
+		pr1 = pr1->pr_parent;
+	while (pr2 != &prison0 && !(pr2->pr_flags & PR_IP6_USER))
+		pr2 = pr1->pr_parent;
+	sx_sunlock(&allprison_lock);
+	return (pr1 == pr2);
+}
+
+/*
  * Make sure our (source) address is set to something meaningful to this jail.
  *
  * v6only should be set based on (inp->inp_flags & IN6P_IPV6_V6ONLY != 0)
  * when needed while binding.
  *
- * Returns 0 if not jailed or if address belongs to jail, EADDRNOTAVAIL if
- * the address doesn't belong, or EAFNOSUPPORT if the jail doesn't allow IPv6.
+ * Returns 0 if jail doesn't restrict IPv6 or if address belongs to jail,
+ * EADDRNOTAVAIL if the address doesn't belong, or EAFNOSUPPORT if the jail
+ * doesn't allow IPv6.
  */
 int
 prison_local_ip6(struct ucred *cred, struct in6_addr *ia6, int v6only)
@@ -1993,10 +3029,14 @@
 	KASSERT(cred != NULL, ("%s: cred is NULL", __func__));
 	KASSERT(ia6 != NULL, ("%s: ia6 is NULL", __func__));
 
-	if (!jailed(cred))
+	pr = cred->cr_prison;
+	if (!(pr->pr_flags & PR_IP6))
 		return (0);
-	pr = cred->cr_prison;
 	mtx_lock(&pr->pr_mtx);
+	if (!(pr->pr_flags & PR_IP6)) {
+		mtx_unlock(&pr->pr_mtx);
+		return (0);
+	}
 	if (pr->pr_ip6 == NULL) {
 		mtx_unlock(&pr->pr_mtx);
 		return (EAFNOSUPPORT);
@@ -2037,10 +3077,14 @@
 	KASSERT(cred != NULL, ("%s: cred is NULL", __func__));
 	KASSERT(ia6 != NULL, ("%s: ia6 is NULL", __func__));
 
-	if (!jailed(cred))
+	pr = cred->cr_prison;
+	if (!(pr->pr_flags & PR_IP6))
 		return (0);
-	pr = cred->cr_prison;
 	mtx_lock(&pr->pr_mtx);
+	if (!(pr->pr_flags & PR_IP6)) {
+		mtx_unlock(&pr->pr_mtx);
+		return (0);
+	}
 	if (pr->pr_ip6 == NULL) {
 		mtx_unlock(&pr->pr_mtx);
 		return (EAFNOSUPPORT);
@@ -2062,8 +3106,9 @@
 /*
  * Check if given address belongs to the jail referenced by cred/prison.
  *
- * Returns 0 if not jailed or if address belongs to jail, EADDRNOTAVAIL if
- * the address doesn't belong, or EAFNOSUPPORT if the jail doesn't allow IPv6.
+ * Returns 0 if jail doesn't restrict IPv6 or if address belongs to jail,
+ * EADDRNOTAVAIL if the address doesn't belong, or EAFNOSUPPORT if the jail
+ * doesn't allow IPv6.
  */
 static int
 _prison_check_ip6(struct prison *pr, struct in6_addr *ia6)
@@ -2104,10 +3149,14 @@
 	KASSERT(cred != NULL, ("%s: cred is NULL", __func__));
 	KASSERT(ia6 != NULL, ("%s: ia6 is NULL", __func__));
 
-	if (!jailed(cred))
+	pr = cred->cr_prison;
+	if (!(pr->pr_flags & PR_IP6))
 		return (0);
-	pr = cred->cr_prison;
 	mtx_lock(&pr->pr_mtx);
+	if (!(pr->pr_flags & PR_IP6)) {
+		mtx_unlock(&pr->pr_mtx);
+		return (0);
+	}
 	if (pr->pr_ip6 == NULL) {
 		mtx_unlock(&pr->pr_mtx);
 		return (EAFNOSUPPORT);
@@ -2128,34 +3177,42 @@
 int
 prison_check_af(struct ucred *cred, int af)
 {
+	struct prison *pr;
 	int error;
 
 	KASSERT(cred != NULL, ("%s: cred is NULL", __func__));
 
-
-	if (!jailed(cred))
-		return (0);
-
+	pr = cred->cr_prison;
 	error = 0;
 	switch (af)
 	{
 #ifdef INET
 	case AF_INET:
-		if (cred->cr_prison->pr_ip4 == NULL)
-			error = EAFNOSUPPORT;
+		if (pr->pr_flags & PR_IP4)
+		{
+			mtx_lock(&pr->pr_mtx);
+			if ((pr->pr_flags & PR_IP4) && pr->pr_ip4 == NULL)
+				error = EAFNOSUPPORT;
+			mtx_unlock(&pr->pr_mtx);
+		}
 		break;
 #endif
 #ifdef INET6
 	case AF_INET6:
-		if (cred->cr_prison->pr_ip6 == NULL)
-			error = EAFNOSUPPORT;
+		if (pr->pr_flags & PR_IP6)
+		{
+			mtx_lock(&pr->pr_mtx);
+			if ((pr->pr_flags & PR_IP6) && pr->pr_ip6 == NULL)
+				error = EAFNOSUPPORT;
+			mtx_unlock(&pr->pr_mtx);
+		}
 		break;
 #endif
 	case AF_LOCAL:
 	case AF_ROUTE:
 		break;
 	default:
-		if (jail_socket_unixiproute_only)
+		if (pr->pr_flags & PR_RESTRICT_SOCKET_UNIXIPROUTE)
 			error = EAFNOSUPPORT;
 	}
 	return (error);
@@ -2165,9 +3222,9 @@
  * Check if given address belongs to the jail referenced by cred (wrapper to
  * prison_check_ip[46]).
  *
- * Returns 0 if not jailed or if address belongs to jail, EADDRNOTAVAIL if
- * the address doesn't belong, or EAFNOSUPPORT if the jail doesn't allow
- * the address family.  IPv4 Address passed in in NBO.
+ * Returns 0 if jail doesn't restrict the address family or if address belongs
+ * to jail, EADDRNOTAVAIL if the address doesn't belong, or EAFNOSUPPORT if
+ * the jail doesn't allow the address family.  IPv4 Address passed in in NBO.
  */
 int
 prison_if(struct ucred *cred, struct sockaddr *sa)
@@ -2199,7 +3256,7 @@
 		break;
 #endif
 	default:
-		if (jailed(cred) && jail_socket_unixiproute_only)
+		if (cred->cr_prison->pr_flags & PR_RESTRICT_SOCKET_UNIXIPROUTE)
 			error = EAFNOSUPPORT;
 	}
 	return (error);
@@ -2212,13 +3269,20 @@
 prison_check(struct ucred *cred1, struct ucred *cred2)
 {
 
-	if (jailed(cred1)) {
-		if (!jailed(cred2))
-			return (ESRCH);
-		if (cred2->cr_prison != cred1->cr_prison)
-			return (ESRCH);
-	}
+	return (cred1->cr_prison == cred2->cr_prison ||
+	    prison_ischild(cred1->cr_prison, cred2->cr_prison) ? 0 : ESRCH);
+}
 
+/*
+ * Return 1 if p2 is a child of p1, otherwise 0.
+ */
+int
+prison_ischild(struct prison *pr1, struct prison *pr2)
+{
+
+	for (pr2 = pr2->pr_parent; pr2 != NULL; pr2 = pr2->pr_parent)
+		if (pr1 == pr2)
+			return (1);
 	return (0);
 }
 
@@ -2229,7 +3293,7 @@
 jailed(struct ucred *cred)
 {
 
-	return (cred->cr_prison != NULL);
+	return (cred->cr_prison != &prison0);
 }
 
 /*
@@ -2265,12 +3329,12 @@
 	struct statfs *sp;
 	size_t len;
 
-	if (!jailed(cred) || jail_enforce_statfs == 0)
+	pr = cred->cr_prison;
+	if (pr->pr_enforce_statfs == 0)
 		return (0);
-	pr = cred->cr_prison;
 	if (pr->pr_root->v_mount == mp)
 		return (0);
-	if (jail_enforce_statfs == 2)
+	if (pr->pr_enforce_statfs == 2)
 		return (ENOENT);
 	/*
 	 * If jail's chroot directory is set to "/" we should be able to see
@@ -2300,9 +3364,9 @@
 	struct prison *pr;
 	size_t len;
 
-	if (!jailed(cred) || jail_enforce_statfs == 0)
+	pr = cred->cr_prison;
+	if (pr->pr_enforce_statfs == 0)
 		return;
-	pr = cred->cr_prison;
 	if (prison_canseemount(cred, mp) != 0) {
 		bzero(sp->f_mntonname, sizeof(sp->f_mntonname));
 		strlcpy(sp->f_mntonname, "[restricted]",
@@ -2416,6 +3480,13 @@
 	case PRIV_MQ_ADMIN:
 
 		/*
+		 * Jail operations within a jail work on child jails.
+		 */
+	case PRIV_JAIL_ATTACH:
+	case PRIV_JAIL_SET:
+	case PRIV_JAIL_REMOVE:
+
+		/*
 		 * Jail implements its own inter-process limits, so allow
 		 * root processes in jail to change scheduling on other
 		 * processes in the same jail.  Likewise for signalling.
@@ -2467,7 +3538,7 @@
 		 * setting system flags.
 		 */
 	case PRIV_VFS_SYSFLAGS:
-		if (jail_chflags_allowed)
+		if (cred->cr_prison->pr_flags & PR_ALLOW_CHFLAGS)
 			return (0);
 		else
 			return (EPERM);
@@ -2480,7 +3551,7 @@
 	case PRIV_VFS_UNMOUNT:
 	case PRIV_VFS_MOUNT_NONUSER:
 	case PRIV_VFS_MOUNT_OWNER:
-		if (jail_mount_allowed)
+		if (cred->cr_prison->pr_flags & PR_ALLOW_MOUNT)
 			return (0);
 		else
 			return (EPERM);
@@ -2503,7 +3574,7 @@
 		 * Conditionally allow creating raw sockets in jail.
 		 */
 	case PRIV_NETINET_RAW:
-		if (jail_allow_raw_sockets)
+		if (cred->cr_prison->pr_flags & PR_ALLOW_RAW_SOCKETS)
 			return (0);
 		else
 			return (EPERM);
@@ -2526,11 +3597,61 @@
 	}
 }
 
+/*
+ * Return the part of pr2's name that is relative to pr1, or the whole name
+ * if it does not directly follow.
+ */
+
+char *
+prison_name(struct prison *pr1, struct prison *pr2)
+{
+	char *name;
+
+	/* Jails see themselves as "0" (if they see themselves at all). */
+	if (pr1 == pr2)
+		return "0";
+	name = pr2->pr_name;
+	if (prison_ischild(pr1, pr2)) {
+		/*
+		 * pr1 isn't locked (and allprison_lock may not be either)
+		 * so its length can't be counted on.  But the number of dots
+		 * can be counted on - and counted.
+		 */
+		for (; pr1 != &prison0; pr1 = pr1->pr_parent)
+			name = strchr(name, '.') + 1;
+	}
+	return (name);
+}
+
+/*
+ * Return the part of pr2's path that is relative to pr1, or the whole path
+ * if it does not directly follow.
+ */
+static char *
+prison_path(struct prison *pr1, struct prison *pr2)
+{
+	char *path1, *path2;
+	int len1;
+
+	path1 = pr1->pr_path;
+	path2 = pr2->pr_path;
+	if (!strcmp(path1, "/"))
+		return (path2);
+	len1 = strlen(path1);
+	if (strncmp(path1, path2, len1))
+		return (path2);
+	if (path2[len1] == '\0')
+		return "/";
+	if (path2[len1] == '/')
+		return (path2 + len1);
+	return (path2);
+}
+
 static int
 sysctl_jail_list(SYSCTL_HANDLER_ARGS)
 {
 	struct xprison *xp;
-	struct prison *pr;
+	struct prison *pr, *cpr;
 #ifdef INET
 	struct in_addr *ip4 = NULL;
 	int ip4s = 0;
@@ -2539,62 +3660,60 @@
 	struct in_addr *ip6 = NULL;
 	int ip6s = 0;
 #endif
-	int error;
+	int descend, error;
 
-	if (jailed(req->td->td_ucred))
-		return (0);
-
 	xp = malloc(sizeof(*xp), M_TEMP, M_WAITOK);
+	pr = req->td->td_ucred->cr_prison;
 	error = 0;
 	sx_slock(&allprison_lock);
-	TAILQ_FOREACH(pr, &allprison, pr_list) {
+	FOREACH_PRISON_DESCENDANT(pr, cpr, descend) {
  again:
-		mtx_lock(&pr->pr_mtx);
+		mtx_lock(&cpr->pr_mtx);
 #ifdef INET
-		if (pr->pr_ip4s > 0) {
-			if (ip4s < pr->pr_ip4s) {
-				ip4s = pr->pr_ip4s;
-				mtx_unlock(&pr->pr_mtx);
+		if (cpr->pr_ip4s > 0) {
+			if (ip4s < cpr->pr_ip4s) {
+				ip4s = cpr->pr_ip4s;
+				mtx_unlock(&cpr->pr_mtx);
 				ip4 = realloc(ip4, ip4s *
 				    sizeof(struct in_addr), M_TEMP, M_WAITOK);
 				goto again;
 			}
-			bcopy(pr->pr_ip4, ip4,
-			    pr->pr_ip4s * sizeof(struct in_addr));
+			bcopy(cpr->pr_ip4, ip4,
+			    cpr->pr_ip4s * sizeof(struct in_addr));
 		}
 #endif
 #ifdef INET6
-		if (pr->pr_ip6s > 0) {
-			if (ip6s < pr->pr_ip6s) {
-				ip6s = pr->pr_ip6s;
-				mtx_unlock(&pr->pr_mtx);
+		if (cpr->pr_ip6s > 0) {
+			if (ip6s < cpr->pr_ip6s) {
+				ip6s = cpr->pr_ip6s;
+				mtx_unlock(&cpr->pr_mtx);
 				ip6 = realloc(ip6, ip6s *
 				    sizeof(struct in6_addr), M_TEMP, M_WAITOK);
 				goto again;
 			}
-			bcopy(pr->pr_ip6, ip6,
-			    pr->pr_ip6s * sizeof(struct in6_addr));
+			bcopy(cpr->pr_ip6, ip6,
+			    cpr->pr_ip6s * sizeof(struct in6_addr));
 		}
 #endif
-		if (pr->pr_ref == 0) {
-			mtx_unlock(&pr->pr_mtx);
+		if (cpr->pr_ref == 0) {
+			mtx_unlock(&cpr->pr_mtx);
 			continue;
 		}
 		bzero(xp, sizeof(*xp));
 		xp->pr_version = XPRISON_VERSION;
-		xp->pr_id = pr->pr_id;
-		xp->pr_state = pr->pr_uref > 0
+		xp->pr_id = cpr->pr_id;
+		xp->pr_state = cpr->pr_uref > 0
 		    ? PRISON_STATE_ALIVE : PRISON_STATE_DYING;
-		strlcpy(xp->pr_path, pr->pr_path, sizeof(xp->pr_path));
-		strlcpy(xp->pr_host, pr->pr_host, sizeof(xp->pr_host));
-		strlcpy(xp->pr_name, pr->pr_name, sizeof(xp->pr_name));
+		strlcpy(xp->pr_path, prison_path(pr, cpr), sizeof(xp->pr_path));
+		strlcpy(xp->pr_host, cpr->pr_host, sizeof(xp->pr_host));
+		strlcpy(xp->pr_name, prison_name(pr, cpr), sizeof(xp->pr_name));
 #ifdef INET
-		xp->pr_ip4s = pr->pr_ip4s;
+		xp->pr_ip4s = cpr->pr_ip4s;
 #endif
 #ifdef INET6
-		xp->pr_ip6s = pr->pr_ip6s;
+		xp->pr_ip6s = cpr->pr_ip6s;
 #endif
-		mtx_unlock(&pr->pr_mtx);
+		mtx_unlock(&cpr->pr_mtx);
 		error = SYSCTL_OUT(req, xp, sizeof(*xp));
 		if (error)
 			break;
@@ -2649,6 +3768,7 @@
 static void
 db_show_prison(struct prison *pr)
 {
+	int fi;
 #if defined(INET) || defined(INET6)
 	int ii;
 #endif
@@ -2659,6 +3779,7 @@
 	db_printf("prison %p:\n", pr);
 	db_printf(" jid             = %d\n", pr->pr_id);
 	db_printf(" name            = %s\n", pr->pr_name);
+	db_printf(" parent          = %p\n", pr->pr_parent);
 	db_printf(" ref             = %d\n", pr->pr_ref);
 	db_printf(" uref            = %d\n", pr->pr_uref);
 	db_printf(" path            = %s\n", pr->pr_path);
@@ -2666,10 +3787,18 @@
 	    ? pr->pr_cpuset->cs_id : -1);
 	db_printf(" root            = %p\n", pr->pr_root);
 	db_printf(" securelevel     = %d\n", pr->pr_securelevel);
+	db_printf(" child           = %p\n", LIST_FIRST(&pr->pr_children));
+	db_printf(" sibling         = %p\n", LIST_NEXT(pr, pr_sibling));
 	db_printf(" flags           = %x", pr->pr_flags);
-	if (pr->pr_flags & PR_PERSIST)
-		db_printf(" persist");
+	for (fi = 0; fi < sizeof(pr_flag_names) / sizeof(pr_flag_names[0]);
+	    fi++)
+		if (pr_flag_names[fi] != NULL && (pr->pr_flags & (1 << fi)))
+			db_printf(" %s", pr_flag_names[fi]);
 	db_printf("\n");
+	db_printf(" enforce_statfs  = %d\n", pr->pr_enforce_statfs);
+#if defined(INET) || defined(INET6)
+	db_printf(" max_af_ips      = %d\n", pr->pr_max_af_ips);
+#endif
 	db_printf(" host.hostname   = %s\n", pr->pr_host);
 #ifdef INET
 	db_printf(" ip4s            = %d\n", pr->pr_ip4s);
@@ -2692,7 +3821,11 @@
 	struct prison *pr;
 
 	if (!have_addr) {
-		/* Show all prisons in the list. */
+		/*
+		 * Show all prisons in the list, and prison0 which is not
+		 * listed.
+		 */
+		db_show_prison(&prison0);
 		TAILQ_FOREACH(pr, &allprison, pr_list) {
 			db_show_prison(pr);
 			if (db_pager_quit)
@@ -2701,18 +3834,22 @@
 		return;
 	}
 
-	/* Look for a prison with the ID and with references. */
-	TAILQ_FOREACH(pr, &allprison, pr_list)
-		if (pr->pr_id == addr && pr->pr_ref > 0)
-			break;
-	if (pr == NULL)
-		/* Look again, without requiring a reference. */
+	if (addr == 0)
+		pr = &prison0;
+	else {
+		/* Look for a prison with the ID and with references. */
 		TAILQ_FOREACH(pr, &allprison, pr_list)
-			if (pr->pr_id == addr)
+			if (pr->pr_id == addr && pr->pr_ref > 0)
 				break;
-	if (pr == NULL)
-		/* Assume address points to a valid prison. */
-		pr = (struct prison *)addr;
+		if (pr == NULL)
+			/* Look again, without requiring a reference. */
+			TAILQ_FOREACH(pr, &allprison, pr_list)
+				if (pr->pr_id == addr)
+					break;
+		if (pr == NULL)
+			/* Assume address points to a valid prison. */
+			pr = (struct prison *)addr;
+	}
 	db_show_prison(pr);
 }
 
Index: sys/kern/sysv_msg.c
===================================================================
--- sys/kern/sysv_msg.c	(revision 191896)
+++ sys/kern/sysv_msg.c	(working copy)
@@ -337,7 +337,7 @@
 {
 	int error;
 
-	if (!jail_sysvipc_allowed && jailed(td->td_ucred))
+	if (!(td->td_ucred->cr_prison->pr_flags & PR_ALLOW_SYSVIPC))
 		return (ENOSYS);
 	if (uap->which < 0 ||
 	    uap->which >= sizeof(msgcalls)/sizeof(msgcalls[0]))
@@ -410,7 +410,7 @@
 	int rval, error, msqix;
 	register struct msqid_kernel *msqkptr;
 
-	if (!jail_sysvipc_allowed && jailed(td->td_ucred))
+	if (!(td->td_ucred->cr_prison->pr_flags & PR_ALLOW_SYSVIPC))
 		return (ENOSYS);
 
 	msqix = IPCID_TO_IX(msqid);
@@ -564,7 +564,7 @@
 
 	DPRINTF(("msgget(0x%x, 0%o)\n", key, msgflg));
 
-	if (!jail_sysvipc_allowed && jailed(td->td_ucred))
+	if (!(td->td_ucred->cr_prison->pr_flags & PR_ALLOW_SYSVIPC))
 		return (ENOSYS);
 
 	mtx_lock(&msq_mtx);
@@ -674,7 +674,7 @@
 	register struct msg *msghdr;
 	short next;
 
-	if (!jail_sysvipc_allowed && jailed(td->td_ucred))
+	if (!(td->td_ucred->cr_prison->pr_flags & PR_ALLOW_SYSVIPC))
 		return (ENOSYS);
 
 	mtx_lock(&msq_mtx);
@@ -1012,7 +1012,7 @@
 	int msqix, error = 0;
 	short next;
 
-	if (!jail_sysvipc_allowed && jailed(td->td_ucred))
+	if (!(td->td_ucred->cr_prison->pr_flags & PR_ALLOW_SYSVIPC))
 		return (ENOSYS);
 
 	msqix = IPCID_TO_IX(msqid);
Index: sys/kern/vfs_syscalls.c
===================================================================
--- sys/kern/vfs_syscalls.c	(revision 191896)
+++ sys/kern/vfs_syscalls.c	(working copy)
@@ -164,12 +164,6 @@
 	return (0);
 }
 
-/* XXX PRISON: could be per prison flag */
-static int prison_quotas;
-#if 0
-SYSCTL_INT(_kern_prison, OID_AUTO, quotas, CTLFLAG_RW, &prison_quotas, 0, "");
-#endif
-
 /*
  * Change filesystem quotas.
  */
@@ -198,7 +192,7 @@
 
 	AUDIT_ARG(cmd, uap->cmd);
 	AUDIT_ARG(uid, uap->uid);
-	if (jailed(td->td_ucred) && !prison_quotas)
+	if (!(td->td_ucred->cr_prison->pr_flags & PR_ALLOW_QUOTAS))
 		return (EPERM);
 	NDINIT(&nd, LOOKUP, FOLLOW | LOCKLEAF | MPSAFE | AUDITVNODE1,
 	   UIO_USERSPACE, uap->path, td);
Index: sys/kern/init_main.c
===================================================================
--- sys/kern/init_main.c	(revision 191896)
+++ sys/kern/init_main.c	(working copy)
@@ -53,6 +53,7 @@
 #include <sys/exec.h>
 #include <sys/file.h>
 #include <sys/filedesc.h>
+#include <sys/jail.h>
 #include <sys/ktr.h>
 #include <sys/lock.h>
 #include <sys/mount.h>
@@ -436,6 +437,7 @@
 	td->td_oncpu = 0;
 	td->td_flags = TDF_INMEM|TDP_KTHREAD;
 	td->td_cpuset = cpuset_thread0();
+	prison0.pr_cpuset = cpuset_ref(td->td_cpuset);
 	p->p_peers = 0;
 	p->p_leader = p;
 
@@ -452,7 +454,7 @@
 	p->p_ucred->cr_ngroups = 1;	/* group 0 */
 	p->p_ucred->cr_uidinfo = uifind(0);
 	p->p_ucred->cr_ruidinfo = uifind(0);
-	p->p_ucred->cr_prison = NULL;	/* Don't jail it. */
+	p->p_ucred->cr_prison = &prison0;
 #ifdef VIMAGE
 	p->p_ucred->cr_vnet = LIST_FIRST(&vnet_head);
 #endif
Index: sys/kern/sysv_sem.c
===================================================================
--- sys/kern/sysv_sem.c	(revision 191896)
+++ sys/kern/sysv_sem.c	(working copy)
@@ -344,7 +344,7 @@
 {
 	int error;
 
-	if (!jail_sysvipc_allowed && jailed(td->td_ucred))
+	if (!(td->td_ucred->cr_prison->pr_flags & PR_ALLOW_SYSVIPC))
 		return (ENOSYS);
 	if (uap->which < 0 ||
 	    uap->which >= sizeof(semcalls)/sizeof(semcalls[0]))
@@ -583,7 +583,7 @@
 
 	DPRINTF(("call to semctl(%d, %d, %d, 0x%p)\n",
 	    semid, semnum, cmd, arg));
-	if (!jail_sysvipc_allowed && jailed(td->td_ucred))
+	if (!(td->td_ucred->cr_prison->pr_flags & PR_ALLOW_SYSVIPC))
 		return (ENOSYS);
 
 	array = NULL;
@@ -855,7 +855,7 @@
 	struct ucred *cred = td->td_ucred;
 
 	DPRINTF(("semget(0x%x, %d, 0%o)\n", key, nsems, semflg));
-	if (!jail_sysvipc_allowed && jailed(td->td_ucred))
+	if (!(td->td_ucred->cr_prison->pr_flags & PR_ALLOW_SYSVIPC))
 		return (ENOSYS);
 
 	mtx_lock(&sem_mtx);
@@ -982,7 +982,7 @@
 #endif
 	DPRINTF(("call to semop(%d, %p, %u)\n", semid, sops, nsops));
 
-	if (!jail_sysvipc_allowed && jailed(td->td_ucred))
+	if (!(td->td_ucred->cr_prison->pr_flags & PR_ALLOW_SYSVIPC))
 		return (ENOSYS);
 
 	semid = IPCID_TO_IX(semid);	/* Convert back to zero origin */
Index: sys/kern/kern_proc.c
===================================================================
--- sys/kern/kern_proc.c	(revision 191896)
+++ sys/kern/kern_proc.c	(working copy)
@@ -739,8 +739,8 @@
 		/* If jailed(cred), emulate the old P_JAILED flag. */
 		if (jailed(cred)) {
 			kp->ki_flag |= P_JAILED;
-			/* If inside a jail, use 0 as a jail ID. */
-			if (!jailed(curthread->td_ucred))
+			/* If inside the jail, use 0 as a jail ID. */
+			if (cred->cr_prison != curthread->td_ucred->cr_prison)
 				kp->ki_jid = cred->cr_prison->pr_id;
 		}
 	}
Index: sys/kern/kern_linker.c
===================================================================
--- sys/kern/kern_linker.c	(revision 191896)
+++ sys/kern/kern_linker.c	(working copy)
@@ -34,6 +34,7 @@
 #include <sys/param.h>
 #include <sys/kernel.h>
 #include <sys/systm.h>
+#include <sys/jail.h>
 #include <sys/malloc.h>
 #include <sys/sysproto.h>
 #include <sys/sysent.h>
@@ -375,7 +376,7 @@
 	int foundfile, error;
 
 	/* Refuse to load modules if securelevel raised */
-	if (securelevel > 0)
+	if (prison0.pr_securelevel > 0)
 		return (EPERM);
 
 	KLD_LOCK_ASSERT();
@@ -580,7 +581,7 @@
 	int error, i;
 
 	/* Refuse to unload modules if securelevel raised. */
-	if (securelevel > 0)
+	if (prison0.pr_securelevel > 0)
 		return (EPERM);
 
 	KLD_LOCK_ASSERT();
Index: sys/kern/sysv_shm.c
===================================================================
--- sys/kern/sysv_shm.c	(revision 191896)
+++ sys/kern/sysv_shm.c	(working copy)
@@ -303,7 +303,7 @@
 	int i;
 	int error = 0;
 
-	if (!jail_sysvipc_allowed && jailed(td->td_ucred))
+	if (!(td->td_ucred->cr_prison->pr_flags & PR_ALLOW_SYSVIPC))
 		return (ENOSYS);
 	mtx_lock(&Giant);
 	shmmap_s = p->p_vmspace->vm_shm;
@@ -357,7 +357,7 @@
 	int rv;
 	int error = 0;
 
-	if (!jail_sysvipc_allowed && jailed(td->td_ucred))
+	if (!(td->td_ucred->cr_prison->pr_flags & PR_ALLOW_SYSVIPC))
 		return (ENOSYS);
 	mtx_lock(&Giant);
 	shmmap_s = p->p_vmspace->vm_shm;
@@ -480,7 +480,7 @@
 	struct shmid_kernel *shmseg;
 	struct oshmid_ds outbuf;
 
-	if (!jail_sysvipc_allowed && jailed(td->td_ucred))
+	if (!(td->td_ucred->cr_prison->pr_flags & PR_ALLOW_SYSVIPC))
 		return (ENOSYS);
 	mtx_lock(&Giant);
 	shmseg = shm_find_segment_by_shmid(uap->shmid);
@@ -542,7 +542,7 @@
 	int error = 0;
 	struct shmid_kernel *shmseg;
 
-	if (!jail_sysvipc_allowed && jailed(td->td_ucred))
+	if (!(td->td_ucred->cr_prison->pr_flags & PR_ALLOW_SYSVIPC))
 		return (ENOSYS);
 
 	mtx_lock(&Giant);
@@ -823,7 +823,7 @@
 	int segnum, mode;
 	int error;
 
-	if (!jail_sysvipc_allowed && jailed(td->td_ucred))
+	if (!(td->td_ucred->cr_prison->pr_flags & PR_ALLOW_SYSVIPC))
 		return (ENOSYS);
 	mtx_lock(&Giant);
 	mode = uap->shmflg & ACCESSPERMS;
@@ -861,7 +861,7 @@
 #if defined(__i386__) && (defined(COMPAT_FREEBSD4) || defined(COMPAT_43))
 	int error;
 
-	if (!jail_sysvipc_allowed && jailed(td->td_ucred))
+	if (!(td->td_ucred->cr_prison->pr_flags & PR_ALLOW_SYSVIPC))
 		return (ENOSYS);
 	if (uap->which < 0 ||
 	    uap->which >= sizeof(shmcalls)/sizeof(shmcalls[0]))
Index: sys/kern/vfs_mount.c
===================================================================
--- sys/kern/vfs_mount.c	(revision 191896)
+++ sys/kern/vfs_mount.c	(working copy)
@@ -1421,6 +1421,11 @@
 root_mount_done(void)
 {
 
+	/* Keep prison0's root in sync with the global rootvnode. */
+	mtx_lock(&prison0.pr_mtx);
+	prison0.pr_root = rootvnode;
+	vref(prison0.pr_root);
+	mtx_unlock(&prison0.pr_mtx);
 	/*
 	 * Use a mutex to prevent the wakeup being missed and waiting for
 	 * an extra 1 second sleep.
Index: sys/kern/kern_exit.c
===================================================================
--- sys/kern/kern_exit.c	(revision 191896)
+++ sys/kern/kern_exit.c	(working copy)
@@ -454,9 +454,8 @@
 	p->p_xstat = rv;
 	p->p_xthread = td;
 
-	/* In case we are jailed tell the prison that we are gone. */
-	if (jailed(p->p_ucred))
-		prison_proc_free(p->p_ucred->cr_prison);
+	/* Tell the prison that we are gone. */
+	prison_proc_free(p->p_ucred->cr_prison);
 
 #ifdef KDTRACE_HOOKS
 	/*
Index: sys/kern/kern_prot.c
===================================================================
--- sys/kern/kern_prot.c	(revision 191896)
+++ sys/kern/kern_prot.c	(working copy)
@@ -1262,33 +1262,25 @@
  * (securelevel >= level).  Note that the logic is inverted -- these
  * functions return EPERM on "success" and 0 on "failure".
  *
+ * Due to care taken when setting the securelevel, we know that no jail will
+ * be less secure that its parent (or the physical system), so it is sufficient
+ * to test the current jail only.
+ *
  * XXXRW: Possibly since this has to do with privilege, it should move to
  * kern_priv.c.
  */
 int
 securelevel_gt(struct ucred *cr, int level)
 {
-	int active_securelevel;
 
-	active_securelevel = securelevel;
-	KASSERT(cr != NULL, ("securelevel_gt: null cr"));
-	if (cr->cr_prison != NULL)
-		active_securelevel = imax(cr->cr_prison->pr_securelevel,
-		    active_securelevel);
-	return (active_securelevel > level ? EPERM : 0);
+	return (cr->cr_prison->pr_securelevel > level ? EPERM : 0);
 }
 
 int
 securelevel_ge(struct ucred *cr, int level)
 {
-	int active_securelevel;
 
-	active_securelevel = securelevel;
-	KASSERT(cr != NULL, ("securelevel_ge: null cr"));
-	if (cr->cr_prison != NULL)
-		active_securelevel = imax(cr->cr_prison->pr_securelevel,
-		    active_securelevel);
-	return (active_securelevel >= level ? EPERM : 0);
+	return (cr->cr_prison->pr_securelevel >= level ? EPERM : 0);
 }
 
 /*
@@ -1822,7 +1814,7 @@
 		/*
 		 * Free a prison, if any.
 		 */
-		if (jailed(cr))
+		if (cr->cr_prison != NULL)
 			prison_free(cr->cr_prison);
 #ifdef AUDIT
 		audit_cred_destroy(cr);
@@ -1857,8 +1849,7 @@
 		(caddr_t)&src->cr_startcopy));
 	uihold(dest->cr_uidinfo);
 	uihold(dest->cr_ruidinfo);
-	if (jailed(dest))
-		prison_hold(dest->cr_prison);
+	prison_hold(dest->cr_prison);
 #ifdef AUDIT
 	audit_cred_copy(src, dest);
 #endif
Index: sys/kern/kern_descrip.c
===================================================================
--- sys/kern/kern_descrip.c	(revision 191896)
+++ sys/kern/kern_descrip.c	(working copy)
@@ -2363,24 +2363,25 @@
 }
 
 /*
- * Scan all active processes to see if any of them have a current or root
- * directory of `olddp'. If so, replace them with the new mount point.
+ * Scan all active processes and prisons to see if any of them have a current
+ * or root directory of `olddp'. If so, replace them with the new mount point.
  */
 void
 mountcheckdirs(struct vnode *olddp, struct vnode *newdp)
 {
 	struct filedesc *fdp;
+	struct prison *pr;
 	struct proc *p;
 	int nrele;
 
 	if (vrefcnt(olddp) == 1)
 		return;
+	nrele = 0;
 	sx_slock(&allproc_lock);
 	FOREACH_PROC_IN_SYSTEM(p) {
 		fdp = fdhold(p);
 		if (fdp == NULL)
 			continue;
-		nrele = 0;
 		FILEDESC_XLOCK(fdp);
 		if (fdp->fd_cdir == olddp) {
 			vref(newdp);
@@ -2392,17 +2393,40 @@
 			fdp->fd_rdir = newdp;
 			nrele++;
 		}
+		if (fdp->fd_jdir == olddp) {
+			vref(newdp);
+			fdp->fd_jdir = newdp;
+			nrele++;
+		}
 		FILEDESC_XUNLOCK(fdp);
 		fddrop(fdp);
-		while (nrele--)
-			vrele(olddp);
 	}
 	sx_sunlock(&allproc_lock);
 	if (rootvnode == olddp) {
-		vrele(rootvnode);
 		vref(newdp);
 		rootvnode = newdp;
+		nrele++;
 	}
+	mtx_lock(&prison0.pr_mtx);
+	if (prison0.pr_root == olddp) {
+		vref(newdp);
+		prison0.pr_root = newdp;
+		nrele++;
+	}
+	mtx_unlock(&prison0.pr_mtx);
+	sx_slock(&allprison_lock);
+	TAILQ_FOREACH(pr, &allprison, pr_list) {
+		mtx_lock(&pr->pr_mtx);
+		if (pr->pr_root == olddp) {
+			vref(newdp);
+			pr->pr_root = newdp;
+			nrele++;
+		}
+		mtx_unlock(&pr->pr_mtx);
+	}
+	sx_sunlock(&allprison_lock);
+	while (nrele--)
+		vrele(olddp);
 }
 
 struct filedesc_to_leader *
Index: sys/kern/kern_fork.c
===================================================================
--- sys/kern/kern_fork.c	(revision 191896)
+++ sys/kern/kern_fork.c	(working copy)
@@ -46,6 +46,7 @@
 #include <sys/sysproto.h>
 #include <sys/eventhandler.h>
 #include <sys/filedesc.h>
+#include <sys/jail.h>
 #include <sys/kernel.h>
 #include <sys/kthread.h>
 #include <sys/sysctl.h>
@@ -54,7 +55,6 @@
 #include <sys/mutex.h>
 #include <sys/priv.h>
 #include <sys/proc.h>
-#include <sys/jail.h>
 #include <sys/pioctl.h>
 #include <sys/resourcevar.h>
 #include <sys/sched.h>
@@ -455,9 +455,8 @@
 
 	p2->p_ucred = crhold(td->td_ucred);
 
-	/* In case we are jailed tell the prison that we exist. */
-	if (jailed(p2->p_ucred))
-		prison_proc_hold(p2->p_ucred->cr_prison);
+	/* Tell the prison that we exist. */
+	prison_proc_hold(p2->p_ucred->cr_prison);
 
 	PROC_UNLOCK(p2);
 
Index: sys/kern/kern_cpuset.c
===================================================================
--- sys/kern/kern_cpuset.c	(revision 191896)
+++ sys/kern/kern_cpuset.c	(working copy)
@@ -36,6 +36,7 @@
 #include <sys/param.h>
 #include <sys/systm.h>
 #include <sys/sysproto.h>
+#include <sys/jail.h>
 #include <sys/kernel.h>
 #include <sys/lock.h>
 #include <sys/malloc.h>
@@ -53,7 +54,6 @@
 #include <sys/limits.h>
 #include <sys/bus.h>
 #include <sys/interrupt.h>
-#include <sys/jail.h>		/* Must come after sys/proc.h */
 
 #include <vm/uma.h>
 
@@ -225,23 +225,16 @@
 
 	KASSERT(td != NULL, ("[%s:%d] td is NULL", __func__, __LINE__));
 	if (set != NULL && jailed(td->td_ucred)) {
-		struct cpuset *rset, *jset;
-		struct prison *pr;
+		struct cpuset *jset, *tset;
 
-		rset = cpuset_refroot(set);
-
-		pr = td->td_ucred->cr_prison;
-		mtx_lock(&pr->pr_mtx);
-		cpuset_ref(pr->pr_cpuset);
-		jset = pr->pr_cpuset;
-		mtx_unlock(&pr->pr_mtx);
-
-		if (jset->cs_id != rset->cs_id) {
+		jset = td->td_ucred->cr_prison->pr_cpuset;
+		for (tset = set; tset != NULL; tset = tset->cs_parent)
+			if (tset == jset)
+				break;
+		if (tset == NULL) {
 			cpuset_rel(set);
 			set = NULL;
 		}
-		cpuset_rel(jset);
-		cpuset_rel(rset);
 	}
 
 	return (set);
@@ -303,7 +296,7 @@
 /*
  * Recursively check for errors that would occur from applying mask to
  * the tree of sets starting at 'set'.  Checks for sets that would become
- * empty as well as RDONLY flags.
+ * empty as well as RDONLY flags.  Do not check jails.
  */
 static int
 cpuset_testupdate(struct cpuset *set, cpuset_t *mask)
@@ -320,14 +313,19 @@
 	CPU_COPY(&set->cs_mask, &newmask);
 	CPU_AND(&newmask, mask);
 	error = 0;
-	LIST_FOREACH(nset, &set->cs_children, cs_siblings) 
+	LIST_FOREACH(nset, &set->cs_children, cs_siblings) {
+		if (set->cs_flags & CPU_SET_ROOT)
+			continue;
 		if ((error = cpuset_testupdate(nset, &newmask)) != 0)
 			break;
+	}
 	return (error);
 }
 
 /*
- * Applies the mask 'mask' without checking for empty sets or permissions.
+ * Apply the mask 'mask' to the cpuset and its children.  Ignore permission
+ * errors, and replace any empty sets (which may occur under jails) with their
+ * parent's mask.
  */
 static void
 cpuset_update(struct cpuset *set, cpuset_t *mask)
@@ -336,6 +334,8 @@
 
 	mtx_assert(&cpuset_lock, MA_OWNED);
 	CPU_AND(&set->cs_mask, mask);
+	if (CPU_EMPTY(&set->cs_mask))
+		CPU_COPY(mask, &set->cs_mask);
 	LIST_FOREACH(nset, &set->cs_children, cs_siblings) 
 		cpuset_update(nset, &set->cs_mask);
 
@@ -456,25 +456,14 @@
 		struct prison *pr;
 
 		sx_slock(&allprison_lock);
-		pr = prison_find(id);
+		pr = prison_find_child(curthread->td_ucred->cr_prison, id);
 		sx_sunlock(&allprison_lock);
 		if (pr == NULL)
 			return (ESRCH);
-		if (jailed(curthread->td_ucred)) {
-			if (curthread->td_ucred->cr_prison == pr) {
-				cpuset_ref(pr->pr_cpuset);
-				set = pr->pr_cpuset;
-			}
-		} else {
-			cpuset_ref(pr->pr_cpuset);
-			set = pr->pr_cpuset;
-		}
+		cpuset_ref(pr->pr_cpuset);
+		*setp = pr->pr_cpuset;
 		mtx_unlock(&pr->pr_mtx);
-		if (set) {
-			*setp = set;
-			return (0);
-		}
-		return (ESRCH);
+		return (0);
 	}
 	case CPU_WHICH_IRQ:
 		return (0);
@@ -731,21 +720,17 @@
  * In case of no error, returns the set in *setp locked with a reference.
  */
 int
-cpuset_create_root(struct thread *td, struct cpuset **setp)
+cpuset_create_root(struct prison *pr, struct cpuset **setp)
 {
 	struct cpuset *root;
 	struct cpuset *set;
 	int error;
 
-	KASSERT(td != NULL, ("[%s:%d] invalid td", __func__, __LINE__));
+	KASSERT(pr != NULL, ("[%s:%d] invalid pr", __func__, __LINE__));
 	KASSERT(setp != NULL, ("[%s:%d] invalid setp", __func__, __LINE__));
 
-	thread_lock(td);
-	root = cpuset_refroot(td->td_cpuset);
-	thread_unlock(td);
-
-	error = cpuset_create(setp, td->td_cpuset, &root->cs_mask);
-	cpuset_rel(root);
+	root = pr->pr_cpuset;
+	error = cpuset_create(setp, root, &root->cs_mask);
 	if (error)
 		return (error);
 
Index: sys/kern/vfs_cache.c
===================================================================
--- sys/kern/vfs_cache.c	(revision 191896)
+++ sys/kern/vfs_cache.c	(working copy)
@@ -41,6 +41,7 @@
 #include <sys/param.h>
 #include <sys/filedesc.h>
 #include <sys/fnv_hash.h>
+#include <sys/jail.h>
 #include <sys/kernel.h>
 #include <sys/lock.h>
 #include <sys/malloc.h>
@@ -1078,6 +1079,7 @@
 	char *bp;
 	int error, i, slash_prefixed;
 	struct namecache *ncp;
+	struct vnode *pr_root;
 #ifdef KDTRACE_HOOKS
 	struct vnode *startvp = vp;
 #endif
@@ -1130,7 +1132,8 @@
 		buflen--;
 		slash_prefixed = 1;
 	}
-	while (vp != rdir && vp != rootvnode) {
+	pr_root = td->td_ucred->cr_prison->pr_root;
+	while (vp != rdir && vp != pr_root && vp != rootvnode) {
 		if (vp->v_vflag & VV_ROOT) {
 			if (vp->v_iflag & VI_DOOMED) {	/* forced unmount */
 				CACHE_RUNLOCK();
Index: sys/kern/kern_mib.c
===================================================================
--- sys/kern/kern_mib.c	(revision 191896)
+++ sys/kern/kern_mib.c	(working copy)
@@ -52,6 +52,7 @@
 #include <sys/mutex.h>
 #include <sys/jail.h>
 #include <sys/smp.h>
+#include <sys/sx.h>
 #include <sys/unistd.h>
 #include <sys/vimage.h>
 
@@ -228,7 +229,7 @@
 
 	pr = req->td->td_ucred->cr_prison;
 	if (pr != NULL) {
-		if (!jail_set_hostname_allowed && req->newptr)
+		if (!(pr->pr_flags & PR_ALLOW_SET_HOSTNAME) && req->newptr)
 			return (EPERM);
 		/*
 		 * Process is in jail, so make a local copy of jail
@@ -277,55 +278,43 @@
     &regression_securelevel_nonmonotonic, 0, "securelevel may be lowered");
 #endif
 
-int securelevel = -1;
-static struct mtx securelevel_mtx;
-
-MTX_SYSINIT(securelevel_lock, &securelevel_mtx, "securelevel mutex lock",
-    MTX_DEF);
-
 static int
 sysctl_kern_securelvl(SYSCTL_HANDLER_ARGS)
 {
-	struct prison *pr;
-	int error, level;
+	struct prison *pr, *cpr;
+	int descend, error, level;
 
 	pr = req->td->td_ucred->cr_prison;
 
 	/*
-	 * If the process is in jail, return the maximum of the global and
-	 * local levels; otherwise, return the global level.  Perform a
-	 * lockless read since the securelevel is an integer.
+	 * Reading the securelevel is easy, since the current jail's level
+	 * is known to be at least as secure as any higher levels.  Perform
+	 * a lockless read since the securelevel is an integer.
 	 */
-	if (pr != NULL)
-		level = imax(securelevel, pr->pr_securelevel);
-	else
-		level = securelevel;
+	level = pr->pr_securelevel;
 	error = sysctl_handle_int(oidp, &level, 0, req);
 	if (error || !req->newptr)
 		return (error);
+	/* Permit update only if the new securelevel exceeds the old. */
+	sx_slock(&allprison_lock);
+	mtx_lock(&pr->pr_mtx);
+	if (!regression_securelevel_nonmonotonic &&
+	    level < pr->pr_securelevel) {
+		mtx_unlock(&pr->pr_mtx);
+		sx_sunlock(&allprison_lock);
+		return (EPERM);
+	}
+	pr->pr_securelevel = level;
 	/*
-	 * Permit update only if the new securelevel exceeds the
-	 * global level, and local level if any.
+	 * Set all child jails to be at least this level, but do not lower
+	 * them (even if regression_securelevel_nonmonotonic).
 	 */
-	if (pr != NULL) {
-		mtx_lock(&pr->pr_mtx);
-		if (!regression_securelevel_nonmonotonic &&
-		    (level < imax(securelevel, pr->pr_securelevel))) {
-			mtx_unlock(&pr->pr_mtx);
-			return (EPERM);
-		}
-		pr->pr_securelevel = level;
-		mtx_unlock(&pr->pr_mtx);
-	} else {
-		mtx_lock(&securelevel_mtx);
-		if (!regression_securelevel_nonmonotonic &&
-		    (level < securelevel)) {
-			mtx_unlock(&securelevel_mtx);
-			return (EPERM);
-		}
-		securelevel = level;
-		mtx_unlock(&securelevel_mtx);
+	FOREACH_PRISON_DESCENDANT_LOCKED(pr, cpr, descend) {
+		if (cpr->pr_securelevel < level)
+			cpr->pr_securelevel = level;
 	}
+	mtx_unlock(&pr->pr_mtx);
+	sx_sunlock(&allprison_lock);
 	return (error);
 }
 
Index: sys/kern/vfs_subr.c
===================================================================
--- sys/kern/vfs_subr.c	(revision 191896)
+++ sys/kern/vfs_subr.c	(working copy)
@@ -467,22 +467,14 @@
 		return (EPERM);
 
 	/*
-	 * If the file system was mounted outside a jail and a jailed thread
-	 * tries to access it, deny immediately.
+	 * If the file system was mounted outside the jail of the calling
+	 * thread, deny immediately.
 	 */
-	if (!jailed(mp->mnt_cred) && jailed(td->td_ucred))
+	if (mp->mnt_cred->cr_prison != td->td_ucred->cr_prison &&
+	    !prison_ischild(td->td_ucred->cr_prison, mp->mnt_cred->cr_prison))
 		return (EPERM);
 
 	/*
-	 * If the file system was mounted inside different jail that the jail of
-	 * the calling thread, deny immediately.
-	 */
-	if (jailed(mp->mnt_cred) && jailed(td->td_ucred) &&
-	    mp->mnt_cred->cr_prison != td->td_ucred->cr_prison) {
-		return (EPERM);
-	}
-
-	/*
 	 * If file system supports delegated administration, we don't check
 	 * for the PRIV_VFS_MOUNT_OWNER privilege - it will be better verified
 	 * by the file system itself.
@@ -2900,7 +2892,7 @@
 
 	db_printf("    mnt_cred = { uid=%u ruid=%u",
 	    (u_int)mp->mnt_cred->cr_uid, (u_int)mp->mnt_cred->cr_ruid);
-	if (mp->mnt_cred->cr_prison != NULL)
+	if (jailed(mp->mnt_cred))
 		db_printf(", jail=%d", mp->mnt_cred->cr_prison->pr_id);
 	db_printf(" }\n");
 	db_printf("    mnt_ref = %d\n", mp->mnt_ref);
Index: sys/netinet/in_pcb.c
===================================================================
--- sys/netinet/in_pcb.c	(revision 191896)
+++ sys/netinet/in_pcb.c	(working copy)
@@ -600,7 +600,7 @@
 			goto done;
 		}
 
-		if (cred == NULL || !jailed(cred)) {
+		if (cred == NULL || !(cred->cr_prison->pr_flags & PR_IP4)) {
 			laddr->s_addr = ia->ia_addr.sin_addr.s_addr;
 			goto done;
 		}
@@ -644,7 +644,7 @@
 		struct ifnet *ifp;
 
 		/* If not jailed, use the default returned. */
-		if (cred == NULL || !jailed(cred)) {
+		if (cred == NULL || !(cred->cr_prison->pr_flags & PR_IP4)) {
 			ia = (struct in_ifaddr *)sro.ro_rt->rt_ifa;
 			laddr->s_addr = ia->ia_addr.sin_addr.s_addr;
 			goto done;
@@ -709,7 +709,7 @@
 		if (ia == NULL)
 			ia = ifatoia(ifa_ifwithnet(sintosa(&sain)));
 
-		if (cred == NULL || !jailed(cred)) {
+		if (cred == NULL || !(cred->cr_prison->pr_flags & PR_IP4)) {
 #if __FreeBSD_version < 800000
 			if (ia == NULL)
 				ia = (struct in_ifaddr *)sro.ro_rt->rt_ifa;
@@ -1220,7 +1220,8 @@
 				 * Found?
 				 */
 				if (cred == NULL ||
-				    inp->inp_cred->cr_prison == cred->cr_prison)
+				    prison_equal_ip4(cred->cr_prison,
+				    inp->inp_cred->cr_prison))
 					return (inp);
 			}
 		}
@@ -1252,7 +1253,8 @@
 			LIST_FOREACH(inp, &phd->phd_pcblist, inp_portlist) {
 				wildcard = 0;
 				if (cred != NULL &&
-				    inp->inp_cred->cr_prison != cred->cr_prison)
+				    !prison_equal_ip4(inp->inp_cred->cr_prison,
+				    cred->cr_prison))
 					continue;
 #ifdef INET6
 				/* XXX inp locking */
@@ -1333,7 +1335,7 @@
 			 * the inp here, without any checks.
 			 * Well unless both bound with SO_REUSEPORT?
 			 */
-			if (jailed(inp->inp_cred))
+			if (inp->inp_cred->cr_prison->pr_flags & PR_IP4)
 				return (inp);
 			if (tmpinp == NULL)
 				tmpinp = inp;
@@ -1378,7 +1380,7 @@
 			    (inp->inp_flags & INP_FAITH) == 0)
 				continue;
 
-			injail = jailed(inp->inp_cred);
+			injail = inp->inp_cred->cr_prison->pr_flags & PR_IP4;
 			if (injail) {
 				if (prison_check_ip4(inp->inp_cred,
 				    &laddr) != 0)
Index: sys/netinet/udp_usrreq.c
===================================================================
--- sys/netinet/udp_usrreq.c	(revision 191896)
+++ sys/netinet/udp_usrreq.c	(working copy)
@@ -988,7 +988,7 @@
 				 * Remember addr if jailed, to prevent
 				 * rebinding.
 				 */
-				if (jailed(td->td_ucred))
+				if (td->td_ucred->cr_prison->pr_flags & PR_IP4)
 					inp->inp_laddr = laddr;
 				inp->inp_lport = lport;
 				if (in_pcbinshash(inp) != 0) {
Index: sys/fs/procfs/procfs_status.c
===================================================================
--- sys/fs/procfs/procfs_status.c	(revision 191896)
+++ sys/fs/procfs/procfs_status.c	(working copy)
@@ -151,10 +151,11 @@
 		sbuf_printf(sb, ",%lu", (u_long)cr->cr_groups[i]);
 	}
 
-	if (jailed(p->p_ucred)) {
-		mtx_lock(&p->p_ucred->cr_prison->pr_mtx);
-		sbuf_printf(sb, " %s", p->p_ucred->cr_prison->pr_host);
-		mtx_unlock(&p->p_ucred->cr_prison->pr_mtx);
+	if (jailed(cr)) {
+		mtx_lock(&cr->cr_prison->pr_mtx);
+		sbuf_printf(sb, " %s",
+		    prison_name(td->td_ucred->cr_prison, cr->cr_prison));
+		mtx_unlock(&cr->cr_prison->pr_mtx);
 	} else {
 		sbuf_printf(sb, " -");
 	}
Index: sys/nfsserver/nfs_srvsock.c
===================================================================
--- sys/nfsserver/nfs_srvsock.c	(revision 191896)
+++ sys/nfsserver/nfs_srvsock.c	(working copy)
@@ -43,6 +43,7 @@
 
 #include <sys/param.h>
 #include <sys/systm.h>
+#include <sys/jail.h>
 #include <sys/kernel.h>
 #include <sys/lock.h>
 #include <sys/malloc.h>
@@ -699,6 +700,8 @@
 	nd = malloc(sizeof (struct nfsrv_descript),
 		M_NFSRVDESC, M_WAITOK);
 	nd->nd_cr = crget();
+	nd->nd_cr->cr_prison = &prison0;
+	prison_hold(&prison0);
 	NFSD_LOCK();
 	nd->nd_md = nd->nd_mrep = m;
 	nd->nd_nam2 = nam;
Index: sys/compat/freebsd32/freebsd32_misc.c
===================================================================
--- sys/compat/freebsd32/freebsd32_misc.c	(revision 191896)
+++ sys/compat/freebsd32/freebsd32_misc.c	(working copy)
@@ -112,8 +112,6 @@
 CTASSERT(sizeof(struct stat32) == 96);
 CTASSERT(sizeof(struct sigaction32) == 24);
 
-extern int jail_max_af_ips;
-
 static int freebsd32_kevent_copyout(void *arg, struct kevent *kevp, int count);
 static int freebsd32_kevent_copyin(void *arg, struct kevent *kevp, int count);
 
@@ -2126,7 +2124,7 @@
 			return (error);
 		tmplen = MAXPATHLEN + MAXHOSTNAMELEN + MAXHOSTNAMELEN;
 #ifdef INET
-		if (j32.ip4s > jail_max_af_ips)
+		if (j32.ip4s > td->td_ucred->cr_prison->pr_max_af_ips)
 			return (EINVAL);
 		tmplen += j32.ip4s * sizeof(struct in_addr);
 #else
@@ -2134,7 +2132,7 @@
 			return (EINVAL);
 #endif
 #ifdef INET6
-		if (j32.ip6s > jail_max_af_ips)
+		if (j32.ip6s > td->td_ucred->cr_prison->pr_max_af_ips)
 			return (EINVAL);
 		tmplen += j32.ip6s * sizeof(struct in6_addr);
 #else
Index: sys/compat/linux/linux_mib.c
===================================================================
--- sys/compat/linux/linux_mib.c	(revision 191896)
+++ sys/compat/linux/linux_mib.c	(working copy)
@@ -57,16 +57,18 @@
 	int	pr_use_linux26;	/* flag to determine whether to use 2.6 emulation */
 };
 
+static struct linux_prison lprison0 = {
+	.pr_osname =		"Linux",
+	.pr_osrelease =		"2.6.16",
+	.pr_oss_version =	0x030600,
+	.pr_use_linux26 =	1,
+};
+
 static unsigned linux_osd_jail_slot;
 
 SYSCTL_NODE(_compat, OID_AUTO, linux, CTLFLAG_RW, 0,
 	    "Linux mode");
 
-static struct mtx osname_lock;
-MTX_SYSINIT(linux_osname, &osname_lock, "linux osname", MTX_DEF);
-
-static char	linux_osname[LINUX_MAX_UTSNAME] = "Linux";
-
 static int
 linux_sysctl_osname(SYSCTL_HANDLER_ARGS)
 {
@@ -86,9 +88,6 @@
 	    0, 0, linux_sysctl_osname, "A",
 	    "Linux kernel OS name");
 
-static char	linux_osrelease[LINUX_MAX_UTSNAME] = "2.6.16";
-static int	linux_use_linux26 = 1;
-
 static int
 linux_sysctl_osrelease(SYSCTL_HANDLER_ARGS)
 {
@@ -108,8 +107,6 @@
 	    0, 0, linux_sysctl_osrelease, "A",
 	    "Linux kernel OS release");
 
-static int	linux_oss_version = 0x030600;
-
 static int
 linux_sysctl_oss_version(SYSCTL_HANDLER_ARGS)
 {
@@ -130,69 +127,74 @@
 	    "Linux OSS version");
 
 /*
- * Returns holding the prison mutex if return non-NULL.
+ * Find a prison with Linux info.
+ * Return the Linux info and the (locked) prison.
  */
 static struct linux_prison *
-linux_get_prison(struct thread *td, struct prison **prp)
+linux_find_prison(struct prison *spr, struct prison **prp)
 {
 	struct prison *pr;
 	struct linux_prison *lpr;
 
-	KASSERT(td == curthread, ("linux_get_prison() called on !curthread"));
-	*prp = pr = td->td_ucred->cr_prison;
-	if (pr == NULL || !linux_osd_jail_slot)
-		return (NULL);
-	mtx_lock(&pr->pr_mtx);
-	lpr = osd_jail_get(pr, linux_osd_jail_slot);
-	if (lpr == NULL)
+	if (!linux_osd_jail_slot)
+		/* In case osd_register failed. */
+		spr = &prison0;
+	for (pr = spr;; pr = pr->pr_parent) {
+		mtx_lock(&pr->pr_mtx);
+		lpr = (pr == &prison0)
+		    ? &lprison0
+		    : osd_jail_get(pr, linux_osd_jail_slot);
+		if (lpr != NULL)
+			break;
 		mtx_unlock(&pr->pr_mtx);
+	}
+	*prp = pr;
 	return (lpr);
 }
 
 /*
- * Ensure a prison has its own Linux info.  The prison should be locked on
- * entrance and will be locked on exit (though it may get unlocked in the
- * interrim).
+ * Ensure a prison has its own Linux info.  If lprp is non-null, point it to
+ * the Linux info and lock the prison.
  */
 static int
 linux_alloc_prison(struct prison *pr, struct linux_prison **lprp)
 {
+	struct prison *ppr;
 	struct linux_prison *lpr, *nlpr;
 	int error;
 
 	/* If this prison already has Linux info, return that. */
 	error = 0;
-	mtx_assert(&pr->pr_mtx, MA_OWNED);
-	lpr = osd_jail_get(pr, linux_osd_jail_slot);
-	if (lpr != NULL)
+	lpr = linux_find_prison(pr, &ppr);
+	if (ppr == pr)
 		goto done;
 	/*
 	 * Allocate a new info record.  Then check again, in case something
 	 * changed during the allocation.
 	 */
-	mtx_unlock(&pr->pr_mtx);
+	mtx_unlock(&ppr->pr_mtx);
 	nlpr = malloc(sizeof(struct linux_prison), M_PRISON, M_WAITOK);
-	mtx_lock(&pr->pr_mtx);
-	lpr = osd_jail_get(pr, linux_osd_jail_slot);
-	if (lpr != NULL) {
+	lpr = linux_find_prison(pr, &ppr);
+	if (ppr == pr) {
 		free(nlpr, M_PRISON);
 		goto done;
 	}
+	/* Inherit the initial values from the ancestor. */
+	mtx_lock(&pr->pr_mtx);
 	error = osd_jail_set(pr, linux_osd_jail_slot, nlpr);
-	if (error)
-		free(nlpr, M_PRISON);
-	else {
+	if (error == 0) {
+		bcopy(lpr, nlpr, sizeof(*lpr));
 		lpr = nlpr;
-		mtx_lock(&osname_lock);
-		strncpy(lpr->pr_osname, linux_osname, LINUX_MAX_UTSNAME);
-		strncpy(lpr->pr_osrelease, linux_osrelease, LINUX_MAX_UTSNAME);
-		lpr->pr_oss_version = linux_oss_version;
-		lpr->pr_use_linux26 = linux_use_linux26;
-		mtx_unlock(&osname_lock);
+	} else {
+		free(nlpr, M_PRISON);
+		lpr = NULL;
 	}
-done:
+	mtx_unlock(&ppr->pr_mtx);
+ done:
 	if (lprp != NULL)
 		*lprp = lpr;
+	else
+		mtx_unlock(&pr->pr_mtx);
 	return (error);
 }
 
@@ -202,7 +204,6 @@
 static int
 linux_prison_create(void *obj, void *data)
 {
-	int error;
 	struct prison *pr = obj;
 	struct vfsoptlist *opts = data;
 
@@ -212,10 +213,7 @@
 	 * Inherit a prison's initial values from its parent
 	 * (different from NULL which also inherits changes).
 	 */
-	mtx_lock(&pr->pr_mtx);
-	error = linux_alloc_prison(pr, NULL);
-	mtx_unlock(&pr->pr_mtx);
-	return (error);
+	return linux_alloc_prison(pr, NULL);
 }
 
 static int
@@ -223,8 +221,7 @@
 {
 	struct vfsoptlist *opts = data;
 	char *osname, *osrelease;
-	size_t len;
-	int error, oss_version;
+	int error, len, oss_version;
 
 	/* Check that the parameters are correct. */
 	(void)vfs_flagopt(opts, "linux", NULL, 0);
@@ -263,8 +260,7 @@
 	struct prison *pr = obj;
 	struct vfsoptlist *opts = data;
 	char *osname, *osrelease;
-	size_t len;
-	int error, gotversion, nolinux, oss_version, yeslinux;
+	int error, gotversion, len, nolinux, oss_version, yeslinux;
 
 	/* Set the parameters, which should be correct. */
 	yeslinux = vfs_flagopt(opts, "linux", NULL, 0);
@@ -281,7 +277,7 @@
 		yeslinux = 1;
 	error = vfs_copyopt(opts, "linux.oss_version", &oss_version,
 	    sizeof(oss_version));
-	gotversion = error == 0;
+	gotversion = (error == 0);
 	yeslinux |= gotversion;
 	if (nolinux) {
 		/* "nolinux": inherit the parent's Linux info. */
@@ -293,7 +289,6 @@
 		 * "linux" or "linux.*":
 		 * the prison gets its own Linux info.
 		 */
-		mtx_lock(&pr->pr_mtx);
 		error = linux_alloc_prison(pr, &lpr);
 		if (error) {
 			mtx_unlock(&pr->pr_mtx);
@@ -328,14 +323,18 @@
 linux_prison_get(void *obj, void *data)
 {
 	struct linux_prison *lpr;
+	struct prison *ppr;
 	struct prison *pr = obj;
 	struct vfsoptlist *opts = data;
 	int error, i;
 
-	mtx_lock(&pr->pr_mtx);
-	/* Tell whether this prison has its own Linux info. */
-	lpr = osd_jail_get(pr, linux_osd_jail_slot);
-	i = lpr != NULL;
+	/*
+	 * Report on the prison that actually has the Linux info.  It's
+	 * kind of bogus to give an ancestor's info, but leave it to the
+	 * caller to check the flag set below.
+	 */
+	lpr = linux_find_prison(pr, &ppr);
+	i = (ppr == pr);
 	error = vfs_setopt(opts, "linux", &i, sizeof(i));
 	if (error != 0 && error != ENOENT)
 		goto done;
@@ -343,39 +342,20 @@
 	error = vfs_setopt(opts, "nolinux", &i, sizeof(i));
 	if (error != 0 && error != ENOENT)
 		goto done;
-	/*
-	 * It's kind of bogus to give the root info, but leave it to the caller
-	 * to check the above flag.
-	 */
-	if (lpr != NULL) {
-		error = vfs_setopts(opts, "linux.osname", lpr->pr_osname);
-		if (error != 0 && error != ENOENT)
-			goto done;
-		error = vfs_setopts(opts, "linux.osrelease", lpr->pr_osrelease);
-		if (error != 0 && error != ENOENT)
-			goto done;
-		error = vfs_setopt(opts, "linux.oss_version",
-		    &lpr->pr_oss_version, sizeof(lpr->pr_oss_version));
-		if (error != 0 && error != ENOENT)
-			goto done;
-	} else {
-		mtx_lock(&osname_lock);
-		error = vfs_setopts(opts, "linux.osname", linux_osname);
-		if (error != 0 && error != ENOENT)
-			goto done;
-		error = vfs_setopts(opts, "linux.osrelease", linux_osrelease);
-		if (error != 0 && error != ENOENT)
-			goto done;
-		error = vfs_setopt(opts, "linux.oss_version",
-		    &linux_oss_version, sizeof(linux_oss_version));
-		if (error != 0 && error != ENOENT)
-			goto done;
-		mtx_unlock(&osname_lock);
-	}
+	error = vfs_setopts(opts, "linux.osname", lpr->pr_osname);
+	if (error != 0 && error != ENOENT)
+		goto done;
+	error = vfs_setopts(opts, "linux.osrelease", lpr->pr_osrelease);
+	if (error != 0 && error != ENOENT)
+		goto done;
+	error = vfs_setopt(opts, "linux.oss_version", &lpr->pr_oss_version,
+	    sizeof(lpr->pr_oss_version));
+	if (error != 0 && error != ENOENT)
+		goto done;
 	error = 0;
 
  done:
-	mtx_unlock(&pr->pr_mtx);
+	mtx_unlock(&ppr->pr_mtx);
 	return (error);
 }
 
@@ -402,11 +382,8 @@
 	if (linux_osd_jail_slot > 0) {
 		/* Copy the system linux info to any current prisons. */
 		sx_xlock(&allprison_lock);
-		TAILQ_FOREACH(pr, &allprison, pr_list) {
-			mtx_lock(&pr->pr_mtx);
+		TAILQ_FOREACH(pr, &allprison, pr_list)
 			(void)linux_alloc_prison(pr, NULL);
-			mtx_unlock(&pr->pr_mtx);
-		}
 		sx_xunlock(&allprison_lock);
 	}
 }
@@ -425,15 +402,9 @@
 	struct prison *pr;
 	struct linux_prison *lpr;
 
-	lpr = linux_get_prison(td, &pr);
-	if (lpr != NULL) {
-		bcopy(lpr->pr_osname, dst, LINUX_MAX_UTSNAME);
-		mtx_unlock(&pr->pr_mtx);
-	} else {
-		mtx_lock(&osname_lock);
-		bcopy(linux_osname, dst, LINUX_MAX_UTSNAME);
-		mtx_unlock(&osname_lock);
-	}
+	lpr = linux_find_prison(td->td_ucred->cr_prison, &pr);
+	bcopy(lpr->pr_osname, dst, LINUX_MAX_UTSNAME);
+	mtx_unlock(&pr->pr_mtx);
 }
 
 int
@@ -442,16 +413,9 @@
 	struct prison *pr;
 	struct linux_prison *lpr;
 
-	lpr = linux_get_prison(td, &pr);
-	if (lpr != NULL) {
-		strlcpy(lpr->pr_osname, osname, LINUX_MAX_UTSNAME);
-		mtx_unlock(&pr->pr_mtx);
-	} else {
-		mtx_lock(&osname_lock);
-		strcpy(linux_osname, osname);
-		mtx_unlock(&osname_lock);
-	}
-
+	lpr = linux_find_prison(td->td_ucred->cr_prison, &pr);
+	strlcpy(lpr->pr_osname, osname, LINUX_MAX_UTSNAME);
+	mtx_unlock(&pr->pr_mtx);
 	return (0);
 }
 
@@ -461,15 +425,9 @@
 	struct prison *pr;
 	struct linux_prison *lpr;
 
-	lpr = linux_get_prison(td, &pr);
-	if (lpr != NULL) {
-		bcopy(lpr->pr_osrelease, dst, LINUX_MAX_UTSNAME);
-		mtx_unlock(&pr->pr_mtx);
-	} else {
-		mtx_lock(&osname_lock);
-		bcopy(linux_osrelease, dst, LINUX_MAX_UTSNAME);
-		mtx_unlock(&osname_lock);
-	}
+	lpr = linux_find_prison(td->td_ucred->cr_prison, &pr);
+	bcopy(lpr->pr_osrelease, dst, LINUX_MAX_UTSNAME);
+	mtx_unlock(&pr->pr_mtx);
 }
 
 int
@@ -479,12 +437,9 @@
 	struct linux_prison *lpr;
 	int use26;
 
-	lpr = linux_get_prison(td, &pr);
-	if (lpr != NULL) {
-		use26 = lpr->pr_use_linux26;
-		mtx_unlock(&pr->pr_mtx);
-	} else
-		use26 = linux_use_linux26;
+	lpr = linux_find_prison(td->td_ucred->cr_prison, &pr);
+	use26 = lpr->pr_use_linux26;
+	mtx_unlock(&pr->pr_mtx);
 	return (use26);
 }
 
@@ -494,20 +449,10 @@
 	struct prison *pr;
 	struct linux_prison *lpr;
 
-	lpr = linux_get_prison(td, &pr);
-	if (lpr != NULL) {
-		strlcpy(lpr->pr_osrelease, osrelease, LINUX_MAX_UTSNAME);
-		lpr->pr_use_linux26 =
-		    strlen(osrelease) >= 3 && osrelease[2] == '6';
-		mtx_unlock(&pr->pr_mtx);
-	} else {
-		mtx_lock(&osname_lock);
-		strcpy(linux_osrelease, osrelease);
-		linux_use_linux26 =
-		    strlen(osrelease) >= 3 && osrelease[2] == '6';
-		mtx_unlock(&osname_lock);
-	}
-
+	lpr = linux_find_prison(td->td_ucred->cr_prison, &pr);
+	strlcpy(lpr->pr_osrelease, osrelease, LINUX_MAX_UTSNAME);
+	lpr->pr_use_linux26 = strlen(osrelease) >= 3 && osrelease[2] == '6';
+	mtx_unlock(&pr->pr_mtx);
 	return (0);
 }
 
@@ -518,12 +463,9 @@
 	struct linux_prison *lpr;
 	int version;
 
-	lpr = linux_get_prison(td, &pr);
-	if (lpr != NULL) {
-		version = lpr->pr_oss_version;
-		mtx_unlock(&pr->pr_mtx);
-	} else
-		version = linux_oss_version;
+	lpr = linux_find_prison(td->td_ucred->cr_prison, &pr);
+	version = lpr->pr_oss_version;
+	mtx_unlock(&pr->pr_mtx);
 	return (version);
 }
 
@@ -533,16 +475,9 @@
 	struct prison *pr;
 	struct linux_prison *lpr;
 
-	lpr = linux_get_prison(td, &pr);
-	if (lpr != NULL) {
-		lpr->pr_oss_version = oss_version;
-		mtx_unlock(&pr->pr_mtx);
-	} else {
-		mtx_lock(&osname_lock);
-		linux_oss_version = oss_version;
-		mtx_unlock(&osname_lock);
-	}
-
+	lpr = linux_find_prison(td->td_ucred->cr_prison, &pr);
+	lpr->pr_oss_version = oss_version;
+	mtx_unlock(&pr->pr_mtx);
 	return (0);
 }
 
Index: sys/net/rtsock.c
===================================================================
--- sys/net/rtsock.c	(revision 191896)
+++ sys/net/rtsock.c	(working copy)
@@ -373,6 +373,8 @@
 			/*
 			 * As a last resort return the 'default' jail address.
 			 */
+			ia = ((struct sockaddr_in *)rt->rt_ifa->ifa_addr)->
+			    sin_addr;
 			if (prison_get_ip4(cred, &ia) != 0)
 				return (ESRCH);
 		}
@@ -414,6 +416,8 @@
 			/*
 			 * As a last resort return the 'default' jail address.
 			 */
+			ia6 = ((struct sockaddr_in6 *)rt->rt_ifa->ifa_addr)->
+			    sin6_addr;
 			if (prison_get_ip6(cred, &ia6) != 0)
 				return (ESRCH);
 		}
Index: sys/netinet6/in6_pcb.c
===================================================================
--- sys/netinet6/in6_pcb.c	(revision 191896)
+++ sys/netinet6/in6_pcb.c	(working copy)
@@ -666,7 +666,8 @@
 			    inp->inp_lport == lport) {
 				/* Found. */
 				if (cred == NULL ||
-				    inp->inp_cred->cr_prison == cred->cr_prison)
+				    prison_equal_ip6(cred->cr_prison,
+				    inp->inp_cred->cr_prison))
 					return (inp);
 			}
 		}
@@ -698,7 +699,8 @@
 			LIST_FOREACH(inp, &phd->phd_pcblist, inp_portlist) {
 				wildcard = 0;
 				if (cred != NULL &&
-				    inp->inp_cred->cr_prison != cred->cr_prison)
+				    !prison_equal_ip6(cred->cr_prison,
+				    inp->inp_cred->cr_prison))
 					continue;
 				/* XXX inp locking */
 				if ((inp->inp_vflag & INP_IPV6) == 0)
@@ -838,7 +840,7 @@
 			 * the inp here, without any checks.
 			 * Well unless both bound with SO_REUSEPORT?
 			 */
-			if (jailed(inp->inp_cred))
+			if (inp->inp_cred->cr_prison->pr_flags & PR_IP6)
 				return (inp);
 			if (tmpinp == NULL)
 				tmpinp = inp;
@@ -878,7 +880,7 @@
 			if (faith && (inp->inp_flags & INP_FAITH) == 0)
 				continue;
 
-			injail = jailed(inp->inp_cred);
+			injail = inp->inp_cred->cr_prison->pr_flags & PR_IP6;
 			if (injail) {
 				if (prison_check_ip6(inp->inp_cred,
 				    laddr) != 0)
Index: sys/contrib/ipfilter/netinet/ip_nat.c
===================================================================
--- sys/contrib/ipfilter/netinet/ip_nat.c	(revision 191896)
+++ sys/contrib/ipfilter/netinet/ip_nat.c	(working copy)
@@ -662,7 +662,11 @@
 		return EPERM;
 	}
 # else
+#  if defined(__FreeBSD_version) && (__FreeBSD_version >= 500034)
+	if (securelevel_ge(curthread->td_ucred, 3) && (mode & FWRITE)) {
+#  else
 	if ((securelevel >= 3) && (mode & FWRITE)) {
+#  endif
 		return EPERM;
 	}
 # endif
Index: sys/contrib/ipfilter/netinet/ip_fil_freebsd.c
===================================================================
--- sys/contrib/ipfilter/netinet/ip_fil_freebsd.c	(revision 191896)
+++ sys/contrib/ipfilter/netinet/ip_fil_freebsd.c	(working copy)
@@ -318,8 +318,10 @@
 #  if (__FreeBSD_version >= 500024)
 struct thread *p;
 #   if (__FreeBSD_version >= 500043)
+#    define	p_cred	td_ucred
 #    define	p_uid	td_ucred->cr_ruid
 #   else
+#    define	p_cred	t_proc->p_cred
 #    define	p_uid	t_proc->p_cred->p_ruid
 #   endif
 #  else
@@ -342,7 +344,11 @@
 	SPL_INT(s);
 
 #if (BSD >= 199306) && defined(_KERNEL)
+# if (__FreeBSD_version >= 500034)
+	if (securelevel_ge(p->p_cred, 3) && (mode & FWRITE))
+# else
 	if ((securelevel >= 3) && (mode & FWRITE))
+# endif
 		return EPERM;
 #endif
 
Index: sys/security/mac_bsdextended/mac_bsdextended.c
===================================================================
--- sys/security/mac_bsdextended/mac_bsdextended.c	(revision 191896)
+++ sys/security/mac_bsdextended/mac_bsdextended.c	(working copy)
@@ -271,8 +271,8 @@
 	}
 
 	if (rule->mbr_subject.mbs_flags & MBS_PRISON_DEFINED) {
-		match = (cred->cr_prison != NULL &&
-		    cred->cr_prison->pr_id == rule->mbr_subject.mbs_prison);
+		match =
+		    (cred->cr_prison->pr_id == rule->mbr_subject.mbs_prison);
 		if (rule->mbr_subject.mbs_neg & MBS_PRISON_DEFINED)
 			match = !match;
 		if (!match)
Index: sys/sys/cpuset.h
===================================================================
--- sys/sys/cpuset.h	(revision 191896)
+++ sys/sys/cpuset.h	(working copy)
@@ -169,6 +169,7 @@
 #define CPU_SET_RDONLY  0x0002  /* No modification allowed. */
 
 extern cpuset_t *cpuset_root;
+struct prison;
 struct proc;
 struct thread;
 
@@ -176,7 +177,7 @@
 struct cpuset *cpuset_ref(struct cpuset *);
 void	cpuset_rel(struct cpuset *);
 int	cpuset_setthread(lwpid_t id, cpuset_t *);
-int	cpuset_create_root(struct thread *, struct cpuset **);
+int	cpuset_create_root(struct prison *, struct cpuset **);
 int	cpuset_setproc_update_set(struct proc *, struct cpuset *);
 
 #else
Index: sys/sys/jail.h
===================================================================
--- sys/sys/jail.h	(revision 191896)
+++ sys/sys/jail.h	(working copy)
@@ -122,8 +122,8 @@
 
 #include <sys/queue.h>
 #include <sys/sysctl.h>
-#include <sys/_lock.h>
-#include <sys/_mutex.h>
+#include <sys/lock.h>
+#include <sys/mutex.h>
 #include <sys/_task.h>
 
 #define JAIL_MAX	999999
@@ -137,8 +137,6 @@
 
 #include <sys/osd.h>
 
-struct cpuset;
-
 /*
  * This structure describes a prison.  It is pointed to by all struct
  * ucreds's of the inmates.  pr_ref keeps track of them and is used to
@@ -162,7 +160,7 @@
 	struct vnode	*pr_root;			/* (c) vnode to rdir */
 	char		 pr_host[MAXHOSTNAMELEN];	/* (p) jail hostname */
 	char		 pr_name[MAXHOSTNAMELEN];	/* (p) admin jail name */
-	void		*pr_spare;			/*     was pr_linux */
+	struct prison	*pr_parent;			/* (c) containing jail */
 	int		 pr_securelevel;		/* (p) securelevel */
 	struct task	 pr_task;			/* (d) destroy task */
 	struct mtx	 pr_mtx;
@@ -171,6 +169,14 @@
 	struct in_addr	*pr_ip4;			/* (p) v4 IPs of jail */
 	int		 pr_ip6s;			/* (p) number of v6 IPs */
 	struct in6_addr	*pr_ip6;			/* (p) v6 IPs of jail */
+	LIST_HEAD(, prison) pr_children;		/* (a) list of child jails */
+	LIST_ENTRY(prison) pr_sibling;			/* (a) next in parent's list */
+	int		 pr_prisoncount;		/* (a) number of child jails */
+	int		 pr_enforce_statfs;		/* (p) statfs permission */
+	int		 pr_max_af_ips;			/* (p) IP address limit */
+	unsigned	 pr_def_perms;			/* (p) child PR_PERM_* flags */
+	int		 pr_def_enforce_statfs;		/* (p) child statfs */
+	int		 pr_def_max_af_ips;		/* (p) child IP limit */
 };
 #endif /* _KERNEL || _WANT_PRISON */
 
@@ -179,7 +185,24 @@
  * Flag bits set via options or internally
  */
 #define	PR_PERSIST	0x00000001	/* Can exist without processes */
+#define	PR_IP4_USER	0x00000004	/* Virtualize IPv4 addresses */
+#define	PR_IP6_USER	0x00000008	/* Virtualize IPv6 addresses */
+
+#define	PR_ALLOW_SET_HOSTNAME		0x00010000
+#define	PR_ALLOW_SYSVIPC		0x00020000
+#define	PR_ALLOW_RAW_SOCKETS		0x00040000
+#define	PR_ALLOW_CHFLAGS		0x00080000
+#define	PR_ALLOW_MOUNT			0x00100000
+#define	PR_ALLOW_QUOTAS			0x00200000
+#define	PR_ALLOW_JAILS			0x00400000
+#define	PR_RESTRICT_SOCKET_UNIXIPROUTE	0x00800000
+
+#define	PR_ALLOW_ALL			0x007f0000
+#define	PR_RESTRICT_ALL			0x00800000
+
 #define	PR_REMOVE	0x01000000	/* In process of being removed */
+#define	PR_IP4		0x02000000	/* Virtualize IPv4 (maybe inherited) */
+#define	PR_IP6		0x04000000	/* Virtualize IPv6 (maybe inherited) */
 
 /*
  * OSD methods
@@ -192,17 +215,67 @@
 #define	PR_MAXMETHOD		5
 
 /*
- * Sysctl-set variables that determine global jail policy
- *
- * XXX MIB entries will need to be protected by a mutex.
+ * Lock/unlock a prison.
+ * XXX These exist not so much for general convenience, but to be useable in
+ *     the FOREACH_PRISON_DESCENDANT_LOCKED macro which can't handle them in
+ *     non-function form as currently defined.
  */
-extern int	jail_set_hostname_allowed;
-extern int	jail_socket_unixiproute_only;
-extern int	jail_sysvipc_allowed;
-extern int	jail_getfsstat_jailrootonly;
-extern int	jail_allow_raw_sockets;
-extern int	jail_chflags_allowed;
+static __inline void
+prison_lock(struct prison *pr)
+{
+	mtx_lock(&pr->pr_mtx);
+}
 
+static __inline void
+prison_unlock(struct prison *pr)
+{
+	mtx_unlock(&pr->pr_mtx);
+}
+
+/* Traverse a prison's immediate children */
+#define	FOREACH_PRISON_CHILD(ppr, cpr)				\
+	LIST_FOREACH(cpr, &(ppr)->pr_children, pr_sibling)
+
+/*
+ * Preorder traversal of all of a prison's descendants.
+ * This ugly loop allows the macro to be followed by a single block
+ * as expected in a looping primitive.
+ */
+#define	FOREACH_PRISON_DESCENDANT(ppr, cpr, descend)			\
+	for ((cpr) = (ppr), (descend) = 1;				\
+	    ((cpr) = ((descend) && !LIST_EMPTY(&(cpr)->pr_children))	\
+	      ? LIST_FIRST(&(cpr)->pr_children)				\
+	      : (cpr) == (ppr)						\
+		? NULL							\
+		: ((descend) = LIST_NEXT(cpr, pr_sibling) != NULL)	\
+		  ? LIST_NEXT(cpr, pr_sibling)				\
+		  : (cpr)->pr_parent);)					\
+		if (!(descend))						\
+			;						\
+		else
+
+/*
+ * As above, but lock descendants on the way down and unlock on the way up.
+ */
+#define	FOREACH_PRISON_DESCENDANT_LOCKED(ppr, cpr, descend)		\
+	for ((cpr) = (ppr), (descend) = 1;				\
+	    ((cpr) = ((descend) && !LIST_EMPTY(&(cpr)->pr_children))	\
+	      ? LIST_FIRST(&(cpr)->pr_children)				\
+	      : (cpr) == (ppr)						\
+		? NULL							\
+		: (prison_unlock(cpr),					\
+		   (descend) = LIST_NEXT(cpr, pr_sibling) != NULL)	\
+		  ? LIST_NEXT(cpr, pr_sibling)				\
+		  : (cpr)->pr_parent);)					\
+		if ((descend) ? (prison_lock(cpr), 0) : 1)		\
+			;						\
+		else
+
+/*
+ * Attributes of the physical system, and the root of the jail tree.
+ */
+extern struct	prison prison0;
+
 TAILQ_HEAD(prisonlist, prison);
 extern struct	prisonlist allprison;
 extern struct	sx allprison_lock;
@@ -240,18 +313,22 @@
 void prison_enforce_statfs(struct ucred *cred, struct mount *mp,
     struct statfs *sp);
 struct prison *prison_find(int prid);
-struct prison *prison_find_name(const char *name);
+struct prison *prison_find_child(struct prison *mypr, int prid);
+struct prison *prison_find_name(struct prison *mypr, const char *name);
 void prison_free(struct prison *pr);
 void prison_free_locked(struct prison *pr);
 void prison_hold(struct prison *pr);
 void prison_hold_locked(struct prison *pr);
 void prison_proc_hold(struct prison *);
 void prison_proc_free(struct prison *);
+int prison_ischild(struct prison *pr1, struct prison *pr2);
+int prison_equal_ip4(struct prison *, struct prison *);
 int prison_get_ip4(struct ucred *cred, struct in_addr *ia);
 int prison_local_ip4(struct ucred *cred, struct in_addr *ia);
 int prison_remote_ip4(struct ucred *cred, struct in_addr *ia);
 int prison_check_ip4(struct ucred *cred, struct in_addr *ia);
 #ifdef INET6
+int prison_equal_ip6(struct prison *, struct prison *);
 int prison_get_ip6(struct ucred *, struct in6_addr *);
 int prison_local_ip6(struct ucred *, struct in6_addr *, int);
 int prison_remote_ip6(struct ucred *, struct in6_addr *);
@@ -259,6 +336,7 @@
 #endif
 int prison_check_af(struct ucred *cred, int af);
 int prison_if(struct ucred *cred, struct sockaddr *sa);
+char *prison_name(struct prison *pr1, struct prison *pr2);
 int prison_priv_check(struct ucred *cred, int priv);
 int sysctl_jail_param(struct sysctl_oid *, void *, int , struct sysctl_req *);
 
Index: sys/sys/systm.h
===================================================================
--- sys/sys/systm.h	(revision 191896)
+++ sys/sys/systm.h	(working copy)
@@ -45,8 +45,6 @@
 #include <sys/queue.h>
 #include <sys/stdint.h>		/* for people using printf mainly */
 
-extern int securelevel;		/* system security level (see init(8)) */
-
 extern int cold;		/* nonzero if we are doing a cold boot */
 extern int rebooting;		/* boot() has been called. */
 extern const char *panicstr;	/* panic message */

--------------080507020907010909080502--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4A051DE3.30705>