From owner-freebsd-arch  Sun May  6 21:28:52 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from fledge.watson.org (fledge.watson.org [204.156.12.50])
	by hub.freebsd.org (Postfix) with ESMTP id A028A37B424
	for <arch@FreeBSD.org>; Sun,  6 May 2001 21:28:00 -0700 (PDT)
	(envelope-from robert@fledge.watson.org)
Received: from fledge.watson.org (robert@fledge.pr.watson.org [192.0.2.3])
	by fledge.watson.org (8.11.3/8.11.3) with SMTP id f474Rvf44400
	for <arch@FreeBSD.org>; Mon, 7 May 2001 00:27:57 -0400 (EDT)
	(envelope-from robert@fledge.watson.org)
Date: Mon, 7 May 2001 00:27:57 -0400 (EDT)
From: Robert Watson <rwatson@FreeBSD.org>
X-Sender: robert@fledge.watson.org
To: arch@FreeBSD.org
Subject: Patch to eliminate struct pcred
Message-ID: <Pine.NEB.3.96L.1010506235944.43785B-100000@fledge.watson.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG


Below, please find patches to eliminate struct pcred, as previously
discussed on this list.  Detailed description of the changes is below, but
the quick of it is: pcred and ucred were independent, these patches merge
both into ucred, simplifying a number of cached credential cases (such as
in sigio), and making the ucred the central structure required for almost
all subject-based authorization events.  While I did this, I took the
opportunity to clean up a number of related issues, including changing the
uid/gid helper functions substantially.  If you prefer patches via the
web, they are at:

  http://www.watson.org/~robert/pcred.diff

Any reviews welcome.  An important observation is that, in practice,
almost all pcred write operations involve a ucred copy-on-write, so this
shouldn't increase the number of ucred's in use; it does slightly expand
ucred, but also removes an indirection from the use of ucred in most
environments.  The performance impact is probably a wash.

Detailed description:

o Merge contents of struct pcred into struct ucred.  Specifically, add the
  real uid, saved uid, real gid, and saved gid to ucred, as well as the
  pcred->pc_uidinfo, which was associated with the real uid, only rename
  it to cr_ruidinfo so as not to conflict with cr_uidinfo, which
  corresponds to the effective uid.
o Remove p_cred from struct proc; add p_ucred to struct proc, replacing
  original macro that pointed.
  p->p_ucred to p->p_cred->pc_ucred.
o Universally update code so that it makes use of ucred instead of pcred,
  p->p_ucred instead of p->p_pcred, cr_ruidinfo instead of p_uidinfo,
  cr_{r,sv}{u,g}id instead of p_*, etc.
o Remove pcred0 and its initialization from init_main.c; initialize
  cr_ruidinfo there.
o Restruction many credential modification chunks to always crdup while
  we figure out locking and optimizations; generally speaking, this
  means moving to a structure like this:
        newcred = crdup(oldcred);
        ...
        p->p_ucred = newcred;
        crfree(oldcred);
  It's not race-free, but better than nothing.  There are also races
  in sys_process.c, all inter-process authorization, fork, exec, and
  exit.
o Remove sigio->sio_ruid since sigio->sio_ucred now contains the ruid;
  remove comments indicating that the old arrangement was a problem.
o Restructure exec1() a little to use newcred/oldcred arrangement, and
  use improved uid management primitives.
o Clean up exit1() so as to do less work in credential cleanup due to
  pcred removal.
o Clean up fork1() so as to do less work in credential cleanup and
  allocation.
o Clean up ktrcanset() to take into account changes, and move to using
  suser_xxx() instead of performing a direct uid==0 comparision.
o Improve commenting in various kern_prot.c credential modification
  calls to better document current behavior.  In a couple of places,
  current behavior is a little questionable and we need to check
  POSIX.1 to make sure it's "right".  More commenting work still
  remains to be done.
o Update credential management calls, such as crfree(), to take into
  account new ruidinfo reference.
o Modify or add the following uid and gid helper routines:
      change_euid()
      change_egid()
      change_ruid()
      change_rgid()
      change_svuid()
      change_svgid()
  In each case, the call now acts on a credential not a process, and as
  such no longer requires more complicated process locking/etc.  They
  now assume the caller will do any necessary allocation of an
  exclusive credential reference.  Each is commented to document its
  reference requirements.
o CANSIGIO() is simplified to require only credentials, not processes
  and pcreds.
o Remove lots of (p_pcred==NULL) checks.
o Add an XXX to authorization code in nfs_lock.c, since it's
  questionable, and needs to be considered carefully.
o Simplify posix4 authorization code to require only credentials, not
  processes and pcreds.  Note that this authorization, as well as
  CANSIGIO(), needs to be updated to use the p_cansignal() and
  p_cansched() centralized authorization routines, as they currently
  do not take into account some desirable restrictions that are handled
  by the centralized routines, as well as being inconsistent with other
  similar authorization instances.


Robert N M Watson             FreeBSD Core Team, TrustedBSD Project
robert@fledge.watson.org      NAI Labs, Safeport Network Services

Index: compat/linprocfs/linprocfs_misc.c
===================================================================
RCS file: /home/ncvs/src/sys/compat/linprocfs/linprocfs_misc.c,v
retrieving revision 1.24
diff -u -r1.24 linprocfs_misc.c
--- compat/linprocfs/linprocfs_misc.c	2001/05/01 08:11:51	1.24
+++ compat/linprocfs/linprocfs_misc.c	2001/05/06 00:43:51
@@ -444,14 +444,14 @@
 	PROC_LOCK(p);
 	sbuf_printf(&sb, "PPid:\t%d\n",		p->p_pptr ?
 						p->p_pptr->p_pid : 0);
-	sbuf_printf(&sb, "Uid:\t%d %d %d %d\n", p->p_cred->p_ruid,
+	sbuf_printf(&sb, "Uid:\t%d %d %d %d\n", p->p_ucred->cr_ruid,
 			                        p->p_ucred->cr_uid,
-			                        p->p_cred->p_svuid,
+			                        p->p_ucred->cr_svuid,
 			                        /* FreeBSD doesn't have fsuid */
 				                p->p_ucred->cr_uid);
-	sbuf_printf(&sb, "Gid:\t%d %d %d %d\n", p->p_cred->p_rgid,
+	sbuf_printf(&sb, "Gid:\t%d %d %d %d\n", p->p_ucred->cr_rgid,
 			                        p->p_ucred->cr_gid,
-			                        p->p_cred->p_svgid,
+			                        p->p_ucred->cr_svgid,
 			                        /* FreeBSD doesn't have fsgid */
 				                p->p_ucred->cr_gid);
 	sbuf_cat(&sb, "Groups:\t");
@@ -543,7 +543,7 @@
 	char *freepath = NULL;
 
 	p = PFIND(pfs->pfs_pid);
-	if (p == NULL || p->p_cred == NULL || p->p_ucred == NULL) {
+	if (p == NULL || p->p_ucred == NULL) {
 		if (p != NULL)
 			PROC_UNLOCK(p);
 		printf("doexelink: pid %d disappeared\n", pfs->pfs_pid);
Index: compat/linprocfs/linprocfs_vnops.c
===================================================================
RCS file: /home/ncvs/src/sys/compat/linprocfs/linprocfs_vnops.c,v
retrieving revision 1.23
diff -u -r1.23 linprocfs_vnops.c
--- compat/linprocfs/linprocfs_vnops.c	2001/05/04 05:19:22	1.23
+++ compat/linprocfs/linprocfs_vnops.c	2001/05/06 00:43:51
@@ -432,7 +432,7 @@
 		procp = PFIND(pfs->pfs_pid);
 		if (procp == NULL)
 			return (ENOENT);
-		if (procp->p_cred == NULL || procp->p_ucred == NULL) {
+		if (procp->p_ucred == NULL) {
 			PROC_UNLOCK(procp);
 			return (ENOENT);
 		}
Index: compat/linux/linux_misc.c
===================================================================
RCS file: /home/ncvs/src/sys/compat/linux/linux_misc.c,v
retrieving revision 1.101
diff -u -r1.101 linux_misc.c
--- compat/linux/linux_misc.c	2001/05/01 08:11:51	1.101
+++ compat/linux/linux_misc.c	2001/05/06 00:43:52
@@ -958,12 +958,11 @@
 	struct proc *p;
 	struct linux_setgroups_args *uap;
 {
-	struct pcred *pc;
+	struct ucred *newcred, *oldcred = p->p_ucred;
 	linux_gid_t linux_gidset[NGROUPS];
 	gid_t *bsd_gidset;
 	int ngrp, error;
 
-	pc = p->p_cred;
 	ngrp = uap->gidsetsize;
 
 	/*
@@ -972,22 +971,22 @@
 	 * Keep cr_groups[0] unchanged to prevent that.
 	 */
 
-	if ((error = suser_xxx(NULL, p, PRISON_ROOT)) != 0)
+	if ((error = suser_xxx(oldcred, NULL, PRISON_ROOT)) != 0)
 		return (error);
 
 	if (ngrp >= NGROUPS)
 		return (EINVAL);
 
-	pc->pc_ucred = crcopy(pc->pc_ucred);
+	newcred = crdup(oldcred);
 	if (ngrp > 0) {
 		error = copyin((caddr_t)uap->gidset, (caddr_t)linux_gidset,
 			       ngrp * sizeof(linux_gid_t));
 		if (error)
 			return (error);
 
-		pc->pc_ucred->cr_ngroups = ngrp + 1;
+		newcred->cr_ngroups = ngrp + 1;
 
-		bsd_gidset = pc->pc_ucred->cr_groups;
+		bsd_gidset = newcred->cr_groups;
 		ngrp--;
 		while (ngrp >= 0) {
 			bsd_gidset[ngrp + 1] = linux_gidset[ngrp];
@@ -995,9 +994,13 @@
 		}
 	}
 	else
-		pc->pc_ucred->cr_ngroups = 1;
+		newcred->cr_ngroups = 1;
 
 	setsugid(p);
+
+	p->p_ucred = newcred;
+	crfree(oldcred);
+
 	return (0);
 }
 
@@ -1006,14 +1009,14 @@
 	struct proc *p;
 	struct linux_getgroups_args *uap;
 {
-	struct pcred *pc;
+	struct ucred *cred;
 	linux_gid_t linux_gidset[NGROUPS];
 	gid_t *bsd_gidset;
 	int bsd_gidsetsz, ngrp, error;
 
-	pc = p->p_cred;
-	bsd_gidset = pc->pc_ucred->cr_groups;
-	bsd_gidsetsz = pc->pc_ucred->cr_ngroups - 1;
+	cred = p->p_ucred;
+	bsd_gidset = cred->cr_groups;
+	bsd_gidsetsz = cred->cr_ngroups - 1;
 
 	/*
 	 * cr_groups[0] holds egid. Returning the whole set
Index: compat/svr4/svr4_misc.c
===================================================================
RCS file: /home/ncvs/src/sys/compat/svr4/svr4_misc.c,v
retrieving revision 1.30
diff -u -r1.30 svr4_misc.c
--- compat/svr4/svr4_misc.c	2001/05/01 08:11:52	1.30
+++ compat/svr4/svr4_misc.c	2001/05/06 00:43:54
@@ -1283,7 +1283,7 @@
 			/*
 			 * Decrement the count of procs running with this uid.
 			 */
-			(void)chgproccnt(q->p_cred->p_uidinfo, -1, 0);
+			(void)chgproccnt(q->p_ucred->cr_ruidinfo, -1, 0);
 
 			/*
 			 * Release reference to text vnode.
@@ -1294,13 +1294,8 @@
 			/*
 			 * Free up credentials.
 			 */
-			PROC_LOCK(q);
-			if (--q->p_cred->p_refcnt == 0) {
-				crfree(q->p_ucred);
-				uifree(q->p_cred->p_uidinfo);
-				FREE(q->p_cred, M_SUBPROC);
-				q->p_cred = NULL;
-			}
+			crfree(q->p_ucred);
+			q->p_ucred = NULL;
 
 			/*
 			 * Remove unused arguments
Index: compat/svr4/svr4_sysvec.c
===================================================================
RCS file: /home/ncvs/src/sys/compat/svr4/svr4_sysvec.c,v
retrieving revision 1.20
diff -u -r1.20 svr4_sysvec.c
--- compat/svr4/svr4_sysvec.c	2001/02/24 22:20:02	1.20
+++ compat/svr4/svr4_sysvec.c	2001/05/04 18:25:53
@@ -213,10 +213,10 @@
 	AUXARGS_ENTRY(pos, AT_FLAGS, args->flags);
 	AUXARGS_ENTRY(pos, AT_ENTRY, args->entry);
 	AUXARGS_ENTRY(pos, AT_BASE, args->base);
-	AUXARGS_ENTRY(pos, AT_UID, imgp->proc->p_cred->p_ruid);
-	AUXARGS_ENTRY(pos, AT_EUID, imgp->proc->p_cred->p_svuid);
-	AUXARGS_ENTRY(pos, AT_GID, imgp->proc->p_cred->p_rgid);
-	AUXARGS_ENTRY(pos, AT_EGID, imgp->proc->p_cred->p_svgid);
+	AUXARGS_ENTRY(pos, AT_UID, imgp->proc->p_ucred->cr_ruid);
+	AUXARGS_ENTRY(pos, AT_EUID, imgp->proc->p_ucred->cr_svuid);
+	AUXARGS_ENTRY(pos, AT_GID, imgp->proc->p_ucred->cr_rgid);
+	AUXARGS_ENTRY(pos, AT_EGID, imgp->proc->p_ucred->cr_svgid);
 	AUXARGS_ENTRY(pos, AT_NULL, 0);
 	
 	free(imgp->auxargs, M_TEMP);      
Index: ddb/db_ps.c
===================================================================
RCS file: /home/ncvs/src/sys/ddb/db_ps.c,v
retrieving revision 1.22
diff -u -r1.22 db_ps.c
--- ddb/db_ps.c	2001/03/28 09:17:49	1.22
+++ ddb/db_ps.c	2001/05/04 15:35:39
@@ -95,7 +95,7 @@
 
 		db_printf("%5d %8p %8p %4d %5d %5d %06x  %d",
 		    p->p_pid, (volatile void *)p, (void *)p->p_addr,
-		    p->p_cred ? p->p_cred->p_ruid : 0, pp->p_pid,
+		    p->p_ucred ? p->p_ucred->cr_ruid : 0, pp->p_pid,
 		    p->p_pgrp ? p->p_pgrp->pg_id : 0, p->p_flag, p->p_stat);
 		if (p->p_wchan) {
 			db_printf("  %6s %8p", p->p_wmesg, (void *)p->p_wchan);
Index: i386/linux/linux_sysvec.c
===================================================================
RCS file: /home/ncvs/src/sys/i386/linux/linux_sysvec.c,v
retrieving revision 1.78
diff -u -r1.78 linux_sysvec.c
--- i386/linux/linux_sysvec.c	2001/05/01 08:12:52	1.78
+++ i386/linux/linux_sysvec.c	2001/05/06 00:45:18
@@ -186,10 +186,10 @@
 	AUXARGS_ENTRY(pos, AT_ENTRY, args->entry);
 	AUXARGS_ENTRY(pos, AT_BASE, args->base);
 	PROC_LOCK(imgp->proc);
-	AUXARGS_ENTRY(pos, AT_UID, imgp->proc->p_cred->p_ruid);
-	AUXARGS_ENTRY(pos, AT_EUID, imgp->proc->p_cred->p_svuid);
-	AUXARGS_ENTRY(pos, AT_GID, imgp->proc->p_cred->p_rgid);
-	AUXARGS_ENTRY(pos, AT_EGID, imgp->proc->p_cred->p_svgid);
+	AUXARGS_ENTRY(pos, AT_UID, imgp->proc->p_ucred->cr_ruid);
+	AUXARGS_ENTRY(pos, AT_EUID, imgp->proc->p_ucred->cr_svuid);
+	AUXARGS_ENTRY(pos, AT_GID, imgp->proc->p_ucred->cr_rgid);
+	AUXARGS_ENTRY(pos, AT_EGID, imgp->proc->p_ucred->cr_svgid);
 	PROC_UNLOCK(imgp->proc);
 	AUXARGS_ENTRY(pos, AT_NULL, 0);
 	
Index: kern/init_main.c
===================================================================
RCS file: /home/ncvs/src/sys/kern/init_main.c,v
retrieving revision 1.168
diff -u -r1.168 init_main.c
--- kern/init_main.c	2001/04/29 02:44:48	1.168
+++ kern/init_main.c	2001/05/04 15:37:01
@@ -85,7 +85,6 @@
 static struct session session0;
 static struct pgrp pgrp0;
 struct	proc proc0;
-static struct pcred cred0;
 static struct procsig procsig0;
 static struct filedesc0 filedesc0;
 static struct plimit limit0;
@@ -321,12 +320,10 @@
 	callout_init(&p->p_slpcallout, 1);
 
 	/* Create credentials. */
-	cred0.p_refcnt = 1;
-	cred0.p_uidinfo = uifind(0);
-	p->p_cred = &cred0;
 	p->p_ucred = crget();
 	p->p_ucred->cr_ngroups = 1;	/* group 0 */
 	p->p_ucred->cr_uidinfo = uifind(0);
+	p->p_ucred->cr_ruidinfo = uifind(0);
 	p->p_ucred->cr_prison = NULL;	/* Don't jail it. */
 
 	/* Create procsig. */
@@ -380,7 +377,7 @@
 	/*
 	 * Charge root for one process.
 	 */
-	(void)chgproccnt(cred0.p_uidinfo, 1, 0);
+	(void)chgproccnt(p->p_ucred->cr_ruidinfo, 1, 0);
 }
 SYSINIT(p0init, SI_SUB_INTRINSIC, SI_ORDER_FIRST, proc0_init, NULL)
 
Index: kern/kern_acct.c
===================================================================
RCS file: /home/ncvs/src/sys/kern/kern_acct.c,v
retrieving revision 1.33
diff -u -r1.33 kern_acct.c
--- kern/kern_acct.c	2001/05/01 08:12:55	1.33
+++ kern/kern_acct.c	2001/05/06 00:45:31
@@ -222,8 +222,8 @@
 	acct.ac_io = encode_comp_t(r->ru_inblock + r->ru_oublock, 0);
 
 	/* (6) The UID and GID of the process */
-	acct.ac_uid = p->p_cred->p_ruid;
-	acct.ac_gid = p->p_cred->p_rgid;
+	acct.ac_uid = p->p_ucred->cr_ruid;
+	acct.ac_gid = p->p_ucred->cr_rgid;
 
 	/* (7) The terminal from which the process was started */
 	if ((p->p_flag & P_CONTROLT) && p->p_pgrp->pg_session->s_ttyp)
Index: kern/kern_descrip.c
===================================================================
RCS file: /home/ncvs/src/sys/kern/kern_descrip.c,v
retrieving revision 1.100
diff -u -r1.100 kern_descrip.c
--- kern/kern_descrip.c	2001/05/01 08:12:55	1.100
+++ kern/kern_descrip.c	2001/05/06 00:45:32
@@ -525,8 +525,6 @@
 	sigio->sio_pgid = pgid;
 	crhold(curproc->p_ucred);
 	sigio->sio_ucred = curproc->p_ucred;
-	/* It would be convenient if p_ruid was in ucred. */
-	sigio->sio_ruid = curproc->p_cred->p_ruid;
 	sigio->sio_myref = sigiop;
 	s = splhigh();
 	*sigiop = sigio;
Index: kern/kern_exec.c
===================================================================
RCS file: /home/ncvs/src/sys/kern/kern_exec.c,v
retrieving revision 1.126
diff -u -r1.126 kern_exec.c
--- kern/kern_exec.c	2001/05/01 08:12:56	1.126
+++ kern/kern_exec.c	2001/05/06 16:25:06
@@ -104,8 +104,9 @@
 	register struct execve_args *uap;
 {
 	struct nameidata nd, *ndp;
+	struct ucred *oldcred = p->p_ucred, *newcred;
 	register_t *stack_base;
-	int error, len, i;
+	int error, len, i, intrace;
 	struct image_params image_params, *imgp;
 	struct vattr attr;
 	int (*img_first) __P((struct image_params *));
@@ -272,23 +273,31 @@
 		p->p_flag &= ~P_PPWAIT;
 		wakeup((caddr_t)p->p_pptr);
 	}
+	intrace = p->p_flag & P_TRACED;
+	PROC_UNLOCK(p);
 
 	/*
+	 * XXX: Note, the whole execve() is incredibly racey right now
+	 * with regards to debugging and privilege/credential management.
+	 * This MUST be fixed prior to any release.
+	 */
+
+	/*
 	 * Implement image setuid/setgid.
 	 *
 	 * Don't honor setuid/setgid if the filesystem prohibits it or if
 	 * the process is being traced.
 	 */
-	if ((((attr.va_mode & VSUID) && p->p_ucred->cr_uid != attr.va_uid) ||
-	     ((attr.va_mode & VSGID) && p->p_ucred->cr_gid != attr.va_gid)) &&
-	    (imgp->vp->v_mount->mnt_flag & MNT_NOSUID) == 0 &&
-	    (p->p_flag & P_TRACED) == 0) {
+	newcred = NULL;
+	if ((((attr.va_mode & VSUID) && oldcred->cr_uid != attr.va_uid) ||
+	     ((attr.va_mode & VSGID) && oldcred->cr_gid != attr.va_gid)) &&
+	    (imgp->vp->v_mount->mnt_flag & MNT_NOSUID) == 0 && intrace == 0) {
 		PROC_UNLOCK(p);
 		/*
 		 * Turn off syscall tracing for set-id programs, except for
 		 * root.
 		 */
-		if (p->p_tracep && suser(p)) {
+		if (p->p_tracep && suser_xxx(oldcred, NULL, PRISON_ROOT)) {
 			p->p_traceflag = 0;
 			vrele(p->p_tracep);
 			p->p_tracep = NULL;
@@ -296,25 +305,42 @@
 		/*
 		 * Set the new credentials.
 		 */
-		p->p_ucred = crcopy(p->p_ucred);
+		newcred = crdup(p->p_ucred);
 		if (attr.va_mode & VSUID)
-			change_euid(p, attr.va_uid);
+			change_euid(newcred, attr.va_uid);
 		if (attr.va_mode & VSGID)
-			p->p_ucred->cr_gid = attr.va_gid;
+			change_egid(newcred, attr.va_gid);
 		setsugid(p);
 		setugidsafety(p);
 	} else {
-		if (p->p_ucred->cr_uid == p->p_cred->p_ruid &&
-		    p->p_ucred->cr_gid == p->p_cred->p_rgid)
-			p->p_flag &= ~P_SUGID;
+		if (oldcred->cr_uid == oldcred->cr_ruid &&
+		    oldcred->cr_gid == oldcred->cr_rgid)
+			p->p_flag &= ~P_SUGID;	/* XXX locking */
 		PROC_UNLOCK(p);
 	}
 
 	/*
 	 * Implement correct POSIX saved-id behavior.
+	 *
+	 * XXX: determine whether tests and sets should occur on old or
+	 * new credentials.
 	 */
-	p->p_cred->p_svuid = p->p_ucred->cr_uid;
-	p->p_cred->p_svgid = p->p_ucred->cr_gid;
+	if (p->p_ucred->cr_svuid != p->p_ucred->cr_uid ||
+	    p->p_ucred->cr_svgid != p->p_ucred->cr_gid) {
+		if (newcred != NULL)
+			newcred = crdup(p->p_ucred);
+
+		change_svuid(newcred, p->p_ucred->cr_uid);
+		change_svgid(newcred, p->p_ucred->cr_gid);
+	}
+
+	if (newcred != NULL) {
+		struct ucred *oldcred;
+
+		oldcred = p->p_ucred;
+		p->p_ucred = newcred;
+		crfree(oldcred);
+	}
 
 	/*
 	 * Store the vp for use in procfs
Index: kern/kern_exit.c
===================================================================
RCS file: /home/ncvs/src/sys/kern/kern_exit.c,v
retrieving revision 1.126
diff -u -r1.126 kern_exit.c
--- kern/kern_exit.c	2001/05/04 16:13:28	1.126
+++ kern/kern_exit.c	2001/05/06 00:49:14
@@ -514,7 +514,7 @@
 			/*
 			 * Decrement the count of procs running with this uid.
 			 */
-			(void)chgproccnt(p->p_cred->p_uidinfo, -1, 0);
+			(void)chgproccnt(p->p_ucred->cr_ruidinfo, -1, 0);
 
 			/*
 			 * Release reference to text vnode
@@ -539,12 +539,8 @@
 			/*
 			 * Free up credentials.
 			 */
-			if (--p->p_cred->p_refcnt == 0) {
-				crfree(p->p_ucred);
-				uifree(p->p_cred->p_uidinfo);
-				FREE(p->p_cred, M_SUBPROC);
-				p->p_cred = NULL;
-			}
+			crfree(p->p_ucred);
+			p->p_ucred = NULL;
 
 			/*
 			 * Remove unused arguments
Index: kern/kern_fork.c
===================================================================
RCS file: /home/ncvs/src/sys/kern/kern_fork.c,v
retrieving revision 1.110
diff -u -r1.110 kern_fork.c
--- kern/kern_fork.c	2001/03/28 11:52:53	1.110
+++ kern/kern_fork.c	2001/05/04 16:34:35
@@ -257,7 +257,7 @@
 	 * exceed the limit. The variable nprocs is the current number of
 	 * processes, maxproc is the limit.
 	 */
-	uid = p1->p_cred->p_ruid;
+	uid = p1->p_ucred->cr_ruid;
 	if ((nprocs >= maxproc - 1 && uid != 0) || nprocs >= maxproc) {
 		tablefull("proc");
 		return (EAGAIN);
@@ -272,7 +272,7 @@
 	 * Increment the count of procs running with this uid. Don't allow
 	 * a nonprivileged user to exceed their current limit.
 	 */
-	ok = chgproccnt(p1->p_cred->p_uidinfo, 1,
+	ok = chgproccnt(p1->p_ucred->cr_ruidinfo, 1,
 		(uid != 0) ? p1->p_rlimit[RLIMIT_NPROC].rlim_cur : 0);
 	if (!ok) {
 		/*
@@ -408,15 +408,9 @@
 	 * We start off holding one spinlock after fork: sched_lock.
 	 */
 	p2->p_spinlocks = 1;
-	PROC_UNLOCK(p2);
-	MALLOC(p2->p_cred, struct pcred *, sizeof(struct pcred),
-	    M_SUBPROC, M_WAITOK);
-	PROC_LOCK(p2);
 	PROC_LOCK(p1);
-	bcopy(p1->p_cred, p2->p_cred, sizeof(*p2->p_cred));
-	p2->p_cred->p_refcnt = 1;
 	crhold(p1->p_ucred);
-	uihold(p1->p_cred->p_uidinfo);
+	p2->p_ucred = p1->p_ucred;
 
 	if (p2->p_args)
 		p2->p_args->ar_ref++;
Index: kern/kern_ktrace.c
===================================================================
RCS file: /home/ncvs/src/sys/kern/kern_ktrace.c,v
retrieving revision 1.52
diff -u -r1.52 kern_ktrace.c
--- kern/kern_ktrace.c	2001/05/01 08:12:56	1.52
+++ kern/kern_ktrace.c	2001/05/06 00:45:34
@@ -531,17 +531,17 @@
 ktrcanset(callp, targetp)
 	struct proc *callp, *targetp;
 {
-	register struct pcred *caller = callp->p_cred;
-	register struct pcred *target = targetp->p_cred;
+	struct ucred *callcr = callp->p_ucred;
+	struct ucred *targetcr = targetp->p_ucred;
 
-	if (prison_check(callp->p_ucred, targetp->p_ucred))
+	if (prison_check(callcr, targetcr))
 		return (0);
-	if ((caller->pc_ucred->cr_uid == target->p_ruid &&
-	     target->p_ruid == target->p_svuid &&
-	     caller->p_rgid == target->p_rgid &&	/* XXX */
-	     target->p_rgid == target->p_svgid &&
+	if ((callcr->cr_uid == targetcr->cr_ruid &&
+	     targetcr->cr_ruid == targetcr->cr_svuid &&
+	     callcr->cr_rgid == targetcr->cr_rgid &&	/* XXX */
+	     targetcr->cr_rgid == targetcr->cr_svgid &&
 	     (targetp->p_traceflag & KTRFAC_ROOT) == 0) ||
-	     caller->pc_ucred->cr_uid == 0)
+	     !suser_xxx(callcr, NULL, PRISON_ROOT))
 		return (1);
 
 	return (0);
Index: kern/kern_proc.c
===================================================================
RCS file: /home/ncvs/src/sys/kern/kern_proc.c,v
retrieving revision 1.93
diff -u -r1.93 kern_proc.c
--- kern/kern_proc.c	2001/05/01 08:12:57	1.93
+++ kern/kern_proc.c	2001/05/06 00:45:35
@@ -424,15 +424,15 @@
 	kp->ki_textvp = p->p_textvp;
 	kp->ki_fd = p->p_fd;
 	kp->ki_vmspace = p->p_vmspace;
-	if (p->p_cred) {
-		kp->ki_uid = p->p_cred->pc_ucred->cr_uid;
-		kp->ki_ruid = p->p_cred->p_ruid;
-		kp->ki_svuid = p->p_cred->p_svuid;
-		kp->ki_ngroups = p->p_cred->pc_ucred->cr_ngroups;
-		bcopy(p->p_cred->pc_ucred->cr_groups, kp->ki_groups,
+	if (p->p_ucred) {
+		kp->ki_uid = p->p_ucred->cr_uid;
+		kp->ki_ruid = p->p_ucred->cr_ruid;
+		kp->ki_svuid = p->p_ucred->cr_svuid;
+		kp->ki_ngroups = p->p_ucred->cr_ngroups;
+		bcopy(p->p_ucred->cr_groups, kp->ki_groups,
 		    NGROUPS * sizeof(gid_t));
-		kp->ki_rgid = p->p_cred->p_rgid;
-		kp->ki_svgid = p->p_cred->p_svgid;
+		kp->ki_rgid = p->p_ucred->cr_rgid;
+		kp->ki_svgid = p->p_ucred->cr_svgid;
 	}
 	if (p->p_procsig) {
 		kp->ki_sigignore = p->p_procsig->ps_sigignore;
@@ -653,7 +653,7 @@
 
 			case KERN_PROC_RUID:
 				if (p->p_ucred == NULL || 
-				    p->p_cred->p_ruid != (uid_t)name[0])
+				    p->p_ucred->cr_ruid != (uid_t)name[0])
 					continue;
 				break;
 			}
Index: kern/kern_prot.c
===================================================================
RCS file: /home/ncvs/src/sys/kern/kern_prot.c,v
retrieving revision 1.89
diff -u -r1.89 kern_prot.c
--- kern/kern_prot.c	2001/05/01 08:12:57	1.89
+++ kern/kern_prot.c	2001/05/06 00:45:35
@@ -210,7 +210,7 @@
 	struct getuid_args *uap;
 {
 
-	p->p_retval[0] = p->p_cred->p_ruid;
+	p->p_retval[0] = p->p_ucred->cr_ruid;
 #if defined(COMPAT_43) || defined(COMPAT_SUNOS)
 	p->p_retval[1] = p->p_ucred->cr_uid;
 #endif
@@ -253,7 +253,7 @@
 	struct getgid_args *uap;
 {
 
-	p->p_retval[0] = p->p_cred->p_rgid;
+	p->p_retval[0] = p->p_ucred->cr_rgid;
 #if defined(COMPAT_43) || defined(COMPAT_SUNOS)
 	p->p_retval[1] = p->p_ucred->cr_groups[0];
 #endif
@@ -293,18 +293,18 @@
 	struct proc *p;
 	register struct	getgroups_args *uap;
 {
-	register struct pcred *pc = p->p_cred;
+	register struct ucred *cred = p->p_ucred;
 	register u_int ngrp;
 	int error;
 
 	if ((ngrp = uap->gidsetsize) == 0) {
-		p->p_retval[0] = pc->pc_ucred->cr_ngroups;
+		p->p_retval[0] = cred->cr_ngroups;
 		return (0);
 	}
-	if (ngrp < pc->pc_ucred->cr_ngroups)
+	if (ngrp < cred->cr_ngroups)
 		return (EINVAL);
-	ngrp = pc->pc_ucred->cr_ngroups;
-	if ((error = copyout((caddr_t)pc->pc_ucred->cr_groups,
+	ngrp = cred->cr_ngroups;
+	if ((error = copyout((caddr_t)cred->cr_groups,
 	    (caddr_t)uap->gidset, ngrp * sizeof(gid_t))))
 		return (error);
 	p->p_retval[0] = ngrp;
@@ -427,7 +427,7 @@
 	struct proc *p;
 	struct setuid_args *uap;
 {
-	register struct pcred *pc = p->p_cred;
+	struct ucred *oldcred = p->p_ucred, *newcred;
 	register uid_t uid;
 	int error;
 
@@ -449,16 +449,17 @@
 	 * 3: Change euid last. (after tests in #2 for "appropriate privs")
 	 */
 	uid = uap->uid;
-	if (uid != pc->p_ruid &&		/* allow setuid(getuid()) */
+	if (uid != oldcred->cr_ruid &&		/* allow setuid(getuid()) */
 #ifdef _POSIX_SAVED_IDS
-	    uid != pc->p_svuid &&		/* allow setuid(saved gid) */
+	    uid != oldcred->cr_svuid &&		/* allow setuid(saved gid) */
 #endif
 #ifdef POSIX_APPENDIX_B_4_2_2	/* Use BSD-compat clause from B.4.2.2 */
-	    uid != pc->pc_ucred->cr_uid &&	/* allow setuid(geteuid()) */
+	    uid != oldcred->cr_uid &&		/* allow setuid(geteuid()) */
 #endif
-	    (error = suser_xxx(0, p, PRISON_ROOT)))
+	    (error = suser_xxx(oldcred, NULL, PRISON_ROOT)))
 		return (error);
 
+	newcred = crdup(oldcred);
 #ifdef _POSIX_SAVED_IDS
 	/*
 	 * Do we have "appropriate privileges" (are we root or uid == euid)
@@ -466,16 +467,16 @@
 	 */
 	if (
 #ifdef POSIX_APPENDIX_B_4_2_2	/* Use the clause from B.4.2.2 */
-	    uid == pc->pc_ucred->cr_uid ||
+	    uid == oldcred->cr_uid ||
 #endif
-	    suser_xxx(0, p, PRISON_ROOT) == 0) /* we are using privs */
+	    suser_xxx(oldcred, NULL, PRISON_ROOT) == 0) /* we are using privs */
 #endif
 	{
 		/*
 		 * Set the real uid and transfer proc count to new user.
 		 */
-		if (uid != pc->p_ruid) {
-			change_ruid(p, uid);
+		if (uid != oldcred->cr_ruid) {
+			change_ruid(newcred, uid);
 			setsugid(p);
 		}
 		/*
@@ -485,8 +486,8 @@
 		 * the security of seteuid() depends on it.  B.4.2.2 says it
 		 * is important that we should do this.
 		 */
-		if (pc->p_svuid != uid) {
-			pc->p_svuid = uid;
+		if (uid != oldcred->cr_svuid) {
+			change_svuid(newcred, uid);
 			setsugid(p);
 		}
 	}
@@ -495,10 +496,12 @@
 	 * In all permitted cases, we are changing the euid.
 	 * Copy credentials so other references do not see our changes.
 	 */
-	if (pc->pc_ucred->cr_uid != uid) {
-		change_euid(p, uid);
+	if (uid != oldcred->cr_uid) {
+		change_euid(newcred, uid);
 		setsugid(p);
 	}
+	p->p_ucred = newcred;
+	crfree(oldcred);
 	return (0);
 }
 
@@ -513,23 +516,31 @@
 	struct proc *p;
 	struct seteuid_args *uap;
 {
-	register struct pcred *pc = p->p_cred;
+	register struct ucred *oldcred = p->p_ucred, *newcred;
 	register uid_t euid;
 	int error;
 
 	euid = uap->euid;
-	if (euid != pc->p_ruid &&		/* allow seteuid(getuid()) */
-	    euid != pc->p_svuid &&		/* allow seteuid(saved uid) */
-	    (error = suser_xxx(0, p, PRISON_ROOT)))
+	/*
+	 * The new effective uid must equal the current real or saved
+	 * uid.  Appropriate privilege may override this restriction.
+	 */
+	if (euid != oldcred->cr_ruid &&		/* allow seteuid(getuid()) */
+	    euid != oldcred->cr_svuid &&	/* allow seteuid(saved uid) */
+	    (error = suser_xxx(oldcred, NULL, PRISON_ROOT)))
 		return (error);
+
 	/*
 	 * Everything's okay, do it.  Copy credentials so other references do
 	 * not see our changes.
 	 */
-	if (pc->pc_ucred->cr_uid != euid) {
-		change_euid(p, euid);
+	newcred = crdup(oldcred);
+	if (oldcred->cr_uid != euid) {
+		change_euid(newcred, euid);
 		setsugid(p);
 	}
+	p->p_ucred = newcred;
+	crfree(oldcred);
 	return (0);
 }
 
@@ -544,7 +555,7 @@
 	struct proc *p;
 	struct setgid_args *uap;
 {
-	register struct pcred *pc = p->p_cred;
+	register struct ucred *oldcred = p->p_ucred, *newcred;
 	register gid_t gid;
 	int error;
 
@@ -560,16 +571,17 @@
 	 * For notes on the logic here, see setuid() above.
 	 */
 	gid = uap->gid;
-	if (gid != pc->p_rgid &&		/* allow setgid(getgid()) */
+	if (gid != oldcred->cr_rgid &&		/* allow setgid(getgid()) */
 #ifdef _POSIX_SAVED_IDS
-	    gid != pc->p_svgid &&		/* allow setgid(saved gid) */
+	    gid != oldcred->cr_svgid &&		/* allow setgid(saved gid) */
 #endif
 #ifdef POSIX_APPENDIX_B_4_2_2	/* Use BSD-compat clause from B.4.2.2 */
-	    gid != pc->pc_ucred->cr_groups[0] && /* allow setgid(getegid()) */
+	    gid != oldcred->cr_groups[0] && /* allow setgid(getegid()) */
 #endif
-	    (error = suser_xxx(0, p, PRISON_ROOT)))
+	    (error = suser_xxx(oldcred, NULL, PRISON_ROOT)))
 		return (error);
 
+	newcred = crdup(oldcred);
 #ifdef _POSIX_SAVED_IDS
 	/*
 	 * Do we have "appropriate privileges" (are we root or gid == egid)
@@ -577,16 +589,16 @@
 	 */
 	if (
 #ifdef POSIX_APPENDIX_B_4_2_2	/* use the clause from B.4.2.2 */
-	    gid == pc->pc_ucred->cr_groups[0] ||
+	    gid == oldcred->cr_groups[0] ||
 #endif
-	    suser_xxx(0, p, PRISON_ROOT) == 0) /* we are using privs */
+	    suser_xxx(oldcred, NULL, PRISON_ROOT) == 0) /* we are using privs */
 #endif
 	{
 		/*
 		 * Set real gid
 		 */
-		if (pc->p_rgid != gid) {
-			pc->p_rgid = gid;
+		if (oldcred->cr_rgid != gid) {
+			change_rgid(newcred, gid);
 			setsugid(p);
 		}
 		/*
@@ -596,8 +608,8 @@
 		 * the security of setegid() depends on it.  B.4.2.2 says it
 		 * is important that we should do this.
 		 */
-		if (pc->p_svgid != gid) {
-			pc->p_svgid = gid;
+		if (oldcred->cr_svgid != gid) {
+			change_svgid(newcred, gid);
 			setsugid(p);
 		}
 	}
@@ -605,11 +617,12 @@
 	 * In all cases permitted cases, we are changing the egid.
 	 * Copy credentials so other references do not see our changes.
 	 */
-	if (pc->pc_ucred->cr_groups[0] != gid) {
-		pc->pc_ucred = crcopy(pc->pc_ucred);
-		pc->pc_ucred->cr_groups[0] = gid;
+	if (oldcred->cr_groups[0] != gid) {
+		change_egid(newcred, gid);
 		setsugid(p);
 	}
+	p->p_ucred = newcred;
+	crfree(oldcred);
 	return (0);
 }
 
@@ -624,20 +637,27 @@
 	struct proc *p;
 	struct setegid_args *uap;
 {
-	register struct pcred *pc = p->p_cred;
+	register struct ucred *oldcred = p->p_ucred, *newcred;
 	register gid_t egid;
 	int error;
 
 	egid = uap->egid;
-	if (egid != pc->p_rgid &&		/* allow setegid(getgid()) */
-	    egid != pc->p_svgid &&		/* allow setegid(saved gid) */
-	    (error = suser_xxx(0, p, PRISON_ROOT)))
+	/*
+	 * The new effective gid must be equal to either the current real or
+	 * saved gid.  Appropriate privilege may override this restriction.
+	 */
+	if (egid != oldcred->cr_rgid &&		/* allow setegid(getgid()) */
+	    egid != oldcred->cr_svgid &&	/* allow setegid(saved gid) */
+	    (error = suser_xxx(oldcred, NULL, PRISON_ROOT)))
 		return (error);
-	if (pc->pc_ucred->cr_groups[0] != egid) {
-		pc->pc_ucred = crcopy(pc->pc_ucred);
-		pc->pc_ucred->cr_groups[0] = egid;
+	
+	newcred = crdup(oldcred);
+	if (oldcred->cr_groups[0] != egid) {
+		change_egid(newcred, egid);
 		setsugid(p);
 	}
+	p->p_ucred = newcred;
+	crfree(oldcred);
 	return (0);
 }
 
@@ -653,11 +673,11 @@
 	struct proc *p;
 	struct setgroups_args *uap;
 {
-	register struct pcred *pc = p->p_cred;
+	register struct ucred *oldcred = p->p_ucred, *newcred;
 	register u_int ngrp;
 	int error;
 
-	if ((error = suser_xxx(0, p, PRISON_ROOT)))
+	if ((error = suser_xxx(oldcred, NULL, PRISON_ROOT)))
 		return (error);
 	ngrp = uap->gidsetsize;
 	if (ngrp > NGROUPS)
@@ -666,7 +686,7 @@
 	 * XXX A little bit lazy here.  We could test if anything has
 	 * changed before crcopy() and setting P_SUGID.
 	 */
-	pc->pc_ucred = crcopy(pc->pc_ucred);
+	newcred = crdup(oldcred);
 	if (ngrp < 1) {
 		/*
 		 * setgroups(0, NULL) is a legitimate way of clearing the
@@ -674,14 +694,18 @@
 		 * have the egid in the groups[0]).  We risk security holes
 		 * when running non-BSD software if we do not do the same.
 		 */
-		pc->pc_ucred->cr_ngroups = 1;
+		newcred->cr_ngroups = 1;
 	} else {
 		if ((error = copyin((caddr_t)uap->gidset,
-		    (caddr_t)pc->pc_ucred->cr_groups, ngrp * sizeof(gid_t))))
+		    (caddr_t)newcred->cr_groups, ngrp * sizeof(gid_t)))) {
+			crfree(newcred);
 			return (error);
-		pc->pc_ucred->cr_ngroups = ngrp;
+		}
+		newcred->cr_ngroups = ngrp;
 	}
 	setsugid(p);
+	p->p_ucred = newcred;
+	crfree(oldcred);
 	return (0);
 }
 
@@ -697,31 +721,52 @@
 	register struct proc *p;
 	struct setreuid_args *uap;
 {
-	register struct pcred *pc = p->p_cred;
+	register struct ucred *oldcred = p->p_ucred, *newcred;
 	register uid_t ruid, euid;
 	int error;
 
 	ruid = uap->ruid;
 	euid = uap->euid;
-	if (((ruid != (uid_t)-1 && ruid != pc->p_ruid && ruid != pc->p_svuid) ||
-	     (euid != (uid_t)-1 && euid != pc->pc_ucred->cr_uid &&
-	     euid != pc->p_ruid && euid != pc->p_svuid)) &&
-	    (error = suser_xxx(0, p, PRISON_ROOT)) != 0)
+	/*
+	 * If an real uid update is requested, the requested real uid must
+	 * be equal to the current real or saved uid.  If an effective uid
+	 * update is requested, the requested euid must be equal to the
+	 * current effective uid, real uid, or saved uid.  Appropriate
+	 * privilege may override these restrictions.
+	 */
+	if (((ruid != (uid_t)-1 && ruid != oldcred->cr_ruid &&
+	      ruid != oldcred->cr_svuid) ||
+	     (euid != (uid_t)-1 && euid != oldcred->cr_uid &&
+	      euid != oldcred->cr_ruid && euid != oldcred->cr_svuid)) &&
+	    (error = suser_xxx(oldcred, NULL, PRISON_ROOT)) != 0)
 		return (error);
 
-	if (euid != (uid_t)-1 && pc->pc_ucred->cr_uid != euid) {
-		change_euid(p, euid);
+	newcred = crdup(oldcred);
+	if (euid != (uid_t)-1 && oldcred->cr_uid != euid) {
+		change_euid(newcred, euid);
 		setsugid(p);
 	}
-	if (ruid != (uid_t)-1 && pc->p_ruid != ruid) {
-		change_ruid(p, ruid);
+	if (ruid != (uid_t)-1 && oldcred->cr_ruid != ruid) {
+		change_ruid(newcred, ruid);
 		setsugid(p);
 	}
-	if ((ruid != (uid_t)-1 || pc->pc_ucred->cr_uid != pc->p_ruid) &&
-	    pc->p_svuid != pc->pc_ucred->cr_uid) {
-		pc->p_svuid = pc->pc_ucred->cr_uid;
+	/*
+	 * XXX: What is this intended to accomplish?  In which cases should
+	 * it be looking at the old values, and in which, the new values?
+	 *
+	 * Note current behavior is:
+	 * If the ruid update is requested (even if the ruid is not changed)
+	 * or the euid is not equal to the value of the ruid, a difference
+	 * in the svuid and the euid will result in the svuid being
+	 * updated to the new value of the euid.
+	 */
+	if ((ruid != (uid_t)-1 || newcred->cr_uid != newcred->cr_ruid) &&
+	    newcred->cr_svuid != newcred->cr_uid) {
+		change_svuid(newcred, newcred->cr_uid);
 		setsugid(p);
 	}
+	p->p_ucred = newcred;
+	crfree(oldcred);
 	return (0);
 }
 
@@ -737,30 +782,49 @@
 	register struct proc *p;
 	struct setregid_args *uap;
 {
-	register struct pcred *pc = p->p_cred;
+	register struct ucred *oldcred = p->p_ucred, *newcred;
 	register gid_t rgid, egid;
 	int error;
 
 	rgid = uap->rgid;
 	egid = uap->egid;
-	if (((rgid != (gid_t)-1 && rgid != pc->p_rgid && rgid != pc->p_svgid) ||
-	     (egid != (gid_t)-1 && egid != pc->pc_ucred->cr_groups[0] &&
-	     egid != pc->p_rgid && egid != pc->p_svgid)) &&
-	    (error = suser_xxx(0, p, PRISON_ROOT)) != 0)
+	/*
+	 * If a real gid update is requested, the requested real gid must
+	 * be equal to the current real or saved gid.  If an effective gid
+	 * update is requested, the requested effective gid must be equal
+	 * to the current effective gid, the current real gid, or the
+	 * current saved gid.  Apropriate privilege may override this
+	 * restriction.
+	 */
+	if (((rgid != (gid_t)-1 && rgid != oldcred->cr_rgid &&
+	    rgid != oldcred->cr_svgid) ||
+	     (egid != (gid_t)-1 && egid != oldcred->cr_groups[0] &&
+	     egid != oldcred->cr_rgid && egid != oldcred->cr_svgid)) &&
+	    (error = suser_xxx(oldcred, NULL, PRISON_ROOT)) != 0)
 		return (error);
 
-	if (egid != (gid_t)-1 && pc->pc_ucred->cr_groups[0] != egid) {
-		pc->pc_ucred = crcopy(pc->pc_ucred);
-		pc->pc_ucred->cr_groups[0] = egid;
+	newcred = crdup(oldcred);
+	if (egid != (gid_t)-1 && oldcred->cr_groups[0] != egid) {
+		change_egid(newcred, egid);
 		setsugid(p);
 	}
-	if (rgid != (gid_t)-1 && pc->p_rgid != rgid) {
-		pc->p_rgid = rgid;
+	if (rgid != (gid_t)-1 && oldcred->cr_rgid != rgid) {
+		change_rgid(newcred, rgid);
 		setsugid(p);
 	}
-	if ((rgid != (gid_t)-1 || pc->pc_ucred->cr_groups[0] != pc->p_rgid) &&
-	    pc->p_svgid != pc->pc_ucred->cr_groups[0]) {
-		pc->p_svgid = pc->pc_ucred->cr_groups[0];
+	/*
+	 * XXX: What is this intended to accomplish?  In which cases should
+	 * it be looking at the old values, and in which, the new values?
+	 *
+	 * Note current behavior is:
+	 * If the rgid update is requested (even if the rgid is not changed)
+	 * or the egid is not equal to the value of the rgid, a difference
+	 * in the svgid and the egid will result in the svuid being
+	 * updated to the new value of the euid.
+	 */
+	if ((rgid != (gid_t)-1 || newcred->cr_groups[0] != newcred->cr_rgid) &&
+	    newcred->cr_svgid != newcred->cr_groups[0]) {
+		change_svgid(newcred, newcred->cr_groups[0]);
 		setsugid(p);
 	}
 	return (0);
@@ -784,33 +848,40 @@
 	register struct proc *p;
 	struct setresuid_args *uap;
 {
-	register struct pcred *pc = p->p_cred;
+	register struct ucred *oldcred = p->p_ucred, *newcred;
 	register uid_t ruid, euid, suid;
 	int error;
 
 	ruid = uap->ruid;
 	euid = uap->euid;
 	suid = uap->suid;
-	if (((ruid != (uid_t)-1 && ruid != pc->p_ruid && ruid != pc->p_svuid &&
-	      ruid != pc->pc_ucred->cr_uid) ||
-	     (euid != (uid_t)-1 && euid != pc->p_ruid && euid != pc->p_svuid &&
-	      euid != pc->pc_ucred->cr_uid) ||
-	     (suid != (uid_t)-1 && suid != pc->p_ruid && suid != pc->p_svuid &&
-	      suid != pc->pc_ucred->cr_uid)) &&
-	    (error = suser_xxx(0, p, PRISON_ROOT)) != 0)
+	if (((ruid != (uid_t)-1 && ruid != oldcred->cr_ruid &&
+	     ruid != oldcred->cr_svuid &&
+	      ruid != oldcred->cr_uid) ||
+	     (euid != (uid_t)-1 && euid != oldcred->cr_ruid &&
+	    euid != oldcred->cr_svuid &&
+	      euid != oldcred->cr_uid) ||
+	     (suid != (uid_t)-1 && suid != oldcred->cr_ruid &&
+	    suid != oldcred->cr_svuid &&
+	      suid != oldcred->cr_uid)) &&
+	    (error = suser_xxx(oldcred, NULL, PRISON_ROOT)) != 0)
 		return (error);
-	if (euid != (uid_t)-1 && pc->pc_ucred->cr_uid != euid) {
-		change_euid(p, euid);
+
+	newcred = crdup(oldcred);
+	if (euid != (uid_t)-1 && oldcred->cr_uid != euid) {
+		change_euid(newcred, euid);
 		setsugid(p);
 	}
-	if (ruid != (uid_t)-1 && pc->p_ruid != ruid) {
-		change_ruid(p, ruid);
+	if (ruid != (uid_t)-1 && oldcred->cr_ruid != ruid) {
+		change_ruid(newcred, ruid);
 		setsugid(p);
 	}
-	if (suid != (uid_t)-1 && pc->p_svuid != suid) {
-		pc->p_svuid = suid;
+	if (suid != (uid_t)-1 && oldcred->cr_svuid != suid) {
+		change_svuid(newcred, suid);
 		setsugid(p);
 	}
+	p->p_ucred = newcred;
+	crfree(oldcred);
 	return (0);
 }
 
@@ -832,35 +903,40 @@
 	register struct proc *p;
 	struct setresgid_args *uap;
 {
-	register struct pcred *pc = p->p_cred;
+	register struct ucred *oldcred = p->p_ucred, *newcred;
 	register gid_t rgid, egid, sgid;
 	int error;
 
 	rgid = uap->rgid;
 	egid = uap->egid;
 	sgid = uap->sgid;
-	if (((rgid != (gid_t)-1 && rgid != pc->p_rgid && rgid != pc->p_svgid &&
-	      rgid != pc->pc_ucred->cr_groups[0]) ||
-	     (egid != (gid_t)-1 && egid != pc->p_rgid && egid != pc->p_svgid &&
-	      egid != pc->pc_ucred->cr_groups[0]) ||
-	     (sgid != (gid_t)-1 && sgid != pc->p_rgid && sgid != pc->p_svgid &&
-	      sgid != pc->pc_ucred->cr_groups[0])) &&
-	    (error = suser_xxx(0, p, PRISON_ROOT)) != 0)
+	if (((rgid != (gid_t)-1 && rgid != oldcred->cr_rgid &&
+	      rgid != oldcred->cr_svgid &&
+	      rgid != oldcred->cr_groups[0]) ||
+	     (egid != (gid_t)-1 && egid != oldcred->cr_rgid &&
+	      egid != oldcred->cr_svgid &&
+	      egid != oldcred->cr_groups[0]) ||
+	     (sgid != (gid_t)-1 && sgid != oldcred->cr_rgid &&
+	      sgid != oldcred->cr_svgid &&
+	      sgid != oldcred->cr_groups[0])) &&
+	    (error = suser_xxx(oldcred, NULL, PRISON_ROOT)) != 0)
 		return (error);
 
-	if (egid != (gid_t)-1 && pc->pc_ucred->cr_groups[0] != egid) {
-		pc->pc_ucred = crcopy(pc->pc_ucred);
-		pc->pc_ucred->cr_groups[0] = egid;
+	newcred = crdup(oldcred);
+	if (egid != (gid_t)-1 && oldcred->cr_groups[0] != egid) {
+		change_egid(newcred, egid);
 		setsugid(p);
 	}
-	if (rgid != (gid_t)-1 && pc->p_rgid != rgid) {
-		pc->p_rgid = rgid;
+	if (rgid != (gid_t)-1 && oldcred->cr_rgid != rgid) {
+		change_rgid(newcred, rgid);
 		setsugid(p);
 	}
-	if (sgid != (gid_t)-1 && pc->p_svgid != sgid) {
-		pc->p_svgid = sgid;
+	if (sgid != (gid_t)-1 && oldcred->cr_svgid != sgid) {
+		change_svgid(newcred, sgid);
 		setsugid(p);
 	}
+	p->p_ucred = newcred;
+	crfree(oldcred);
 	return (0);
 }
 
@@ -877,18 +953,18 @@
 	register struct proc *p;
 	struct getresuid_args *uap;
 {
-	struct pcred *pc = p->p_cred;
+	struct ucred *cred = p->p_ucred;
 	int error1 = 0, error2 = 0, error3 = 0;
 
 	if (uap->ruid)
-		error1 = copyout((caddr_t)&pc->p_ruid,
-		    (caddr_t)uap->ruid, sizeof(pc->p_ruid));
+		error1 = copyout((caddr_t)&cred->cr_ruid,
+		    (caddr_t)uap->ruid, sizeof(cred->cr_ruid));
 	if (uap->euid)
-		error2 = copyout((caddr_t)&pc->pc_ucred->cr_uid,
-		    (caddr_t)uap->euid, sizeof(pc->pc_ucred->cr_uid));
+		error2 = copyout((caddr_t)&cred->cr_uid,
+		    (caddr_t)uap->euid, sizeof(cred->cr_uid));
 	if (uap->suid)
-		error3 = copyout((caddr_t)&pc->p_svuid,
-		    (caddr_t)uap->suid, sizeof(pc->p_svuid));
+		error3 = copyout((caddr_t)&cred->cr_svuid,
+		    (caddr_t)uap->suid, sizeof(cred->cr_svuid));
 	return error1 ? error1 : (error2 ? error2 : error3);
 }
 
@@ -905,18 +981,18 @@
 	register struct proc *p;
 	struct getresgid_args *uap;
 {
-	struct pcred *pc = p->p_cred;
+	struct ucred *cred = p->p_ucred;
 	int error1 = 0, error2 = 0, error3 = 0;
 
 	if (uap->rgid)
-		error1 = copyout((caddr_t)&pc->p_rgid,
-		    (caddr_t)uap->rgid, sizeof(pc->p_rgid));
+		error1 = copyout((caddr_t)&cred->cr_rgid,
+		    (caddr_t)uap->rgid, sizeof(cred->cr_rgid));
 	if (uap->egid)
-		error2 = copyout((caddr_t)&pc->pc_ucred->cr_groups[0],
-		    (caddr_t)uap->egid, sizeof(pc->pc_ucred->cr_groups[0]));
+		error2 = copyout((caddr_t)&cred->cr_groups[0],
+		    (caddr_t)uap->egid, sizeof(cred->cr_groups[0]));
 	if (uap->sgid)
-		error3 = copyout((caddr_t)&pc->p_svgid,
-		    (caddr_t)uap->sgid, sizeof(pc->p_svgid));
+		error3 = copyout((caddr_t)&cred->cr_svgid,
+		    (caddr_t)uap->sgid, sizeof(cred->cr_svgid));
 	return error1 ? error1 : (error2 ? error2 : error3);
 }
 
@@ -1113,10 +1189,10 @@
 	 * Generally, the object credential's ruid or svuid must match the
 	 * subject credential's ruid or euid.
 	 */
-	if (p1->p_cred->p_ruid != p2->p_cred->p_ruid &&
-	    p1->p_cred->p_ruid != p2->p_cred->p_svuid &&
-	    p1->p_ucred->cr_uid != p2->p_cred->p_ruid &&
-	    p1->p_ucred->cr_uid != p2->p_cred->p_svuid) {
+	if (p1->p_ucred->cr_ruid != p2->p_ucred->cr_ruid &&
+	    p1->p_ucred->cr_ruid != p2->p_ucred->cr_svuid &&
+	    p1->p_ucred->cr_uid != p2->p_ucred->cr_ruid &&
+	    p1->p_ucred->cr_uid != p2->p_ucred->cr_svuid) {
 		/* Not permitted, try privilege. */
 		error = suser_xxx(NULL, p1, PRISON_ROOT);
 		if (error)
@@ -1140,9 +1216,9 @@
 	if ((error = prison_check(p1->p_ucred, p2->p_ucred)))
 		return (error);
 
-	if (p1->p_cred->p_ruid == p2->p_cred->p_ruid)
+	if (p1->p_ucred->cr_ruid == p2->p_ucred->cr_ruid)
 		return (0);
-	if (p1->p_ucred->cr_uid == p2->p_cred->p_ruid)
+	if (p1->p_ucred->cr_uid == p2->p_ucred->cr_ruid)
 		return (0);
 
 	if (!suser_xxx(0, p1, PRISON_ROOT)) {
@@ -1178,9 +1254,9 @@
 
 	/* not owned by you, has done setuid (unless you're root) */
 	/* add a CAP_SYS_PTRACE here? */
-	if (p1->p_cred->pc_ucred->cr_uid != p2->p_cred->p_ruid ||
-	    p1->p_cred->p_ruid != p2->p_cred->p_ruid ||
-	    p1->p_cred->p_svuid != p2->p_cred->p_ruid ||
+	if (p1->p_ucred->cr_uid != p2->p_ucred->cr_ruid ||
+	    p1->p_ucred->cr_ruid != p2->p_ucred->cr_ruid ||
+	    p1->p_ucred->cr_svuid != p2->p_ucred->cr_ruid ||
 	    p2->p_flag & P_SUGID) {
 		if ((error = suser_xxx(0, p1, PRISON_ROOT)))
 			return (error);
@@ -1308,6 +1384,7 @@
 	*newcr = *cr;
 	mtx_init(&newcr->cr_mtx, "ucred", MTX_DEF);
 	uihold(newcr->cr_uidinfo);
+	uihold(newcr->cr_ruidinfo);
 	if (jailed(newcr))
 		prison_hold(newcr->cr_prison);
 	newcr->cr_ref = 1;
@@ -1375,48 +1452,123 @@
 }
 
 /*
- * Helper function to change the effective uid of a process
+ * change_euid(): Change a process's effective uid.
+ * Arguments: struct ucred *newcred, uid_t euid
+ * Returns: none
+ * Locks: none
+ * Side effects: newcred->cr_uid and newcred->cr_uidinfo will be modified.
+ * References: newcred must be an exclusive credential reference for the
+ *             duration of the call.
+ * Notes: none
  */
 void
-change_euid(p, euid)
-	struct	proc *p;
-	uid_t	euid;
+change_euid(newcred, euid)
+	struct ucred *newcred;
+	uid_t euid;
 {
-	struct	pcred *pc;
-	struct	uidinfo *uip;
 
-	pc = p->p_cred;
-	/*
-	 * crcopy is essentially a NOP if ucred has a reference count
-	 * of 1, which is true if it has already been copied.
-	 */
-	pc->pc_ucred = crcopy(pc->pc_ucred);
-	uip = pc->pc_ucred->cr_uidinfo;
-	pc->pc_ucred->cr_uid = euid;
-	pc->pc_ucred->cr_uidinfo = uifind(euid);
-	uifree(uip);
+	newcred->cr_uid = euid;
+	uifree(newcred->cr_uidinfo);
+	newcred->cr_uidinfo = uifind(euid);
 }
 
 /*
- * Helper function to change the real uid of a process
- *
- * The per-uid process count for this process is transfered from
- * the old uid to the new uid.
+ * change_egid(): Change a process's effective gid.
+ * Arguments: struct ucred *newcred, gid_t egid
+ * Returns: none
+ * Locks: none
+ * Side effects: newcred->cr_gid will be modified.
+ * References: newcred must be an exclusive credential reference for the
+ *             duration of the call.
+ * Notes: none
  */
 void
-change_ruid(p, ruid)
-	struct	proc *p;
-	uid_t	ruid;
+change_egid(newcred, egid)
+	struct ucred *newcred;
+	gid_t egid;
+{
+
+	newcred->cr_groups[0] = egid;
+}
+
+/*
+ * change_ruid(): Change a process's real uid.
+ * Arguments: struct ucred *newcred, uid_t ruid
+ * Returns: none
+ * Locks: none
+ * Side effects: newcred->cr_ruid will be updated, newcred->cr_ruidinfo
+ *               will be updated, and the old and new cr_ruidinfo proc
+ *               counts will be updated.
+ * References: newcred must be an exclusive credential reference for the
+ *             duration of the call.
+ * Notes: none
+ */
+void
+change_ruid(newcred, ruid)
+	struct ucred *newcred;
+	uid_t ruid;
+{
+
+	(void)chgproccnt(newcred->cr_ruidinfo, -1, 0);
+	newcred->cr_ruid = ruid;
+	uifree(newcred->cr_ruidinfo);
+	newcred->cr_ruidinfo = uifind(ruid);
+	(void)chgproccnt(newcred->cr_ruidinfo, 1, 0);
+}
+
+/*
+ * change_rgid(): Change a process's real gid.
+ * Arguments: struct ucred *newcred, gid_t rgid
+ * Returns: none
+ * Locks: none
+ * Side effects: newcred->cr_rgid will be updated.
+ * References: newcred must be an exclusive credential reference for the
+ *             duration of the call.
+ * Notes: none
+ */
+void
+change_rgid(newcred, rgid)
+	struct ucred *newcred;
+	gid_t rgid;
+{
+
+	newcred->cr_rgid = rgid;
+}
+
+/*
+ * change_svuid(): Change a process's saved uid.
+ * Arguments: struct ucred *newcred, uid_t svuid
+ * Returns: none
+ * Locks: none
+ * Side effects: newcred->cr_svuid will be updated.
+ * References: newcred must be an exclusive credential reference for the
+ *             duration of the call.
+ * Notes: none
+ */
+void
+change_svuid(newcred, svuid)
+	struct ucred *newcred;
+	uid_t svuid;
+{
+
+	newcred->cr_svuid = svuid;
+}
+
+/*
+ * change_svgid(): Change a process's saved gid.
+ * Arguments: struct ucred *newcred, gid_t svgid
+ * Returns: none
+ * Locks: none
+ * Side effects: newcred->cr_svgid will be updated.
+ * References: newcred must be an exclusive credential reference for the
+ *             duration of the call.
+ * Notes: none
+ */
+void
+change_svgid(newcred, svgid)
+	struct ucred *newcred;
+	gid_t svgid;
 {
-	struct	pcred *pc;
-	struct	uidinfo *uip;
 
-	pc = p->p_cred;
-	(void)chgproccnt(pc->p_uidinfo, -1, 0);
-	uip = pc->p_uidinfo;
-	/* It is assumed that pcred is not shared between processes */
-	pc->p_ruid = ruid;
-	pc->p_uidinfo = uifind(ruid);
-	(void)chgproccnt(pc->p_uidinfo, 1, 0);
-	uifree(uip);
+	newcred->cr_svgid = svgid;
 }
Index: kern/kern_sig.c
===================================================================
RCS file: /home/ncvs/src/sys/kern/kern_sig.c,v
retrieving revision 1.117
diff -u -r1.117 kern_sig.c
--- kern/kern_sig.c	2001/04/27 19:28:23	1.117
+++ kern/kern_sig.c	2001/05/04 16:48:36
@@ -98,14 +98,14 @@
     "Log processes quitting on abnormal signals to syslog(3)");
 
 /*
- * Policy -- Can real uid ruid with ucred uc send a signal to process q?
+ * Policy -- Can ucred cr1 send SIGIO to process cr2?
  */
-#define CANSIGIO(ruid, uc, q) \
-	((uc)->cr_uid == 0 || \
-	    (ruid) == (q)->p_cred->p_ruid || \
-	    (uc)->cr_uid == (q)->p_cred->p_ruid || \
-	    (ruid) == (q)->p_ucred->cr_uid || \
-	    (uc)->cr_uid == (q)->p_ucred->cr_uid)
+#define CANSIGIO(cr1, cr2) \
+	((cr1)->cr_uid == 0 || \
+	    (cr2)->cr_ruid == (cr2)->cr_ruid || \
+	    (cr2)->cr_uid == (cr2)->cr_ruid || \
+	    (cr2)->cr_ruid == (cr2)->cr_uid || \
+	    (cr2)->cr_uid == (cr2)->cr_uid)
 
 int sugid_coredump;
 SYSCTL_INT(_kern, OID_AUTO, sugid_coredump, CTLFLAG_RW, 
@@ -1609,8 +1609,8 @@
 {
 	CTR3(KTR_PROC, "killproc: proc %p (pid %d, %s)",
 		p, p->p_pid, p->p_comm);
-	log(LOG_ERR, "pid %d (%s), uid %d, was killed: %s\n", p->p_pid, p->p_comm,
-		p->p_cred && p->p_ucred ? p->p_ucred->cr_uid : -1, why);
+	log(LOG_ERR, "pid %d (%s), uid %d, was killed: %s\n", p->p_pid,
+	    p->p_comm, p->p_ucred ? p->p_ucred->cr_uid : -1, why);
 	PROC_LOCK(p);
 	psignal(p, SIGKILL);
 	PROC_UNLOCK(p);
@@ -1649,7 +1649,7 @@
 			log(LOG_INFO,
 			    "pid %d (%s), uid %d: exited on signal %d%s\n",
 			    p->p_pid, p->p_comm,
-			    p->p_cred && p->p_ucred ? p->p_ucred->cr_uid : -1,
+			    p->p_ucred ? p->p_ucred->cr_uid : -1,
 			    sig &~ WCOREFLAG,
 			    sig & WCOREFLAG ? " (core dumped)" : "");
 	} else {
@@ -1869,8 +1869,7 @@
 		
 	if (sigio->sio_pgid > 0) {
 		PROC_LOCK(sigio->sio_proc);
-		if (CANSIGIO(sigio->sio_ruid, sigio->sio_ucred,
-		             sigio->sio_proc))
+		if (CANSIGIO(sigio->sio_ucred, sigio->sio_proc->p_ucred))
 			psignal(sigio->sio_proc, sig);
 		PROC_UNLOCK(sigio->sio_proc);
 	} else if (sigio->sio_pgid < 0) {
@@ -1878,7 +1877,7 @@
 
 		LIST_FOREACH(p, &sigio->sio_pgrp->pg_members, p_pglist) {
 			PROC_LOCK(p);
-			if (CANSIGIO(sigio->sio_ruid, sigio->sio_ucred, p) &&
+			if (CANSIGIO(sigio->sio_ucred, p->p_ucred) &&
 			    (checkctty == 0 || (p->p_flag & P_CONTROLT)))
 				psignal(p, sig);
 			PROC_UNLOCK(p);
Index: kern/uipc_usrreq.c
===================================================================
RCS file: /home/ncvs/src/sys/kern/uipc_usrreq.c,v
retrieving revision 1.65
diff -u -r1.65 uipc_usrreq.c
--- kern/uipc_usrreq.c	2001/05/01 08:12:59	1.65
+++ kern/uipc_usrreq.c	2001/05/06 00:45:37
@@ -988,8 +988,8 @@
 	if (cm->cmsg_type == SCM_CREDS) {
 		cmcred = (struct cmsgcred *)(cm + 1);
 		cmcred->cmcred_pid = p->p_pid;
-		cmcred->cmcred_uid = p->p_cred->p_ruid;
-		cmcred->cmcred_gid = p->p_cred->p_rgid;
+		cmcred->cmcred_uid = p->p_ucred->cr_ruid;
+		cmcred->cmcred_gid = p->p_ucred->cr_rgid;
 		cmcred->cmcred_euid = p->p_ucred->cr_uid;
 		cmcred->cmcred_ngroups = MIN(p->p_ucred->cr_ngroups,
 							CMGROUP_MAX);
Index: kern/vfs_syscalls.c
===================================================================
RCS file: /home/ncvs/src/sys/kern/vfs_syscalls.c,v
retrieving revision 1.189
diff -u -r1.189 vfs_syscalls.c
--- kern/vfs_syscalls.c	2001/04/29 02:44:49	1.189
+++ kern/vfs_syscalls.c	2001/05/04 16:53:44
@@ -1711,8 +1711,8 @@
 	 * rather than to modify the potentially shared process structure.
 	 */
 	tmpcred = crdup(cred);
-	tmpcred->cr_uid = p->p_cred->p_ruid;
-	tmpcred->cr_groups[0] = p->p_cred->p_rgid;
+	tmpcred->cr_uid = cred->cr_ruid;
+	tmpcred->cr_groups[0] = cred->cr_rgid;
 	p->p_ucred = tmpcred;
 	NDINIT(&nd, LOOKUP, FOLLOW | LOCKLEAF | NOOBJ, UIO_USERSPACE,
 	    SCARG(uap, path), p);
@@ -3799,7 +3799,7 @@
 	}
 	cnt = auio.uio_resid;
 	error = VOP_SETEXTATTR(vp, attrnamespace, attrname, &auio,
-	    p->p_cred->pc_ucred, p);
+	    p->p_ucred, p);
 	cnt -= auio.uio_resid;
 	p->p_retval[0] = cnt;
 done:
@@ -3912,7 +3912,7 @@
 	}
 	cnt = auio.uio_resid;
 	error = VOP_GETEXTATTR(vp, attrnamespace, attrname, &auio,
-	    p->p_cred->pc_ucred, p);
+	    p->p_ucred, p);
 	cnt -= auio.uio_resid;
 	p->p_retval[0] = cnt;
 done:
@@ -3995,7 +3995,7 @@
 	vn_lock(vp, LK_EXCLUSIVE | LK_RETRY, p);
 
 	error = VOP_SETEXTATTR(vp, attrnamespace, attrname, NULL,
-	    p->p_cred->pc_ucred, p);
+	    p->p_ucred, p);
 
 	VOP_UNLOCK(vp, 0, p);
 	vn_finished_write(mp);
Index: miscfs/procfs/procfs_status.c
===================================================================
RCS file: /home/ncvs/src/sys/miscfs/procfs/procfs_status.c,v
retrieving revision 1.29
diff -u -r1.29 procfs_status.c
--- miscfs/procfs/procfs_status.c	2001/05/01 08:13:09	1.29
+++ miscfs/procfs/procfs_status.c	2001/05/06 00:45:44
@@ -153,11 +153,11 @@
 
 	ps += snprintf(ps, psbuf + sizeof(psbuf) - ps, " %lu %lu %lu", 
 		(u_long)cr->cr_uid,
-		(u_long)p->p_cred->p_ruid,
-		(u_long)p->p_cred->p_rgid);
+		(u_long)cr->cr_ruid,
+		(u_long)cr->cr_rgid);
 	DOCHECK();
 
-	/* egid (p->p_cred->p_svgid) is equal to cr_ngroups[0] 
+	/* egid (cr->cr_svgid) is equal to cr_ngroups[0] 
 	   see also getegid(2) in /sys/kern/kern_prot.c */
 
 	for (i = 0; i < cr->cr_ngroups; i++) {
Index: miscfs/procfs/procfs_vnops.c
===================================================================
RCS file: /home/ncvs/src/sys/miscfs/procfs/procfs_vnops.c,v
retrieving revision 1.95
diff -u -r1.95 procfs_vnops.c
--- miscfs/procfs/procfs_vnops.c	2001/05/01 08:13:09	1.95
+++ miscfs/procfs/procfs_vnops.c	2001/05/06 00:45:44
@@ -404,7 +404,7 @@
 		procp = PFIND(pfs->pfs_pid);
 		if (procp == NULL)
 			return (ENOENT);
-		if (procp->p_cred == NULL || procp->p_ucred == NULL) {
+		if (procp->p_ucred == NULL) {
 			PROC_UNLOCK(procp);
 			return (ENOENT);
 		}
@@ -942,8 +942,7 @@
 	 */
 	case Pfile:
 		procp = PFIND(pfs->pfs_pid);
-		if (procp == NULL || procp->p_cred == NULL ||
-		    procp->p_ucred == NULL) {
+		if (procp == NULL || procp->p_ucred == NULL) {
 			if (procp != NULL)
 				PROC_UNLOCK(procp);
 			printf("procfs_readlink: pid %d disappeared\n",
Index: nfs/nfs_lock.c
===================================================================
RCS file: /home/ncvs/src/sys/nfs/nfs_lock.c,v
retrieving revision 1.4
diff -u -r1.4 nfs_lock.c
--- nfs/nfs_lock.c	2001/05/01 08:13:14	1.4
+++ nfs/nfs_lock.c	2001/05/06 00:47:01
@@ -236,9 +236,11 @@
 
 	/* Let root, or someone who once was root (lockd generally
 	 * switches to the daemon uid once it is done setting up) make 
-	 * this call
+	 * this call.
+	 *
+	 * XXX
 	 */
-	if ((error = suser(p)) != 0 && p->p_cred->p_svuid != 0)
+	if ((error = suser(p)) != 0 && p->p_ucred->cr_svuid != 0)
 		return (error);
 
 	/* the version should match, or we're out of sync */
Index: posix4/p1003_1b.c
===================================================================
RCS file: /home/ncvs/src/sys/posix4/p1003_1b.c,v
retrieving revision 1.8
diff -u -r1.8 p1003_1b.c
--- posix4/p1003_1b.c	2001/05/01 08:13:16	1.8
+++ posix4/p1003_1b.c	2001/05/06 00:47:11
@@ -68,16 +68,17 @@
 /*
  * This is stolen from CANSIGNAL in kern_sig:
  *
- * Can process p, with pcred pc, do "write flavor" operations to process q?
+ * Can process with credential cr1 do "write flavor" operations to credential
+ * cr2.  This check needs to use generalized checks.
  */
-#define CAN_AFFECT(p, pc, q) \
-	((pc)->pc_ucred->cr_uid == 0 || \
-	    (pc)->p_ruid == (q)->p_cred->p_ruid || \
-	    (pc)->pc_ucred->cr_uid == (q)->p_cred->p_ruid || \
-	    (pc)->p_ruid == (q)->p_ucred->cr_uid || \
-	    (pc)->pc_ucred->cr_uid == (q)->p_ucred->cr_uid)
+#define CAN_AFFECT(cr1, cr2) \
+	((cr1)->cr_uid == 0 || \
+	    (c1)->cr_ruid == (cr2)->cr_ruid || \
+	    (c1)->cr_uid == (cr2)->cr_ruid || \
+	    (c1)->cr_ruid == (cr2)->cr_uid || \
+	    (c1)->cr_uid == (cr2)->cr_uid)
 #else
-#define CAN_AFFECT(p, pc, q) ((pc)->pc_ucred->cr_uid == 0)
+#define CAN_AFFECT(cr1, cr2) ((cr1)->cr_uid == 0)
 #endif
 
 /*
@@ -99,7 +100,7 @@
 	{
 		/* Enforce permission policy.
 		 */
-		if (CAN_AFFECT(p, p->p_cred, other_proc))
+		if (CAN_AFFECT(p->p_ucred, other_proc->p_ucred))
 			*pp = other_proc;
 		else
 			ret = EPERM;
Index: sys/filedesc.h
===================================================================
RCS file: /home/ncvs/src/sys/sys/filedesc.h,v
retrieving revision 1.26
diff -u -r1.26 filedesc.h
--- sys/filedesc.h	2000/11/18 21:01:04	1.26
+++ sys/filedesc.h	2001/05/04 15:52:27
@@ -117,7 +117,6 @@
 	struct	sigio **sio_myref;	/* location of the pointer that holds
 					 * the reference to this structure */
 	struct	ucred *sio_ucred;	/* current credentials */
-	uid_t	sio_ruid;		/* real user id */
 	pid_t	sio_pgid;		/* pgid for signals */
 };
 #define	sio_proc	sio_u.siu_proc
Index: sys/proc.h
===================================================================
RCS file: /home/ncvs/src/sys/sys/proc.h,v
retrieving revision 1.161
diff -u -r1.161 proc.h
--- sys/proc.h	2001/04/27 19:28:25	1.161
+++ sys/proc.h	2001/05/03 19:55:27
@@ -156,7 +156,7 @@
 	LIST_ENTRY(proc) p_list;	/* (d) List of all processes. */
 
 	/* substructures: */
-	struct	pcred *p_cred;		/* (c + k) Process owner's identity. */
+	struct	ucred *p_ucred;		/* (c + k) Process owner's identity. */
 	struct	filedesc *p_fd;		/* (b) Ptr to open files structure. */
 	struct	pstats *p_stats;	/* (b) Accounting/statistics (CPU). */
 	struct	plimit *p_limit;	/* (m) Process limits. */
@@ -166,7 +166,6 @@
 #define	p_sigignore	p_procsig->ps_sigignore
 #define	p_sigcatch	p_procsig->ps_sigcatch
 
-#define	p_ucred		p_cred->pc_ucred
 #define	p_rlimit	p_limit->pl_rlimit
 
 	int	p_flag;			/* (c) P_* flags. */
@@ -336,23 +335,6 @@
 #define	P_CAN_SEE	1
 #define	P_CAN_SCHED	3
 #define	P_CAN_DEBUG	4
-
-/*
- * MOVE TO ucred.h?
- *
- * Shareable process credentials (always resident).  This includes a reference
- * to the current user credentials as well as real and saved ids that may be
- * used to change ids.
- */
-struct	pcred {
-	struct	ucred *pc_ucred;	/* Current credentials. */
-	uid_t	p_ruid;			/* Real user id. */
-	uid_t	p_svuid;		/* Saved effective user id. */
-	gid_t	p_rgid;			/* Real group id. */
-	gid_t	p_svgid;		/* Saved effective group id. */
-	int	p_refcnt;		/* Number of references. */
-	struct	uidinfo *p_uidinfo;	/* Per uid resource consumption. */
-};
 
 #ifdef _KERNEL
 
Index: sys/ucred.h
===================================================================
RCS file: /home/ncvs/src/sys/sys/ucred.h,v
retrieving revision 1.23
diff -u -r1.23 ucred.h
--- sys/ucred.h	2001/05/01 08:13:18	1.23
+++ sys/ucred.h	2001/05/06 00:47:17
@@ -50,9 +50,14 @@
 struct ucred {
 	u_int	cr_ref;			/* reference count */
 	uid_t	cr_uid;			/* effective user id */
+	uid_t	cr_ruid;		/* real user id */
+	uid_t	cr_svuid;		/* saved user id */
 	short	cr_ngroups;		/* number of groups */
 	gid_t	cr_groups[NGROUPS];	/* groups */
-	struct	uidinfo *cr_uidinfo;	/* per uid resource consumption */
+	gid_t	cr_rgid;		/* real group id */
+	gid_t	cr_svgid;		/* saved user id */
+	struct	uidinfo *cr_uidinfo;	/* per euid resource consumption */
+	struct	uidinfo *cr_ruidinfo;	/* per ruid resource consumption */
 	struct	prison *cr_prison;	/* jail(4) */
 	struct	mtx cr_mtx;		/* protect refcount */
 };
@@ -77,8 +82,12 @@
 
 struct proc;
 
-void		change_euid __P((struct proc *p, uid_t euid));
-void		change_ruid __P((struct proc *p, uid_t ruid));
+void		change_euid __P((struct ucred *newcred, uid_t euid));
+void		change_egid __P((struct ucred *newcred, gid_t egid));
+void		change_ruid __P((struct ucred *newcred, uid_t ruid));
+void		change_rgid __P((struct ucred *newcred, uid_t rgid));
+void		change_svuid __P((struct ucred *newcred, uid_t svuid));
+void		change_svgid __P((struct ucred *newcred, gid_t svgid));
 struct ucred	*crcopy __P((struct ucred *cr));
 struct ucred	*crdup __P((struct ucred *cr));
 void		crfree __P((struct ucred *cr));
Index: ufs/ufs/ufs_extattr.c
===================================================================
RCS file: /home/ncvs/src/sys/ufs/ufs/ufs_extattr.c,v
retrieving revision 1.31
diff -u -r1.31 ufs_extattr.c
--- ufs/ufs/ufs_extattr.c	2001/04/29 02:45:28	1.31
+++ ufs/ufs/ufs_extattr.c	2001/05/04 18:22:17
@@ -621,7 +621,7 @@
 	auio.uio_rw = UIO_READ;
 	auio.uio_procp = (struct proc *) p;
 
-	VOP_LEASE(backing_vnode, p, p->p_cred->pc_ucred, LEASE_WRITE);
+	VOP_LEASE(backing_vnode, p, p->p_ucred, LEASE_WRITE);
 	vn_lock(backing_vnode, LK_SHARED | LK_NOPAUSE | LK_RETRY, p);
 	error = VOP_READ(backing_vnode, &auio, IO_NODELOCKED,
 	    ump->um_extattr.uepm_ucred);
@@ -702,7 +702,7 @@
 	 * Processes with privilege, but in jail, are not allowed to
 	 * configure extended attributes.
 	 */
-	if ((error = suser_xxx(p->p_cred->pc_ucred, p, 0))) {
+	if ((error = suser_xxx(p->p_ucred, p, 0))) {
 		if (filename_vp != NULL)
 			VOP_UNLOCK(filename_vp, 0, p);
 		return (error);
Index: ufs/ufs/ufs_vfsops.c
===================================================================
RCS file: /home/ncvs/src/sys/ufs/ufs/ufs_vfsops.c,v
retrieving revision 1.24
diff -u -r1.24 ufs_vfsops.c
--- ufs/ufs/ufs_vfsops.c	2001/05/01 08:13:19	1.24
+++ ufs/ufs/ufs_vfsops.c	2001/05/06 00:47:20
@@ -108,14 +108,14 @@
 	int cmd, type, error;
 
 	if (uid == -1)
-		uid = p->p_cred->p_ruid;
+		uid = p->p_ucred->cr_ruid;
 	cmd = cmds >> SUBCMDSHIFT;
 
 	switch (cmd) {
 	case Q_SYNC:
 		break;
 	case Q_GETQUOTA:
-		if (uid == p->p_cred->p_ruid)
+		if (uid == p->p_ucred->cr_ruid)
 			break;
 		/* fall through */
 	default:


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Mon May  7 13: 8:14 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from meow.osd.bsdi.com (meow.osd.bsdi.com [204.216.28.88])
	by hub.freebsd.org (Postfix) with ESMTP
	id 90F3B37B422; Mon,  7 May 2001 13:08:05 -0700 (PDT)
	(envelope-from jhb@FreeBSD.org)
Received: from laptop.baldwin.cx (john@jhb-laptop.osd.bsdi.com [204.216.28.241])
	by meow.osd.bsdi.com (8.11.2/8.11.2) with ESMTP id f47K7uG88251;
	Mon, 7 May 2001 13:07:57 -0700 (PDT)
	(envelope-from jhb@FreeBSD.org)
Message-ID: <XFMail.010507130340.jhb@FreeBSD.org>
X-Mailer: XFMail 1.4.0 on FreeBSD
X-Priority: 3 (Normal)
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 8bit
MIME-Version: 1.0
In-Reply-To: <Pine.NEB.3.96L.1010506235944.43785B-100000@fledge.watson.org>
Date: Mon, 07 May 2001 13:03:40 -0700 (PDT)
From: John Baldwin <jhb@FreeBSD.org>
To: Robert Watson <rwatson@FreeBSD.org>
Subject: RE: Patch to eliminate struct pcred
Cc: arch@FreeBSD.org
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG


On 07-May-01 Robert Watson wrote:

> Index: compat/svr4/svr4_misc.c
> ===================================================================
> RCS file: /home/ncvs/src/sys/compat/svr4/svr4_misc.c,v
> retrieving revision 1.30
> diff -u -r1.30 svr4_misc.c
> --- compat/svr4/svr4_misc.c   2001/05/01 08:11:52     1.30
> +++ compat/svr4/svr4_misc.c   2001/05/06 00:43:54
> @@ -1294,13 +1294,8 @@
>                       /*
>                        * Free up credentials.
>                        */
> -                     PROC_LOCK(q);
> -                     if (--q->p_cred->p_refcnt == 0) {
> -                             crfree(q->p_ucred);
> -                             uifree(q->p_cred->p_uidinfo);
> -                             FREE(q->p_cred, M_SUBPROC);
> -                             q->p_cred = NULL;
> -                     }
> +                     crfree(q->p_ucred);
> +                     q->p_ucred = NULL;

Removing the proc lock here looks suspicious, but I think it might mirror a
change I just made to kern_exit.c in wait1(), in which case it is ok.

> Index: kern/kern_exec.c
> ===================================================================
> RCS file: /home/ncvs/src/sys/kern/kern_exec.c,v
> retrieving revision 1.126
> diff -u -r1.126 kern_exec.c
> --- kern/kern_exec.c  2001/05/01 08:12:56     1.126
> +++ kern/kern_exec.c  2001/05/06 16:25:06
> @@ -104,8 +104,9 @@
>       register struct execve_args *uap;
>  {
>       struct nameidata nd, *ndp;
> +     struct ucred *oldcred = p->p_ucred, *newcred;
>       register_t *stack_base;
> -     int error, len, i;
> +     int error, len, i, intrace;
>       struct image_params image_params, *imgp;
>       struct vattr attr;
>       int (*img_first) __P((struct image_params *));
> @@ -272,23 +273,31 @@
>               p->p_flag &= ~P_PPWAIT;
>               wakeup((caddr_t)p->p_pptr);
>       }
> +     intrace = p->p_flag & P_TRACED;
> +     PROC_UNLOCK(p);

This unlock is busted since we then try to unlock this lock again later on
since you didn't remove the other unlocks.  Also, this whole caching of the
intrace flag is bogus too.  If you read a value and release the lock, then you
have now lost the ability to safely make decisions on the value you just read. 
You have to hold the lock over both reading the value and deciding what to do
based on that value so that the entire thing is an "atomic" operation.  For
now, I would just revert the intrace changes to check the flag directly like
the code does now and not add in this proc unlock.

>       /*
> +      * XXX: Note, the whole execve() is incredibly racey right now
> +      * with regards to debugging and privilege/credential management.
> +      * This MUST be fixed prior to any release.
> +      */
> +
> +     /*
>        * Implement image setuid/setgid.
>        *
>        * Don't honor setuid/setgid if the filesystem prohibits it or if
>        * the process is being traced.
>        */
> -     if ((((attr.va_mode & VSUID) && p->p_ucred->cr_uid != attr.va_uid) ||
> -          ((attr.va_mode & VSGID) && p->p_ucred->cr_gid != attr.va_gid)) &&
> -         (imgp->vp->v_mount->mnt_flag & MNT_NOSUID) == 0 &&
> -         (p->p_flag & P_TRACED) == 0) {
> +     newcred = NULL;
> +     if ((((attr.va_mode & VSUID) && oldcred->cr_uid != attr.va_uid) ||
> +          ((attr.va_mode & VSGID) && oldcred->cr_gid != attr.va_gid)) &&
> +         (imgp->vp->v_mount->mnt_flag & MNT_NOSUID) == 0 && intrace == 0) {
>               PROC_UNLOCK(p);
>               /*
>                * Turn off syscall tracing for set-id programs, except for
>                * root.
>                */
> -             if (p->p_tracep && suser(p)) {
> +             if (p->p_tracep && suser_xxx(oldcred, NULL, PRISON_ROOT)) {
>                       p->p_traceflag = 0;
>                       vrele(p->p_tracep);
>                       p->p_tracep = NULL;
> @@ -296,25 +305,42 @@
>               /*
>                * Set the new credentials.
>                */
> -             p->p_ucred = crcopy(p->p_ucred);
> +             newcred = crdup(p->p_ucred);
>               if (attr.va_mode & VSUID)
> -                     change_euid(p, attr.va_uid);
> +                     change_euid(newcred, attr.va_uid);
>               if (attr.va_mode & VSGID)
> -                     p->p_ucred->cr_gid = attr.va_gid;
> +                     change_egid(newcred, attr.va_gid);
>               setsugid(p);
>               setugidsafety(p);
>       } else {
> -             if (p->p_ucred->cr_uid == p->p_cred->p_ruid &&
> -                 p->p_ucred->cr_gid == p->p_cred->p_rgid)
> -                     p->p_flag &= ~P_SUGID;
> +             if (oldcred->cr_uid == oldcred->cr_ruid &&
> +                 oldcred->cr_gid == oldcred->cr_rgid)
> +                     p->p_flag &= ~P_SUGID;  /* XXX locking */
>               PROC_UNLOCK(p);
>       }
>  
>       /*
>        * Implement correct POSIX saved-id behavior.
> +      *
> +      * XXX: determine whether tests and sets should occur on old or
> +      * new credentials.
>        */
> -     p->p_cred->p_svuid = p->p_ucred->cr_uid;
> -     p->p_cred->p_svgid = p->p_ucred->cr_gid;
> +     if (p->p_ucred->cr_svuid != p->p_ucred->cr_uid ||
> +         p->p_ucred->cr_svgid != p->p_ucred->cr_gid) {
> +             if (newcred != NULL)
> +                     newcred = crdup(p->p_ucred);
> +
> +             change_svuid(newcred, p->p_ucred->cr_uid);
> +             change_svgid(newcred, p->p_ucred->cr_gid);
> +     }
> +
> +     if (newcred != NULL) {
> +             struct ucred *oldcred;
> +
> +             oldcred = p->p_ucred;
> +             p->p_ucred = newcred;
> +             crfree(oldcred);
> +     }
>  
>       /*
>        * Store the vp for use in procfs

-- 

John Baldwin <jhb@FreeBSD.org> -- http://www.FreeBSD.org/~jhb/
PGP Key: http://www.baldwin.cx/~john/pgpkey.asc
"Power Users Use the Power to Serve!"  -  http://www.FreeBSD.org/

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Mon May  7 14:17:14 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from netbank.com.br (garrincha.netbank.com.br [200.203.199.88])
	by hub.freebsd.org (Postfix) with ESMTP id 41A0C37B424
	for <arch@freebsd.org>; Mon,  7 May 2001 14:17:09 -0700 (PDT)
	(envelope-from riel@conectiva.com.br)
Received: from surriel.ddts.net (unknown [200.181.137.248])
	by netbank.com.br (Postfix) with ESMTP
	id 8CA724680C; Mon,  7 May 2001 18:17:59 -0300 (BRST)
Received: from localhost (ekpitz@localhost [127.0.0.1])
	by surriel.ddts.net (8.11.3/8.11.2) with ESMTP id f47LGvi17187;
	Mon, 7 May 2001 18:16:58 -0300
Date: Mon, 7 May 2001 18:16:57 -0300 (BRST)
From: Rik van Riel <riel@conectiva.com.br>
X-Sender: riel@imladris.rielhome.conectiva
To: arch@freebsd.org
Cc: linux-mm@kvack.org, Matt Dillon <dillon@earth.backplane.com>,
	sfkaplan@cs.amherst.edu
Subject: on load control / process swapping
Message-ID: <Pine.LNX.4.21.0105061924160.582-100000@imladris.rielhome.conectiva>
X-spambait: aardvark@kernelnewbies.org
X-spammeplease: aardvark@nl.linux.org
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

Hi,

after staring at the code for a long long time, I finally
figured out exactly why FreeBSD's load control code (the
process swapping in vm_glue.c) can never work in many
scenarios.

In short, the process suspension / wake up code only does
load control in the sense that system load is reduced, but
absolutely no effort is made to ensure that individual
programs can run without thrashing. This, of course, kind of
defeats the purpose of doing load control in the first place.


To see this situation in some more detail, lets first look
at how the current process suspension code has evolved over
time.  Early paging Unixes, including earlier BSDs, had a
rate-limited clock algorithm for the pageout code, where
the VM subsystem would only scan (and page) memory out at
a rate of fastscan pages per second.

Whenever the paging system wasn't able to keep up, free
memory would get below a certain threshold and memory load
control (in the form of process suspension) kicked in.  As
soon as free memory (averaged over a few seconds) got over
this threshold, processes get swapped in again.  Because of
the exact "speed limit" for the paging code, this would give
a slow rotation of memory-resident progesses at a paging rate
well below the thashing threshold.


More modern Unixes, like FreeBSD, NetBSD or Linux, however,
don't have the artificial speed limit on pageout.  This means
the pageout code can go on freeing memory until well beyond
the trashing point of the system.  It also means that the
amount of free memory is no longer any indication of whether
the system is thrashing or not.

Add to that the fact that the classical load control in BSD
resumes a suspended task whenever the system is above the
(now not very meaningful) free memory threshold, regardless
of whether the resident tasks have had the opportunity to
make any progress ... which of course only encourages more
thrashing instead of letting the system work itself out of
the overload situation.


Any solution will have to address the following points:

1) allow the resident processes to stay resident long
   enough to make progess
2) make sure the resident processes aren't thrashing,
   that is, don't let new processes back in memory if
   none of the currently resident processes is "ready"
   to be suspended
3) have a mechanism to detect thrashing in a VM
   subsystem which isn't rate-limited  (hard?)

and, for extra brownie points:
4) fairness, small processes can be paged in and out
   faster, so we can suspend&resume them faster; this
   has the side effect of leaving the proverbial root
   shell more usable
5) make sure already resident processes cannot create
   a situation that'll keep the swapped out tasks out
   of memory forever ... but don't kill performance either,
   since bad performance means we cannot get out of the
   bad situation we're in


Points 1), 2) and 4) are relatively easy to address by simply
keeping resident tasks unswappable for a long enough time that
they've been able to do real work in an environment where
3) indicates we're not thrashing.


3) is the hard part. We know we're not thrashing when we don't
have ongoing page faults all the time, but (say) only 50% of the
time. However, I still have no idea to determine when we _are_
thrashing, since a system which always has 10 ongoing page faults
may still be functioning without thrashing...  This is the part
where I cannot hand a ready solution but where we have to figure
out a solution together.

(and it's also the reason I cannot "send a patch" ... I know the
current scheme cannot possibly work all the time, I understand why,
but I just don't have a solution to the problem ... yet)

regards,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Mon May  7 15:50:38 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from earth.backplane.com (earth-nat-cw.backplane.com [208.161.114.67])
	by hub.freebsd.org (Postfix) with ESMTP id 1881037B422
	for <arch@freebsd.org>; Mon,  7 May 2001 15:50:34 -0700 (PDT)
	(envelope-from dillon@earth.backplane.com)
Received: (from dillon@localhost)
	by earth.backplane.com (8.11.2/8.11.2) id f47MoKe68863;
	Mon, 7 May 2001 15:50:20 -0700 (PDT)
	(envelope-from dillon)
Date: Mon, 7 May 2001 15:50:20 -0700 (PDT)
From: Matt Dillon <dillon@earth.backplane.com>
Message-Id: <200105072250.f47MoKe68863@earth.backplane.com>
To: Rik van Riel <riel@conectiva.com.br>
Cc: arch@freebsd.org, linux-mm@kvack.org, sfkaplan@cs.amherst.edu
Subject: Re: on load control / process swapping
References:  <Pine.LNX.4.21.0105061924160.582-100000@imladris.rielhome.conectiva>
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

:In short, the process suspension / wake up code only does
:load control in the sense that system load is reduced, but
:absolutely no effort is made to ensure that individual
:programs can run without thrashing. This, of course, kind of
:defeats the purpose of doing load control in the first place.
:
:
:To see this situation in some more detail, lets first look
:at how the current process suspension code has evolved over
:time.  Early paging Unixes, including earlier BSDs, had a
:rate-limited clock algorithm for the pageout code, where
:the VM subsystem would only scan (and page) memory out at
:a rate of fastscan pages per second.
:
:Whenever the paging system wasn't able to keep up, free
:memory would get below a certain threshold and memory load
:control (in the form of process suspension) kicked in.  As
:soon as free memory (averaged over a few seconds) got over
:this threshold, processes get swapped in again.  Because of
:the exact "speed limit" for the paging code, this would give
:a slow rotation of memory-resident progesses at a paging rate
:well below the thashing threshold.
:
:More modern Unixes, like FreeBSD, NetBSD or Linux, however,
:don't have the artificial speed limit on pageout.  This means
:the pageout code can go on freeing memory until well beyond
:the trashing point of the system.  It also means that the
:amount of free memory is no longer any indication of whether
:the system is thrashing or not.
:
:Add to that the fact that the classical load control in BSD
:resumes a suspended task whenever the system is above the
:(now not very meaningful) free memory threshold, regardless
:of whether the resident tasks have had the opportunity to
:make any progress ... which of course only encourages more
:thrashing instead of letting the system work itself out of
:the overload situation.
:
:
:Any solution will have to address the following points:
:
:1) allow the resident processes to stay resident long
:   enough to make progess

    This is accomplished as a side effect to the way the page queues
    are handled.  A page placed in the active queue is not allowed
    to be moved out of that queue for a minimum period of time based
    on page aging.  See line 500 or so of vm_pageout.c (in -stable) .

    Thus when a process wakes up and pages a bunch of pages in, those
    pages are guarenteed to stay in-core for a period of time no matter
    what level of memory stress is occuring.

:2) make sure the resident processes aren't thrashing,
:   that is, don't let new processes back in memory if
:   none of the currently resident processes is "ready"
:   to be suspended

    When a process is swapped out, the process is removed from the run
    queue and the P_INMEM flag is cleared.  The process is only woken up
    when faultin() is called (vm_glue.c line 312).  faultin() is only
    called from the scheduler() (line 340 of vm_glue.c) and the scheduler
    only runs when the VM system indicates a minimum number of free pages
    are available (vm_page_count_min()), which you can adjust with
    the vm.v_free_min sysctl (usually represents 1-9 megabytes, dependings
    on how much memory the system has).

    So what occurs is that the system comes under extreme memory pressure
    and starts to swapout blocked processes.  This reduces memory pressure
    over time.  When memory pressure is sufficiently reudced the scheduler
    wakes up a swapped-out process (one at a time).

    There might be some fine tuning that we can do here, such as try to
    choose a better process to swapout (right now it's priority based which
    isn't the best way to do it).

:3) have a mechanism to detect thrashing in a VM
:   subsystem which isn't rate-limited  (hard?)

    In FreeBSD, rate-limiting is a function of a lightly loaded system.
    We rate-limit page laundering (pageouts).  However, if the rate-limited
    laundering is not sufficient to reach our free + cache page targets,
    we take another laundering loop and this time do not limit it at all.

    Thus under heavy memory pressure, no real rate limiting occurs.  The
    system will happily pagein and pageout megabytes/sec.  The reason we
    do this is because David Greenman and John Dyson found a long time
    ago that attempting to rate limit paging does not actually solve the
    thrashing problem, it actually makes it worse... So they solved the
    problem another way (see my answers for #1 and #2).  It isn't the
    paging operations themselves that cause thrashing.

:and, for extra brownie points:
:4) fairness, small processes can be paged in and out
:   faster, so we can suspend&resume them faster; this
:   has the side effect of leaving the proverbial root
:   shell more usable

    Small process can contribute to thrashing as easily as large
    processes can under extreme memory pressure... for example,
    take an overloaded shell machine.  *ALL* processes are 'small'
    processes in that case, or most of them are, and in great numbers
    they can be the cause.  So no test that specifically checks the
    size of the process can be used to give it any sort of priority.

    Additionally, *idle* small processes are also great contributers 
    to the VM subsystem in regards to clearing out idle pages.  For
    example, on a heavily loaded shell machine more then 80% of the
    'small processes' have been idle for long periods of time and it
    is exactly our ability to page them out that allows us to extend
    the machine's operational life and move the thrashing threshold
    farther away.  The last thing we want to do is make a 'fix' that
    prevents us from paging out idle small processes.  It would kill
    the machine.

:5) make sure already resident processes cannot create
:   a situation that'll keep the swapped out tasks out
:   of memory forever ... but don't kill performance either,
:   since bad performance means we cannot get out of the
:   bad situation we're in

    When the system starts swapping processes out, it continues to swap
    them out until memory pressure goes down.  With memory pressure down
    processes are swapped back in again one at a time, typically in FIFO
    order.  So this situation will generally not occur.

    Basically we have all the algorithms in place to deal with thrashing.
    I'm sure that there are a few places where we can optimize things...
    for example, we can certainly tune the swapout algorithm itself.

						-Matt

:regards,
:
:Rik
:--
:Virtual memory is like a game you can't win;

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Mon May  7 16:35:34 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from perninha.conectiva.com.br (perninha.conectiva.com.br [200.250.58.156])
	by hub.freebsd.org (Postfix) with ESMTP id 54FE537B422
	for <arch@freebsd.org>; Mon,  7 May 2001 16:35:27 -0700 (PDT)
	(envelope-from riel@conectiva.com.br)
Received: from burns.conectiva (burns.conectiva [10.0.0.4])
	by perninha.conectiva.com.br (Postfix) with SMTP id D180516B1C
	for <arch@freebsd.org>; Mon,  7 May 2001 20:35:25 -0300 (EST)
Received: (qmail 13083 invoked by uid 0); 7 May 2001 23:33:57 -0000
Received: from duckman.distro.conectiva (HELO duckman.conectiva.com.br) (root@10.0.17.2)
  by burns.conectiva with SMTP; 7 May 2001 23:33:57 -0000
Received: from localhost (riel@localhost)
	by duckman.conectiva.com.br (8.11.3/8.11.3) with ESMTP id f47NZPF02739;
	Mon, 7 May 2001 20:35:25 -0300
X-Authentication-Warning: duckman.distro.conectiva: riel owned process doing -bs
Date: Mon, 7 May 2001 20:35:25 -0300 (BRST)
From: Rik van Riel <riel@conectiva.com.br>
X-X-Sender:  <riel@duckman.distro.conectiva>
To: Matt Dillon <dillon@earth.backplane.com>
Cc: <arch@freebsd.org>, <linux-mm@kvack.org>,
	<sfkaplan@cs.amherst.edu>
Subject: Re: on load control / process swapping
In-Reply-To: <200105072250.f47MoKe68863@earth.backplane.com>
Message-ID: <Pine.LNX.4.33.0105071956180.18102-100000@duckman.distro.conectiva>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

On Mon, 7 May 2001, Matt Dillon wrote:

> :1) allow the resident processes to stay resident long
> :   enough to make progess
>
>     This is accomplished as a side effect to the way the page queues
>     are handled.  A page placed in the active queue is not allowed
>     to be moved out of that queue for a minimum period of time based
>     on page aging.  See line 500 or so of vm_pageout.c (in -stable) .
>
>     Thus when a process wakes up and pages a bunch of pages in, those
>     pages are guarenteed to stay in-core for a period of time no matter
>     what level of memory stress is occuring.

I don't see anything limiting the speed at which the active list
is scanned over and over again. OTOH, you are right that a failure
to deactivate enough pages will trigger the swapout code .....

This sure is a subtle interaction ;)

> :2) make sure the resident processes aren't thrashing,
> :   that is, don't let new processes back in memory if
> :   none of the currently resident processes is "ready"
> :   to be suspended
>
>     When a process is swapped out, the process is removed from the run
>     queue and the P_INMEM flag is cleared.  The process is only woken up
>     when faultin() is called (vm_glue.c line 312).  faultin() is only
>     called from the scheduler() (line 340 of vm_glue.c) and the scheduler
>     only runs when the VM system indicates a minimum number of free pages
>     are available (vm_page_count_min()), which you can adjust with
>     the vm.v_free_min sysctl (usually represents 1-9 megabytes, dependings
>     on how much memory the system has).

But ... is this a good enough indication that the processes
currently resident have enough memory available to make any
progress ?

Especially if all the currently resident processes are waiting
in page faults, won't that make it easier for the system to find
pages to swap out, etc... ?

One thing I _am_ wondering though: the pageout and the pagein
thresholds are different. Can't this lead to problems where we
always hit both the pageout threshold -and- the pagein threshold
and the system thrashes swapping processes in and out ?

> :3) have a mechanism to detect thrashing in a VM
> :   subsystem which isn't rate-limited  (hard?)
>
>     In FreeBSD, rate-limiting is a function of a lightly loaded system.
>     We rate-limit page laundering (pageouts).  However, if the rate-limited
>     laundering is not sufficient to reach our free + cache page targets,
>     we take another laundering loop and this time do not limit it at all.
>
>     Thus under heavy memory pressure, no real rate limiting occurs.  The
>     system will happily pagein and pageout megabytes/sec.  The reason we
>     do this is because David Greenman and John Dyson found a long time
>     ago that attempting to rate limit paging does not actually solve the
>     thrashing problem, it actually makes it worse... So they solved the
>     problem another way (see my answers for #1 and #2).  It isn't the
>     paging operations themselves that cause thrashing.

Agreed on all points ... I'm just wondering how well 1) and 2)
still work after all the changes that were made to the VM in
the last few years.  They sure are subtle ...

> :and, for extra brownie points:
> :4) fairness, small processes can be paged in and out
> :   faster, so we can suspend&resume them faster; this
> :   has the side effect of leaving the proverbial root
> :   shell more usable
>
>     Small process can contribute to thrashing as easily as large
>     processes can under extreme memory pressure... for example,
>     take an overloaded shell machine.  *ALL* processes are 'small'
>     processes in that case, or most of them are, and in great numbers
>     they can be the cause.  So no test that specifically checks the
>     size of the process can be used to give it any sort of priority.

There's a test related to 2) though ... A small process needs
to be in memory less time than a big process in order to make
progress, so it can be swapped out earlier.

It can also be swapped back in earlier, giving small processes
shorter "time slices" for swapping than what large processes
have.  I'm not quite sure how much this would matter, though...

> :5) make sure already resident processes cannot create
> :   a situation that'll keep the swapped out tasks out
> :   of memory forever ... but don't kill performance either,
> :   since bad performance means we cannot get out of the
> :   bad situation we're in
>
>     When the system starts swapping processes out, it continues to swap
>     them out until memory pressure goes down.  With memory pressure down
>     processes are swapped back in again one at a time, typically in FIFO
>     order.  So this situation will generally not occur.
>
>     Basically we have all the algorithms in place to deal with thrashing.
>     I'm sure that there are a few places where we can optimize things...
>     for example, we can certainly tune the swapout algorithm itself.

Interesting, FreeBSD indeed _does_ seem to have all of the things in
place (though the interactions between the various parts seem to be
carefully hidden ;)).

They indeed should work for lots of scenarios, but things like the
subtlety of some of the code and the fact that the swapin and
swapout thresholds are fairly unrelated look a bit worrying...

regards,

Rik
--
Linux MM bugzilla: http://linux-mm.org/bugzilla.shtml

Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com/


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Mon May  7 17:56:21 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from earth.backplane.com (earth-nat-cw.backplane.com [208.161.114.67])
	by hub.freebsd.org (Postfix) with ESMTP id 285E937B423
	for <arch@freebsd.org>; Mon,  7 May 2001 17:56:16 -0700 (PDT)
	(envelope-from dillon@earth.backplane.com)
Received: (from dillon@localhost)
	by earth.backplane.com (8.11.2/8.11.2) id f480u1Q71866;
	Mon, 7 May 2001 17:56:01 -0700 (PDT)
	(envelope-from dillon)
Date: Mon, 7 May 2001 17:56:01 -0700 (PDT)
From: Matt Dillon <dillon@earth.backplane.com>
Message-Id: <200105080056.f480u1Q71866@earth.backplane.com>
To: Rik van Riel <riel@conectiva.com.br>
Cc: <arch@freebsd.org>, <linux-mm@kvack.org>,
	<sfkaplan@cs.amherst.edu>
Subject: Re: on load control / process swapping
References:  <Pine.LNX.4.33.0105071956180.18102-100000@duckman.distro.conectiva>
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG


:>     to be moved out of that queue for a minimum period of time based
:>     on page aging.  See line 500 or so of vm_pageout.c (in -stable) .
:>
:>     Thus when a process wakes up and pages a bunch of pages in, those
:>     pages are guarenteed to stay in-core for a period of time no matter
:>     what level of memory stress is occuring.
:
:I don't see anything limiting the speed at which the active list
:is scanned over and over again. OTOH, you are right that a failure
:to deactivate enough pages will trigger the swapout code .....
:
:This sure is a subtle interaction ;)

    Look at the loop line 1362 of vm_pageout.c.  Note that it enforces
    a HZ/2 tsleep (2 scans per second) if the pageout daemon is unable
    to clean sufficient pages in two loops.  The tsleep is not woken up
    by anyone while waiting that 1/2 second becuase vm_pages_needed has
    not been cleared yet.  This is what is limiting the page queue scan.

:>     When a process is swapped out, the process is removed from the run
:>     queue and the P_INMEM flag is cleared.  The process is only woken up
:>     when faultin() is called (vm_glue.c line 312).  faultin() is only
:>     called from the scheduler() (line 340 of vm_glue.c) and the scheduler
:>     only runs when the VM system indicates a minimum number of free pages
:>     are available (vm_page_count_min()), which you can adjust with
:>     the vm.v_free_min sysctl (usually represents 1-9 megabytes, dependings
:>     on how much memory the system has).
:
:But ... is this a good enough indication that the processes
:currently resident have enough memory available to make any
:progress ?

    Yes.  Consider detecting the difference between a large process accessing
    its pages randomly, and a small process accessing a relatively small
    set of pages over and over again.  Now consider what happens when the
    system gets overloaded.  The small process will be able to access its
    pages enough that they will get page priority over the larger process.
    The larger process, due to the more random accesses (or simply the fact
    that it is accessing a larger set of pages) will tend to stall more on
    pagein I/O which has the side effect of reducing the large process's
    access rate on all of its pages.  The result:  small processes get more
    priority just by being small.

:Especially if all the currently resident processes are waiting
:in page faults, won't that make it easier for the system to find
:pages to swap out, etc... ?
:
:One thing I _am_ wondering though: the pageout and the pagein
:thresholds are different. Can't this lead to problems where we
:always hit both the pageout threshold -and- the pagein threshold
:and the system thrashes swapping processes in and out ?

    The system will not page out a page it has just paged in due to the
    center-of-the-road initialization of act_count (the page aging).
    My experience at BEST was that both pagein and pageout activity
    occured simultaniously, but the fact had no detrimental effect on
    the system.  You have to treat the pagein and pageout operations
    independantly because, in fact, they are only weakly related to each
    other.  The only optimization you make, to reduce thrashing, is to
    not allow a just-paged-in page to immediately turn around and be paged
    out.

    I could probably make this work even better by setting the vm_page_t's
    act_count to its max value when paging in from swap.  I'll think about
    doing that.

    The pagein and pageout rates have nothing to do with thrashing, per say,
    and should never be arbitrarily limited.   Consider the difference
    between a system that is paing heavily and a system with only two small
    processes (like cp's) competing for disk I/O.  Insofar as I/O goes,
    there is no difference.  You can have a perfectly running system with
    high pagein and pageout rates.  It's only when the paging I/O starts
    to eat into pages that are in active use where thrashing begins to occur.
    Think of a hotdog being eaten from both ends by two lovers.  Memory
    pressure (active VM pages) eat away at one end, pageout I/O eats away
    at the other.  You don't get fireworks until they meet.

:>     ago that attempting to rate limit paging does not actually solve the
:>     thrashing problem, it actually makes it worse... So they solved the
:>     problem another way (see my answers for #1 and #2).  It isn't the
:>     paging operations themselves that cause thrashing.
:
:Agreed on all points ... I'm just wondering how well 1) and 2)
:still work after all the changes that were made to the VM in
:the last few years.  They sure are subtle ...

    The algorithms mostly stayed the same.  Much of the work was to remove
    artificial limitations that were reducing performance (due to the
    existance of greater amounts of memory, faster disks, and so forth...).
    I also spent a good deal of time removing 'restart' cases from the code
    that was causing a lot of cpu-wasteage in certain cases.  What few
    restart cases remain just don't occur all that often.  And I've done
    other things like extend the heuristics we already use for read()/write()
    to the VM system and change heuristic variables into per-vm-map elements
    rather then sharing them with read/write within the vnode.  Etc.

:>     Small process can contribute to thrashing as easily as large
:>     processes can under extreme memory pressure... for example,
:>     take an overloaded shell machine.  *ALL* processes are 'small'
:>     processes in that case, or most of them are, and in great numbers
:>     they can be the cause.  So no test that specifically checks the
:>     size of the process can be used to give it any sort of priority.
:
:There's a test related to 2) though ... A small process needs
:to be in memory less time than a big process in order to make
:progress, so it can be swapped out earlier.

    Not necessarily.  It depends whether the small process is cpu-bound
    or interactive.  A cpu-bound small process should be allowed to run
    and not swapped out.  An interactive small process can be safely
    swapped if idle for a period of time, because it can be swapped back
    in very quickly.  It should not be swapped if it isn't idle (someone is
    typing, for example), because that would just waste disk I/O paging out
    and then paging right back in.  You never want to swapout a small
    process gratuitously simply because it is small.

:It can also be swapped back in earlier, giving small processes
:shorter "time slices" for swapping than what large processes
:have.  I'm not quite sure how much this would matter, though...

    Both swapin and swapout activities are demand paged, but will be
    clustered if possible.  I don't think there would be any point
    trying to conditionalize the algorithm based on the size of the
    process.  The size has its own indirect positive effects which I
    think are sufficient.

:Interesting, FreeBSD indeed _does_ seem to have all of the things in
:place (though the interactions between the various parts seem to be
:carefully hidden ;)).
:
:They indeed should work for lots of scenarios, but things like the
:subtlety of some of the code and the fact that the swapin and
:swapout thresholds are fairly unrelated look a bit worrying...
:
:regards,
:
:Rik

    I don't think it's possible to write a nice neat thrash-handling
    algorithm.  It's a bunch of algorithms all working together, all
    closely tied to the VM page cache.  Each taken alone is fairly easy
    to describe and understand.  All of them together result in complex
    interactions that are very easy to break if you make a mistake.  It
    usually takes me a couple of tries to get a solution to a problem in
    place without breaking something else (performance-wise) in the
    process.  For example, I fubar'd heavy load performance for a month
    in FreeBSD-4.2 when I 'fixed' the pageout scan laundering algorithm.

						-Matt


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Mon May  7 21:44:21 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from fledge.watson.org (fledge.watson.org [204.156.12.50])
	by hub.freebsd.org (Postfix) with ESMTP id E94C437B423
	for <arch@FreeBSD.org>; Mon,  7 May 2001 21:44:01 -0700 (PDT)
	(envelope-from robert@fledge.watson.org)
Received: from fledge.watson.org (robert@fledge.pr.watson.org [192.0.2.3])
	by fledge.watson.org (8.11.3/8.11.3) with SMTP id f484hwf62354
	for <arch@FreeBSD.org>; Tue, 8 May 2001 00:43:58 -0400 (EDT)
	(envelope-from robert@fledge.watson.org)
Date: Tue, 8 May 2001 00:43:58 -0400 (EDT)
From: Robert Watson <rwatson@FreeBSD.org>
X-Sender: robert@fledge.watson.org
To: arch@FreeBSD.org
Subject: securelevel -> securelevel_check()
Message-ID: <Pine.NEB.3.96L.1010508003718.11741U-100000@fledge.watson.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG


One of the features requested for jailNG a number of times, most recently
by Matt Dillon, has been to introduce support for per-jail securelevels.
This would permit jail securelevels to float above the system securelevel,
and allow the jail securelevel to be lowered from outside the jail.  This
would offer a number of benefits, largely in the form of permitting more
sane use of file system flags within the jail.  To do this, it is
necessary to modify securelevel checks to attempt to go to a process-local
(well, credential-local) securelevel.  The first step in this process is
to abstract out securelevel checks to a central securelevel_check(cred,
maxlevel) call.

The attached patch does this for most of the kernel, excluding ipfilter
since that's contributed code.  In some cases, converting from global
securelevel to credential securelevel introduces ambiguities: should the
process credential be used, or the file descriptor credential, for
example.  These concerns existed in a number of cases already.  I may not
have them all right, but would welcome comments.

After this is in place, I will produce an updated jailNG patch that
incorporates a new managed per-jail securelevel variable.  When a
securelevel check is performed, the global value is used if the process is
not in jail.  If in jail, the greater of local and global securelevels
will be used.  Securelevel modification using the normal kern.securelevel
mib will now point to global securelevel outside of jail, and local
securelevel within.  kern.securelevel will only allow the securelevel to
be raised, never lowered.  The jail.instance.*.securelevel variable will
allow the securelevel to be lowered from outside the jail; however, due to
the check semantics, in effect per-jail securelevels will be at least the
global level, preventing jails from being used to circumvent the global
securelevel.

As I've indicated in the past, I'm not a great fan of securelevels, but
this seemed like a reasonable feature request to me, and it has
substantial utility, especially where the administrator may want to make
use of schg and related flags within the jail, but be able to disassemble
the jail (or modify it) without rebooting to lower the global securelevel.

Robert N M Watson             FreeBSD Core Team, TrustedBSD Project
robert@fledge.watson.org      NAI Labs, Safeport Network Services


? compile/GENERIC
Index: alpha/alpha/mem.c
===================================================================
RCS file: /home/ncvs/src/sys/alpha/alpha/mem.c,v
retrieving revision 1.34
diff -u -r1.34 mem.c
--- alpha/alpha/mem.c	2001/03/26 12:39:47	1.34
+++ alpha/alpha/mem.c	2001/05/08 04:31:08
@@ -114,12 +114,16 @@
 static int
 mmopen(dev_t dev, int flags, int fmt, struct proc *p)
 {
+	int error;
 
 	switch (minor(dev)) {
 	case 0:
 	case 1:
-		if ((flags & FWRITE) && securelevel > 0)
-			return (EPERM);
+		if (flags & FWRITE) {
+			error = securelevel_check(p->p_ucred, 0);
+			if (error)
+				return (error);
+		}
 		break;
 	case 32:
 #ifdef PERFMON
Index: alpha/alpha/sys_machdep.c
===================================================================
RCS file: /home/ncvs/src/sys/alpha/alpha/sys_machdep.c,v
retrieving revision 1.10
diff -u -r1.10 sys_machdep.c
--- alpha/alpha/sys_machdep.c	2001/05/01 08:11:48	1.10
+++ alpha/alpha/sys_machdep.c	2001/05/08 04:31:08
@@ -114,8 +114,9 @@
 	if (error)
 		return (error);
 
-	if (securelevel > 0)
-		return (EPERM);
+	error = securelevel_check(p->p_ucred, 0);
+	if (error)
+		return (ERROR);
 
 	error = suser(p);
 	if (error)
Index: cam/scsi/scsi_pass.c
===================================================================
RCS file: /home/ncvs/src/sys/cam/scsi/scsi_pass.c,v
retrieving revision 1.28
diff -u -r1.28 scsi_pass.c
--- cam/scsi/scsi_pass.c	2001/03/27 05:45:11	1.28
+++ cam/scsi/scsi_pass.c	2001/05/08 04:31:13
@@ -37,6 +37,7 @@
 #include <sys/conf.h>
 #include <sys/errno.h>
 #include <sys/devicestat.h>
+#include <sys/proc.h>
 
 #include <cam/cam.h>
 #include <cam/cam_ccb.h>
@@ -368,9 +369,10 @@
 	/*
 	 * Don't allow access when we're running at a high securelvel.
 	 */
-	if (securelevel > 1) {
+	error = securelevel_check(p->p_ucred, 1);
+	if (error) {
 		splx(s);
-		return(EPERM);
+		return (error);
 	}
 
 	/*
Index: dev/pci/pci_user.c
===================================================================
RCS file: /home/ncvs/src/sys/dev/pci/pci_user.c,v
retrieving revision 1.2
diff -u -r1.2 pci_user.c
--- dev/pci/pci_user.c	2001/03/26 12:40:30	1.2
+++ dev/pci/pci_user.c	2001/05/08 04:31:22
@@ -39,6 +39,7 @@
 #include <sys/kernel.h>
 #include <sys/queue.h>
 #include <sys/types.h>
+#include <sys/proc.h>
 
 #include <vm/vm.h>
 #include <vm/pmap.h>
@@ -87,8 +88,12 @@
 static int
 pci_open(dev_t dev, int oflags, int devtype, struct proc *p)
 {
-	if ((oflags & FWRITE) && securelevel > 0) {
-		return EPERM;
+	int error;
+
+	if (oflags & FWRITE) {
+		error = securelevel_check(p->p_ucred, 0);
+		if (error)
+			return (error);
 	}
 	return 0;
 }
Index: dev/random/randomdev.c
===================================================================
RCS file: /home/ncvs/src/sys/dev/random/randomdev.c,v
retrieving revision 1.28
diff -u -r1.28 randomdev.c
--- dev/random/randomdev.c	2001/05/01 08:12:03	1.28
+++ dev/random/randomdev.c	2001/05/08 04:31:23
@@ -45,6 +45,7 @@
 #include <sys/sysctl.h>
 #include <sys/uio.h>
 #include <sys/unistd.h>
+#include <sys/proc.h>
 #include <sys/vnode.h>
 
 #include <machine/bus.h>
@@ -140,17 +141,29 @@
 static int
 random_open(dev_t dev, int flags, int fmt, struct proc *p)
 {
-	if ((flags & FWRITE) && (securelevel > 0 || suser(p)))
-		return EPERM;
-	else
+	int error;
+
+	if (flags & FWRITE) {
+		error = securelevel_check(p->p_ucred, 0);
+		if (error)
+			return error;
+
+		error = suser(p);
+		return error;
+	} else
 		return 0;
 }
 
 static int
 random_close(dev_t dev, int flags, int fmt, struct proc *p)
 {
-	if ((flags & FWRITE) && !(securelevel > 0 || suser(p)))
-		random_reseed();
+	int error;
+
+	if (flags & FWRITE) {
+		if (!(securelevel_check(p->p_ucred, 0) ||
+		    suser(p)))
+			random_reseed();
+	}
 	return 0;
 }
 
Index: dev/syscons/syscons.c
===================================================================
RCS file: /home/ncvs/src/sys/dev/syscons/syscons.c,v
retrieving revision 1.357
diff -u -r1.357 syscons.c
--- dev/syscons/syscons.c	2001/05/01 08:12:05	1.357
+++ dev/syscons/syscons.c	2001/05/08 04:31:26
@@ -995,8 +995,9 @@
 	error = suser(p);
 	if (error != 0)
 	    return error;
-	if (securelevel > 0)
-	    return EPERM;
+	error = securelevel_check(p->p_ucred, 0);
+	if (error != 0)
+	    return error;
 #ifdef __i386__
 	p->p_md.md_regs->tf_eflags |= PSL_IOPL;
 #endif
Index: i386/i386/mem.c
===================================================================
RCS file: /home/ncvs/src/sys/i386/i386/mem.c,v
retrieving revision 1.88
diff -u -r1.88 mem.c
--- i386/i386/mem.c	2001/03/26 12:40:48	1.88
+++ i386/i386/mem.c	2001/05/08 04:31:27
@@ -113,15 +113,19 @@
 	switch (minor(dev)) {
 	case 0:
 	case 1:
-		if ((flags & FWRITE) && securelevel > 0)
-			return (EPERM);
+		if (flags & FWRITE) {
+			error = securelevel_check(p->p_ucred, 0);
+			if (error)
+				return (error);
+		}
 		break;
 	case 14:
 		error = suser(p);
 		if (error != 0)
 			return (error);
-		if (securelevel > 0)
-			return (EPERM);
+		error = securelevel_check(p->p_ucred, 0);
+		if (error)
+			return (error);
 		p->p_md.md_regs->tf_eflags |= PSL_IOPL;
 		break;
 	}
Index: i386/i386/sys_machdep.c
===================================================================
RCS file: /home/ncvs/src/sys/i386/i386/sys_machdep.c,v
retrieving revision 1.55
diff -u -r1.55 sys_machdep.c
--- i386/i386/sys_machdep.c	2001/05/01 08:12:47	1.55
+++ i386/i386/sys_machdep.c	2001/05/08 04:31:27
@@ -179,8 +179,9 @@
 
 	if ((error = suser(p)) != 0)
 		return (error);
-	if (securelevel > 0)
-		return (EPERM);
+	error = securelevel_check(p->p_ucred, 0);
+	if (error)
+		return (error);
 	/*
 	 * XXX 
 	 * While this is restricted to root, we should probably figure out
Index: i386/isa/spigot.c
===================================================================
RCS file: /home/ncvs/src/sys/i386/isa/spigot.c,v
retrieving revision 1.48
diff -u -r1.48 spigot.c
--- i386/isa/spigot.c	2001/05/01 08:12:51	1.48
+++ i386/isa/spigot.c	2001/05/08 04:31:27
@@ -182,8 +182,9 @@
 	error = suser(p);
 	if (error != 0)
 		return error;
-	if (securelevel > 0)
-		return EPERM;
+	error = securelevel(p->p_ucred, 0);
+	if (error)
+		return error;
 #endif
 
 	ss->flags |= OPEN;
@@ -238,8 +239,9 @@
 		error = suser(p);
 		if (error != 0)
 			return error;
-		if (securelevel > 0)
-			return EPERM;
+		error = securelevel(p->p_ucred, 0);
+		if (error != 0)
+			return error;
 #endif
 		p->p_md.md_regs->tf_eflags |= PSL_IOPL;
 		break;
Index: i386/linux/linux_machdep.c
===================================================================
RCS file: /home/ncvs/src/sys/i386/linux/linux_machdep.c,v
retrieving revision 1.16
diff -u -r1.16 linux_machdep.c
--- i386/linux/linux_machdep.c	2001/05/01 08:12:52	1.16
+++ i386/linux/linux_machdep.c	2001/05/08 04:31:28
@@ -472,8 +472,8 @@
 		return (EINVAL);
 	if ((error = suser(p)) != 0)
 		return (error);
-	if (securelevel > 0)
-		return (EPERM);
+	if ((error = securelevel_check(p->p_ucred, 0)) != 0)
+		return (error);
 	p->p_md.md_regs->tf_eflags = (p->p_md.md_regs->tf_eflags & ~PSL_IOPL) |
 	    (args->level * (PSL_IOPL / 3));
 	return (0);
Index: ia64/ia64/mem.c
===================================================================
RCS file: /home/ncvs/src/sys/ia64/ia64/mem.c,v
retrieving revision 1.3
diff -u -r1.3 mem.c
--- ia64/ia64/mem.c	2001/03/26 12:40:56	1.3
+++ ia64/ia64/mem.c	2001/05/08 04:31:32
@@ -113,12 +113,16 @@
 static int
 mmopen(dev_t dev, int flags, int fmt, struct proc *p)
 {
+	int error;
 
 	switch (minor(dev)) {
 	case 0:
 	case 1:
-		if ((flags & FWRITE) && securelevel > 0)
-			return (EPERM);
+		if (flags & FWRITE) {
+			error = securelevel_check(p->p_ucred, 0);
+			if (error)
+				return (error);
+		}
 		break;
 	case 32:
 #ifdef PERFMON
Index: kern/kern_linker.c
===================================================================
RCS file: /home/ncvs/src/sys/kern/kern_linker.c,v
retrieving revision 1.59
diff -u -r1.59 kern_linker.c
--- kern/kern_linker.c	2001/03/22 08:58:45	1.59
+++ kern/kern_linker.c	2001/05/08 04:31:33
@@ -292,8 +292,9 @@
     int foundfile, error = 0;
 
     /* Refuse to load modules if securelevel raised */
-    if (securelevel > 0)
-	return EPERM; 
+    error = securelevel_check(curproc->p_ucred, 0);
+    if (error)
+	return error;
 
     lf = linker_find_file_by_name(filename);
     if (lf) {
@@ -420,8 +421,9 @@
     int i;
 
     /* Refuse to unload modules if securelevel raised */
-    if (securelevel > 0)
-	return EPERM; 
+    error = securelevel_check(curproc->p_ucred, 0);
+    if (error)
+	return error; 
 
     KLD_DPF(FILE, ("linker_file_unload: lf->refs=%d\n", file->refs));
     lockmgr(&lock, LK_EXCLUSIVE, 0, curproc);
@@ -673,8 +675,9 @@
 
     p->p_retval[0] = -1;
 
-    if (securelevel > 0)	/* redundant, but that's OK */
-	return EPERM;
+    error = securelevel_check(p->p_ucred, 0);	/* redundant, but that's OK */
+    if (error)
+	return error;
 
     if ((error = suser(p)) != 0)
 	return error;
@@ -716,8 +719,9 @@
     linker_file_t lf;
     int error = 0;
 
-    if (securelevel > 0)	/* redundant, but that's OK */
-	return EPERM;
+    error = securelevel_check(p->p_ucred, 0);	/* redundant, but that's OK */
+    if (error)
+	return error;
 
     if ((error = suser(p)) != 0)
 	return error;
Index: kern/kern_prot.c
===================================================================
RCS file: /home/ncvs/src/sys/kern/kern_prot.c,v
retrieving revision 1.89
diff -u -r1.89 kern_prot.c
--- kern/kern_prot.c	2001/05/01 08:12:57	1.89
+++ kern/kern_prot.c	2001/05/08 04:31:34
@@ -984,6 +984,22 @@
 	return (0);
 }
 
+/*
+ * Given a securelevel requirement, test whether securelevel state
+ * meets the requirement.
+ */
+int
+securelevel_check(cred, maxlevel)
+	struct ucred *cred;
+	int maxlevel;
+{
+
+	/* XXX: In the future, this will be protected by a mutex. */
+	if (securelevel > maxlevel)
+		return (EPERM);
+	return (0);
+}
+
 static int suser_permitted = 1;
 
 SYSCTL_INT(_kern, OID_AUTO, suser_permitted, CTLFLAG_RW, &suser_permitted, 0,
@@ -1189,8 +1205,11 @@
 	}
 
 	/* can't trace init when securelevel > 0 */
-	if (securelevel > 0 && p2->p_pid == 1)
-		return (EPERM);
+	if (p2->p_pid == 1) {
+		error = securelevel_check(p1->p_ucred, 0);
+		if (error)
+			return (error);
+	}
 
 	return (0);
 }
Index: kern/kern_sysctl.c
===================================================================
RCS file: /home/ncvs/src/sys/kern/kern_sysctl.c,v
retrieving revision 1.106
diff -u -r1.106 kern_sysctl.c
--- kern/kern_sysctl.c	2001/03/08 01:20:43	1.106
+++ kern/kern_sysctl.c	2001/05/08 04:31:34
@@ -1013,9 +1013,15 @@
 	}
 
 	/* If writing isn't allowed */
-	if (req->newptr && (!(oid->oid_kind & CTLFLAG_WR) ||
-	    ((oid->oid_kind & CTLFLAG_SECURE) && securelevel > 0)))
-		return (EPERM);
+	if (req->newptr) {
+		if (!(oid->oid_kind & CTLFLAG_WR))
+			return (EPERM);
+		if (oid->oid_kind & CTLFLAG_SECURE) {
+			error = securelevel_check(req->p->p_ucred, 0);
+			if (error)
+				return (error);
+		}
+	}
 
 	/* Most likely only root can write */
 	if (!(oid->oid_kind & CTLFLAG_ANYBODY) &&
Index: kern/kern_time.c
===================================================================
RCS file: /home/ncvs/src/sys/kern/kern_time.c,v
retrieving revision 1.73
diff -u -r1.73 kern_time.c
--- kern/kern_time.c	2001/05/01 08:12:57	1.73
+++ kern/kern_time.c	2001/05/08 04:31:35
@@ -103,7 +103,7 @@
 	 * than one second, nor more than once per second. This allows
 	 * a miscreant to make the clock march double-time, but no worse.
 	 */
-	if (securelevel > 1) {
+	if (securelevel_check(curproc->p_ucred, 1)) {
 		if (delta.tv_sec < 0 || delta.tv_usec < 0) {
 			/*
 			 * Update maxtime to latest time we've seen.
Index: miscfs/procfs/procfs_subr.c
===================================================================
RCS file: /home/ncvs/src/sys/miscfs/procfs/procfs_subr.c,v
retrieving revision 1.33
diff -u -r1.33 procfs_subr.c
--- miscfs/procfs/procfs_subr.c	2001/05/01 08:13:09	1.33
+++ miscfs/procfs/procfs_subr.c	2001/05/08 04:31:35
@@ -250,14 +250,17 @@
 	struct proc *curp = uio->uio_procp;
 	struct pfsnode *pfs = VTOPFS(vp);
 	struct proc *p;
-	int rtval;
+	int rtval, error;
 
 	p = PFIND(pfs->pfs_pid);
 	if (p == NULL)
 		return (EINVAL);
 	PROC_UNLOCK(p);
-	if (p->p_pid == 1 && securelevel > 0 && uio->uio_rw == UIO_WRITE)
-		return (EACCES);
+	if (p->p_pid == 1 && uio->uio_rw == UIO_WRITE) {
+		error = securelevel_check(curp->p_ucred, 0);
+		if (error)
+			return (EACCES);
+	}
 
 	mp_fixme("pfs_lockowner needs a lock");
 	while (pfs->pfs_lockowner) {
Index: miscfs/specfs/spec_vnops.c
===================================================================
RCS file: /home/ncvs/src/sys/miscfs/specfs/spec_vnops.c,v
retrieving revision 1.157
diff -u -r1.157 spec_vnops.c
--- miscfs/specfs/spec_vnops.c	2001/04/30 14:35:35	1.157
+++ miscfs/specfs/spec_vnops.c	2001/05/08 04:31:36
@@ -176,13 +176,16 @@
 		 * When running in secure mode, do not allow opens
 		 * for writing if the device is mounted
 		 */
-		if (securelevel >= 1 && vfs_mountedon(vp))
-			return (EPERM);
+		error = securelevel_check(ap->a_cred, 0);
+		if (error && vfs_mountedon(vp))
+			return (error);
 
 		/*
 		 * When running in very secure mode, do not allow
 		 * opens for writing of any devices.
 		 */
+		error = securelevel_check(ap->a_cred, 1);
+			return (error);
 		if (securelevel >= 2)
 			return (EPERM);
 	}
Index: netinet/ip_dummynet.c
===================================================================
RCS file: /home/ncvs/src/sys/netinet/ip_dummynet.c,v
retrieving revision 1.39
diff -u -r1.39 ip_dummynet.c
--- netinet/ip_dummynet.c	2001/02/10 00:10:18	1.39
+++ netinet/ip_dummynet.c	2001/05/08 04:31:53
@@ -1817,8 +1817,11 @@
     struct dn_pipe *p, tmp_pipe;
 
     /* Disallow sets in really-really secure mode. */
-    if (sopt->sopt_dir == SOPT_SET && securelevel >= 3)
-	return (EPERM);
+    if (sopt->sopt_dir == SOPT_SET) {
+	error = securelevel_check(curproc->p_ucred, 2);
+	if (error)
+	    return (error);
+    }
 
     switch (sopt->sopt_name) {
     default :
Index: netinet/ip_fw.c
===================================================================
RCS file: /home/ncvs/src/sys/netinet/ip_fw.c,v
retrieving revision 1.164
diff -u -r1.164 ip_fw.c
--- netinet/ip_fw.c	2001/04/06 06:52:25	1.164
+++ netinet/ip_fw.c	2001/05/08 04:31:55
@@ -43,6 +43,7 @@
 #include <sys/sysctl.h>
 #include <sys/syslog.h>
 #include <sys/ucred.h>
+#include <sys/proc.h>
 #include <net/if.h>
 #include <net/route.h>
 #include <netinet/in.h>
@@ -1841,9 +1842,12 @@
 	 * Disallow modifications in really-really secure mode, but still allow
 	 * the logging counters to be reset.
 	 */
-	if (securelevel >= 3 && (sopt->sopt_name == IP_FW_ADD ||
-	    (sopt->sopt_dir == SOPT_SET && sopt->sopt_name != IP_FW_RESETLOG)))
-			return (EPERM);
+	if (sopt->sopt_name == IP_FW_ADD || (sopt->sopt_dir == SOPT_SET &&
+	    sopt->sopt_name != IP_FW_RESETLOG)) {
+		error = securelevel_check(curproc->p_ucred, 2);
+		if (error)
+			return (error);
+	}
 	error = 0;
 
 	switch (sopt->sopt_name) {
Index: pc98/pc98/syscons.c
===================================================================
RCS file: /home/ncvs/src/sys/pc98/pc98/syscons.c,v
retrieving revision 1.159
diff -u -r1.159 syscons.c
--- pc98/pc98/syscons.c	2001/05/01 08:13:15	1.159
+++ pc98/pc98/syscons.c	2001/05/08 04:31:58
@@ -997,8 +997,9 @@
 	error = suser(p);
 	if (error != 0)
 	    return error;
-	if (securelevel > 0)
-	    return EPERM;
+	error = securelevel(p->p_ucred, 0);
+	if (error != 0)
+	    return error;
 #ifdef __i386__
 	p->p_md.md_regs->tf_eflags |= PSL_IOPL;
 #endif
Index: sys/systm.h
===================================================================
RCS file: /home/ncvs/src/sys/sys/systm.h,v
retrieving revision 1.139
diff -u -r1.139 systm.h
--- sys/systm.h	2001/04/27 19:28:25	1.139
+++ sys/systm.h	2001/05/08 04:31:59
@@ -164,6 +164,7 @@
 /* flags for suser_xxx() */
 #define PRISON_ROOT	1
 
+int	securelevel_check __P((struct ucred *cred, int maxlevel));
 int	suser __P((struct proc *));
 int	suser_xxx __P((struct ucred *cred, struct proc *proc, int flag));
 int	u_cansee __P((struct ucred *u1, struct ucred *u2));
Index: ufs/ufs/ufs_vnops.c
===================================================================
RCS file: /home/ncvs/src/sys/ufs/ufs/ufs_vnops.c,v
retrieving revision 1.166
diff -u -r1.166 ufs_vnops.c
--- ufs/ufs/ufs_vnops.c	2001/05/01 09:12:39	1.166
+++ ufs/ufs/ufs_vnops.c	2001/05/08 04:32:03
@@ -482,7 +482,7 @@
 		if (!suser_xxx(cred, NULL, 0)) {
 			if ((ip->i_flags
 			    & (SF_NOUNLINK | SF_IMMUTABLE | SF_APPEND)) &&
-			    securelevel > 0)
+			    securelevel_check(p->p_ucred, 0))
 				return (EPERM);
 			/* Snapshot flag cannot be set or cleared */
 			if (((vap->va_flags & SF_SNAPSHOT) != 0 &&
Index: vm/vm_mmap.c
===================================================================
RCS file: /home/ncvs/src/sys/vm/vm_mmap.c,v
retrieving revision 1.118
diff -u -r1.118 vm_mmap.c
--- vm/vm_mmap.c	2001/05/01 08:13:21	1.118
+++ vm/vm_mmap.c	2001/05/08 04:32:03
@@ -333,7 +333,8 @@
 			 * other securelevel.
 			 * XXX this will have to go
 			 */
-			if (securelevel >= 1)
+			error = securelevel_check(p->p_ucred, 0);
+			if (error)
 				disablexworkaround = 1;
 			else
 				disablexworkaround = suser(p);


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Tue May  8  8:35:23 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from blount.mail.mindspring.net (blount.mail.mindspring.net [207.69.200.226])
	by hub.freebsd.org (Postfix) with ESMTP id 954D637B423
	for <freebsd-arch@FreeBSD.ORG>; Tue,  8 May 2001 08:35:19 -0700 (PDT)
	(envelope-from tlambert2@mindspring.com)
Received: from mindspring.com (pool0302.cvx21-bradley.dialup.earthlink.net [209.179.193.47])
	by blount.mail.mindspring.net (8.9.3/8.8.5) with ESMTP id LAA04345;
	Tue, 8 May 2001 11:35:06 -0400 (EDT)
Message-ID: <3AF8123F.632C02E6@mindspring.com>
Date: Tue, 08 May 2001 08:35:27 -0700
From: Terry Lambert <tlambert2@mindspring.com>
Reply-To: tlambert2@mindspring.com
X-Mailer: Mozilla 4.7 [en]C-CCK-MCD {Sony}  (Win98; U)
X-Accept-Language: en
MIME-Version: 1.0
To: Matt Dillon <dillon@earth.backplane.com>
Cc: Bosko Milekic <bmilekic@technokratis.com>,
	freebsd-arch@FreeBSD.ORG
Subject: Re: Mbuf slab [new allocator]
References: <20010503195904.A53281@technokratis.com> <200105051833.f45IXiW49096@earth.backplane.com>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

Matt Dillon wrote:
> Bosko Milekic wrote:
> : Anyone interested in the mbuf subsystem code should
> : probably read this.  Others may still read it, but it
> : is somewhat longer than your average Email, so consider
> : this a warning. :-)  Also, although I tried my best to
> : cover most issues here, feel free to let me know if I
> : should clarify some points.
> :
> :  Not so long ago, as I'm sure some of you remember,
> : Alfred committed a patch
> : ...
> 
>     Sounds good.  You know the motto - first make it work,
> then make it fast.

SLAB allocators are inherently pessimal for symmetry and
kernel preemption, which is to say, this change would be
inherently bad for SMP.


I also personally think SLAB allocators are _not_ the way
to go in the long run (or even in the short run).

I would point you guys to:

	UNIX Internals: The New Frontiers
	Uresh Vahalia
	Chapter 12

Specifically, I suggest looking at the Dynix Allocator; the
author likes the SLAB allocators, and when I was reviewing
the book for Prentice Hall prior to its publication, we
differed significantly on some aspects of Chapter 12.

The Dynix allocator is still the best bet for optimal
concurrency; a combination of the Dynaix allocator and a
zone allocator would probably be the best we could hope
for in the near term, without a total rewrite taking cache
coloring into account.


Note that the _primary factor_, IMO, limiting the number
of processors usable by SVR4 prior to degrading unacceptably,
is the use of a SLAB allocator, which places all processors
into the same contention zone.


If you guys _insist_ on going to a SLAB allocator, _at least_
do it right -- one of the few benefits of a SLAB allocator is
the ability to perform allocations at interrupt level, if it
is correctly implemented.

-- Terry

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Tue May  8  8:46:58 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from peorth.iteration.net (peorth.iteration.net [208.190.180.178])
	by hub.freebsd.org (Postfix) with ESMTP
	id 1BEFB37B422; Tue,  8 May 2001 08:46:54 -0700 (PDT)
	(envelope-from keichii@peorth.iteration.net)
Received: by peorth.iteration.net (Postfix, from userid 1001)
	id B01A8595E8; Tue,  8 May 2001 10:46:51 -0500 (CDT)
Date: Tue, 8 May 2001 10:46:51 -0500
From: "Michael C . Wu" <keichii@iteration.net>
To: Brian Dean <bsd@bsdhome.com>
Cc: freebsd-arch@freebsd.org, small@freebsd.org
Subject: Re: rc.diskless* patches
Message-ID: <20010508104651.B38957@peorth.iteration.net>
Reply-To: "Michael C . Wu" <keichii@peorth.iteration.net>
Mail-Followup-To: "Michael C . Wu" <keichii@iteration.net>,
	Brian Dean <bsd@bsdhome.com>, freebsd-arch@freebsd.org,
	small@freebsd.org
References: <20010502225656.A1173@vger.bsdhome.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.2.5i
In-Reply-To: <20010502225656.A1173@vger.bsdhome.com>; from bsd@bsdhome.com on Wed, May 02, 2001 at 10:56:56PM -0400
X-PGP-Fingerprint: 5025 F691 F943 8128 48A8  5025 77CE 29C5 8FA1 2E20
X-PGP-Key-ID: 0x8FA12E20
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

On Wed, May 02, 2001 at 10:56:56PM -0400, Brian Dean scribbled:
| I've put together some patches to the diskless startup code that I'd
| like to commit.  I've made both -stable and -current versions of the
| patches.  I've tested the -stable patches, but I have not tested the
| -current patches, hopefully someone can do that and get back to me.
| My -current environment is not working at the moment.
| 
| The patches do three things:

[snip]

| My patches are at:
| 	http://people.freebsd.org/~bsd/diskless

I think this is fine, there should be no difference
to users.  Perhaps -small will think so too.


Michael,
-- 
+-----------------------------------------------------------+
| keichii@iteration.net         | keichii@freebsd.org       |
| http://iteration.net/~keichii | Yes, BSD is a conspiracy. |
+-----------------------------------------------------------+

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Tue May  8  9:22: 5 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from fw.wintelcom.net (ns1.wintelcom.net [209.1.153.20])
	by hub.freebsd.org (Postfix) with ESMTP id 8308537B422
	for <freebsd-arch@FreeBSD.ORG>; Tue,  8 May 2001 09:21:59 -0700 (PDT)
	(envelope-from bright@fw.wintelcom.net)
Received: (from bright@localhost)
	by fw.wintelcom.net (8.10.0/8.10.0) id f48GLlP06751;
	Tue, 8 May 2001 09:21:47 -0700 (PDT)
Date: Tue, 8 May 2001 09:21:47 -0700
From: Alfred Perlstein <bright@wintelcom.net>
To: Terry Lambert <tlambert2@mindspring.com>
Cc: Matt Dillon <dillon@earth.backplane.com>,
	Bosko Milekic <bmilekic@technokratis.com>, freebsd-arch@FreeBSD.ORG
Subject: Re: Mbuf slab [new allocator]
Message-ID: <20010508092146.L18676@fw.wintelcom.net>
References: <20010503195904.A53281@technokratis.com> <200105051833.f45IXiW49096@earth.backplane.com> <3AF8123F.632C02E6@mindspring.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.2.5i
In-Reply-To: <3AF8123F.632C02E6@mindspring.com>; from tlambert2@mindspring.com on Tue, May 08, 2001 at 08:35:27AM -0700
X-all-your-base: are belong to us.
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

* Terry Lambert <tlambert2@mindspring.com> [010508 08:35] wrote:
> Matt Dillon wrote:
> > Bosko Milekic wrote:
> > : Anyone interested in the mbuf subsystem code should
> > : probably read this.  Others may still read it, but it
> > : is somewhat longer than your average Email, so consider
> > : this a warning. :-)  Also, although I tried my best to
> > : cover most issues here, feel free to let me know if I
> > : should clarify some points.
> > :
> > :  Not so long ago, as I'm sure some of you remember,
> > : Alfred committed a patch
> > : ...
> > 
> >     Sounds good.  You know the motto - first make it work,
> > then make it fast.
> 
> SLAB allocators are inherently pessimal for symmetry and
> kernel preemption, which is to say, this change would be
> inherently bad for SMP.
> 
> 
> I also personally think SLAB allocators are _not_ the way
> to go in the long run (or even in the short run).
> 
> I would point you guys to:
> 
> 	UNIX Internals: The New Frontiers
> 	Uresh Vahalia
> 	Chapter 12

Terry, I know. :)

http://people.freebsd.org/~alfred/memcache/

/*

Slab and mp caching allocator.

The concepts used here are a combination of the slab, Dynix and
Horde allocators.

...


Of course it still needs a lot of work.

-Alfred


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Tue May  8 13:53:39 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from beastie.mckusick.com (beastie.mckusick.com [209.31.233.184])
	by hub.freebsd.org (Postfix) with ESMTP id 3D32037B422
	for <arch@FreeBSD.ORG>; Tue,  8 May 2001 13:53:35 -0700 (PDT)
	(envelope-from mckusick@mckusick.com)
Received: from beastie.mckusick.com (localhost [127.0.0.1])
	by beastie.mckusick.com (8.9.3/8.9.3) with ESMTP id NAA08757;
	Tue, 8 May 2001 13:52:58 -0700 (PDT)
	(envelope-from mckusick@beastie.mckusick.com)
Message-Id: <200105082052.NAA08757@beastie.mckusick.com>
To: Matt Dillon <dillon@earth.backplane.com>
Subject: Re: on load control / process swapping 
Cc: Rik van Riel <riel@conectiva.com.br>, arch@FreeBSD.ORG,
	linux-mm@kvack.org, sfkaplan@cs.amherst.edu
In-Reply-To: Your message of "Mon, 07 May 2001 15:50:20 PDT."
             <200105072250.f47MoKe68863@earth.backplane.com> 
Date: Tue, 08 May 2001 13:52:58 -0700
From: Kirk McKusick <mckusick@mckusick.com>
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

I know that FreeBSD will swap out sleeping processes, but will it
ever swap out running processes? The old BSD VM system would do so
(we called it hard swapping). It is possible to get a set of running
processes that simply do not all fit in memory, and the only way
for them to make forward progress is to cycle them through memory.

As to the size issue, we used to be biased towards the processes
with large resident set sizes in kicking things out. In general,
swapping out small things does not buy you much memory and it
annoys more users. To avoid picking on the biggest, each time we
needed to kick something out, we would find the five biggest, and 
kick out the one that had been memory resident the longest. The
effect is to go round-robin among the big processes. Note that
this algorithm allows you to kick out shells, if they are the
biggest processes. Also note that this is a last ditch algorithm
used only after there are no more idle processes available to
kick out. Our decision that we had had to kick out running
processes was: (1) no idle processes available to swap, (2) load
average over one (if there is just one process, kicking it out
does not solve the problem :-), (3) paging rate above a specified
threshhold over the entire previous 30 seconds (e.g., been bad 
for a long time and not getting better in the short term), and
(4) paging rate to/from swap area using more than half the 
available disk bandwidth (if your filesystems are on the same
disk as you swap areas, you can get a false sense of success
because all your process stop paging while they are blocked
waiting for their file data.

	Kirk

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Tue May  8 17:18:34 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from earth.backplane.com (earth-nat-cw.backplane.com [208.161.114.67])
	by hub.freebsd.org (Postfix) with ESMTP id 2518537B422
	for <arch@FreeBSD.ORG>; Tue,  8 May 2001 17:18:31 -0700 (PDT)
	(envelope-from dillon@earth.backplane.com)
Received: (from dillon@localhost)
	by earth.backplane.com (8.11.2/8.11.2) id f490IGR87881;
	Tue, 8 May 2001 17:18:16 -0700 (PDT)
	(envelope-from dillon)
Date: Tue, 8 May 2001 17:18:16 -0700 (PDT)
From: Matt Dillon <dillon@earth.backplane.com>
Message-Id: <200105090018.f490IGR87881@earth.backplane.com>
To: Kirk McKusick <mckusick@mckusick.com>
Cc: Rik van Riel <riel@conectiva.com.br>, arch@FreeBSD.ORG,
	linux-mm@kvack.org, sfkaplan@cs.amherst.edu
Subject: Re: on load control / process swapping 
References:  <200105082052.NAA08757@beastie.mckusick.com>
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

:
:I know that FreeBSD will swap out sleeping processes, but will it
:ever swap out running processes? The old BSD VM system would do so
:(we called it hard swapping). It is possible to get a set of running
:processes that simply do not all fit in memory, and the only way
:for them to make forward progress is to cycle them through memory.

    I looked at the code fairly carefully last night... it doesn't
    swap out running processes and it also does not appear to swap
    out processes blocked in a page-fault (on I/O).  Now, of course
    we can't swap a process out right then (it might be holding locks),
    but I think it would be beneficial to be able to mark the process
    as 'requesting a swapout on return to user mode' or something
    like that.  At the moment what gets picked for swapping is
    hit-or-miss due to the wait states.

:As to the size issue, we used to be biased towards the processes
:with large resident set sizes in kicking things out. In general,
:swapping out small things does not buy you much memory and it

    The VM system does enforce the 'memoryuse' resource limit when
    the memory load gets heavy.  But once the load goes beyond that
    the VM system doesn't appear to care how big the process is.

:...
:biggest processes. Also note that this is a last ditch algorithm
:used only after there are no more idle processes available to
:kick out. Our decision that we had had to kick out running
:processes was: (1) no idle processes available to swap, (2) load
:average over one (if there is just one process, kicking it out
:does not solve the problem :-), (3) paging rate above a specified
:threshhold over the entire previous 30 seconds (e.g., been bad 
:for a long time and not getting better in the short term), and
:(4) paging rate to/from swap area using more than half the 
:available disk bandwidth (if your filesystems are on the same
:disk as you swap areas, you can get a false sense of success
:because all your process stop paging while they are blocked
:waiting for their file data.
:
:	Kirk

    I don't think we want to kick out running processes.  Thrashing
    by definition means that many of the processes are stuck in 
    disk-wait, usually from a VM fault, and not running.  The other 
    effect of thrashing is, of course, the the cpu idle time goes way
    up due to all the process stalls.  A process that is actually able 
    to run under these circumstances probably has a small run-time footprint
    (at least for whatever operation it is currently doing), so it should
    definitely be allowed to continue to run.

						-Matt


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Tue May  8 19: 8:40 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from netau1.alcanet.com.au (ntp.alcanet.com.au [203.62.196.27])
	by hub.freebsd.org (Postfix) with ESMTP id 4C3C437B423
	for <arch@FreeBSD.ORG>; Tue,  8 May 2001 19:08:34 -0700 (PDT)
	(envelope-from jeremyp@gsmx07.alcatel.com.au)
Received: from mfg1.cim.alcatel.com.au (mfg1.cim.alcatel.com.au [139.188.23.1])
	by netau1.alcanet.com.au (8.9.3 (PHNE_22672)/8.9.3) with ESMTP id MAA04160;
	Wed, 9 May 2001 12:07:48 +1000 (EST)
Received: from gsmx07.alcatel.com.au by cim.alcatel.com.au
 (PMDF V5.2-32 #37641) with ESMTP id <01K3CY9R4TGGRX79H5@cim.alcatel.com.au>;
 Wed, 9 May 2001 12:07:37 +1100
Received: (from jeremyp@localhost)	by gsmx07.alcatel.com.au (8.11.1/8.11.1)
 id f4927hR25482; Wed, 09 May 2001 12:07:43 +1000 (EST envelope-from jeremyp)
Content-return: prohibited
Date: Wed, 09 May 2001 12:07:43 +1000
From: Peter Jeremy <peter.jeremy@alcatel.com.au>
Subject: Re: on load control / process swapping
In-reply-to: <200105090018.f490IGR87881@earth.backplane.com>; from
 dillon@earth.backplane.com on Tue, May 08, 2001 at 05:18:16PM -0700
To: Matt Dillon <dillon@earth.backplane.com>
Cc: Kirk McKusick <mckusick@mckusick.com>,
	Rik van Riel <riel@conectiva.com.br>, arch@FreeBSD.ORG,
	linux-mm@kvack.org, sfkaplan@cs.amherst.edu
Mail-Followup-To: Matt Dillon <dillon@earth.backplane.com>,
	Kirk McKusick <mckusick@mckusick.com>,
	Rik van Riel <riel@conectiva.com.br>, arch@FreeBSD.ORG,
	linux-mm@kvack.org, sfkaplan@cs.amherst.edu
Message-id: <20010509120743.Y59150@gsmx07.alcatel.com.au>
MIME-version: 1.0
Content-type: text/plain; charset=us-ascii
Content-disposition: inline
User-Agent: Mutt/1.2.5i
References: <200105082052.NAA08757@beastie.mckusick.com>
 <200105090018.f490IGR87881@earth.backplane.com>
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

On 2001-May-08 17:18:16 -0700, Matt Dillon <dillon@earth.backplane.com> wrote:
>    I don't think we want to kick out running processes.  Thrashing
>    by definition means that many of the processes are stuck in 
>    disk-wait, usually from a VM fault, and not running.  The other 
>    effect of thrashing is, of course, the the cpu idle time goes way
>    up due to all the process stalls.  A process that is actually able 
>    to run under these circumstances probably has a small run-time footprint
>    (at least for whatever operation it is currently doing), so it should
>    definitely be allowed to continue to run.

I don't think this follows.  A program that does something like:
{
	extern char	memory[BIG_NUMBER];
	int		i;

	for (i = 0; i < BIG_NUMBER; i += PAGE_SIZE)
		memory[i]++;
}
will thrash nicely (assuming BIG_NUMBER is large compared to the
currently available physical memory).  Occasionally, it will be
runnable - at which stage it has a footprint of only two pages, but
after executing a couple of instructions, it'll have another page
fault.  Old pages will remain resident for some time before they age
enough to be paged out.  If the VM system is stressed, swapping this
process out completely would seem to be a win.

Whilst this code is artificial, a process managing a very large hash
table will have similar behaviour.

Given that most (all?) recent CPU's have cheap hi-resolution clocks,
would it be worthwhile for the VM system to maintain a per-process
page fault rate?  (average clock cycles before a process faults).  If
you ignore spikes due to process initialisation etc, a process that
faults very quickly after being given the CPU wants a working set size
that is larger than the VM system currently allows.  The fault rate
would seem to be proportional to the ratio between the wanted WSS and
allowed RSS.  This would seem to be a useful parameter to help decide
which process to swap out - in an ideal world the VM subsystem would
swap processes to keep the WSS of all in-core processes at about the
size of non-kernel RAM.

Peter

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Tue May  8 22: 9:49 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from fledge.watson.org (fledge.watson.org [204.156.12.50])
	by hub.freebsd.org (Postfix) with ESMTP
	id BCB7E37B422; Tue,  8 May 2001 22:09:46 -0700 (PDT)
	(envelope-from robert@fledge.watson.org)
Received: from fledge.watson.org (robert@fledge.pr.watson.org [192.0.2.3])
	by fledge.watson.org (8.11.3/8.11.3) with SMTP id f4959hf80730;
	Wed, 9 May 2001 01:09:43 -0400 (EDT)
	(envelope-from robert@fledge.watson.org)
Date: Wed, 9 May 2001 01:09:43 -0400 (EDT)
From: Robert Watson <rwatson@FreeBSD.org>
X-Sender: robert@fledge.watson.org
To: John Baldwin <jhb@FreeBSD.org>
Cc: arch@FreeBSD.org
Subject: RE: Patch to eliminate struct pcred
In-Reply-To: <XFMail.010507130340.jhb@FreeBSD.org>
Message-ID: <Pine.NEB.3.96L.1010509010510.11741r-100000@fledge.watson.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG


John,

Thanks for your comments.  As you point out, the srv4 exit change is
replicated from your kern_exit change of similar ilk.  It might be nice to
revisit whatever rationale there was for breaking out the srv4 exit code,
and see if we can just rely on a wrapped exit1(), which is the approach
taken by the linuxulator.  This would reduce code replication. 

I've likewise removed the intrace cached process flag, and increased the
size of the "there's a race condition here" warning in the execve() code.
As noted in the comment, and as you've indicated, we need to address this
more broad locking problems that result in security issues before we
un-giat this and a number of other calls (in particular, any operations
involving inter-process activities such as tracing, debugging, and
signalling).  While modifying the code, I cleaned up the sv[ug]id
modification code there -- I need to dig up a copy of POSIX.1 to verify
that the new (and the old) behavior are consistent with the requirements. 
I've also added a comment indicating that we may want to set P_SUGID in
the event that we do update the saved id's. 

I've also updated the patch to take into account my recent posix4 commits. 

The revised patch is available at: 

  http://www.watson.org/~robert/pcred.2.diff

Tomorrow I plan to run some more heavy-duty tests, and re-review the code. 
After that, I'd like to go ahead and commit, assuming no further reviews
will be coming in. 

Thanks,

Robert N M Watson             FreeBSD Core Team, TrustedBSD Project
robert@fledge.watson.org      NAI Labs, Safeport Network Services


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Wed May  9  1: 3:19 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from Awfulhak.org (awfulhak.demon.co.uk [194.222.196.252])
	by hub.freebsd.org (Postfix) with ESMTP
	id 8C72937B422; Wed,  9 May 2001 01:03:09 -0700 (PDT)
	(envelope-from brian@Awfulhak.org)
Received: from hak.lan.Awfulhak.org (root@hak.lan.Awfulhak.org [172.16.0.12])
	by Awfulhak.org (8.11.3/8.11.3) with ESMTP id f4989YW15445;
	Wed, 9 May 2001 09:09:35 +0100 (BST)
	(envelope-from brian@lan.Awfulhak.org)
Received: from hak.lan.Awfulhak.org (brian@localhost [127.0.0.1])
	by hak.lan.Awfulhak.org (8.11.3/8.11.3) with ESMTP id f49833B84293;
	Wed, 9 May 2001 09:03:04 +0100 (BST)
	(envelope-from brian@hak.lan.Awfulhak.org)
Message-Id: <200105090803.f49833B84293@hak.lan.Awfulhak.org>
X-Mailer: exmh version 2.3.1 01/18/2001 with nmh-1.0.4
To: Brian Somers <brian@FreeBSD.org>
Cc: cvs-committers@FreeBSD.org, cvs-all@FreeBSD.org,
	brian@Awfulhak.org, freebsd-arch@FreeBSD.org
Subject: Re: cvs commit: src/etc rc 
In-Reply-To: Message from Brian Somers <brian@FreeBSD.org> 
   of "Wed, 09 May 2001 00:24:47 PDT." <200105090724.f497OlW22190@freefall.freebsd.org> 
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Date: Wed, 09 May 2001 09:03:03 +0100
From: Brian Somers <brian@Awfulhak.org>
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

> brian       2001/05/09 00:24:47 PDT
> 
>   Modified files:        (Branch: RELENG_4)
>     etc                  rc 
>   Log:
>   Remove sockets as well as regular files in /var/run and /var/spool/lock
>   at boot time.  This restores the pre-4.3 behaviour.
>   
>   Revision    Changes    Path
>   1.212.2.25  +2 -2      src/etc/rc

I think maybe this should just remove everything ?  Comments ?

-- 
Brian <brian@Awfulhak.org>                        <brian@[uk.]FreeBSD.org>
      <http://www.Awfulhak.org>                   <brian@[uk.]OpenBSD.org>
Don't _EVER_ lose your sense of humour !

Index: rc
===================================================================
RCS file: /home/ncvs/src/etc/rc,v
retrieving revision 1.261
diff -u -r1.261 rc
--- rc	2001/04/15 13:44:05	1.261
+++ rc	2001/05/09 08:07:55
@@ -312,9 +312,12 @@
 			cd "$dir" && for file in .* *
 			do
 				[ ."$file" = .. -o ."$file" = ... ] && continue
-				[ -d "$file" -a ! -L "$file" ] &&
+				if [ -d "$file" -a ! -L "$file" ]
+				then
 					purgedir "$file"
-				[ -f "$file" -o -S "$file" ] && rm -f -- "$file"
+				else
+					rm -f -- "$file"
+				fi
 			done
 		)
 		done


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Wed May  9  9:22:10 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from meow.osd.bsdi.com (meow.osd.bsdi.com [204.216.28.88])
	by hub.freebsd.org (Postfix) with ESMTP
	id 958CC37B423; Wed,  9 May 2001 09:22:06 -0700 (PDT)
	(envelope-from jhb@FreeBSD.org)
Received: from laptop.baldwin.cx (john@jhb-laptop.osd.bsdi.com [204.216.28.241])
	by meow.osd.bsdi.com (8.11.2/8.11.2) with ESMTP id f49GM2G52464;
	Wed, 9 May 2001 09:22:02 -0700 (PDT)
	(envelope-from jhb@FreeBSD.org)
Message-ID: <XFMail.010509092108.jhb@FreeBSD.org>
X-Mailer: XFMail 1.4.0 on FreeBSD
X-Priority: 3 (Normal)
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 8bit
MIME-Version: 1.0
In-Reply-To: <Pine.NEB.3.96L.1010509010510.11741r-100000@fledge.watson.org>
Date: Wed, 09 May 2001 09:21:08 -0700 (PDT)
From: John Baldwin <jhb@FreeBSD.org>
To: Robert Watson <rwatson@FreeBSD.org>
Subject: RE: Patch to eliminate struct pcred
Cc: arch@FreeBSD.org
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG


On 09-May-01 Robert Watson wrote:
> 
> John,
> 
> Thanks for your comments.  As you point out, the srv4 exit change is
> replicated from your kern_exit change of similar ilk.  It might be nice to
> revisit whatever rationale there was for breaking out the srv4 exit code,
> and see if we can just rely on a wrapped exit1(), which is the approach
> taken by the linuxulator.  This would reduce code replication. 

Yes, it does need to be wrapped.  I think it is unwrapped because we got it
from NetBSD and that may be how they do things.  *shrug*

> I've likewise removed the intrace cached process flag,

Thanks.

Some comments:

@@ -274,21 +275,31 @@
...
-           (p->p_flag & P_TRACED) == 0) {
+           p->p_flag & P_TRACED) {
...

It looks like you've inverted the sense of that test.

What is the XXX: locking comment about here:

@@ -296,25 +307,50 @@
+                       p->p_flag &= ~P_SUGID;  /* XXX locking */
                PROC_UNLOCK(p);

The process is locked when that flag is cleared.

Looks fine otherwise.

-- 

John Baldwin <jhb@FreeBSD.org> -- http://www.FreeBSD.org/~jhb/
PGP Key: http://www.baldwin.cx/~john/pgpkey.asc
"Power Users Use the Power to Serve!"  -  http://www.FreeBSD.org/

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Wed May  9 10:23:51 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from fledge.watson.org (fledge.watson.org [204.156.12.50])
	by hub.freebsd.org (Postfix) with ESMTP
	id 61B5037B422; Wed,  9 May 2001 10:23:48 -0700 (PDT)
	(envelope-from robert@fledge.watson.org)
Received: from fledge.watson.org (robert@fledge.pr.watson.org [192.0.2.3])
	by fledge.watson.org (8.11.3/8.11.3) with SMTP id f49HNif90229;
	Wed, 9 May 2001 13:23:45 -0400 (EDT)
	(envelope-from robert@fledge.watson.org)
Date: Wed, 9 May 2001 13:23:44 -0400 (EDT)
From: Robert Watson <rwatson@FreeBSD.org>
X-Sender: robert@fledge.watson.org
To: John Baldwin <jhb@FreeBSD.org>
Cc: arch@FreeBSD.org
Subject: RE: Patch to eliminate struct pcred
In-Reply-To: <XFMail.010509092108.jhb@FreeBSD.org>
Message-ID: <Pine.NEB.3.96L.1010509132233.11741w-100000@fledge.watson.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG


On Wed, 9 May 2001, John Baldwin wrote:

> Some comments:
> 
> @@ -274,21 +275,31 @@
> ...
> -           (p->p_flag & P_TRACED) == 0) {
> +           p->p_flag & P_TRACED) {
> ...
> 
> It looks like you've inverted the sense of that test.

Oops, nice catch.  I've now fixed that.

> What is the XXX: locking comment about here:
> 
> @@ -296,25 +307,50 @@
> +                       p->p_flag &= ~P_SUGID;  /* XXX locking */
>                 PROC_UNLOCK(p);
> 
> The process is locked when that flag is cleared.

This is from an earlier incarnation where I had the locking rearranged
some.  It no longer applies, so I've removed it.

A patch with those changes (only) is available at:

    http://www.watson.org/~robert/pcred.3.diff

Thanks again,

Robert N M Watson             FreeBSD Core Team, TrustedBSD Project
robert@fledge.watson.org      NAI Labs, Safeport Network Services


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Wed May  9 11:50:43 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from sax.sax.de (sax.sax.de [193.175.26.33])
	by hub.freebsd.org (Postfix) with ESMTP
	id C9BD437B423; Wed,  9 May 2001 11:50:36 -0700 (PDT)
	(envelope-from j@uriah.heep.sax.de)
Received: (from uucp@localhost)
	by sax.sax.de (8.9.3/8.9.3) with UUCP id UAA28359;
	Wed, 9 May 2001 20:50:34 +0200 (CEST)
Received: (from j@localhost)
	by uriah.heep.sax.de (8.11.3/8.11.3) id f49IgFx28974;
	Wed, 9 May 2001 20:42:15 +0200 (MET DST)
	(envelope-from j)
Date: Wed, 9 May 2001 20:42:15 +0200
From: J Wunsch <j@uriah.heep.sax.de>
To: cvs-all@FreeBSD.org, freebsd-arch@FreeBSD.org
Subject: Re: cvs commit: src/etc rc
Message-ID: <20010509204214.A28936@uriah.heep.sax.de>
Reply-To: Joerg Wunsch <joerg_wunsch@uriah.heep.sax.de>
References: <brian@FreeBSD.org> <200105090803.f49833B84293@hak.lan.Awfulhak.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Mailer: Mutt 1.0.1i
In-Reply-To: <200105090803.f49833B84293@hak.lan.Awfulhak.org>; from brian@Awfulhak.org on Wed, May 09, 2001 at 09:03:03AM +0100
X-Phone: +49-351-2012 669
X-PGP-Fingerprint: DC 47 E6 E4 FF A6 E9 8F  93 21 E0 7D F9 12 D6 4E
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

As Brian Somers wrote:

[/var/run at boottime]

> I think maybe this should just remove everything ?  Comments ?

I think so.  Solaris 8 is even using a tmpfs for /var/run.

Anybody who stores something in /var/run and expects it to survive a
reboot needs to change his mind.

To quote hier(9):

      run/       system information files describing various info
                 about system since it was booted
                              ^^^^^^^^^^^^^^^^^^^
-- 
cheers, J"org               .-.-.   --... ...--   -.. .  DL8DTL

http://www.sax.de/~joerg/                        NIC: JW11-RIPE
Never trust an operating system you don't have sources for. ;-)

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Wed May  9 12:41:57 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from earth.backplane.com (earth-nat-cw.backplane.com [208.161.114.67])
	by hub.freebsd.org (Postfix) with ESMTP id 278DC37B424
	for <arch@FreeBSD.ORG>; Wed,  9 May 2001 12:41:55 -0700 (PDT)
	(envelope-from dillon@earth.backplane.com)
Received: (from dillon@localhost)
	by earth.backplane.com (8.11.2/8.11.2) id f49JfdD98861;
	Wed, 9 May 2001 12:41:39 -0700 (PDT)
	(envelope-from dillon)
Date: Wed, 9 May 2001 12:41:39 -0700 (PDT)
From: Matt Dillon <dillon@earth.backplane.com>
Message-Id: <200105091941.f49JfdD98861@earth.backplane.com>
To: Peter Jeremy <peter.jeremy@alcatel.com.au>
Cc: Kirk McKusick <mckusick@mckusick.com>,
	Rik van Riel <riel@conectiva.com.br>, arch@FreeBSD.ORG,
	linux-mm@kvack.org, sfkaplan@cs.amherst.edu
Subject: Re: on load control / process swapping
References: <200105082052.NAA08757@beastie.mckusick.com>
 <200105090018.f490IGR87881@earth.backplane.com> <20010509120743.Y59150@gsmx07.alcatel.com.au>
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG


:I don't think this follows.  A program that does something like:
:{
:	extern char	memory[BIG_NUMBER];
:	int		i;
:
:	for (i = 0; i < BIG_NUMBER; i += PAGE_SIZE)
:		memory[i]++;
:}
:will thrash nicely (assuming BIG_NUMBER is large compared to the
:currently available physical memory).  Occasionally, it will be
:runnable - at which stage it has a footprint of only two pages, but

    Why only two pages?  It looks to me like the footprint is BIG_NUMBER
    bytes.

:after executing a couple of instructions, it'll have another page
:fault.  Old pages will remain resident for some time before they age
:enough to be paged out.  If the VM system is stressed, swapping this
:process out completely would seem to be a win.

    Not exactly.  Page aging works both ways.  Just accessing a page
    once does not give it priority over everything else in the page
    queues.

:...
:you ignore spikes due to process initialisation etc, a process that
:faults very quickly after being given the CPU wants a working set size
:that is larger than the VM system currently allows.  The fault rate
:would seem to be proportional to the ratio between the wanted WSS and
:allowed RSS.  This would seem to be a useful parameter to help decide
:which process to swap out - in an ideal world the VM subsystem would
:swap processes to keep the WSS of all in-core processes at about the
:size of non-kernel RAM.
:
:Peter

    Fault rate isn't useful -- maybe faults that require large disk seeks
    would be useful, but just counting the faults themselves is not useful.

						-Matt


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Wed May  9 13:21:46 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from earth.backplane.com (earth-nat-cw.backplane.com [208.161.114.67])
	by hub.freebsd.org (Postfix) with ESMTP
	id A024237B424; Wed,  9 May 2001 13:21:41 -0700 (PDT)
	(envelope-from dillon@earth.backplane.com)
Received: (from dillon@localhost)
	by earth.backplane.com (8.11.2/8.11.2) id f49KLdT99914;
	Wed, 9 May 2001 13:21:39 -0700 (PDT)
	(envelope-from dillon)
Date: Wed, 9 May 2001 13:21:39 -0700 (PDT)
From: Matt Dillon <dillon@earth.backplane.com>
Message-Id: <200105092021.f49KLdT99914@earth.backplane.com>
To: Brian Somers <brian@Awfulhak.org>
Cc: Brian Somers <brian@FreeBSD.ORG>, cvs-committers@FreeBSD.ORG,
	cvs-all@FreeBSD.ORG, brian@Awfulhak.org, freebsd-arch@FreeBSD.ORG
Subject: Re: cvs commit: src/etc rc 
References:  <200105090803.f49833B84293@hak.lan.Awfulhak.org>
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

:> brian       2001/05/09 00:24:47 PDT
:> 
:>   Modified files:        (Branch: RELENG_4)
:>     etc                  rc 
:>   Log:
:>   Remove sockets as well as regular files in /var/run and /var/spool/lock
:>   at boot time.  This restores the pre-4.3 behaviour.
:>   
:>   Revision    Changes    Path
:>   1.212.2.25  +2 -2      src/etc/rc
:
:I think maybe this should just remove everything ?  Comments ?
:
:-- 
:Brian <brian@Awfulhak.org>                        <brian@[uk.]FreeBSD.org>

    Yes.  /var/run should be wiped completely.  Programs needing 
    persistent /var storage should use /var/db.  That's why we have
    a /var/db separate from a /var/run.

						-Matt


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Wed May  9 23:14: 5 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from dt051n37.san.rr.com (dt051n37.san.rr.com [204.210.32.55])
	by hub.freebsd.org (Postfix) with ESMTP
	id 7F35D37B422; Wed,  9 May 2001 23:13:55 -0700 (PDT)
	(envelope-from DougB@DougBarton.net)
Received: from DougBarton.net (master [10.0.0.2])
	by dt051n37.san.rr.com (8.9.3/8.9.3) with ESMTP id XAA22907;
	Wed, 9 May 2001 23:13:44 -0700 (PDT)
	(envelope-from DougB@DougBarton.net)
Message-ID: <3AFA3198.20314F94@DougBarton.net>
Date: Wed, 09 May 2001 23:13:44 -0700
From: Doug Barton <DougB@DougBarton.net>
Organization: Triborough Bridge & Tunnel Authority
X-Mailer: Mozilla 4.77 [en] (X11; U; Linux 2.2.12 i386)
X-Accept-Language: en
MIME-Version: 1.0
To: Matt Dillon <dillon@earth.backplane.com>
Cc: Brian Somers <brian@Awfulhak.org>, cvs-committers@FreeBSD.org,
	cvs-all@FreeBSD.org, freebsd-arch@FreeBSD.org
Subject: Re: cvs commit: src/etc rc
References: <200105090803.f49833B84293@hak.lan.Awfulhak.org> <200105092021.f49KLdT99914@earth.backplane.com>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

Matt Dillon wrote:
> 
> :> brian       2001/05/09 00:24:47 PDT
> :>
> :>   Modified files:        (Branch: RELENG_4)
> :>     etc                  rc
> :>   Log:
> :>   Remove sockets as well as regular files in /var/run and /var/spool/lock
> :>   at boot time.  This restores the pre-4.3 behaviour.
> :>
> :>   Revision    Changes    Path
> :>   1.212.2.25  +2 -2      src/etc/rc
> :
> :I think maybe this should just remove everything ?  Comments ?
> :
> :--
> :Brian <brian@Awfulhak.org>                        <brian@[uk.]FreeBSD.org>
> 
>     Yes.  /var/run should be wiped completely.  Programs needing
>     persistent /var storage should use /var/db.  That's why we have
>     a /var/db separate from a /var/run.

	If you need another vote, count me in.

-- 
    I need someone really bad. Are you really bad?

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Thu May 10  6:49:11 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from Awfulhak.org (awfulhak.demon.co.uk [194.222.196.252])
	by hub.freebsd.org (Postfix) with ESMTP
	id 67A0D37B422; Thu, 10 May 2001 06:49:07 -0700 (PDT)
	(envelope-from brian@Awfulhak.org)
Received: from hak.lan.Awfulhak.org (root@hak.lan.Awfulhak.org [172.16.0.12])
	by Awfulhak.org (8.11.3/8.11.3) with ESMTP id f4ADn1308700;
	Thu, 10 May 2001 14:49:02 +0100 (BST)
	(envelope-from brian@lan.Awfulhak.org)
Received: from hak.lan.Awfulhak.org (brian@localhost [127.0.0.1])
	by hak.lan.Awfulhak.org (8.11.3/8.11.3) with ESMTP id f4ADn0d32593;
	Thu, 10 May 2001 14:49:00 +0100 (BST)
	(envelope-from brian@hak.lan.Awfulhak.org)
Message-Id: <200105101349.f4ADn0d32593@hak.lan.Awfulhak.org>
X-Mailer: exmh version 2.3.1 01/18/2001 with nmh-1.0.4
To: Peter Wemm <peter@FreeBSD.org>, freebsd-arch@FreeBSD.org
Cc: Brian Somers <brian@Awfulhak.org>
Subject: linker_search_path()
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Date: Thu, 10 May 2001 14:49:00 +0100
From: Brian Somers <brian@Awfulhak.org>
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

Hi,

The digi driver that I re-wrote recently uses linker_load_file() to 
grab some throwaway data from another purpose-built digi_* module.

At the moment, I use an almost-hard-coded filename of

  snprintf(modfile, MAXPATHLEN, "/boot/kernel/digi_%s.ko", sc->module);

which isn't really very bright.

Can anyone tell me what the plans are for linker_search_path() in 
kern/kern_linker.c ?  There's a comment (written by peter):

    /*
     * There will be a system to look up or guess a file name from
     * a module name.
     * For now we just try to load a file with the same name.
     */
    pathname = linker_search_path(modname);

I wouldn't mind implementing that ``system'' or even making 
linker_search_path() non-static so that I can use it from 
dev/digi/digi.c.

Comments ?

Cheers.
-- 
Brian <brian@Awfulhak.org>                        <brian@[uk.]FreeBSD.org>
      <http://www.Awfulhak.org>                   <brian@[uk.]OpenBSD.org>
Don't _EVER_ lose your sense of humour !


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Fri May 11 16:42:22 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from smtp04.primenet.com (smtp04.primenet.com [206.165.6.134])
	by hub.freebsd.org (Postfix) with ESMTP
	id D90D037B62D; Fri, 11 May 2001 16:42:13 -0700 (PDT)
	(envelope-from tlambert@usr08.primenet.com)
Received: (from daemon@localhost)
	by smtp04.primenet.com (8.9.3/8.9.3) id QAA12297;
	Fri, 11 May 2001 16:42:11 -0700 (MST)
Received: from usr08.primenet.com(206.165.6.208)
 via SMTP by smtp04.primenet.com, id smtpdAAAx6aq3x; Fri May 11 16:41:56 2001
Received: (from tlambert@localhost)
	by usr08.primenet.com (8.8.5/8.8.5) id QAA04578;
	Fri, 11 May 2001 16:43:00 -0700 (MST)
From: Terry Lambert <tlambert@primenet.com>
Message-Id: <200105112343.QAA04578@usr08.primenet.com>
Subject: FreeBSD breaks sockets two ways...
To: freebsd-net@FreeBSD.ORG
Date: Fri, 11 May 2001 23:43:00 +0000 (GMT)
Cc: arch@FreeBSD.ORG
X-Mailer: ELM [version 2.5 PL2]
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

I have run into two issues, that I find really, really annoying.  This
is in FreeBSD 4.3 and 5.x.  Bot machines are on a local (non-switched)
segment (it works the same with a switch, but taking that out proves
it is not the switch causing the problem).


Primus
------

The first is that when you create a socket, and bind it to a
specific local IP address, and then connect, it fails to
allocate an automatic port private to the socket; specifically:

	int	s;
	struct sockaddr_in sockaddr;

	s = socket(AF_INET, SOCK_STREAM, 0);
	bzero(&sockaddr,sizeof(sockaddr));
	sockaddr.sin_family = AF_INET;
	sockaddr.sin_addr.s_addr = s_addr2;
	sockaddr.sin_port = 0;
	if (bind(s, (struct sockaddr *) &sockaddr, sizeof(sockaddr)) == -1) {
		perror("bind");
		errx(1, "bind failed");
	}

...in other words, the sockets are all hashed into the same
(global) collsion domain, even though they are _not_ global,
they are specific to a particular IP address.


Secondus
--------

On an OS where the above actually works (e.g. _not_ FreeBSD), when
I make connections from two ports which are the same, but with
different IP addresses, it seems that the MAC address is used by
FreeBSD to differentiate connections, and _not_ the IP/port pair.

This means that on FreeBSD, the incoming connection on two different
source IPs from the same MAC address end up resetting the first
connection, when the second one comes in; instead of getting two
total connections, I end up getting only a single connection.


Both of these seem to be serious screwups in the routing code hash
lookup algorithm, acting as if everything is in the INADDR_ANY
domain, and as if it were keying off the MAC address, and not the
IP address... as it should be.


Has anyone else seen this?  Obviously, it's hard to reproduce
FreeBSD-to-FreeBSD (at least without a BPF program on the client
side to cause the problem)...


I'm primarily interested in a fix for 4.3.


					Thanks,
					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Fri May 11 16:54:44 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from mail.tgd.net (rand.tgd.net [64.81.67.117])
	by hub.freebsd.org (Postfix) with SMTP id B879637B43F
	for <arch@FreeBSD.ORG>; Fri, 11 May 2001 16:54:38 -0700 (PDT)
	(envelope-from sean@mailhost.tgd.net)
Received: (qmail 67706 invoked by uid 1001); 11 May 2001 23:54:32 -0000
Date: Fri, 11 May 2001 16:54:32 -0700
From: Sean Chittenden <sean-freebsd-stable@chittenden.org>
To: Terry Lambert <tlambert@primenet.com>
Cc: freebsd-net@FreeBSD.ORG, arch@FreeBSD.ORG
Subject: Re: FreeBSD breaks sockets two ways...
Message-ID: <20010511165432.A67648@rand.tgd.net>
References: <200105112343.QAA04578@usr08.primenet.com>
Mime-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha1;
	protocol="application/pgp-signature"; boundary="NzB8fVQJ5HfG6fxh"
Content-Disposition: inline
In-Reply-To: <200105112343.QAA04578@usr08.primenet.com>; from "tlambert@primenet.com" on Fri, May 11, 2001 at = 11:43:00PM
X-PGP-Key: 0x1EDDFAAD
X-PGP-Fingerprint: C665 A17F 9A56 286C 5CFB 1DEA 9F4F 5CEF 1EDD FAAD
X-Web-Homepage: http://sean.chittenden.org/
X-All-your-base: are belong to us.
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG


--NzB8fVQJ5HfG6fxh
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

	Are you sure it's failing to allocate the port?

	I had a similar problem in trying to connect to a service, but
found out that aliasing an IP didn't add the arp entry in the routing
table (local connections were failing).  If I added the arp entry by
hand, everything was happy (is IP aliasing a part of the scneario
you're describing?).

arp -s a.b.c.d 00:60:08:aa:aa:aa pub
arp -s a.b.c.e 00:60:08:aa:aa:ab pub

	A tad annoying, but it seems to work (yeah, I know about the
ethers file, but I refuse to use it).  -sc

On Fri, May 11, 2001 at 11:43:00PM +0000, Terry Lambert wrote:
> I have run into two issues, that I find really, really annoying.  This
> is in FreeBSD 4.3 and 5.x.  Bot machines are on a local (non-switched)
> segment (it works the same with a switch, but taking that out proves
> it is not the switch causing the problem).
>=20
>=20
> Primus
> ------
>=20
> The first is that when you create a socket, and bind it to a
> specific local IP address, and then connect, it fails to
> allocate an automatic port private to the socket; specifically:

--=20
Sean Chittenden

--NzB8fVQJ5HfG6fxh
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Comment: Sean Chittenden <sean@chittenden.org>

iEYEARECAAYFAjr8e7cACgkQn09c7x7d+q2tuQCaA6PwZyW5IG33AgevgaN+n5so
pZkAnRjax8S0kGKdusPUWJ1/dv9si1FN
=tJcl
-----END PGP SIGNATURE-----

--NzB8fVQJ5HfG6fxh--

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Fri May 11 18:46:57 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from smtp03.primenet.com (smtp03.primenet.com [206.165.6.133])
	by hub.freebsd.org (Postfix) with ESMTP
	id B904B37B424; Fri, 11 May 2001 18:46:51 -0700 (PDT)
	(envelope-from tlambert@usr06.primenet.com)
Received: (from daemon@localhost)
	by smtp03.primenet.com (8.9.3/8.9.3) id SAA00722;
	Fri, 11 May 2001 18:46:44 -0700 (MST)
Received: from usr06.primenet.com(206.165.6.206)
 via SMTP by smtp03.primenet.com, id smtpdAAAh3a4vb; Fri May 11 18:46:35 2001
Received: (from tlambert@localhost)
	by usr06.primenet.com (8.8.5/8.8.5) id SAA24408;
	Fri, 11 May 2001 18:52:27 -0700 (MST)
From: Terry Lambert <tlambert@primenet.com>
Message-Id: <200105120152.SAA24408@usr06.primenet.com>
Subject: Re: FreeBSD breaks sockets two ways...
To: freebsd-net@FreeBSD.ORG
Date: Sat, 12 May 2001 01:52:17 +0000 (GMT)
Cc: arch@FreeBSD.ORG
X-Mailer: ELM [version 2.5 PL2]
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

]         Are you sure it's failing to allocate the port?
] 
]         I had a similar problem in trying to connect to a service, but
] found out that aliasing an IP didn't add the arp entry in the routing
] table (local connections were failing).  If I added the arp entry by
] hand, everything was happy (is IP aliasing a part of the scneario
] you're describing?).
] 
] arp -s a.b.c.d 00:60:08:aa:aa:aa pub
] arp -s a.b.c.e 00:60:08:aa:aa:ab pub
] 
]         A tad annoying, but it seems to work (yeah, I know about the
] ethers file, but I refuse to use it).  -sc

Unfortunately, I'm very certain.

I talked to Bill Paul about the gratuitous ARP problem last
night; I was well aware of it; we added the ARP entries by
hand to the target for the aliases on the source machine.

I'm _positive_ on the outbound connection problem (the code
fragment I attached should have done the job, and I've seen
the FreeBSD code that's the problem, but am still pondering
about how to fix it; I think I'll have to do two lookups, or
hang a chain off a hash bucket indexed by IP last (instead
of port).  Hopefully, someone will get to this before I do.

We've also tried by setting up the ARP table for the target
machine, and then written the aforementioned BPF program to
stage the connection attempts from a single client machine.

We did the same thing from a second client on the same segment.

The single client, two IP attempt failed, while the two machine
attempt succeeded.  The only difference in the packets that was
reported by tcpdump was the source MAC address -- otherwise,
they were byte-for-byte identical.

So there is definitely a problem there with the index being by
MAC instead of IP.

Maybe this came in as part of the "aliased IP NFS client being
seen as an attacker by the server" fix?


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Sat May 12  5: 6:30 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from pcnet1.pcnet.com (pcnet1.pcnet.com [204.213.232.3])
	by hub.freebsd.org (Postfix) with ESMTP id 94A7937B423
	for <arch@FreeBSD.org>; Sat, 12 May 2001 05:06:27 -0700 (PDT)
	(envelope-from eischen@vigrid.com)
Received: (from eischen@localhost)
	by pcnet1.pcnet.com (8.8.7/PCNet) id IAA15937;
	Sat, 12 May 2001 08:05:38 -0400 (EDT)
Date: Sat, 12 May 2001 08:05:38 -0400 (EDT)
From: Daniel Eischen <eischen@vigrid.com>
To: Bruce Evans <bde@zeta.org.au>
Cc: arch@FreeBSD.org
Subject: Re: cvs commit: src/sys/i386/linux linux_sysvec.c
In-Reply-To: <Pine.BSF.4.21.0105121949530.8482-100000@besplex.bde.org>
Message-ID: <Pine.SUN.3.91.1010512075410.14092A-100000@pcnet1.pcnet.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

[ Moved to -arch ]

On Sat, 12 May 2001, Bruce Evans wrote:
> On Fri, 11 May 2001, Daniel Eischen wrote:
> 
> > deischen    2001/05/11 20:23:14 PDT
> > 
> >   Modified files:
> >     sys/i386/linux       linux_sysvec.c 
> >   Log:
> >   Preserve the state of the %gs register when setting up the signal
> >   handler in Linux emulation.  According to bde, this is what Linux
> >   does.
> >   
> >   Recent versions of linuxthreads use %gs for thread-specific data,
> >   while FreeBSD uses %fs (mostly because WINE uses %gs).
> 
> I think FreeBSD should use %gs too (except I think segment registers
> should never be used).  Are there different compatibility problems
> with WINE?

Using %gs is OK by me.  I've never used WINE, so I'm not sure how it
uses %gs.  I think Terry raised the issue when we were discussing
which register to use on -arch, and at the time nobody seemed to care
if we used %fs or %gs.

I suppose using %gs would conflict with WINE if it ever relied on our
native threads libraries.  But since Linux uses %gs, and WINE is more
likely to run under Linux than anything else, it would seem safe for
FreeBSD to use %gs also.

-- 
Dan Eischen

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Sat May 12  7:24: 1 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from netbank.com.br (garrincha.netbank.com.br [200.203.199.88])
	by hub.freebsd.org (Postfix) with ESMTP id B8AAB37B443
	for <arch@freebsd.org>; Sat, 12 May 2001 07:23:54 -0700 (PDT)
	(envelope-from riel@conectiva.com.br)
Received: from surriel.ddts.net (1-248.ctame701-1.telepar.net.br [200.181.137.248])
	by netbank.com.br (Postfix) with ESMTP
	id 5D7CF46804; Sat, 12 May 2001 11:25:36 -0300 (BRST)
Received: from localhost (mflznt@localhost [127.0.0.1])
	by surriel.ddts.net (8.11.3/8.11.2) with ESMTP id f4CENhi04957;
	Sat, 12 May 2001 11:23:44 -0300
Date: Sat, 12 May 2001 11:23:43 -0300 (BRST)
From: Rik van Riel <riel@conectiva.com.br>
X-Sender: riel@imladris.rielhome.conectiva
To: Matt Dillon <dillon@earth.backplane.com>
Cc: arch@freebsd.org, linux-mm@kvack.org, sfkaplan@cs.amherst.edu
Subject: Re: on load control / process swapping
In-Reply-To: <200105080056.f480u1Q71866@earth.backplane.com>
Message-ID: <Pine.LNX.4.21.0105121109210.5468-100000@imladris.rielhome.conectiva>
X-spambait: aardvark@kernelnewbies.org
X-spammeplease: aardvark@nl.linux.org
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

On Mon, 7 May 2001, Matt Dillon wrote:

>     Look at the loop line 1362 of vm_pageout.c.  Note that it enforces
>     a HZ/2 tsleep (2 scans per second) if the pageout daemon is unable
>     to clean sufficient pages in two loops.  The tsleep is not woken up
>     by anyone while waiting that 1/2 second becuase vm_pages_needed has
>     not been cleared yet.  This is what is limiting the page queue scan.

Ahhh, so FreeBSD _does_ have a maxscan equivalent, just one that
only kicks in when the system is under very heavy memory pressure.

That explains why FreeBSD's thrashing detection code works... ;)

(I'm not convinced, though, that limiting the speed at which we
scan the active list is a good thing. There are some arguments
in favour of speed limiting, but it mostly seems to come down
to a short-cut to thrashing detection...)

> :But ... is this a good enough indication that the processes
> :currently resident have enough memory available to make any
> :progress ?
> 
>     Yes.  Consider detecting the difference between a large process accessing
>     its pages randomly, and a small process accessing a relatively small
>     set of pages over and over again.  Now consider what happens when the
>     system gets overloaded.  The small process will be able to access its
>     pages enough that they will get page priority over the larger process.
>     The larger process, due to the more random accesses (or simply the fact
>     that it is accessing a larger set of pages) will tend to stall more on
>     pagein I/O which has the side effect of reducing the large process's
>     access rate on all of its pages.  The result:  small processes get more
>     priority just by being small.

But if the larger processes never get a chance to make decent
progress without thrashing, won't your system be slowed down
forever by these (thrashing) large processes?

It's nice to protect your small processes from the large ones,
but if the large processes don't get to run to completion the
system will never get out of thrashing...

> :Especially if all the currently resident processes are waiting
> :in page faults, won't that make it easier for the system to find
> :pages to swap out, etc... ?
> :
> :One thing I _am_ wondering though: the pageout and the pagein
> :thresholds are different. Can't this lead to problems where we
> :always hit both the pageout threshold -and- the pagein threshold
> :and the system thrashes swapping processes in and out ?
> 
>     The system will not page out a page it has just paged in due to the
>     center-of-the-road initialization of act_count (the page aging).

Indeed, the speed limiting of the pageout scanning takes care of
this. But still, having the swapout threshold defined as being
short of inactive pages while the swapin threshold uses the number
of free+cache pages as an indication could lead to the situation
where you suspend and wake up processes while it isn't needed.

Or worse, suspending one process which easily fit in memory and
then waking up another process, which cannot be swapped in because
the first process' memory is still sitting in RAM and cannot be
removed yet due to the pageout scan speed limiting (and also cannot
be used, because we suspended the process).

The chance of this happening could be quite big in some situations
because the swapout and swapin thresholds are measuring things that
are only indirectly related...

>     The pagein and pageout rates have nothing to do with thrashing, per say,
>     and should never be arbitrarily limited.

But they are, with the pageout daemon going to sleep for half a
second if it doesn't succeed in freeing enough memory at once.
It even does this if a large part of the memory on the active
list belongs to a process which has just been suspended because
of thrashing...


>     I don't think it's possible to write a nice neat thrash-handling
>     algorithm.  It's a bunch of algorithms all working together, all
>     closely tied to the VM page cache.  Each taken alone is fairly easy
>     to describe and understand.  All of them together result in complex
>     interactions that are very easy to break if you make a mistake.

Heheh, certainly true ;)

cheers,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Sat May 12  7:28:31 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from netbank.com.br (garrincha.netbank.com.br [200.203.199.88])
	by hub.freebsd.org (Postfix) with ESMTP id BCAB337B423
	for <arch@FreeBSD.ORG>; Sat, 12 May 2001 07:28:28 -0700 (PDT)
	(envelope-from riel@conectiva.com.br)
Received: from surriel.ddts.net (1-248.ctame701-1.telepar.net.br [200.181.137.248])
	by netbank.com.br (Postfix) with ESMTP
	id 0CFB746804; Sat, 12 May 2001 11:30:17 -0300 (BRST)
Received: from localhost (svsumc@localhost [127.0.0.1])
	by surriel.ddts.net (8.11.3/8.11.2) with ESMTP id f4CESPi05036;
	Sat, 12 May 2001 11:28:26 -0300
Date: Sat, 12 May 2001 11:28:25 -0300 (BRST)
From: Rik van Riel <riel@conectiva.com.br>
X-Sender: riel@imladris.rielhome.conectiva
To: Matt Dillon <dillon@earth.backplane.com>
Cc: Kirk McKusick <mckusick@mckusick.com>, arch@FreeBSD.ORG,
	linux-mm@kvack.org, sfkaplan@cs.amherst.edu
Subject: Re: on load control / process swapping 
In-Reply-To: <200105090018.f490IGR87881@earth.backplane.com>
Message-ID: <Pine.LNX.4.21.0105121124110.5468-100000@imladris.rielhome.conectiva>
X-spambait: aardvark@kernelnewbies.org
X-spammeplease: aardvark@nl.linux.org
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

On Tue, 8 May 2001, Matt Dillon wrote:

> :I know that FreeBSD will swap out sleeping processes, but will it
> :ever swap out running processes? The old BSD VM system would do so
> :(we called it hard swapping). It is possible to get a set of running
> :processes that simply do not all fit in memory, and the only way
> :for them to make forward progress is to cycle them through memory.
> 
>     I looked at the code fairly carefully last night... it doesn't
>     swap out running processes and it also does not appear to swap
>     out processes blocked in a page-fault (on I/O).  Now, of course
>     we can't swap a process out right then (it might be holding locks),
>     but I think it would be beneficial to be able to mark the process
>     as 'requesting a swapout on return to user mode' or something
>     like that.

In the (still very rough) swapping code for Linux I simply do
this as "swapout on next pagefault". The idea behind that is:

1) it's easy, at a page fault we know we can suspend the process

2) if we're thrashing, we want every process to make as much
   progress as possible before it's suspended (swapped out),
   letting the process run until the next page fault means we
   will never suspend a process while it's still able to make
   progress

3) thrashing should be a rare situation, so you don't want to
   complicate fast-path code like "return to userspace"; instead
   we make sure to have as little impact on the rest of the
   kernel code as possible

regards,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Sat May 12 10:22: 2 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from earth.backplane.com (earth-nat-cw.backplane.com [208.161.114.67])
	by hub.freebsd.org (Postfix) with ESMTP id EE99837B424
	for <arch@freebsd.org>; Sat, 12 May 2001 10:21:56 -0700 (PDT)
	(envelope-from dillon@earth.backplane.com)
Received: (from dillon@localhost)
	by earth.backplane.com (8.11.3/8.11.2) id f4CHLSS18553;
	Sat, 12 May 2001 10:21:28 -0700 (PDT)
	(envelope-from dillon)
Date: Sat, 12 May 2001 10:21:28 -0700 (PDT)
From: Matt Dillon <dillon@earth.backplane.com>
Message-Id: <200105121721.f4CHLSS18553@earth.backplane.com>
To: Rik van Riel <riel@conectiva.com.br>
Cc: arch@freebsd.org, linux-mm@kvack.org, sfkaplan@cs.amherst.edu
Subject: Re: on load control / process swapping
References:  <Pine.LNX.4.21.0105121109210.5468-100000@imladris.rielhome.conectiva>
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG


:
:Ahhh, so FreeBSD _does_ have a maxscan equivalent, just one that
:only kicks in when the system is under very heavy memory pressure.
:
:That explains why FreeBSD's thrashing detection code works... ;)
:
:(I'm not convinced, though, that limiting the speed at which we
:scan the active list is a good thing. There are some arguments
:in favour of speed limiting, but it mostly seems to come down
:to a short-cut to thrashing detection...)

    Note that there is a big distinction between limiting the page
    queue scan rate (which we do not do), and sleeping between full
    scans (which we do).  Limiting the page queue scan rate on a
    page-by-page basis does not scale.  Sleeping in between full queue
    scans (in an extreme case) does scale.

						-Matt


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Sat May 12 14:17:28 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from perninha.conectiva.com.br (perninha.conectiva.com.br [200.250.58.156])
	by hub.freebsd.org (Postfix) with ESMTP id 695FF37B423
	for <arch@freebsd.org>; Sat, 12 May 2001 14:17:24 -0700 (PDT)
	(envelope-from riel@conectiva.com.br)
Received: from burns.conectiva (burns.conectiva [10.0.0.4])
	by perninha.conectiva.com.br (Postfix) with SMTP id 2CE5B16C5C
	for <arch@freebsd.org>; Sat, 12 May 2001 18:17:17 -0300 (EST)
Received: (qmail 15923 invoked by uid 0); 12 May 2001 21:15:52 -0000
Received: from duckman.distro.conectiva (HELO duckman.conectiva.com.br) (root@10.0.17.2)
  by burns.conectiva with SMTP; 12 May 2001 21:15:52 -0000
Received: from localhost (riel@localhost)
	by duckman.conectiva.com.br (8.11.3/8.11.3) with ESMTP id f4CLHFK11130;
	Sat, 12 May 2001 18:17:16 -0300
X-Authentication-Warning: duckman.distro.conectiva: riel owned process doing -bs
Date: Sat, 12 May 2001 18:17:15 -0300 (BRST)
From: Rik van Riel <riel@conectiva.com.br>
X-X-Sender:  <riel@duckman.distro.conectiva>
To: Matt Dillon <dillon@earth.backplane.com>
Cc: <arch@freebsd.org>, <linux-mm@kvack.org>,
	<sfkaplan@cs.amherst.edu>
Subject: Re: on load control / process swapping
In-Reply-To: <200105121721.f4CHLSS18553@earth.backplane.com>
Message-ID: <Pine.LNX.4.33.0105121816190.18102-100000@duckman.distro.conectiva>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

On Sat, 12 May 2001, Matt Dillon wrote:

> :Ahhh, so FreeBSD _does_ have a maxscan equivalent, just one that
> :only kicks in when the system is under very heavy memory pressure.
> :
> :That explains why FreeBSD's thrashing detection code works... ;)
>
>     Note that there is a big distinction between limiting the page
>     queue scan rate (which we do not do), and sleeping between full
>     scans (which we do).  Limiting the page queue scan rate on a
>     page-by-page basis does not scale.  Sleeping in between full queue
>     scans (in an extreme case) does scale.

I'm not convinced it's doing a very useful thing, though ;)

(see the rest of the email you replied to)

Rik
--
Linux MM bugzilla: http://linux-mm.org/bugzilla.shtml

Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com/


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message


From owner-freebsd-arch  Sat May 12 16:58:31 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from earth.backplane.com (earth-nat-cw.backplane.com [208.161.114.67])
	by hub.freebsd.org (Postfix) with ESMTP id C044937B43E
	for <arch@freebsd.org>; Sat, 12 May 2001 16:58:24 -0700 (PDT)
	(envelope-from dillon@earth.backplane.com)
Received: (from dillon@localhost)
	by earth.backplane.com (8.11.3/8.11.2) id f4CNwEr20137;
	Sat, 12 May 2001 16:58:14 -0700 (PDT)
	(envelope-from dillon)
Date: Sat, 12 May 2001 16:58:14 -0700 (PDT)
From: Matt Dillon <dillon@earth.backplane.com>
Message-Id: <200105122358.f4CNwEr20137@earth.backplane.com>
To: Rik van Riel <riel@conectiva.com.br>
Cc: arch@freebsd.org, linux-mm@kvack.org, sfkaplan@cs.amherst.edu
Subject: Re: on load control / process swapping
References:  <Pine.LNX.4.21.0105121109210.5468-100000@imladris.rielhome.conectiva>
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

:
:But if the larger processes never get a chance to make decent
:progress without thrashing, won't your system be slowed down
:forever by these (thrashing) large processes?
:
:It's nice to protect your small processes from the large ones,
:but if the large processes don't get to run to completion the
:system will never get out of thrashing...

    Consider the case where you have one large process and many small
    processes.  If you were to skew things to allow the large process to
    run at the cost of all the small processes, you have just inconvenienced
    98% of your users so one ozob can run a big job.  Not only that, but 
    there is no guarentee that the 'big job' will ever finish (a topic of
    many a paper on scheduling, BTW)... what if it's been running for hours
    and still has hours to go?  Do we blow away the rest of the system to
    let it run?  

    What if there are several big jobs?  If you skew things in favor of
    one the others could take 60 seconds *just* to recover their RSS when
    they are finally allowed to run.  So much for timesharing... you
    would have to run each job exclusively for 5-10 minutes at a time
    to get any sort of effiency, which is not practical in a timeshare
    system.  So there is really very little that you can do.

:Indeed, the speed limiting of the pageout scanning takes care of
:this. But still, having the swapout threshold defined as being
:short of inactive pages while the swapin threshold uses the number
:of free+cache pages as an indication could lead to the situation
:where you suspend and wake up processes while it isn't needed.
:
:Or worse, suspending one process which easily fit in memory and
:then waking up another process, which cannot be swapped in because
:the first process' memory is still sitting in RAM and cannot be
:removed yet due to the pageout scan speed limiting (and also cannot
:be used, because we suspended the process).

    We don't suspend running processes, but I do believe FreeBSD is still
    vulnerable to this issue.  Suspending the marked process when it hits the
    vm_fault code is a good idea and would solve the problem.  If the process
    never takes an allocation fault, it probably doesn't have to be swapped
    out.  The normal pageout would suffice for that process.

:>     The pagein and pageout rates have nothing to do with thrashing, per say,
:>     and should never be arbitrarily limited.
:
:But they are, with the pageout daemon going to sleep for half a
:second if it doesn't succeed in freeing enough memory at once.
:It even does this if a large part of the memory on the active
:list belongs to a process which has just been suspended because
:of thrashing...

    No.  I did say the code was complex.  A process which has been
    suspended for thrashing gets all of its pages depressed in priority.
    The page daemon would have no problem recovering the pages.   See
    line 1458 of vm_pageout.c.  This code also enforces the 'memoryuse'
    resource limit (which is perhaps even more important).  It is not
    necessary to try to launder the pages immediately.  Simply depressing
    their priority is sufficient and it allows for quicker recovery when
    the thrashing goes away.  It also allows us to implement the 
    vm.swap_idle_{threshold1,threshold2,enabled} sysctls trivially, which
    results in proactive swapping that is extremely useful in certain
    situations (like shell machines with lots of idle users).

    The pagedaemon gets behind when there are too many
    active pages in the system and the pagedaemon is unable to move them
    to the inactive queue due to the pages still being very active... that is,
    when the active resident set for all processes in the system exceeds
    available memory.  This is what triggers thrashing.  Swapping has the
    side effect of reducing the total active resident set for the system
    as a whole, fixing the thrashing problem. 

						-Matt

:>     I don't think it's possible to write a nice neat thrash-handling
:>     algorithm.  It's a bunch of algorithms all working together, all
:>     closely tied to the VM page cache.  Each taken alone is fairly easy
:>     to describe and understand.  All of them together result in complex
:>     interactions that are very easy to break if you make a mistake.
:
:Heheh, certainly true ;)
:
:cheers,
:
:Rik

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message