From owner-freebsd-arch Sun May 6 21:28:52 2001 Delivered-To: freebsd-arch@freebsd.org Received: from fledge.watson.org (fledge.watson.org [204.156.12.50]) by hub.freebsd.org (Postfix) with ESMTP id A028A37B424 for ; Sun, 6 May 2001 21:28:00 -0700 (PDT) (envelope-from robert@fledge.watson.org) Received: from fledge.watson.org (robert@fledge.pr.watson.org [192.0.2.3]) by fledge.watson.org (8.11.3/8.11.3) with SMTP id f474Rvf44400 for ; Mon, 7 May 2001 00:27:57 -0400 (EDT) (envelope-from robert@fledge.watson.org) Date: Mon, 7 May 2001 00:27:57 -0400 (EDT) From: Robert Watson X-Sender: robert@fledge.watson.org To: arch@FreeBSD.org Subject: Patch to eliminate struct pcred Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG Below, please find patches to eliminate struct pcred, as previously discussed on this list. Detailed description of the changes is below, but the quick of it is: pcred and ucred were independent, these patches merge both into ucred, simplifying a number of cached credential cases (such as in sigio), and making the ucred the central structure required for almost all subject-based authorization events. While I did this, I took the opportunity to clean up a number of related issues, including changing the uid/gid helper functions substantially. If you prefer patches via the web, they are at: http://www.watson.org/~robert/pcred.diff Any reviews welcome. An important observation is that, in practice, almost all pcred write operations involve a ucred copy-on-write, so this shouldn't increase the number of ucred's in use; it does slightly expand ucred, but also removes an indirection from the use of ucred in most environments. The performance impact is probably a wash. Detailed description: o Merge contents of struct pcred into struct ucred. Specifically, add the real uid, saved uid, real gid, and saved gid to ucred, as well as the pcred->pc_uidinfo, which was associated with the real uid, only rename it to cr_ruidinfo so as not to conflict with cr_uidinfo, which corresponds to the effective uid. o Remove p_cred from struct proc; add p_ucred to struct proc, replacing original macro that pointed. p->p_ucred to p->p_cred->pc_ucred. o Universally update code so that it makes use of ucred instead of pcred, p->p_ucred instead of p->p_pcred, cr_ruidinfo instead of p_uidinfo, cr_{r,sv}{u,g}id instead of p_*, etc. o Remove pcred0 and its initialization from init_main.c; initialize cr_ruidinfo there. o Restruction many credential modification chunks to always crdup while we figure out locking and optimizations; generally speaking, this means moving to a structure like this: newcred = crdup(oldcred); ... p->p_ucred = newcred; crfree(oldcred); It's not race-free, but better than nothing. There are also races in sys_process.c, all inter-process authorization, fork, exec, and exit. o Remove sigio->sio_ruid since sigio->sio_ucred now contains the ruid; remove comments indicating that the old arrangement was a problem. o Restructure exec1() a little to use newcred/oldcred arrangement, and use improved uid management primitives. o Clean up exit1() so as to do less work in credential cleanup due to pcred removal. o Clean up fork1() so as to do less work in credential cleanup and allocation. o Clean up ktrcanset() to take into account changes, and move to using suser_xxx() instead of performing a direct uid==0 comparision. o Improve commenting in various kern_prot.c credential modification calls to better document current behavior. In a couple of places, current behavior is a little questionable and we need to check POSIX.1 to make sure it's "right". More commenting work still remains to be done. o Update credential management calls, such as crfree(), to take into account new ruidinfo reference. o Modify or add the following uid and gid helper routines: change_euid() change_egid() change_ruid() change_rgid() change_svuid() change_svgid() In each case, the call now acts on a credential not a process, and as such no longer requires more complicated process locking/etc. They now assume the caller will do any necessary allocation of an exclusive credential reference. Each is commented to document its reference requirements. o CANSIGIO() is simplified to require only credentials, not processes and pcreds. o Remove lots of (p_pcred==NULL) checks. o Add an XXX to authorization code in nfs_lock.c, since it's questionable, and needs to be considered carefully. o Simplify posix4 authorization code to require only credentials, not processes and pcreds. Note that this authorization, as well as CANSIGIO(), needs to be updated to use the p_cansignal() and p_cansched() centralized authorization routines, as they currently do not take into account some desirable restrictions that are handled by the centralized routines, as well as being inconsistent with other similar authorization instances. Robert N M Watson FreeBSD Core Team, TrustedBSD Project robert@fledge.watson.org NAI Labs, Safeport Network Services Index: compat/linprocfs/linprocfs_misc.c =================================================================== RCS file: /home/ncvs/src/sys/compat/linprocfs/linprocfs_misc.c,v retrieving revision 1.24 diff -u -r1.24 linprocfs_misc.c --- compat/linprocfs/linprocfs_misc.c 2001/05/01 08:11:51 1.24 +++ compat/linprocfs/linprocfs_misc.c 2001/05/06 00:43:51 @@ -444,14 +444,14 @@ PROC_LOCK(p); sbuf_printf(&sb, "PPid:\t%d\n", p->p_pptr ? p->p_pptr->p_pid : 0); - sbuf_printf(&sb, "Uid:\t%d %d %d %d\n", p->p_cred->p_ruid, + sbuf_printf(&sb, "Uid:\t%d %d %d %d\n", p->p_ucred->cr_ruid, p->p_ucred->cr_uid, - p->p_cred->p_svuid, + p->p_ucred->cr_svuid, /* FreeBSD doesn't have fsuid */ p->p_ucred->cr_uid); - sbuf_printf(&sb, "Gid:\t%d %d %d %d\n", p->p_cred->p_rgid, + sbuf_printf(&sb, "Gid:\t%d %d %d %d\n", p->p_ucred->cr_rgid, p->p_ucred->cr_gid, - p->p_cred->p_svgid, + p->p_ucred->cr_svgid, /* FreeBSD doesn't have fsgid */ p->p_ucred->cr_gid); sbuf_cat(&sb, "Groups:\t"); @@ -543,7 +543,7 @@ char *freepath = NULL; p = PFIND(pfs->pfs_pid); - if (p == NULL || p->p_cred == NULL || p->p_ucred == NULL) { + if (p == NULL || p->p_ucred == NULL) { if (p != NULL) PROC_UNLOCK(p); printf("doexelink: pid %d disappeared\n", pfs->pfs_pid); Index: compat/linprocfs/linprocfs_vnops.c =================================================================== RCS file: /home/ncvs/src/sys/compat/linprocfs/linprocfs_vnops.c,v retrieving revision 1.23 diff -u -r1.23 linprocfs_vnops.c --- compat/linprocfs/linprocfs_vnops.c 2001/05/04 05:19:22 1.23 +++ compat/linprocfs/linprocfs_vnops.c 2001/05/06 00:43:51 @@ -432,7 +432,7 @@ procp = PFIND(pfs->pfs_pid); if (procp == NULL) return (ENOENT); - if (procp->p_cred == NULL || procp->p_ucred == NULL) { + if (procp->p_ucred == NULL) { PROC_UNLOCK(procp); return (ENOENT); } Index: compat/linux/linux_misc.c =================================================================== RCS file: /home/ncvs/src/sys/compat/linux/linux_misc.c,v retrieving revision 1.101 diff -u -r1.101 linux_misc.c --- compat/linux/linux_misc.c 2001/05/01 08:11:51 1.101 +++ compat/linux/linux_misc.c 2001/05/06 00:43:52 @@ -958,12 +958,11 @@ struct proc *p; struct linux_setgroups_args *uap; { - struct pcred *pc; + struct ucred *newcred, *oldcred = p->p_ucred; linux_gid_t linux_gidset[NGROUPS]; gid_t *bsd_gidset; int ngrp, error; - pc = p->p_cred; ngrp = uap->gidsetsize; /* @@ -972,22 +971,22 @@ * Keep cr_groups[0] unchanged to prevent that. */ - if ((error = suser_xxx(NULL, p, PRISON_ROOT)) != 0) + if ((error = suser_xxx(oldcred, NULL, PRISON_ROOT)) != 0) return (error); if (ngrp >= NGROUPS) return (EINVAL); - pc->pc_ucred = crcopy(pc->pc_ucred); + newcred = crdup(oldcred); if (ngrp > 0) { error = copyin((caddr_t)uap->gidset, (caddr_t)linux_gidset, ngrp * sizeof(linux_gid_t)); if (error) return (error); - pc->pc_ucred->cr_ngroups = ngrp + 1; + newcred->cr_ngroups = ngrp + 1; - bsd_gidset = pc->pc_ucred->cr_groups; + bsd_gidset = newcred->cr_groups; ngrp--; while (ngrp >= 0) { bsd_gidset[ngrp + 1] = linux_gidset[ngrp]; @@ -995,9 +994,13 @@ } } else - pc->pc_ucred->cr_ngroups = 1; + newcred->cr_ngroups = 1; setsugid(p); + + p->p_ucred = newcred; + crfree(oldcred); + return (0); } @@ -1006,14 +1009,14 @@ struct proc *p; struct linux_getgroups_args *uap; { - struct pcred *pc; + struct ucred *cred; linux_gid_t linux_gidset[NGROUPS]; gid_t *bsd_gidset; int bsd_gidsetsz, ngrp, error; - pc = p->p_cred; - bsd_gidset = pc->pc_ucred->cr_groups; - bsd_gidsetsz = pc->pc_ucred->cr_ngroups - 1; + cred = p->p_ucred; + bsd_gidset = cred->cr_groups; + bsd_gidsetsz = cred->cr_ngroups - 1; /* * cr_groups[0] holds egid. Returning the whole set Index: compat/svr4/svr4_misc.c =================================================================== RCS file: /home/ncvs/src/sys/compat/svr4/svr4_misc.c,v retrieving revision 1.30 diff -u -r1.30 svr4_misc.c --- compat/svr4/svr4_misc.c 2001/05/01 08:11:52 1.30 +++ compat/svr4/svr4_misc.c 2001/05/06 00:43:54 @@ -1283,7 +1283,7 @@ /* * Decrement the count of procs running with this uid. */ - (void)chgproccnt(q->p_cred->p_uidinfo, -1, 0); + (void)chgproccnt(q->p_ucred->cr_ruidinfo, -1, 0); /* * Release reference to text vnode. @@ -1294,13 +1294,8 @@ /* * Free up credentials. */ - PROC_LOCK(q); - if (--q->p_cred->p_refcnt == 0) { - crfree(q->p_ucred); - uifree(q->p_cred->p_uidinfo); - FREE(q->p_cred, M_SUBPROC); - q->p_cred = NULL; - } + crfree(q->p_ucred); + q->p_ucred = NULL; /* * Remove unused arguments Index: compat/svr4/svr4_sysvec.c =================================================================== RCS file: /home/ncvs/src/sys/compat/svr4/svr4_sysvec.c,v retrieving revision 1.20 diff -u -r1.20 svr4_sysvec.c --- compat/svr4/svr4_sysvec.c 2001/02/24 22:20:02 1.20 +++ compat/svr4/svr4_sysvec.c 2001/05/04 18:25:53 @@ -213,10 +213,10 @@ AUXARGS_ENTRY(pos, AT_FLAGS, args->flags); AUXARGS_ENTRY(pos, AT_ENTRY, args->entry); AUXARGS_ENTRY(pos, AT_BASE, args->base); - AUXARGS_ENTRY(pos, AT_UID, imgp->proc->p_cred->p_ruid); - AUXARGS_ENTRY(pos, AT_EUID, imgp->proc->p_cred->p_svuid); - AUXARGS_ENTRY(pos, AT_GID, imgp->proc->p_cred->p_rgid); - AUXARGS_ENTRY(pos, AT_EGID, imgp->proc->p_cred->p_svgid); + AUXARGS_ENTRY(pos, AT_UID, imgp->proc->p_ucred->cr_ruid); + AUXARGS_ENTRY(pos, AT_EUID, imgp->proc->p_ucred->cr_svuid); + AUXARGS_ENTRY(pos, AT_GID, imgp->proc->p_ucred->cr_rgid); + AUXARGS_ENTRY(pos, AT_EGID, imgp->proc->p_ucred->cr_svgid); AUXARGS_ENTRY(pos, AT_NULL, 0); free(imgp->auxargs, M_TEMP); Index: ddb/db_ps.c =================================================================== RCS file: /home/ncvs/src/sys/ddb/db_ps.c,v retrieving revision 1.22 diff -u -r1.22 db_ps.c --- ddb/db_ps.c 2001/03/28 09:17:49 1.22 +++ ddb/db_ps.c 2001/05/04 15:35:39 @@ -95,7 +95,7 @@ db_printf("%5d %8p %8p %4d %5d %5d %06x %d", p->p_pid, (volatile void *)p, (void *)p->p_addr, - p->p_cred ? p->p_cred->p_ruid : 0, pp->p_pid, + p->p_ucred ? p->p_ucred->cr_ruid : 0, pp->p_pid, p->p_pgrp ? p->p_pgrp->pg_id : 0, p->p_flag, p->p_stat); if (p->p_wchan) { db_printf(" %6s %8p", p->p_wmesg, (void *)p->p_wchan); Index: i386/linux/linux_sysvec.c =================================================================== RCS file: /home/ncvs/src/sys/i386/linux/linux_sysvec.c,v retrieving revision 1.78 diff -u -r1.78 linux_sysvec.c --- i386/linux/linux_sysvec.c 2001/05/01 08:12:52 1.78 +++ i386/linux/linux_sysvec.c 2001/05/06 00:45:18 @@ -186,10 +186,10 @@ AUXARGS_ENTRY(pos, AT_ENTRY, args->entry); AUXARGS_ENTRY(pos, AT_BASE, args->base); PROC_LOCK(imgp->proc); - AUXARGS_ENTRY(pos, AT_UID, imgp->proc->p_cred->p_ruid); - AUXARGS_ENTRY(pos, AT_EUID, imgp->proc->p_cred->p_svuid); - AUXARGS_ENTRY(pos, AT_GID, imgp->proc->p_cred->p_rgid); - AUXARGS_ENTRY(pos, AT_EGID, imgp->proc->p_cred->p_svgid); + AUXARGS_ENTRY(pos, AT_UID, imgp->proc->p_ucred->cr_ruid); + AUXARGS_ENTRY(pos, AT_EUID, imgp->proc->p_ucred->cr_svuid); + AUXARGS_ENTRY(pos, AT_GID, imgp->proc->p_ucred->cr_rgid); + AUXARGS_ENTRY(pos, AT_EGID, imgp->proc->p_ucred->cr_svgid); PROC_UNLOCK(imgp->proc); AUXARGS_ENTRY(pos, AT_NULL, 0); Index: kern/init_main.c =================================================================== RCS file: /home/ncvs/src/sys/kern/init_main.c,v retrieving revision 1.168 diff -u -r1.168 init_main.c --- kern/init_main.c 2001/04/29 02:44:48 1.168 +++ kern/init_main.c 2001/05/04 15:37:01 @@ -85,7 +85,6 @@ static struct session session0; static struct pgrp pgrp0; struct proc proc0; -static struct pcred cred0; static struct procsig procsig0; static struct filedesc0 filedesc0; static struct plimit limit0; @@ -321,12 +320,10 @@ callout_init(&p->p_slpcallout, 1); /* Create credentials. */ - cred0.p_refcnt = 1; - cred0.p_uidinfo = uifind(0); - p->p_cred = &cred0; p->p_ucred = crget(); p->p_ucred->cr_ngroups = 1; /* group 0 */ p->p_ucred->cr_uidinfo = uifind(0); + p->p_ucred->cr_ruidinfo = uifind(0); p->p_ucred->cr_prison = NULL; /* Don't jail it. */ /* Create procsig. */ @@ -380,7 +377,7 @@ /* * Charge root for one process. */ - (void)chgproccnt(cred0.p_uidinfo, 1, 0); + (void)chgproccnt(p->p_ucred->cr_ruidinfo, 1, 0); } SYSINIT(p0init, SI_SUB_INTRINSIC, SI_ORDER_FIRST, proc0_init, NULL) Index: kern/kern_acct.c =================================================================== RCS file: /home/ncvs/src/sys/kern/kern_acct.c,v retrieving revision 1.33 diff -u -r1.33 kern_acct.c --- kern/kern_acct.c 2001/05/01 08:12:55 1.33 +++ kern/kern_acct.c 2001/05/06 00:45:31 @@ -222,8 +222,8 @@ acct.ac_io = encode_comp_t(r->ru_inblock + r->ru_oublock, 0); /* (6) The UID and GID of the process */ - acct.ac_uid = p->p_cred->p_ruid; - acct.ac_gid = p->p_cred->p_rgid; + acct.ac_uid = p->p_ucred->cr_ruid; + acct.ac_gid = p->p_ucred->cr_rgid; /* (7) The terminal from which the process was started */ if ((p->p_flag & P_CONTROLT) && p->p_pgrp->pg_session->s_ttyp) Index: kern/kern_descrip.c =================================================================== RCS file: /home/ncvs/src/sys/kern/kern_descrip.c,v retrieving revision 1.100 diff -u -r1.100 kern_descrip.c --- kern/kern_descrip.c 2001/05/01 08:12:55 1.100 +++ kern/kern_descrip.c 2001/05/06 00:45:32 @@ -525,8 +525,6 @@ sigio->sio_pgid = pgid; crhold(curproc->p_ucred); sigio->sio_ucred = curproc->p_ucred; - /* It would be convenient if p_ruid was in ucred. */ - sigio->sio_ruid = curproc->p_cred->p_ruid; sigio->sio_myref = sigiop; s = splhigh(); *sigiop = sigio; Index: kern/kern_exec.c =================================================================== RCS file: /home/ncvs/src/sys/kern/kern_exec.c,v retrieving revision 1.126 diff -u -r1.126 kern_exec.c --- kern/kern_exec.c 2001/05/01 08:12:56 1.126 +++ kern/kern_exec.c 2001/05/06 16:25:06 @@ -104,8 +104,9 @@ register struct execve_args *uap; { struct nameidata nd, *ndp; + struct ucred *oldcred = p->p_ucred, *newcred; register_t *stack_base; - int error, len, i; + int error, len, i, intrace; struct image_params image_params, *imgp; struct vattr attr; int (*img_first) __P((struct image_params *)); @@ -272,23 +273,31 @@ p->p_flag &= ~P_PPWAIT; wakeup((caddr_t)p->p_pptr); } + intrace = p->p_flag & P_TRACED; + PROC_UNLOCK(p); /* + * XXX: Note, the whole execve() is incredibly racey right now + * with regards to debugging and privilege/credential management. + * This MUST be fixed prior to any release. + */ + + /* * Implement image setuid/setgid. * * Don't honor setuid/setgid if the filesystem prohibits it or if * the process is being traced. */ - if ((((attr.va_mode & VSUID) && p->p_ucred->cr_uid != attr.va_uid) || - ((attr.va_mode & VSGID) && p->p_ucred->cr_gid != attr.va_gid)) && - (imgp->vp->v_mount->mnt_flag & MNT_NOSUID) == 0 && - (p->p_flag & P_TRACED) == 0) { + newcred = NULL; + if ((((attr.va_mode & VSUID) && oldcred->cr_uid != attr.va_uid) || + ((attr.va_mode & VSGID) && oldcred->cr_gid != attr.va_gid)) && + (imgp->vp->v_mount->mnt_flag & MNT_NOSUID) == 0 && intrace == 0) { PROC_UNLOCK(p); /* * Turn off syscall tracing for set-id programs, except for * root. */ - if (p->p_tracep && suser(p)) { + if (p->p_tracep && suser_xxx(oldcred, NULL, PRISON_ROOT)) { p->p_traceflag = 0; vrele(p->p_tracep); p->p_tracep = NULL; @@ -296,25 +305,42 @@ /* * Set the new credentials. */ - p->p_ucred = crcopy(p->p_ucred); + newcred = crdup(p->p_ucred); if (attr.va_mode & VSUID) - change_euid(p, attr.va_uid); + change_euid(newcred, attr.va_uid); if (attr.va_mode & VSGID) - p->p_ucred->cr_gid = attr.va_gid; + change_egid(newcred, attr.va_gid); setsugid(p); setugidsafety(p); } else { - if (p->p_ucred->cr_uid == p->p_cred->p_ruid && - p->p_ucred->cr_gid == p->p_cred->p_rgid) - p->p_flag &= ~P_SUGID; + if (oldcred->cr_uid == oldcred->cr_ruid && + oldcred->cr_gid == oldcred->cr_rgid) + p->p_flag &= ~P_SUGID; /* XXX locking */ PROC_UNLOCK(p); } /* * Implement correct POSIX saved-id behavior. + * + * XXX: determine whether tests and sets should occur on old or + * new credentials. */ - p->p_cred->p_svuid = p->p_ucred->cr_uid; - p->p_cred->p_svgid = p->p_ucred->cr_gid; + if (p->p_ucred->cr_svuid != p->p_ucred->cr_uid || + p->p_ucred->cr_svgid != p->p_ucred->cr_gid) { + if (newcred != NULL) + newcred = crdup(p->p_ucred); + + change_svuid(newcred, p->p_ucred->cr_uid); + change_svgid(newcred, p->p_ucred->cr_gid); + } + + if (newcred != NULL) { + struct ucred *oldcred; + + oldcred = p->p_ucred; + p->p_ucred = newcred; + crfree(oldcred); + } /* * Store the vp for use in procfs Index: kern/kern_exit.c =================================================================== RCS file: /home/ncvs/src/sys/kern/kern_exit.c,v retrieving revision 1.126 diff -u -r1.126 kern_exit.c --- kern/kern_exit.c 2001/05/04 16:13:28 1.126 +++ kern/kern_exit.c 2001/05/06 00:49:14 @@ -514,7 +514,7 @@ /* * Decrement the count of procs running with this uid. */ - (void)chgproccnt(p->p_cred->p_uidinfo, -1, 0); + (void)chgproccnt(p->p_ucred->cr_ruidinfo, -1, 0); /* * Release reference to text vnode @@ -539,12 +539,8 @@ /* * Free up credentials. */ - if (--p->p_cred->p_refcnt == 0) { - crfree(p->p_ucred); - uifree(p->p_cred->p_uidinfo); - FREE(p->p_cred, M_SUBPROC); - p->p_cred = NULL; - } + crfree(p->p_ucred); + p->p_ucred = NULL; /* * Remove unused arguments Index: kern/kern_fork.c =================================================================== RCS file: /home/ncvs/src/sys/kern/kern_fork.c,v retrieving revision 1.110 diff -u -r1.110 kern_fork.c --- kern/kern_fork.c 2001/03/28 11:52:53 1.110 +++ kern/kern_fork.c 2001/05/04 16:34:35 @@ -257,7 +257,7 @@ * exceed the limit. The variable nprocs is the current number of * processes, maxproc is the limit. */ - uid = p1->p_cred->p_ruid; + uid = p1->p_ucred->cr_ruid; if ((nprocs >= maxproc - 1 && uid != 0) || nprocs >= maxproc) { tablefull("proc"); return (EAGAIN); @@ -272,7 +272,7 @@ * Increment the count of procs running with this uid. Don't allow * a nonprivileged user to exceed their current limit. */ - ok = chgproccnt(p1->p_cred->p_uidinfo, 1, + ok = chgproccnt(p1->p_ucred->cr_ruidinfo, 1, (uid != 0) ? p1->p_rlimit[RLIMIT_NPROC].rlim_cur : 0); if (!ok) { /* @@ -408,15 +408,9 @@ * We start off holding one spinlock after fork: sched_lock. */ p2->p_spinlocks = 1; - PROC_UNLOCK(p2); - MALLOC(p2->p_cred, struct pcred *, sizeof(struct pcred), - M_SUBPROC, M_WAITOK); - PROC_LOCK(p2); PROC_LOCK(p1); - bcopy(p1->p_cred, p2->p_cred, sizeof(*p2->p_cred)); - p2->p_cred->p_refcnt = 1; crhold(p1->p_ucred); - uihold(p1->p_cred->p_uidinfo); + p2->p_ucred = p1->p_ucred; if (p2->p_args) p2->p_args->ar_ref++; Index: kern/kern_ktrace.c =================================================================== RCS file: /home/ncvs/src/sys/kern/kern_ktrace.c,v retrieving revision 1.52 diff -u -r1.52 kern_ktrace.c --- kern/kern_ktrace.c 2001/05/01 08:12:56 1.52 +++ kern/kern_ktrace.c 2001/05/06 00:45:34 @@ -531,17 +531,17 @@ ktrcanset(callp, targetp) struct proc *callp, *targetp; { - register struct pcred *caller = callp->p_cred; - register struct pcred *target = targetp->p_cred; + struct ucred *callcr = callp->p_ucred; + struct ucred *targetcr = targetp->p_ucred; - if (prison_check(callp->p_ucred, targetp->p_ucred)) + if (prison_check(callcr, targetcr)) return (0); - if ((caller->pc_ucred->cr_uid == target->p_ruid && - target->p_ruid == target->p_svuid && - caller->p_rgid == target->p_rgid && /* XXX */ - target->p_rgid == target->p_svgid && + if ((callcr->cr_uid == targetcr->cr_ruid && + targetcr->cr_ruid == targetcr->cr_svuid && + callcr->cr_rgid == targetcr->cr_rgid && /* XXX */ + targetcr->cr_rgid == targetcr->cr_svgid && (targetp->p_traceflag & KTRFAC_ROOT) == 0) || - caller->pc_ucred->cr_uid == 0) + !suser_xxx(callcr, NULL, PRISON_ROOT)) return (1); return (0); Index: kern/kern_proc.c =================================================================== RCS file: /home/ncvs/src/sys/kern/kern_proc.c,v retrieving revision 1.93 diff -u -r1.93 kern_proc.c --- kern/kern_proc.c 2001/05/01 08:12:57 1.93 +++ kern/kern_proc.c 2001/05/06 00:45:35 @@ -424,15 +424,15 @@ kp->ki_textvp = p->p_textvp; kp->ki_fd = p->p_fd; kp->ki_vmspace = p->p_vmspace; - if (p->p_cred) { - kp->ki_uid = p->p_cred->pc_ucred->cr_uid; - kp->ki_ruid = p->p_cred->p_ruid; - kp->ki_svuid = p->p_cred->p_svuid; - kp->ki_ngroups = p->p_cred->pc_ucred->cr_ngroups; - bcopy(p->p_cred->pc_ucred->cr_groups, kp->ki_groups, + if (p->p_ucred) { + kp->ki_uid = p->p_ucred->cr_uid; + kp->ki_ruid = p->p_ucred->cr_ruid; + kp->ki_svuid = p->p_ucred->cr_svuid; + kp->ki_ngroups = p->p_ucred->cr_ngroups; + bcopy(p->p_ucred->cr_groups, kp->ki_groups, NGROUPS * sizeof(gid_t)); - kp->ki_rgid = p->p_cred->p_rgid; - kp->ki_svgid = p->p_cred->p_svgid; + kp->ki_rgid = p->p_ucred->cr_rgid; + kp->ki_svgid = p->p_ucred->cr_svgid; } if (p->p_procsig) { kp->ki_sigignore = p->p_procsig->ps_sigignore; @@ -653,7 +653,7 @@ case KERN_PROC_RUID: if (p->p_ucred == NULL || - p->p_cred->p_ruid != (uid_t)name[0]) + p->p_ucred->cr_ruid != (uid_t)name[0]) continue; break; } Index: kern/kern_prot.c =================================================================== RCS file: /home/ncvs/src/sys/kern/kern_prot.c,v retrieving revision 1.89 diff -u -r1.89 kern_prot.c --- kern/kern_prot.c 2001/05/01 08:12:57 1.89 +++ kern/kern_prot.c 2001/05/06 00:45:35 @@ -210,7 +210,7 @@ struct getuid_args *uap; { - p->p_retval[0] = p->p_cred->p_ruid; + p->p_retval[0] = p->p_ucred->cr_ruid; #if defined(COMPAT_43) || defined(COMPAT_SUNOS) p->p_retval[1] = p->p_ucred->cr_uid; #endif @@ -253,7 +253,7 @@ struct getgid_args *uap; { - p->p_retval[0] = p->p_cred->p_rgid; + p->p_retval[0] = p->p_ucred->cr_rgid; #if defined(COMPAT_43) || defined(COMPAT_SUNOS) p->p_retval[1] = p->p_ucred->cr_groups[0]; #endif @@ -293,18 +293,18 @@ struct proc *p; register struct getgroups_args *uap; { - register struct pcred *pc = p->p_cred; + register struct ucred *cred = p->p_ucred; register u_int ngrp; int error; if ((ngrp = uap->gidsetsize) == 0) { - p->p_retval[0] = pc->pc_ucred->cr_ngroups; + p->p_retval[0] = cred->cr_ngroups; return (0); } - if (ngrp < pc->pc_ucred->cr_ngroups) + if (ngrp < cred->cr_ngroups) return (EINVAL); - ngrp = pc->pc_ucred->cr_ngroups; - if ((error = copyout((caddr_t)pc->pc_ucred->cr_groups, + ngrp = cred->cr_ngroups; + if ((error = copyout((caddr_t)cred->cr_groups, (caddr_t)uap->gidset, ngrp * sizeof(gid_t)))) return (error); p->p_retval[0] = ngrp; @@ -427,7 +427,7 @@ struct proc *p; struct setuid_args *uap; { - register struct pcred *pc = p->p_cred; + struct ucred *oldcred = p->p_ucred, *newcred; register uid_t uid; int error; @@ -449,16 +449,17 @@ * 3: Change euid last. (after tests in #2 for "appropriate privs") */ uid = uap->uid; - if (uid != pc->p_ruid && /* allow setuid(getuid()) */ + if (uid != oldcred->cr_ruid && /* allow setuid(getuid()) */ #ifdef _POSIX_SAVED_IDS - uid != pc->p_svuid && /* allow setuid(saved gid) */ + uid != oldcred->cr_svuid && /* allow setuid(saved gid) */ #endif #ifdef POSIX_APPENDIX_B_4_2_2 /* Use BSD-compat clause from B.4.2.2 */ - uid != pc->pc_ucred->cr_uid && /* allow setuid(geteuid()) */ + uid != oldcred->cr_uid && /* allow setuid(geteuid()) */ #endif - (error = suser_xxx(0, p, PRISON_ROOT))) + (error = suser_xxx(oldcred, NULL, PRISON_ROOT))) return (error); + newcred = crdup(oldcred); #ifdef _POSIX_SAVED_IDS /* * Do we have "appropriate privileges" (are we root or uid == euid) @@ -466,16 +467,16 @@ */ if ( #ifdef POSIX_APPENDIX_B_4_2_2 /* Use the clause from B.4.2.2 */ - uid == pc->pc_ucred->cr_uid || + uid == oldcred->cr_uid || #endif - suser_xxx(0, p, PRISON_ROOT) == 0) /* we are using privs */ + suser_xxx(oldcred, NULL, PRISON_ROOT) == 0) /* we are using privs */ #endif { /* * Set the real uid and transfer proc count to new user. */ - if (uid != pc->p_ruid) { - change_ruid(p, uid); + if (uid != oldcred->cr_ruid) { + change_ruid(newcred, uid); setsugid(p); } /* @@ -485,8 +486,8 @@ * the security of seteuid() depends on it. B.4.2.2 says it * is important that we should do this. */ - if (pc->p_svuid != uid) { - pc->p_svuid = uid; + if (uid != oldcred->cr_svuid) { + change_svuid(newcred, uid); setsugid(p); } } @@ -495,10 +496,12 @@ * In all permitted cases, we are changing the euid. * Copy credentials so other references do not see our changes. */ - if (pc->pc_ucred->cr_uid != uid) { - change_euid(p, uid); + if (uid != oldcred->cr_uid) { + change_euid(newcred, uid); setsugid(p); } + p->p_ucred = newcred; + crfree(oldcred); return (0); } @@ -513,23 +516,31 @@ struct proc *p; struct seteuid_args *uap; { - register struct pcred *pc = p->p_cred; + register struct ucred *oldcred = p->p_ucred, *newcred; register uid_t euid; int error; euid = uap->euid; - if (euid != pc->p_ruid && /* allow seteuid(getuid()) */ - euid != pc->p_svuid && /* allow seteuid(saved uid) */ - (error = suser_xxx(0, p, PRISON_ROOT))) + /* + * The new effective uid must equal the current real or saved + * uid. Appropriate privilege may override this restriction. + */ + if (euid != oldcred->cr_ruid && /* allow seteuid(getuid()) */ + euid != oldcred->cr_svuid && /* allow seteuid(saved uid) */ + (error = suser_xxx(oldcred, NULL, PRISON_ROOT))) return (error); + /* * Everything's okay, do it. Copy credentials so other references do * not see our changes. */ - if (pc->pc_ucred->cr_uid != euid) { - change_euid(p, euid); + newcred = crdup(oldcred); + if (oldcred->cr_uid != euid) { + change_euid(newcred, euid); setsugid(p); } + p->p_ucred = newcred; + crfree(oldcred); return (0); } @@ -544,7 +555,7 @@ struct proc *p; struct setgid_args *uap; { - register struct pcred *pc = p->p_cred; + register struct ucred *oldcred = p->p_ucred, *newcred; register gid_t gid; int error; @@ -560,16 +571,17 @@ * For notes on the logic here, see setuid() above. */ gid = uap->gid; - if (gid != pc->p_rgid && /* allow setgid(getgid()) */ + if (gid != oldcred->cr_rgid && /* allow setgid(getgid()) */ #ifdef _POSIX_SAVED_IDS - gid != pc->p_svgid && /* allow setgid(saved gid) */ + gid != oldcred->cr_svgid && /* allow setgid(saved gid) */ #endif #ifdef POSIX_APPENDIX_B_4_2_2 /* Use BSD-compat clause from B.4.2.2 */ - gid != pc->pc_ucred->cr_groups[0] && /* allow setgid(getegid()) */ + gid != oldcred->cr_groups[0] && /* allow setgid(getegid()) */ #endif - (error = suser_xxx(0, p, PRISON_ROOT))) + (error = suser_xxx(oldcred, NULL, PRISON_ROOT))) return (error); + newcred = crdup(oldcred); #ifdef _POSIX_SAVED_IDS /* * Do we have "appropriate privileges" (are we root or gid == egid) @@ -577,16 +589,16 @@ */ if ( #ifdef POSIX_APPENDIX_B_4_2_2 /* use the clause from B.4.2.2 */ - gid == pc->pc_ucred->cr_groups[0] || + gid == oldcred->cr_groups[0] || #endif - suser_xxx(0, p, PRISON_ROOT) == 0) /* we are using privs */ + suser_xxx(oldcred, NULL, PRISON_ROOT) == 0) /* we are using privs */ #endif { /* * Set real gid */ - if (pc->p_rgid != gid) { - pc->p_rgid = gid; + if (oldcred->cr_rgid != gid) { + change_rgid(newcred, gid); setsugid(p); } /* @@ -596,8 +608,8 @@ * the security of setegid() depends on it. B.4.2.2 says it * is important that we should do this. */ - if (pc->p_svgid != gid) { - pc->p_svgid = gid; + if (oldcred->cr_svgid != gid) { + change_svgid(newcred, gid); setsugid(p); } } @@ -605,11 +617,12 @@ * In all cases permitted cases, we are changing the egid. * Copy credentials so other references do not see our changes. */ - if (pc->pc_ucred->cr_groups[0] != gid) { - pc->pc_ucred = crcopy(pc->pc_ucred); - pc->pc_ucred->cr_groups[0] = gid; + if (oldcred->cr_groups[0] != gid) { + change_egid(newcred, gid); setsugid(p); } + p->p_ucred = newcred; + crfree(oldcred); return (0); } @@ -624,20 +637,27 @@ struct proc *p; struct setegid_args *uap; { - register struct pcred *pc = p->p_cred; + register struct ucred *oldcred = p->p_ucred, *newcred; register gid_t egid; int error; egid = uap->egid; - if (egid != pc->p_rgid && /* allow setegid(getgid()) */ - egid != pc->p_svgid && /* allow setegid(saved gid) */ - (error = suser_xxx(0, p, PRISON_ROOT))) + /* + * The new effective gid must be equal to either the current real or + * saved gid. Appropriate privilege may override this restriction. + */ + if (egid != oldcred->cr_rgid && /* allow setegid(getgid()) */ + egid != oldcred->cr_svgid && /* allow setegid(saved gid) */ + (error = suser_xxx(oldcred, NULL, PRISON_ROOT))) return (error); - if (pc->pc_ucred->cr_groups[0] != egid) { - pc->pc_ucred = crcopy(pc->pc_ucred); - pc->pc_ucred->cr_groups[0] = egid; + + newcred = crdup(oldcred); + if (oldcred->cr_groups[0] != egid) { + change_egid(newcred, egid); setsugid(p); } + p->p_ucred = newcred; + crfree(oldcred); return (0); } @@ -653,11 +673,11 @@ struct proc *p; struct setgroups_args *uap; { - register struct pcred *pc = p->p_cred; + register struct ucred *oldcred = p->p_ucred, *newcred; register u_int ngrp; int error; - if ((error = suser_xxx(0, p, PRISON_ROOT))) + if ((error = suser_xxx(oldcred, NULL, PRISON_ROOT))) return (error); ngrp = uap->gidsetsize; if (ngrp > NGROUPS) @@ -666,7 +686,7 @@ * XXX A little bit lazy here. We could test if anything has * changed before crcopy() and setting P_SUGID. */ - pc->pc_ucred = crcopy(pc->pc_ucred); + newcred = crdup(oldcred); if (ngrp < 1) { /* * setgroups(0, NULL) is a legitimate way of clearing the @@ -674,14 +694,18 @@ * have the egid in the groups[0]). We risk security holes * when running non-BSD software if we do not do the same. */ - pc->pc_ucred->cr_ngroups = 1; + newcred->cr_ngroups = 1; } else { if ((error = copyin((caddr_t)uap->gidset, - (caddr_t)pc->pc_ucred->cr_groups, ngrp * sizeof(gid_t)))) + (caddr_t)newcred->cr_groups, ngrp * sizeof(gid_t)))) { + crfree(newcred); return (error); - pc->pc_ucred->cr_ngroups = ngrp; + } + newcred->cr_ngroups = ngrp; } setsugid(p); + p->p_ucred = newcred; + crfree(oldcred); return (0); } @@ -697,31 +721,52 @@ register struct proc *p; struct setreuid_args *uap; { - register struct pcred *pc = p->p_cred; + register struct ucred *oldcred = p->p_ucred, *newcred; register uid_t ruid, euid; int error; ruid = uap->ruid; euid = uap->euid; - if (((ruid != (uid_t)-1 && ruid != pc->p_ruid && ruid != pc->p_svuid) || - (euid != (uid_t)-1 && euid != pc->pc_ucred->cr_uid && - euid != pc->p_ruid && euid != pc->p_svuid)) && - (error = suser_xxx(0, p, PRISON_ROOT)) != 0) + /* + * If an real uid update is requested, the requested real uid must + * be equal to the current real or saved uid. If an effective uid + * update is requested, the requested euid must be equal to the + * current effective uid, real uid, or saved uid. Appropriate + * privilege may override these restrictions. + */ + if (((ruid != (uid_t)-1 && ruid != oldcred->cr_ruid && + ruid != oldcred->cr_svuid) || + (euid != (uid_t)-1 && euid != oldcred->cr_uid && + euid != oldcred->cr_ruid && euid != oldcred->cr_svuid)) && + (error = suser_xxx(oldcred, NULL, PRISON_ROOT)) != 0) return (error); - if (euid != (uid_t)-1 && pc->pc_ucred->cr_uid != euid) { - change_euid(p, euid); + newcred = crdup(oldcred); + if (euid != (uid_t)-1 && oldcred->cr_uid != euid) { + change_euid(newcred, euid); setsugid(p); } - if (ruid != (uid_t)-1 && pc->p_ruid != ruid) { - change_ruid(p, ruid); + if (ruid != (uid_t)-1 && oldcred->cr_ruid != ruid) { + change_ruid(newcred, ruid); setsugid(p); } - if ((ruid != (uid_t)-1 || pc->pc_ucred->cr_uid != pc->p_ruid) && - pc->p_svuid != pc->pc_ucred->cr_uid) { - pc->p_svuid = pc->pc_ucred->cr_uid; + /* + * XXX: What is this intended to accomplish? In which cases should + * it be looking at the old values, and in which, the new values? + * + * Note current behavior is: + * If the ruid update is requested (even if the ruid is not changed) + * or the euid is not equal to the value of the ruid, a difference + * in the svuid and the euid will result in the svuid being + * updated to the new value of the euid. + */ + if ((ruid != (uid_t)-1 || newcred->cr_uid != newcred->cr_ruid) && + newcred->cr_svuid != newcred->cr_uid) { + change_svuid(newcred, newcred->cr_uid); setsugid(p); } + p->p_ucred = newcred; + crfree(oldcred); return (0); } @@ -737,30 +782,49 @@ register struct proc *p; struct setregid_args *uap; { - register struct pcred *pc = p->p_cred; + register struct ucred *oldcred = p->p_ucred, *newcred; register gid_t rgid, egid; int error; rgid = uap->rgid; egid = uap->egid; - if (((rgid != (gid_t)-1 && rgid != pc->p_rgid && rgid != pc->p_svgid) || - (egid != (gid_t)-1 && egid != pc->pc_ucred->cr_groups[0] && - egid != pc->p_rgid && egid != pc->p_svgid)) && - (error = suser_xxx(0, p, PRISON_ROOT)) != 0) + /* + * If a real gid update is requested, the requested real gid must + * be equal to the current real or saved gid. If an effective gid + * update is requested, the requested effective gid must be equal + * to the current effective gid, the current real gid, or the + * current saved gid. Apropriate privilege may override this + * restriction. + */ + if (((rgid != (gid_t)-1 && rgid != oldcred->cr_rgid && + rgid != oldcred->cr_svgid) || + (egid != (gid_t)-1 && egid != oldcred->cr_groups[0] && + egid != oldcred->cr_rgid && egid != oldcred->cr_svgid)) && + (error = suser_xxx(oldcred, NULL, PRISON_ROOT)) != 0) return (error); - if (egid != (gid_t)-1 && pc->pc_ucred->cr_groups[0] != egid) { - pc->pc_ucred = crcopy(pc->pc_ucred); - pc->pc_ucred->cr_groups[0] = egid; + newcred = crdup(oldcred); + if (egid != (gid_t)-1 && oldcred->cr_groups[0] != egid) { + change_egid(newcred, egid); setsugid(p); } - if (rgid != (gid_t)-1 && pc->p_rgid != rgid) { - pc->p_rgid = rgid; + if (rgid != (gid_t)-1 && oldcred->cr_rgid != rgid) { + change_rgid(newcred, rgid); setsugid(p); } - if ((rgid != (gid_t)-1 || pc->pc_ucred->cr_groups[0] != pc->p_rgid) && - pc->p_svgid != pc->pc_ucred->cr_groups[0]) { - pc->p_svgid = pc->pc_ucred->cr_groups[0]; + /* + * XXX: What is this intended to accomplish? In which cases should + * it be looking at the old values, and in which, the new values? + * + * Note current behavior is: + * If the rgid update is requested (even if the rgid is not changed) + * or the egid is not equal to the value of the rgid, a difference + * in the svgid and the egid will result in the svuid being + * updated to the new value of the euid. + */ + if ((rgid != (gid_t)-1 || newcred->cr_groups[0] != newcred->cr_rgid) && + newcred->cr_svgid != newcred->cr_groups[0]) { + change_svgid(newcred, newcred->cr_groups[0]); setsugid(p); } return (0); @@ -784,33 +848,40 @@ register struct proc *p; struct setresuid_args *uap; { - register struct pcred *pc = p->p_cred; + register struct ucred *oldcred = p->p_ucred, *newcred; register uid_t ruid, euid, suid; int error; ruid = uap->ruid; euid = uap->euid; suid = uap->suid; - if (((ruid != (uid_t)-1 && ruid != pc->p_ruid && ruid != pc->p_svuid && - ruid != pc->pc_ucred->cr_uid) || - (euid != (uid_t)-1 && euid != pc->p_ruid && euid != pc->p_svuid && - euid != pc->pc_ucred->cr_uid) || - (suid != (uid_t)-1 && suid != pc->p_ruid && suid != pc->p_svuid && - suid != pc->pc_ucred->cr_uid)) && - (error = suser_xxx(0, p, PRISON_ROOT)) != 0) + if (((ruid != (uid_t)-1 && ruid != oldcred->cr_ruid && + ruid != oldcred->cr_svuid && + ruid != oldcred->cr_uid) || + (euid != (uid_t)-1 && euid != oldcred->cr_ruid && + euid != oldcred->cr_svuid && + euid != oldcred->cr_uid) || + (suid != (uid_t)-1 && suid != oldcred->cr_ruid && + suid != oldcred->cr_svuid && + suid != oldcred->cr_uid)) && + (error = suser_xxx(oldcred, NULL, PRISON_ROOT)) != 0) return (error); - if (euid != (uid_t)-1 && pc->pc_ucred->cr_uid != euid) { - change_euid(p, euid); + + newcred = crdup(oldcred); + if (euid != (uid_t)-1 && oldcred->cr_uid != euid) { + change_euid(newcred, euid); setsugid(p); } - if (ruid != (uid_t)-1 && pc->p_ruid != ruid) { - change_ruid(p, ruid); + if (ruid != (uid_t)-1 && oldcred->cr_ruid != ruid) { + change_ruid(newcred, ruid); setsugid(p); } - if (suid != (uid_t)-1 && pc->p_svuid != suid) { - pc->p_svuid = suid; + if (suid != (uid_t)-1 && oldcred->cr_svuid != suid) { + change_svuid(newcred, suid); setsugid(p); } + p->p_ucred = newcred; + crfree(oldcred); return (0); } @@ -832,35 +903,40 @@ register struct proc *p; struct setresgid_args *uap; { - register struct pcred *pc = p->p_cred; + register struct ucred *oldcred = p->p_ucred, *newcred; register gid_t rgid, egid, sgid; int error; rgid = uap->rgid; egid = uap->egid; sgid = uap->sgid; - if (((rgid != (gid_t)-1 && rgid != pc->p_rgid && rgid != pc->p_svgid && - rgid != pc->pc_ucred->cr_groups[0]) || - (egid != (gid_t)-1 && egid != pc->p_rgid && egid != pc->p_svgid && - egid != pc->pc_ucred->cr_groups[0]) || - (sgid != (gid_t)-1 && sgid != pc->p_rgid && sgid != pc->p_svgid && - sgid != pc->pc_ucred->cr_groups[0])) && - (error = suser_xxx(0, p, PRISON_ROOT)) != 0) + if (((rgid != (gid_t)-1 && rgid != oldcred->cr_rgid && + rgid != oldcred->cr_svgid && + rgid != oldcred->cr_groups[0]) || + (egid != (gid_t)-1 && egid != oldcred->cr_rgid && + egid != oldcred->cr_svgid && + egid != oldcred->cr_groups[0]) || + (sgid != (gid_t)-1 && sgid != oldcred->cr_rgid && + sgid != oldcred->cr_svgid && + sgid != oldcred->cr_groups[0])) && + (error = suser_xxx(oldcred, NULL, PRISON_ROOT)) != 0) return (error); - if (egid != (gid_t)-1 && pc->pc_ucred->cr_groups[0] != egid) { - pc->pc_ucred = crcopy(pc->pc_ucred); - pc->pc_ucred->cr_groups[0] = egid; + newcred = crdup(oldcred); + if (egid != (gid_t)-1 && oldcred->cr_groups[0] != egid) { + change_egid(newcred, egid); setsugid(p); } - if (rgid != (gid_t)-1 && pc->p_rgid != rgid) { - pc->p_rgid = rgid; + if (rgid != (gid_t)-1 && oldcred->cr_rgid != rgid) { + change_rgid(newcred, rgid); setsugid(p); } - if (sgid != (gid_t)-1 && pc->p_svgid != sgid) { - pc->p_svgid = sgid; + if (sgid != (gid_t)-1 && oldcred->cr_svgid != sgid) { + change_svgid(newcred, sgid); setsugid(p); } + p->p_ucred = newcred; + crfree(oldcred); return (0); } @@ -877,18 +953,18 @@ register struct proc *p; struct getresuid_args *uap; { - struct pcred *pc = p->p_cred; + struct ucred *cred = p->p_ucred; int error1 = 0, error2 = 0, error3 = 0; if (uap->ruid) - error1 = copyout((caddr_t)&pc->p_ruid, - (caddr_t)uap->ruid, sizeof(pc->p_ruid)); + error1 = copyout((caddr_t)&cred->cr_ruid, + (caddr_t)uap->ruid, sizeof(cred->cr_ruid)); if (uap->euid) - error2 = copyout((caddr_t)&pc->pc_ucred->cr_uid, - (caddr_t)uap->euid, sizeof(pc->pc_ucred->cr_uid)); + error2 = copyout((caddr_t)&cred->cr_uid, + (caddr_t)uap->euid, sizeof(cred->cr_uid)); if (uap->suid) - error3 = copyout((caddr_t)&pc->p_svuid, - (caddr_t)uap->suid, sizeof(pc->p_svuid)); + error3 = copyout((caddr_t)&cred->cr_svuid, + (caddr_t)uap->suid, sizeof(cred->cr_svuid)); return error1 ? error1 : (error2 ? error2 : error3); } @@ -905,18 +981,18 @@ register struct proc *p; struct getresgid_args *uap; { - struct pcred *pc = p->p_cred; + struct ucred *cred = p->p_ucred; int error1 = 0, error2 = 0, error3 = 0; if (uap->rgid) - error1 = copyout((caddr_t)&pc->p_rgid, - (caddr_t)uap->rgid, sizeof(pc->p_rgid)); + error1 = copyout((caddr_t)&cred->cr_rgid, + (caddr_t)uap->rgid, sizeof(cred->cr_rgid)); if (uap->egid) - error2 = copyout((caddr_t)&pc->pc_ucred->cr_groups[0], - (caddr_t)uap->egid, sizeof(pc->pc_ucred->cr_groups[0])); + error2 = copyout((caddr_t)&cred->cr_groups[0], + (caddr_t)uap->egid, sizeof(cred->cr_groups[0])); if (uap->sgid) - error3 = copyout((caddr_t)&pc->p_svgid, - (caddr_t)uap->sgid, sizeof(pc->p_svgid)); + error3 = copyout((caddr_t)&cred->cr_svgid, + (caddr_t)uap->sgid, sizeof(cred->cr_svgid)); return error1 ? error1 : (error2 ? error2 : error3); } @@ -1113,10 +1189,10 @@ * Generally, the object credential's ruid or svuid must match the * subject credential's ruid or euid. */ - if (p1->p_cred->p_ruid != p2->p_cred->p_ruid && - p1->p_cred->p_ruid != p2->p_cred->p_svuid && - p1->p_ucred->cr_uid != p2->p_cred->p_ruid && - p1->p_ucred->cr_uid != p2->p_cred->p_svuid) { + if (p1->p_ucred->cr_ruid != p2->p_ucred->cr_ruid && + p1->p_ucred->cr_ruid != p2->p_ucred->cr_svuid && + p1->p_ucred->cr_uid != p2->p_ucred->cr_ruid && + p1->p_ucred->cr_uid != p2->p_ucred->cr_svuid) { /* Not permitted, try privilege. */ error = suser_xxx(NULL, p1, PRISON_ROOT); if (error) @@ -1140,9 +1216,9 @@ if ((error = prison_check(p1->p_ucred, p2->p_ucred))) return (error); - if (p1->p_cred->p_ruid == p2->p_cred->p_ruid) + if (p1->p_ucred->cr_ruid == p2->p_ucred->cr_ruid) return (0); - if (p1->p_ucred->cr_uid == p2->p_cred->p_ruid) + if (p1->p_ucred->cr_uid == p2->p_ucred->cr_ruid) return (0); if (!suser_xxx(0, p1, PRISON_ROOT)) { @@ -1178,9 +1254,9 @@ /* not owned by you, has done setuid (unless you're root) */ /* add a CAP_SYS_PTRACE here? */ - if (p1->p_cred->pc_ucred->cr_uid != p2->p_cred->p_ruid || - p1->p_cred->p_ruid != p2->p_cred->p_ruid || - p1->p_cred->p_svuid != p2->p_cred->p_ruid || + if (p1->p_ucred->cr_uid != p2->p_ucred->cr_ruid || + p1->p_ucred->cr_ruid != p2->p_ucred->cr_ruid || + p1->p_ucred->cr_svuid != p2->p_ucred->cr_ruid || p2->p_flag & P_SUGID) { if ((error = suser_xxx(0, p1, PRISON_ROOT))) return (error); @@ -1308,6 +1384,7 @@ *newcr = *cr; mtx_init(&newcr->cr_mtx, "ucred", MTX_DEF); uihold(newcr->cr_uidinfo); + uihold(newcr->cr_ruidinfo); if (jailed(newcr)) prison_hold(newcr->cr_prison); newcr->cr_ref = 1; @@ -1375,48 +1452,123 @@ } /* - * Helper function to change the effective uid of a process + * change_euid(): Change a process's effective uid. + * Arguments: struct ucred *newcred, uid_t euid + * Returns: none + * Locks: none + * Side effects: newcred->cr_uid and newcred->cr_uidinfo will be modified. + * References: newcred must be an exclusive credential reference for the + * duration of the call. + * Notes: none */ void -change_euid(p, euid) - struct proc *p; - uid_t euid; +change_euid(newcred, euid) + struct ucred *newcred; + uid_t euid; { - struct pcred *pc; - struct uidinfo *uip; - pc = p->p_cred; - /* - * crcopy is essentially a NOP if ucred has a reference count - * of 1, which is true if it has already been copied. - */ - pc->pc_ucred = crcopy(pc->pc_ucred); - uip = pc->pc_ucred->cr_uidinfo; - pc->pc_ucred->cr_uid = euid; - pc->pc_ucred->cr_uidinfo = uifind(euid); - uifree(uip); + newcred->cr_uid = euid; + uifree(newcred->cr_uidinfo); + newcred->cr_uidinfo = uifind(euid); } /* - * Helper function to change the real uid of a process - * - * The per-uid process count for this process is transfered from - * the old uid to the new uid. + * change_egid(): Change a process's effective gid. + * Arguments: struct ucred *newcred, gid_t egid + * Returns: none + * Locks: none + * Side effects: newcred->cr_gid will be modified. + * References: newcred must be an exclusive credential reference for the + * duration of the call. + * Notes: none */ void -change_ruid(p, ruid) - struct proc *p; - uid_t ruid; +change_egid(newcred, egid) + struct ucred *newcred; + gid_t egid; +{ + + newcred->cr_groups[0] = egid; +} + +/* + * change_ruid(): Change a process's real uid. + * Arguments: struct ucred *newcred, uid_t ruid + * Returns: none + * Locks: none + * Side effects: newcred->cr_ruid will be updated, newcred->cr_ruidinfo + * will be updated, and the old and new cr_ruidinfo proc + * counts will be updated. + * References: newcred must be an exclusive credential reference for the + * duration of the call. + * Notes: none + */ +void +change_ruid(newcred, ruid) + struct ucred *newcred; + uid_t ruid; +{ + + (void)chgproccnt(newcred->cr_ruidinfo, -1, 0); + newcred->cr_ruid = ruid; + uifree(newcred->cr_ruidinfo); + newcred->cr_ruidinfo = uifind(ruid); + (void)chgproccnt(newcred->cr_ruidinfo, 1, 0); +} + +/* + * change_rgid(): Change a process's real gid. + * Arguments: struct ucred *newcred, gid_t rgid + * Returns: none + * Locks: none + * Side effects: newcred->cr_rgid will be updated. + * References: newcred must be an exclusive credential reference for the + * duration of the call. + * Notes: none + */ +void +change_rgid(newcred, rgid) + struct ucred *newcred; + gid_t rgid; +{ + + newcred->cr_rgid = rgid; +} + +/* + * change_svuid(): Change a process's saved uid. + * Arguments: struct ucred *newcred, uid_t svuid + * Returns: none + * Locks: none + * Side effects: newcred->cr_svuid will be updated. + * References: newcred must be an exclusive credential reference for the + * duration of the call. + * Notes: none + */ +void +change_svuid(newcred, svuid) + struct ucred *newcred; + uid_t svuid; +{ + + newcred->cr_svuid = svuid; +} + +/* + * change_svgid(): Change a process's saved gid. + * Arguments: struct ucred *newcred, gid_t svgid + * Returns: none + * Locks: none + * Side effects: newcred->cr_svgid will be updated. + * References: newcred must be an exclusive credential reference for the + * duration of the call. + * Notes: none + */ +void +change_svgid(newcred, svgid) + struct ucred *newcred; + gid_t svgid; { - struct pcred *pc; - struct uidinfo *uip; - pc = p->p_cred; - (void)chgproccnt(pc->p_uidinfo, -1, 0); - uip = pc->p_uidinfo; - /* It is assumed that pcred is not shared between processes */ - pc->p_ruid = ruid; - pc->p_uidinfo = uifind(ruid); - (void)chgproccnt(pc->p_uidinfo, 1, 0); - uifree(uip); + newcred->cr_svgid = svgid; } Index: kern/kern_sig.c =================================================================== RCS file: /home/ncvs/src/sys/kern/kern_sig.c,v retrieving revision 1.117 diff -u -r1.117 kern_sig.c --- kern/kern_sig.c 2001/04/27 19:28:23 1.117 +++ kern/kern_sig.c 2001/05/04 16:48:36 @@ -98,14 +98,14 @@ "Log processes quitting on abnormal signals to syslog(3)"); /* - * Policy -- Can real uid ruid with ucred uc send a signal to process q? + * Policy -- Can ucred cr1 send SIGIO to process cr2? */ -#define CANSIGIO(ruid, uc, q) \ - ((uc)->cr_uid == 0 || \ - (ruid) == (q)->p_cred->p_ruid || \ - (uc)->cr_uid == (q)->p_cred->p_ruid || \ - (ruid) == (q)->p_ucred->cr_uid || \ - (uc)->cr_uid == (q)->p_ucred->cr_uid) +#define CANSIGIO(cr1, cr2) \ + ((cr1)->cr_uid == 0 || \ + (cr2)->cr_ruid == (cr2)->cr_ruid || \ + (cr2)->cr_uid == (cr2)->cr_ruid || \ + (cr2)->cr_ruid == (cr2)->cr_uid || \ + (cr2)->cr_uid == (cr2)->cr_uid) int sugid_coredump; SYSCTL_INT(_kern, OID_AUTO, sugid_coredump, CTLFLAG_RW, @@ -1609,8 +1609,8 @@ { CTR3(KTR_PROC, "killproc: proc %p (pid %d, %s)", p, p->p_pid, p->p_comm); - log(LOG_ERR, "pid %d (%s), uid %d, was killed: %s\n", p->p_pid, p->p_comm, - p->p_cred && p->p_ucred ? p->p_ucred->cr_uid : -1, why); + log(LOG_ERR, "pid %d (%s), uid %d, was killed: %s\n", p->p_pid, + p->p_comm, p->p_ucred ? p->p_ucred->cr_uid : -1, why); PROC_LOCK(p); psignal(p, SIGKILL); PROC_UNLOCK(p); @@ -1649,7 +1649,7 @@ log(LOG_INFO, "pid %d (%s), uid %d: exited on signal %d%s\n", p->p_pid, p->p_comm, - p->p_cred && p->p_ucred ? p->p_ucred->cr_uid : -1, + p->p_ucred ? p->p_ucred->cr_uid : -1, sig &~ WCOREFLAG, sig & WCOREFLAG ? " (core dumped)" : ""); } else { @@ -1869,8 +1869,7 @@ if (sigio->sio_pgid > 0) { PROC_LOCK(sigio->sio_proc); - if (CANSIGIO(sigio->sio_ruid, sigio->sio_ucred, - sigio->sio_proc)) + if (CANSIGIO(sigio->sio_ucred, sigio->sio_proc->p_ucred)) psignal(sigio->sio_proc, sig); PROC_UNLOCK(sigio->sio_proc); } else if (sigio->sio_pgid < 0) { @@ -1878,7 +1877,7 @@ LIST_FOREACH(p, &sigio->sio_pgrp->pg_members, p_pglist) { PROC_LOCK(p); - if (CANSIGIO(sigio->sio_ruid, sigio->sio_ucred, p) && + if (CANSIGIO(sigio->sio_ucred, p->p_ucred) && (checkctty == 0 || (p->p_flag & P_CONTROLT))) psignal(p, sig); PROC_UNLOCK(p); Index: kern/uipc_usrreq.c =================================================================== RCS file: /home/ncvs/src/sys/kern/uipc_usrreq.c,v retrieving revision 1.65 diff -u -r1.65 uipc_usrreq.c --- kern/uipc_usrreq.c 2001/05/01 08:12:59 1.65 +++ kern/uipc_usrreq.c 2001/05/06 00:45:37 @@ -988,8 +988,8 @@ if (cm->cmsg_type == SCM_CREDS) { cmcred = (struct cmsgcred *)(cm + 1); cmcred->cmcred_pid = p->p_pid; - cmcred->cmcred_uid = p->p_cred->p_ruid; - cmcred->cmcred_gid = p->p_cred->p_rgid; + cmcred->cmcred_uid = p->p_ucred->cr_ruid; + cmcred->cmcred_gid = p->p_ucred->cr_rgid; cmcred->cmcred_euid = p->p_ucred->cr_uid; cmcred->cmcred_ngroups = MIN(p->p_ucred->cr_ngroups, CMGROUP_MAX); Index: kern/vfs_syscalls.c =================================================================== RCS file: /home/ncvs/src/sys/kern/vfs_syscalls.c,v retrieving revision 1.189 diff -u -r1.189 vfs_syscalls.c --- kern/vfs_syscalls.c 2001/04/29 02:44:49 1.189 +++ kern/vfs_syscalls.c 2001/05/04 16:53:44 @@ -1711,8 +1711,8 @@ * rather than to modify the potentially shared process structure. */ tmpcred = crdup(cred); - tmpcred->cr_uid = p->p_cred->p_ruid; - tmpcred->cr_groups[0] = p->p_cred->p_rgid; + tmpcred->cr_uid = cred->cr_ruid; + tmpcred->cr_groups[0] = cred->cr_rgid; p->p_ucred = tmpcred; NDINIT(&nd, LOOKUP, FOLLOW | LOCKLEAF | NOOBJ, UIO_USERSPACE, SCARG(uap, path), p); @@ -3799,7 +3799,7 @@ } cnt = auio.uio_resid; error = VOP_SETEXTATTR(vp, attrnamespace, attrname, &auio, - p->p_cred->pc_ucred, p); + p->p_ucred, p); cnt -= auio.uio_resid; p->p_retval[0] = cnt; done: @@ -3912,7 +3912,7 @@ } cnt = auio.uio_resid; error = VOP_GETEXTATTR(vp, attrnamespace, attrname, &auio, - p->p_cred->pc_ucred, p); + p->p_ucred, p); cnt -= auio.uio_resid; p->p_retval[0] = cnt; done: @@ -3995,7 +3995,7 @@ vn_lock(vp, LK_EXCLUSIVE | LK_RETRY, p); error = VOP_SETEXTATTR(vp, attrnamespace, attrname, NULL, - p->p_cred->pc_ucred, p); + p->p_ucred, p); VOP_UNLOCK(vp, 0, p); vn_finished_write(mp); Index: miscfs/procfs/procfs_status.c =================================================================== RCS file: /home/ncvs/src/sys/miscfs/procfs/procfs_status.c,v retrieving revision 1.29 diff -u -r1.29 procfs_status.c --- miscfs/procfs/procfs_status.c 2001/05/01 08:13:09 1.29 +++ miscfs/procfs/procfs_status.c 2001/05/06 00:45:44 @@ -153,11 +153,11 @@ ps += snprintf(ps, psbuf + sizeof(psbuf) - ps, " %lu %lu %lu", (u_long)cr->cr_uid, - (u_long)p->p_cred->p_ruid, - (u_long)p->p_cred->p_rgid); + (u_long)cr->cr_ruid, + (u_long)cr->cr_rgid); DOCHECK(); - /* egid (p->p_cred->p_svgid) is equal to cr_ngroups[0] + /* egid (cr->cr_svgid) is equal to cr_ngroups[0] see also getegid(2) in /sys/kern/kern_prot.c */ for (i = 0; i < cr->cr_ngroups; i++) { Index: miscfs/procfs/procfs_vnops.c =================================================================== RCS file: /home/ncvs/src/sys/miscfs/procfs/procfs_vnops.c,v retrieving revision 1.95 diff -u -r1.95 procfs_vnops.c --- miscfs/procfs/procfs_vnops.c 2001/05/01 08:13:09 1.95 +++ miscfs/procfs/procfs_vnops.c 2001/05/06 00:45:44 @@ -404,7 +404,7 @@ procp = PFIND(pfs->pfs_pid); if (procp == NULL) return (ENOENT); - if (procp->p_cred == NULL || procp->p_ucred == NULL) { + if (procp->p_ucred == NULL) { PROC_UNLOCK(procp); return (ENOENT); } @@ -942,8 +942,7 @@ */ case Pfile: procp = PFIND(pfs->pfs_pid); - if (procp == NULL || procp->p_cred == NULL || - procp->p_ucred == NULL) { + if (procp == NULL || procp->p_ucred == NULL) { if (procp != NULL) PROC_UNLOCK(procp); printf("procfs_readlink: pid %d disappeared\n", Index: nfs/nfs_lock.c =================================================================== RCS file: /home/ncvs/src/sys/nfs/nfs_lock.c,v retrieving revision 1.4 diff -u -r1.4 nfs_lock.c --- nfs/nfs_lock.c 2001/05/01 08:13:14 1.4 +++ nfs/nfs_lock.c 2001/05/06 00:47:01 @@ -236,9 +236,11 @@ /* Let root, or someone who once was root (lockd generally * switches to the daemon uid once it is done setting up) make - * this call + * this call. + * + * XXX */ - if ((error = suser(p)) != 0 && p->p_cred->p_svuid != 0) + if ((error = suser(p)) != 0 && p->p_ucred->cr_svuid != 0) return (error); /* the version should match, or we're out of sync */ Index: posix4/p1003_1b.c =================================================================== RCS file: /home/ncvs/src/sys/posix4/p1003_1b.c,v retrieving revision 1.8 diff -u -r1.8 p1003_1b.c --- posix4/p1003_1b.c 2001/05/01 08:13:16 1.8 +++ posix4/p1003_1b.c 2001/05/06 00:47:11 @@ -68,16 +68,17 @@ /* * This is stolen from CANSIGNAL in kern_sig: * - * Can process p, with pcred pc, do "write flavor" operations to process q? + * Can process with credential cr1 do "write flavor" operations to credential + * cr2. This check needs to use generalized checks. */ -#define CAN_AFFECT(p, pc, q) \ - ((pc)->pc_ucred->cr_uid == 0 || \ - (pc)->p_ruid == (q)->p_cred->p_ruid || \ - (pc)->pc_ucred->cr_uid == (q)->p_cred->p_ruid || \ - (pc)->p_ruid == (q)->p_ucred->cr_uid || \ - (pc)->pc_ucred->cr_uid == (q)->p_ucred->cr_uid) +#define CAN_AFFECT(cr1, cr2) \ + ((cr1)->cr_uid == 0 || \ + (c1)->cr_ruid == (cr2)->cr_ruid || \ + (c1)->cr_uid == (cr2)->cr_ruid || \ + (c1)->cr_ruid == (cr2)->cr_uid || \ + (c1)->cr_uid == (cr2)->cr_uid) #else -#define CAN_AFFECT(p, pc, q) ((pc)->pc_ucred->cr_uid == 0) +#define CAN_AFFECT(cr1, cr2) ((cr1)->cr_uid == 0) #endif /* @@ -99,7 +100,7 @@ { /* Enforce permission policy. */ - if (CAN_AFFECT(p, p->p_cred, other_proc)) + if (CAN_AFFECT(p->p_ucred, other_proc->p_ucred)) *pp = other_proc; else ret = EPERM; Index: sys/filedesc.h =================================================================== RCS file: /home/ncvs/src/sys/sys/filedesc.h,v retrieving revision 1.26 diff -u -r1.26 filedesc.h --- sys/filedesc.h 2000/11/18 21:01:04 1.26 +++ sys/filedesc.h 2001/05/04 15:52:27 @@ -117,7 +117,6 @@ struct sigio **sio_myref; /* location of the pointer that holds * the reference to this structure */ struct ucred *sio_ucred; /* current credentials */ - uid_t sio_ruid; /* real user id */ pid_t sio_pgid; /* pgid for signals */ }; #define sio_proc sio_u.siu_proc Index: sys/proc.h =================================================================== RCS file: /home/ncvs/src/sys/sys/proc.h,v retrieving revision 1.161 diff -u -r1.161 proc.h --- sys/proc.h 2001/04/27 19:28:25 1.161 +++ sys/proc.h 2001/05/03 19:55:27 @@ -156,7 +156,7 @@ LIST_ENTRY(proc) p_list; /* (d) List of all processes. */ /* substructures: */ - struct pcred *p_cred; /* (c + k) Process owner's identity. */ + struct ucred *p_ucred; /* (c + k) Process owner's identity. */ struct filedesc *p_fd; /* (b) Ptr to open files structure. */ struct pstats *p_stats; /* (b) Accounting/statistics (CPU). */ struct plimit *p_limit; /* (m) Process limits. */ @@ -166,7 +166,6 @@ #define p_sigignore p_procsig->ps_sigignore #define p_sigcatch p_procsig->ps_sigcatch -#define p_ucred p_cred->pc_ucred #define p_rlimit p_limit->pl_rlimit int p_flag; /* (c) P_* flags. */ @@ -336,23 +335,6 @@ #define P_CAN_SEE 1 #define P_CAN_SCHED 3 #define P_CAN_DEBUG 4 - -/* - * MOVE TO ucred.h? - * - * Shareable process credentials (always resident). This includes a reference - * to the current user credentials as well as real and saved ids that may be - * used to change ids. - */ -struct pcred { - struct ucred *pc_ucred; /* Current credentials. */ - uid_t p_ruid; /* Real user id. */ - uid_t p_svuid; /* Saved effective user id. */ - gid_t p_rgid; /* Real group id. */ - gid_t p_svgid; /* Saved effective group id. */ - int p_refcnt; /* Number of references. */ - struct uidinfo *p_uidinfo; /* Per uid resource consumption. */ -}; #ifdef _KERNEL Index: sys/ucred.h =================================================================== RCS file: /home/ncvs/src/sys/sys/ucred.h,v retrieving revision 1.23 diff -u -r1.23 ucred.h --- sys/ucred.h 2001/05/01 08:13:18 1.23 +++ sys/ucred.h 2001/05/06 00:47:17 @@ -50,9 +50,14 @@ struct ucred { u_int cr_ref; /* reference count */ uid_t cr_uid; /* effective user id */ + uid_t cr_ruid; /* real user id */ + uid_t cr_svuid; /* saved user id */ short cr_ngroups; /* number of groups */ gid_t cr_groups[NGROUPS]; /* groups */ - struct uidinfo *cr_uidinfo; /* per uid resource consumption */ + gid_t cr_rgid; /* real group id */ + gid_t cr_svgid; /* saved user id */ + struct uidinfo *cr_uidinfo; /* per euid resource consumption */ + struct uidinfo *cr_ruidinfo; /* per ruid resource consumption */ struct prison *cr_prison; /* jail(4) */ struct mtx cr_mtx; /* protect refcount */ }; @@ -77,8 +82,12 @@ struct proc; -void change_euid __P((struct proc *p, uid_t euid)); -void change_ruid __P((struct proc *p, uid_t ruid)); +void change_euid __P((struct ucred *newcred, uid_t euid)); +void change_egid __P((struct ucred *newcred, gid_t egid)); +void change_ruid __P((struct ucred *newcred, uid_t ruid)); +void change_rgid __P((struct ucred *newcred, uid_t rgid)); +void change_svuid __P((struct ucred *newcred, uid_t svuid)); +void change_svgid __P((struct ucred *newcred, gid_t svgid)); struct ucred *crcopy __P((struct ucred *cr)); struct ucred *crdup __P((struct ucred *cr)); void crfree __P((struct ucred *cr)); Index: ufs/ufs/ufs_extattr.c =================================================================== RCS file: /home/ncvs/src/sys/ufs/ufs/ufs_extattr.c,v retrieving revision 1.31 diff -u -r1.31 ufs_extattr.c --- ufs/ufs/ufs_extattr.c 2001/04/29 02:45:28 1.31 +++ ufs/ufs/ufs_extattr.c 2001/05/04 18:22:17 @@ -621,7 +621,7 @@ auio.uio_rw = UIO_READ; auio.uio_procp = (struct proc *) p; - VOP_LEASE(backing_vnode, p, p->p_cred->pc_ucred, LEASE_WRITE); + VOP_LEASE(backing_vnode, p, p->p_ucred, LEASE_WRITE); vn_lock(backing_vnode, LK_SHARED | LK_NOPAUSE | LK_RETRY, p); error = VOP_READ(backing_vnode, &auio, IO_NODELOCKED, ump->um_extattr.uepm_ucred); @@ -702,7 +702,7 @@ * Processes with privilege, but in jail, are not allowed to * configure extended attributes. */ - if ((error = suser_xxx(p->p_cred->pc_ucred, p, 0))) { + if ((error = suser_xxx(p->p_ucred, p, 0))) { if (filename_vp != NULL) VOP_UNLOCK(filename_vp, 0, p); return (error); Index: ufs/ufs/ufs_vfsops.c =================================================================== RCS file: /home/ncvs/src/sys/ufs/ufs/ufs_vfsops.c,v retrieving revision 1.24 diff -u -r1.24 ufs_vfsops.c --- ufs/ufs/ufs_vfsops.c 2001/05/01 08:13:19 1.24 +++ ufs/ufs/ufs_vfsops.c 2001/05/06 00:47:20 @@ -108,14 +108,14 @@ int cmd, type, error; if (uid == -1) - uid = p->p_cred->p_ruid; + uid = p->p_ucred->cr_ruid; cmd = cmds >> SUBCMDSHIFT; switch (cmd) { case Q_SYNC: break; case Q_GETQUOTA: - if (uid == p->p_cred->p_ruid) + if (uid == p->p_ucred->cr_ruid) break; /* fall through */ default: To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Mon May 7 13: 8:14 2001 Delivered-To: freebsd-arch@freebsd.org Received: from meow.osd.bsdi.com (meow.osd.bsdi.com [204.216.28.88]) by hub.freebsd.org (Postfix) with ESMTP id 90F3B37B422; Mon, 7 May 2001 13:08:05 -0700 (PDT) (envelope-from jhb@FreeBSD.org) Received: from laptop.baldwin.cx (john@jhb-laptop.osd.bsdi.com [204.216.28.241]) by meow.osd.bsdi.com (8.11.2/8.11.2) with ESMTP id f47K7uG88251; Mon, 7 May 2001 13:07:57 -0700 (PDT) (envelope-from jhb@FreeBSD.org) Message-ID: X-Mailer: XFMail 1.4.0 on FreeBSD X-Priority: 3 (Normal) Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 8bit MIME-Version: 1.0 In-Reply-To: Date: Mon, 07 May 2001 13:03:40 -0700 (PDT) From: John Baldwin To: Robert Watson Subject: RE: Patch to eliminate struct pcred Cc: arch@FreeBSD.org Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG On 07-May-01 Robert Watson wrote: > Index: compat/svr4/svr4_misc.c > =================================================================== > RCS file: /home/ncvs/src/sys/compat/svr4/svr4_misc.c,v > retrieving revision 1.30 > diff -u -r1.30 svr4_misc.c > --- compat/svr4/svr4_misc.c 2001/05/01 08:11:52 1.30 > +++ compat/svr4/svr4_misc.c 2001/05/06 00:43:54 > @@ -1294,13 +1294,8 @@ > /* > * Free up credentials. > */ > - PROC_LOCK(q); > - if (--q->p_cred->p_refcnt == 0) { > - crfree(q->p_ucred); > - uifree(q->p_cred->p_uidinfo); > - FREE(q->p_cred, M_SUBPROC); > - q->p_cred = NULL; > - } > + crfree(q->p_ucred); > + q->p_ucred = NULL; Removing the proc lock here looks suspicious, but I think it might mirror a change I just made to kern_exit.c in wait1(), in which case it is ok. > Index: kern/kern_exec.c > =================================================================== > RCS file: /home/ncvs/src/sys/kern/kern_exec.c,v > retrieving revision 1.126 > diff -u -r1.126 kern_exec.c > --- kern/kern_exec.c 2001/05/01 08:12:56 1.126 > +++ kern/kern_exec.c 2001/05/06 16:25:06 > @@ -104,8 +104,9 @@ > register struct execve_args *uap; > { > struct nameidata nd, *ndp; > + struct ucred *oldcred = p->p_ucred, *newcred; > register_t *stack_base; > - int error, len, i; > + int error, len, i, intrace; > struct image_params image_params, *imgp; > struct vattr attr; > int (*img_first) __P((struct image_params *)); > @@ -272,23 +273,31 @@ > p->p_flag &= ~P_PPWAIT; > wakeup((caddr_t)p->p_pptr); > } > + intrace = p->p_flag & P_TRACED; > + PROC_UNLOCK(p); This unlock is busted since we then try to unlock this lock again later on since you didn't remove the other unlocks. Also, this whole caching of the intrace flag is bogus too. If you read a value and release the lock, then you have now lost the ability to safely make decisions on the value you just read. You have to hold the lock over both reading the value and deciding what to do based on that value so that the entire thing is an "atomic" operation. For now, I would just revert the intrace changes to check the flag directly like the code does now and not add in this proc unlock. > /* > + * XXX: Note, the whole execve() is incredibly racey right now > + * with regards to debugging and privilege/credential management. > + * This MUST be fixed prior to any release. > + */ > + > + /* > * Implement image setuid/setgid. > * > * Don't honor setuid/setgid if the filesystem prohibits it or if > * the process is being traced. > */ > - if ((((attr.va_mode & VSUID) && p->p_ucred->cr_uid != attr.va_uid) || > - ((attr.va_mode & VSGID) && p->p_ucred->cr_gid != attr.va_gid)) && > - (imgp->vp->v_mount->mnt_flag & MNT_NOSUID) == 0 && > - (p->p_flag & P_TRACED) == 0) { > + newcred = NULL; > + if ((((attr.va_mode & VSUID) && oldcred->cr_uid != attr.va_uid) || > + ((attr.va_mode & VSGID) && oldcred->cr_gid != attr.va_gid)) && > + (imgp->vp->v_mount->mnt_flag & MNT_NOSUID) == 0 && intrace == 0) { > PROC_UNLOCK(p); > /* > * Turn off syscall tracing for set-id programs, except for > * root. > */ > - if (p->p_tracep && suser(p)) { > + if (p->p_tracep && suser_xxx(oldcred, NULL, PRISON_ROOT)) { > p->p_traceflag = 0; > vrele(p->p_tracep); > p->p_tracep = NULL; > @@ -296,25 +305,42 @@ > /* > * Set the new credentials. > */ > - p->p_ucred = crcopy(p->p_ucred); > + newcred = crdup(p->p_ucred); > if (attr.va_mode & VSUID) > - change_euid(p, attr.va_uid); > + change_euid(newcred, attr.va_uid); > if (attr.va_mode & VSGID) > - p->p_ucred->cr_gid = attr.va_gid; > + change_egid(newcred, attr.va_gid); > setsugid(p); > setugidsafety(p); > } else { > - if (p->p_ucred->cr_uid == p->p_cred->p_ruid && > - p->p_ucred->cr_gid == p->p_cred->p_rgid) > - p->p_flag &= ~P_SUGID; > + if (oldcred->cr_uid == oldcred->cr_ruid && > + oldcred->cr_gid == oldcred->cr_rgid) > + p->p_flag &= ~P_SUGID; /* XXX locking */ > PROC_UNLOCK(p); > } > > /* > * Implement correct POSIX saved-id behavior. > + * > + * XXX: determine whether tests and sets should occur on old or > + * new credentials. > */ > - p->p_cred->p_svuid = p->p_ucred->cr_uid; > - p->p_cred->p_svgid = p->p_ucred->cr_gid; > + if (p->p_ucred->cr_svuid != p->p_ucred->cr_uid || > + p->p_ucred->cr_svgid != p->p_ucred->cr_gid) { > + if (newcred != NULL) > + newcred = crdup(p->p_ucred); > + > + change_svuid(newcred, p->p_ucred->cr_uid); > + change_svgid(newcred, p->p_ucred->cr_gid); > + } > + > + if (newcred != NULL) { > + struct ucred *oldcred; > + > + oldcred = p->p_ucred; > + p->p_ucred = newcred; > + crfree(oldcred); > + } > > /* > * Store the vp for use in procfs -- John Baldwin -- http://www.FreeBSD.org/~jhb/ PGP Key: http://www.baldwin.cx/~john/pgpkey.asc "Power Users Use the Power to Serve!" - http://www.FreeBSD.org/ To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Mon May 7 14:17:14 2001 Delivered-To: freebsd-arch@freebsd.org Received: from netbank.com.br (garrincha.netbank.com.br [200.203.199.88]) by hub.freebsd.org (Postfix) with ESMTP id 41A0C37B424 for ; Mon, 7 May 2001 14:17:09 -0700 (PDT) (envelope-from riel@conectiva.com.br) Received: from surriel.ddts.net (unknown [200.181.137.248]) by netbank.com.br (Postfix) with ESMTP id 8CA724680C; Mon, 7 May 2001 18:17:59 -0300 (BRST) Received: from localhost (ekpitz@localhost [127.0.0.1]) by surriel.ddts.net (8.11.3/8.11.2) with ESMTP id f47LGvi17187; Mon, 7 May 2001 18:16:58 -0300 Date: Mon, 7 May 2001 18:16:57 -0300 (BRST) From: Rik van Riel X-Sender: riel@imladris.rielhome.conectiva To: arch@freebsd.org Cc: linux-mm@kvack.org, Matt Dillon , sfkaplan@cs.amherst.edu Subject: on load control / process swapping Message-ID: X-spambait: aardvark@kernelnewbies.org X-spammeplease: aardvark@nl.linux.org MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG Hi, after staring at the code for a long long time, I finally figured out exactly why FreeBSD's load control code (the process swapping in vm_glue.c) can never work in many scenarios. In short, the process suspension / wake up code only does load control in the sense that system load is reduced, but absolutely no effort is made to ensure that individual programs can run without thrashing. This, of course, kind of defeats the purpose of doing load control in the first place. To see this situation in some more detail, lets first look at how the current process suspension code has evolved over time. Early paging Unixes, including earlier BSDs, had a rate-limited clock algorithm for the pageout code, where the VM subsystem would only scan (and page) memory out at a rate of fastscan pages per second. Whenever the paging system wasn't able to keep up, free memory would get below a certain threshold and memory load control (in the form of process suspension) kicked in. As soon as free memory (averaged over a few seconds) got over this threshold, processes get swapped in again. Because of the exact "speed limit" for the paging code, this would give a slow rotation of memory-resident progesses at a paging rate well below the thashing threshold. More modern Unixes, like FreeBSD, NetBSD or Linux, however, don't have the artificial speed limit on pageout. This means the pageout code can go on freeing memory until well beyond the trashing point of the system. It also means that the amount of free memory is no longer any indication of whether the system is thrashing or not. Add to that the fact that the classical load control in BSD resumes a suspended task whenever the system is above the (now not very meaningful) free memory threshold, regardless of whether the resident tasks have had the opportunity to make any progress ... which of course only encourages more thrashing instead of letting the system work itself out of the overload situation. Any solution will have to address the following points: 1) allow the resident processes to stay resident long enough to make progess 2) make sure the resident processes aren't thrashing, that is, don't let new processes back in memory if none of the currently resident processes is "ready" to be suspended 3) have a mechanism to detect thrashing in a VM subsystem which isn't rate-limited (hard?) and, for extra brownie points: 4) fairness, small processes can be paged in and out faster, so we can suspend&resume them faster; this has the side effect of leaving the proverbial root shell more usable 5) make sure already resident processes cannot create a situation that'll keep the swapped out tasks out of memory forever ... but don't kill performance either, since bad performance means we cannot get out of the bad situation we're in Points 1), 2) and 4) are relatively easy to address by simply keeping resident tasks unswappable for a long enough time that they've been able to do real work in an environment where 3) indicates we're not thrashing. 3) is the hard part. We know we're not thrashing when we don't have ongoing page faults all the time, but (say) only 50% of the time. However, I still have no idea to determine when we _are_ thrashing, since a system which always has 10 ongoing page faults may still be functioning without thrashing... This is the part where I cannot hand a ready solution but where we have to figure out a solution together. (and it's also the reason I cannot "send a patch" ... I know the current scheme cannot possibly work all the time, I understand why, but I just don't have a solution to the problem ... yet) regards, Rik -- Virtual memory is like a game you can't win; However, without VM there's truly nothing to lose... http://www.surriel.com/ http://distro.conectiva.com/ Send all your spam to aardvark@nl.linux.org (spam digging piggy) To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Mon May 7 15:50:38 2001 Delivered-To: freebsd-arch@freebsd.org Received: from earth.backplane.com (earth-nat-cw.backplane.com [208.161.114.67]) by hub.freebsd.org (Postfix) with ESMTP id 1881037B422 for ; Mon, 7 May 2001 15:50:34 -0700 (PDT) (envelope-from dillon@earth.backplane.com) Received: (from dillon@localhost) by earth.backplane.com (8.11.2/8.11.2) id f47MoKe68863; Mon, 7 May 2001 15:50:20 -0700 (PDT) (envelope-from dillon) Date: Mon, 7 May 2001 15:50:20 -0700 (PDT) From: Matt Dillon Message-Id: <200105072250.f47MoKe68863@earth.backplane.com> To: Rik van Riel Cc: arch@freebsd.org, linux-mm@kvack.org, sfkaplan@cs.amherst.edu Subject: Re: on load control / process swapping References: Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG :In short, the process suspension / wake up code only does :load control in the sense that system load is reduced, but :absolutely no effort is made to ensure that individual :programs can run without thrashing. This, of course, kind of :defeats the purpose of doing load control in the first place. : : :To see this situation in some more detail, lets first look :at how the current process suspension code has evolved over :time. Early paging Unixes, including earlier BSDs, had a :rate-limited clock algorithm for the pageout code, where :the VM subsystem would only scan (and page) memory out at :a rate of fastscan pages per second. : :Whenever the paging system wasn't able to keep up, free :memory would get below a certain threshold and memory load :control (in the form of process suspension) kicked in. As :soon as free memory (averaged over a few seconds) got over :this threshold, processes get swapped in again. Because of :the exact "speed limit" for the paging code, this would give :a slow rotation of memory-resident progesses at a paging rate :well below the thashing threshold. : :More modern Unixes, like FreeBSD, NetBSD or Linux, however, :don't have the artificial speed limit on pageout. This means :the pageout code can go on freeing memory until well beyond :the trashing point of the system. It also means that the :amount of free memory is no longer any indication of whether :the system is thrashing or not. : :Add to that the fact that the classical load control in BSD :resumes a suspended task whenever the system is above the :(now not very meaningful) free memory threshold, regardless :of whether the resident tasks have had the opportunity to :make any progress ... which of course only encourages more :thrashing instead of letting the system work itself out of :the overload situation. : : :Any solution will have to address the following points: : :1) allow the resident processes to stay resident long : enough to make progess This is accomplished as a side effect to the way the page queues are handled. A page placed in the active queue is not allowed to be moved out of that queue for a minimum period of time based on page aging. See line 500 or so of vm_pageout.c (in -stable) . Thus when a process wakes up and pages a bunch of pages in, those pages are guarenteed to stay in-core for a period of time no matter what level of memory stress is occuring. :2) make sure the resident processes aren't thrashing, : that is, don't let new processes back in memory if : none of the currently resident processes is "ready" : to be suspended When a process is swapped out, the process is removed from the run queue and the P_INMEM flag is cleared. The process is only woken up when faultin() is called (vm_glue.c line 312). faultin() is only called from the scheduler() (line 340 of vm_glue.c) and the scheduler only runs when the VM system indicates a minimum number of free pages are available (vm_page_count_min()), which you can adjust with the vm.v_free_min sysctl (usually represents 1-9 megabytes, dependings on how much memory the system has). So what occurs is that the system comes under extreme memory pressure and starts to swapout blocked processes. This reduces memory pressure over time. When memory pressure is sufficiently reudced the scheduler wakes up a swapped-out process (one at a time). There might be some fine tuning that we can do here, such as try to choose a better process to swapout (right now it's priority based which isn't the best way to do it). :3) have a mechanism to detect thrashing in a VM : subsystem which isn't rate-limited (hard?) In FreeBSD, rate-limiting is a function of a lightly loaded system. We rate-limit page laundering (pageouts). However, if the rate-limited laundering is not sufficient to reach our free + cache page targets, we take another laundering loop and this time do not limit it at all. Thus under heavy memory pressure, no real rate limiting occurs. The system will happily pagein and pageout megabytes/sec. The reason we do this is because David Greenman and John Dyson found a long time ago that attempting to rate limit paging does not actually solve the thrashing problem, it actually makes it worse... So they solved the problem another way (see my answers for #1 and #2). It isn't the paging operations themselves that cause thrashing. :and, for extra brownie points: :4) fairness, small processes can be paged in and out : faster, so we can suspend&resume them faster; this : has the side effect of leaving the proverbial root : shell more usable Small process can contribute to thrashing as easily as large processes can under extreme memory pressure... for example, take an overloaded shell machine. *ALL* processes are 'small' processes in that case, or most of them are, and in great numbers they can be the cause. So no test that specifically checks the size of the process can be used to give it any sort of priority. Additionally, *idle* small processes are also great contributers to the VM subsystem in regards to clearing out idle pages. For example, on a heavily loaded shell machine more then 80% of the 'small processes' have been idle for long periods of time and it is exactly our ability to page them out that allows us to extend the machine's operational life and move the thrashing threshold farther away. The last thing we want to do is make a 'fix' that prevents us from paging out idle small processes. It would kill the machine. :5) make sure already resident processes cannot create : a situation that'll keep the swapped out tasks out : of memory forever ... but don't kill performance either, : since bad performance means we cannot get out of the : bad situation we're in When the system starts swapping processes out, it continues to swap them out until memory pressure goes down. With memory pressure down processes are swapped back in again one at a time, typically in FIFO order. So this situation will generally not occur. Basically we have all the algorithms in place to deal with thrashing. I'm sure that there are a few places where we can optimize things... for example, we can certainly tune the swapout algorithm itself. -Matt :regards, : :Rik :-- :Virtual memory is like a game you can't win; To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Mon May 7 16:35:34 2001 Delivered-To: freebsd-arch@freebsd.org Received: from perninha.conectiva.com.br (perninha.conectiva.com.br [200.250.58.156]) by hub.freebsd.org (Postfix) with ESMTP id 54FE537B422 for ; Mon, 7 May 2001 16:35:27 -0700 (PDT) (envelope-from riel@conectiva.com.br) Received: from burns.conectiva (burns.conectiva [10.0.0.4]) by perninha.conectiva.com.br (Postfix) with SMTP id D180516B1C for ; Mon, 7 May 2001 20:35:25 -0300 (EST) Received: (qmail 13083 invoked by uid 0); 7 May 2001 23:33:57 -0000 Received: from duckman.distro.conectiva (HELO duckman.conectiva.com.br) (root@10.0.17.2) by burns.conectiva with SMTP; 7 May 2001 23:33:57 -0000 Received: from localhost (riel@localhost) by duckman.conectiva.com.br (8.11.3/8.11.3) with ESMTP id f47NZPF02739; Mon, 7 May 2001 20:35:25 -0300 X-Authentication-Warning: duckman.distro.conectiva: riel owned process doing -bs Date: Mon, 7 May 2001 20:35:25 -0300 (BRST) From: Rik van Riel X-X-Sender: To: Matt Dillon Cc: , , Subject: Re: on load control / process swapping In-Reply-To: <200105072250.f47MoKe68863@earth.backplane.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG On Mon, 7 May 2001, Matt Dillon wrote: > :1) allow the resident processes to stay resident long > : enough to make progess > > This is accomplished as a side effect to the way the page queues > are handled. A page placed in the active queue is not allowed > to be moved out of that queue for a minimum period of time based > on page aging. See line 500 or so of vm_pageout.c (in -stable) . > > Thus when a process wakes up and pages a bunch of pages in, those > pages are guarenteed to stay in-core for a period of time no matter > what level of memory stress is occuring. I don't see anything limiting the speed at which the active list is scanned over and over again. OTOH, you are right that a failure to deactivate enough pages will trigger the swapout code ..... This sure is a subtle interaction ;) > :2) make sure the resident processes aren't thrashing, > : that is, don't let new processes back in memory if > : none of the currently resident processes is "ready" > : to be suspended > > When a process is swapped out, the process is removed from the run > queue and the P_INMEM flag is cleared. The process is only woken up > when faultin() is called (vm_glue.c line 312). faultin() is only > called from the scheduler() (line 340 of vm_glue.c) and the scheduler > only runs when the VM system indicates a minimum number of free pages > are available (vm_page_count_min()), which you can adjust with > the vm.v_free_min sysctl (usually represents 1-9 megabytes, dependings > on how much memory the system has). But ... is this a good enough indication that the processes currently resident have enough memory available to make any progress ? Especially if all the currently resident processes are waiting in page faults, won't that make it easier for the system to find pages to swap out, etc... ? One thing I _am_ wondering though: the pageout and the pagein thresholds are different. Can't this lead to problems where we always hit both the pageout threshold -and- the pagein threshold and the system thrashes swapping processes in and out ? > :3) have a mechanism to detect thrashing in a VM > : subsystem which isn't rate-limited (hard?) > > In FreeBSD, rate-limiting is a function of a lightly loaded system. > We rate-limit page laundering (pageouts). However, if the rate-limited > laundering is not sufficient to reach our free + cache page targets, > we take another laundering loop and this time do not limit it at all. > > Thus under heavy memory pressure, no real rate limiting occurs. The > system will happily pagein and pageout megabytes/sec. The reason we > do this is because David Greenman and John Dyson found a long time > ago that attempting to rate limit paging does not actually solve the > thrashing problem, it actually makes it worse... So they solved the > problem another way (see my answers for #1 and #2). It isn't the > paging operations themselves that cause thrashing. Agreed on all points ... I'm just wondering how well 1) and 2) still work after all the changes that were made to the VM in the last few years. They sure are subtle ... > :and, for extra brownie points: > :4) fairness, small processes can be paged in and out > : faster, so we can suspend&resume them faster; this > : has the side effect of leaving the proverbial root > : shell more usable > > Small process can contribute to thrashing as easily as large > processes can under extreme memory pressure... for example, > take an overloaded shell machine. *ALL* processes are 'small' > processes in that case, or most of them are, and in great numbers > they can be the cause. So no test that specifically checks the > size of the process can be used to give it any sort of priority. There's a test related to 2) though ... A small process needs to be in memory less time than a big process in order to make progress, so it can be swapped out earlier. It can also be swapped back in earlier, giving small processes shorter "time slices" for swapping than what large processes have. I'm not quite sure how much this would matter, though... > :5) make sure already resident processes cannot create > : a situation that'll keep the swapped out tasks out > : of memory forever ... but don't kill performance either, > : since bad performance means we cannot get out of the > : bad situation we're in > > When the system starts swapping processes out, it continues to swap > them out until memory pressure goes down. With memory pressure down > processes are swapped back in again one at a time, typically in FIFO > order. So this situation will generally not occur. > > Basically we have all the algorithms in place to deal with thrashing. > I'm sure that there are a few places where we can optimize things... > for example, we can certainly tune the swapout algorithm itself. Interesting, FreeBSD indeed _does_ seem to have all of the things in place (though the interactions between the various parts seem to be carefully hidden ;)). They indeed should work for lots of scenarios, but things like the subtlety of some of the code and the fact that the swapin and swapout thresholds are fairly unrelated look a bit worrying... regards, Rik -- Linux MM bugzilla: http://linux-mm.org/bugzilla.shtml Virtual memory is like a game you can't win; However, without VM there's truly nothing to lose... http://www.surriel.com/ http://www.conectiva.com/ http://distro.conectiva.com/ To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Mon May 7 17:56:21 2001 Delivered-To: freebsd-arch@freebsd.org Received: from earth.backplane.com (earth-nat-cw.backplane.com [208.161.114.67]) by hub.freebsd.org (Postfix) with ESMTP id 285E937B423 for ; Mon, 7 May 2001 17:56:16 -0700 (PDT) (envelope-from dillon@earth.backplane.com) Received: (from dillon@localhost) by earth.backplane.com (8.11.2/8.11.2) id f480u1Q71866; Mon, 7 May 2001 17:56:01 -0700 (PDT) (envelope-from dillon) Date: Mon, 7 May 2001 17:56:01 -0700 (PDT) From: Matt Dillon Message-Id: <200105080056.f480u1Q71866@earth.backplane.com> To: Rik van Riel Cc: , , Subject: Re: on load control / process swapping References: Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG :> to be moved out of that queue for a minimum period of time based :> on page aging. See line 500 or so of vm_pageout.c (in -stable) . :> :> Thus when a process wakes up and pages a bunch of pages in, those :> pages are guarenteed to stay in-core for a period of time no matter :> what level of memory stress is occuring. : :I don't see anything limiting the speed at which the active list :is scanned over and over again. OTOH, you are right that a failure :to deactivate enough pages will trigger the swapout code ..... : :This sure is a subtle interaction ;) Look at the loop line 1362 of vm_pageout.c. Note that it enforces a HZ/2 tsleep (2 scans per second) if the pageout daemon is unable to clean sufficient pages in two loops. The tsleep is not woken up by anyone while waiting that 1/2 second becuase vm_pages_needed has not been cleared yet. This is what is limiting the page queue scan. :> When a process is swapped out, the process is removed from the run :> queue and the P_INMEM flag is cleared. The process is only woken up :> when faultin() is called (vm_glue.c line 312). faultin() is only :> called from the scheduler() (line 340 of vm_glue.c) and the scheduler :> only runs when the VM system indicates a minimum number of free pages :> are available (vm_page_count_min()), which you can adjust with :> the vm.v_free_min sysctl (usually represents 1-9 megabytes, dependings :> on how much memory the system has). : :But ... is this a good enough indication that the processes :currently resident have enough memory available to make any :progress ? Yes. Consider detecting the difference between a large process accessing its pages randomly, and a small process accessing a relatively small set of pages over and over again. Now consider what happens when the system gets overloaded. The small process will be able to access its pages enough that they will get page priority over the larger process. The larger process, due to the more random accesses (or simply the fact that it is accessing a larger set of pages) will tend to stall more on pagein I/O which has the side effect of reducing the large process's access rate on all of its pages. The result: small processes get more priority just by being small. :Especially if all the currently resident processes are waiting :in page faults, won't that make it easier for the system to find :pages to swap out, etc... ? : :One thing I _am_ wondering though: the pageout and the pagein :thresholds are different. Can't this lead to problems where we :always hit both the pageout threshold -and- the pagein threshold :and the system thrashes swapping processes in and out ? The system will not page out a page it has just paged in due to the center-of-the-road initialization of act_count (the page aging). My experience at BEST was that both pagein and pageout activity occured simultaniously, but the fact had no detrimental effect on the system. You have to treat the pagein and pageout operations independantly because, in fact, they are only weakly related to each other. The only optimization you make, to reduce thrashing, is to not allow a just-paged-in page to immediately turn around and be paged out. I could probably make this work even better by setting the vm_page_t's act_count to its max value when paging in from swap. I'll think about doing that. The pagein and pageout rates have nothing to do with thrashing, per say, and should never be arbitrarily limited. Consider the difference between a system that is paing heavily and a system with only two small processes (like cp's) competing for disk I/O. Insofar as I/O goes, there is no difference. You can have a perfectly running system with high pagein and pageout rates. It's only when the paging I/O starts to eat into pages that are in active use where thrashing begins to occur. Think of a hotdog being eaten from both ends by two lovers. Memory pressure (active VM pages) eat away at one end, pageout I/O eats away at the other. You don't get fireworks until they meet. :> ago that attempting to rate limit paging does not actually solve the :> thrashing problem, it actually makes it worse... So they solved the :> problem another way (see my answers for #1 and #2). It isn't the :> paging operations themselves that cause thrashing. : :Agreed on all points ... I'm just wondering how well 1) and 2) :still work after all the changes that were made to the VM in :the last few years. They sure are subtle ... The algorithms mostly stayed the same. Much of the work was to remove artificial limitations that were reducing performance (due to the existance of greater amounts of memory, faster disks, and so forth...). I also spent a good deal of time removing 'restart' cases from the code that was causing a lot of cpu-wasteage in certain cases. What few restart cases remain just don't occur all that often. And I've done other things like extend the heuristics we already use for read()/write() to the VM system and change heuristic variables into per-vm-map elements rather then sharing them with read/write within the vnode. Etc. :> Small process can contribute to thrashing as easily as large :> processes can under extreme memory pressure... for example, :> take an overloaded shell machine. *ALL* processes are 'small' :> processes in that case, or most of them are, and in great numbers :> they can be the cause. So no test that specifically checks the :> size of the process can be used to give it any sort of priority. : :There's a test related to 2) though ... A small process needs :to be in memory less time than a big process in order to make :progress, so it can be swapped out earlier. Not necessarily. It depends whether the small process is cpu-bound or interactive. A cpu-bound small process should be allowed to run and not swapped out. An interactive small process can be safely swapped if idle for a period of time, because it can be swapped back in very quickly. It should not be swapped if it isn't idle (someone is typing, for example), because that would just waste disk I/O paging out and then paging right back in. You never want to swapout a small process gratuitously simply because it is small. :It can also be swapped back in earlier, giving small processes :shorter "time slices" for swapping than what large processes :have. I'm not quite sure how much this would matter, though... Both swapin and swapout activities are demand paged, but will be clustered if possible. I don't think there would be any point trying to conditionalize the algorithm based on the size of the process. The size has its own indirect positive effects which I think are sufficient. :Interesting, FreeBSD indeed _does_ seem to have all of the things in :place (though the interactions between the various parts seem to be :carefully hidden ;)). : :They indeed should work for lots of scenarios, but things like the :subtlety of some of the code and the fact that the swapin and :swapout thresholds are fairly unrelated look a bit worrying... : :regards, : :Rik I don't think it's possible to write a nice neat thrash-handling algorithm. It's a bunch of algorithms all working together, all closely tied to the VM page cache. Each taken alone is fairly easy to describe and understand. All of them together result in complex interactions that are very easy to break if you make a mistake. It usually takes me a couple of tries to get a solution to a problem in place without breaking something else (performance-wise) in the process. For example, I fubar'd heavy load performance for a month in FreeBSD-4.2 when I 'fixed' the pageout scan laundering algorithm. -Matt To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Mon May 7 21:44:21 2001 Delivered-To: freebsd-arch@freebsd.org Received: from fledge.watson.org (fledge.watson.org [204.156.12.50]) by hub.freebsd.org (Postfix) with ESMTP id E94C437B423 for ; Mon, 7 May 2001 21:44:01 -0700 (PDT) (envelope-from robert@fledge.watson.org) Received: from fledge.watson.org (robert@fledge.pr.watson.org [192.0.2.3]) by fledge.watson.org (8.11.3/8.11.3) with SMTP id f484hwf62354 for ; Tue, 8 May 2001 00:43:58 -0400 (EDT) (envelope-from robert@fledge.watson.org) Date: Tue, 8 May 2001 00:43:58 -0400 (EDT) From: Robert Watson X-Sender: robert@fledge.watson.org To: arch@FreeBSD.org Subject: securelevel -> securelevel_check() Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG One of the features requested for jailNG a number of times, most recently by Matt Dillon, has been to introduce support for per-jail securelevels. This would permit jail securelevels to float above the system securelevel, and allow the jail securelevel to be lowered from outside the jail. This would offer a number of benefits, largely in the form of permitting more sane use of file system flags within the jail. To do this, it is necessary to modify securelevel checks to attempt to go to a process-local (well, credential-local) securelevel. The first step in this process is to abstract out securelevel checks to a central securelevel_check(cred, maxlevel) call. The attached patch does this for most of the kernel, excluding ipfilter since that's contributed code. In some cases, converting from global securelevel to credential securelevel introduces ambiguities: should the process credential be used, or the file descriptor credential, for example. These concerns existed in a number of cases already. I may not have them all right, but would welcome comments. After this is in place, I will produce an updated jailNG patch that incorporates a new managed per-jail securelevel variable. When a securelevel check is performed, the global value is used if the process is not in jail. If in jail, the greater of local and global securelevels will be used. Securelevel modification using the normal kern.securelevel mib will now point to global securelevel outside of jail, and local securelevel within. kern.securelevel will only allow the securelevel to be raised, never lowered. The jail.instance.*.securelevel variable will allow the securelevel to be lowered from outside the jail; however, due to the check semantics, in effect per-jail securelevels will be at least the global level, preventing jails from being used to circumvent the global securelevel. As I've indicated in the past, I'm not a great fan of securelevels, but this seemed like a reasonable feature request to me, and it has substantial utility, especially where the administrator may want to make use of schg and related flags within the jail, but be able to disassemble the jail (or modify it) without rebooting to lower the global securelevel. Robert N M Watson FreeBSD Core Team, TrustedBSD Project robert@fledge.watson.org NAI Labs, Safeport Network Services ? compile/GENERIC Index: alpha/alpha/mem.c =================================================================== RCS file: /home/ncvs/src/sys/alpha/alpha/mem.c,v retrieving revision 1.34 diff -u -r1.34 mem.c --- alpha/alpha/mem.c 2001/03/26 12:39:47 1.34 +++ alpha/alpha/mem.c 2001/05/08 04:31:08 @@ -114,12 +114,16 @@ static int mmopen(dev_t dev, int flags, int fmt, struct proc *p) { + int error; switch (minor(dev)) { case 0: case 1: - if ((flags & FWRITE) && securelevel > 0) - return (EPERM); + if (flags & FWRITE) { + error = securelevel_check(p->p_ucred, 0); + if (error) + return (error); + } break; case 32: #ifdef PERFMON Index: alpha/alpha/sys_machdep.c =================================================================== RCS file: /home/ncvs/src/sys/alpha/alpha/sys_machdep.c,v retrieving revision 1.10 diff -u -r1.10 sys_machdep.c --- alpha/alpha/sys_machdep.c 2001/05/01 08:11:48 1.10 +++ alpha/alpha/sys_machdep.c 2001/05/08 04:31:08 @@ -114,8 +114,9 @@ if (error) return (error); - if (securelevel > 0) - return (EPERM); + error = securelevel_check(p->p_ucred, 0); + if (error) + return (ERROR); error = suser(p); if (error) Index: cam/scsi/scsi_pass.c =================================================================== RCS file: /home/ncvs/src/sys/cam/scsi/scsi_pass.c,v retrieving revision 1.28 diff -u -r1.28 scsi_pass.c --- cam/scsi/scsi_pass.c 2001/03/27 05:45:11 1.28 +++ cam/scsi/scsi_pass.c 2001/05/08 04:31:13 @@ -37,6 +37,7 @@ #include #include #include +#include #include #include @@ -368,9 +369,10 @@ /* * Don't allow access when we're running at a high securelvel. */ - if (securelevel > 1) { + error = securelevel_check(p->p_ucred, 1); + if (error) { splx(s); - return(EPERM); + return (error); } /* Index: dev/pci/pci_user.c =================================================================== RCS file: /home/ncvs/src/sys/dev/pci/pci_user.c,v retrieving revision 1.2 diff -u -r1.2 pci_user.c --- dev/pci/pci_user.c 2001/03/26 12:40:30 1.2 +++ dev/pci/pci_user.c 2001/05/08 04:31:22 @@ -39,6 +39,7 @@ #include #include #include +#include #include #include @@ -87,8 +88,12 @@ static int pci_open(dev_t dev, int oflags, int devtype, struct proc *p) { - if ((oflags & FWRITE) && securelevel > 0) { - return EPERM; + int error; + + if (oflags & FWRITE) { + error = securelevel_check(p->p_ucred, 0); + if (error) + return (error); } return 0; } Index: dev/random/randomdev.c =================================================================== RCS file: /home/ncvs/src/sys/dev/random/randomdev.c,v retrieving revision 1.28 diff -u -r1.28 randomdev.c --- dev/random/randomdev.c 2001/05/01 08:12:03 1.28 +++ dev/random/randomdev.c 2001/05/08 04:31:23 @@ -45,6 +45,7 @@ #include #include #include +#include #include #include @@ -140,17 +141,29 @@ static int random_open(dev_t dev, int flags, int fmt, struct proc *p) { - if ((flags & FWRITE) && (securelevel > 0 || suser(p))) - return EPERM; - else + int error; + + if (flags & FWRITE) { + error = securelevel_check(p->p_ucred, 0); + if (error) + return error; + + error = suser(p); + return error; + } else return 0; } static int random_close(dev_t dev, int flags, int fmt, struct proc *p) { - if ((flags & FWRITE) && !(securelevel > 0 || suser(p))) - random_reseed(); + int error; + + if (flags & FWRITE) { + if (!(securelevel_check(p->p_ucred, 0) || + suser(p))) + random_reseed(); + } return 0; } Index: dev/syscons/syscons.c =================================================================== RCS file: /home/ncvs/src/sys/dev/syscons/syscons.c,v retrieving revision 1.357 diff -u -r1.357 syscons.c --- dev/syscons/syscons.c 2001/05/01 08:12:05 1.357 +++ dev/syscons/syscons.c 2001/05/08 04:31:26 @@ -995,8 +995,9 @@ error = suser(p); if (error != 0) return error; - if (securelevel > 0) - return EPERM; + error = securelevel_check(p->p_ucred, 0); + if (error != 0) + return error; #ifdef __i386__ p->p_md.md_regs->tf_eflags |= PSL_IOPL; #endif Index: i386/i386/mem.c =================================================================== RCS file: /home/ncvs/src/sys/i386/i386/mem.c,v retrieving revision 1.88 diff -u -r1.88 mem.c --- i386/i386/mem.c 2001/03/26 12:40:48 1.88 +++ i386/i386/mem.c 2001/05/08 04:31:27 @@ -113,15 +113,19 @@ switch (minor(dev)) { case 0: case 1: - if ((flags & FWRITE) && securelevel > 0) - return (EPERM); + if (flags & FWRITE) { + error = securelevel_check(p->p_ucred, 0); + if (error) + return (error); + } break; case 14: error = suser(p); if (error != 0) return (error); - if (securelevel > 0) - return (EPERM); + error = securelevel_check(p->p_ucred, 0); + if (error) + return (error); p->p_md.md_regs->tf_eflags |= PSL_IOPL; break; } Index: i386/i386/sys_machdep.c =================================================================== RCS file: /home/ncvs/src/sys/i386/i386/sys_machdep.c,v retrieving revision 1.55 diff -u -r1.55 sys_machdep.c --- i386/i386/sys_machdep.c 2001/05/01 08:12:47 1.55 +++ i386/i386/sys_machdep.c 2001/05/08 04:31:27 @@ -179,8 +179,9 @@ if ((error = suser(p)) != 0) return (error); - if (securelevel > 0) - return (EPERM); + error = securelevel_check(p->p_ucred, 0); + if (error) + return (error); /* * XXX * While this is restricted to root, we should probably figure out Index: i386/isa/spigot.c =================================================================== RCS file: /home/ncvs/src/sys/i386/isa/spigot.c,v retrieving revision 1.48 diff -u -r1.48 spigot.c --- i386/isa/spigot.c 2001/05/01 08:12:51 1.48 +++ i386/isa/spigot.c 2001/05/08 04:31:27 @@ -182,8 +182,9 @@ error = suser(p); if (error != 0) return error; - if (securelevel > 0) - return EPERM; + error = securelevel(p->p_ucred, 0); + if (error) + return error; #endif ss->flags |= OPEN; @@ -238,8 +239,9 @@ error = suser(p); if (error != 0) return error; - if (securelevel > 0) - return EPERM; + error = securelevel(p->p_ucred, 0); + if (error != 0) + return error; #endif p->p_md.md_regs->tf_eflags |= PSL_IOPL; break; Index: i386/linux/linux_machdep.c =================================================================== RCS file: /home/ncvs/src/sys/i386/linux/linux_machdep.c,v retrieving revision 1.16 diff -u -r1.16 linux_machdep.c --- i386/linux/linux_machdep.c 2001/05/01 08:12:52 1.16 +++ i386/linux/linux_machdep.c 2001/05/08 04:31:28 @@ -472,8 +472,8 @@ return (EINVAL); if ((error = suser(p)) != 0) return (error); - if (securelevel > 0) - return (EPERM); + if ((error = securelevel_check(p->p_ucred, 0)) != 0) + return (error); p->p_md.md_regs->tf_eflags = (p->p_md.md_regs->tf_eflags & ~PSL_IOPL) | (args->level * (PSL_IOPL / 3)); return (0); Index: ia64/ia64/mem.c =================================================================== RCS file: /home/ncvs/src/sys/ia64/ia64/mem.c,v retrieving revision 1.3 diff -u -r1.3 mem.c --- ia64/ia64/mem.c 2001/03/26 12:40:56 1.3 +++ ia64/ia64/mem.c 2001/05/08 04:31:32 @@ -113,12 +113,16 @@ static int mmopen(dev_t dev, int flags, int fmt, struct proc *p) { + int error; switch (minor(dev)) { case 0: case 1: - if ((flags & FWRITE) && securelevel > 0) - return (EPERM); + if (flags & FWRITE) { + error = securelevel_check(p->p_ucred, 0); + if (error) + return (error); + } break; case 32: #ifdef PERFMON Index: kern/kern_linker.c =================================================================== RCS file: /home/ncvs/src/sys/kern/kern_linker.c,v retrieving revision 1.59 diff -u -r1.59 kern_linker.c --- kern/kern_linker.c 2001/03/22 08:58:45 1.59 +++ kern/kern_linker.c 2001/05/08 04:31:33 @@ -292,8 +292,9 @@ int foundfile, error = 0; /* Refuse to load modules if securelevel raised */ - if (securelevel > 0) - return EPERM; + error = securelevel_check(curproc->p_ucred, 0); + if (error) + return error; lf = linker_find_file_by_name(filename); if (lf) { @@ -420,8 +421,9 @@ int i; /* Refuse to unload modules if securelevel raised */ - if (securelevel > 0) - return EPERM; + error = securelevel_check(curproc->p_ucred, 0); + if (error) + return error; KLD_DPF(FILE, ("linker_file_unload: lf->refs=%d\n", file->refs)); lockmgr(&lock, LK_EXCLUSIVE, 0, curproc); @@ -673,8 +675,9 @@ p->p_retval[0] = -1; - if (securelevel > 0) /* redundant, but that's OK */ - return EPERM; + error = securelevel_check(p->p_ucred, 0); /* redundant, but that's OK */ + if (error) + return error; if ((error = suser(p)) != 0) return error; @@ -716,8 +719,9 @@ linker_file_t lf; int error = 0; - if (securelevel > 0) /* redundant, but that's OK */ - return EPERM; + error = securelevel_check(p->p_ucred, 0); /* redundant, but that's OK */ + if (error) + return error; if ((error = suser(p)) != 0) return error; Index: kern/kern_prot.c =================================================================== RCS file: /home/ncvs/src/sys/kern/kern_prot.c,v retrieving revision 1.89 diff -u -r1.89 kern_prot.c --- kern/kern_prot.c 2001/05/01 08:12:57 1.89 +++ kern/kern_prot.c 2001/05/08 04:31:34 @@ -984,6 +984,22 @@ return (0); } +/* + * Given a securelevel requirement, test whether securelevel state + * meets the requirement. + */ +int +securelevel_check(cred, maxlevel) + struct ucred *cred; + int maxlevel; +{ + + /* XXX: In the future, this will be protected by a mutex. */ + if (securelevel > maxlevel) + return (EPERM); + return (0); +} + static int suser_permitted = 1; SYSCTL_INT(_kern, OID_AUTO, suser_permitted, CTLFLAG_RW, &suser_permitted, 0, @@ -1189,8 +1205,11 @@ } /* can't trace init when securelevel > 0 */ - if (securelevel > 0 && p2->p_pid == 1) - return (EPERM); + if (p2->p_pid == 1) { + error = securelevel_check(p1->p_ucred, 0); + if (error) + return (error); + } return (0); } Index: kern/kern_sysctl.c =================================================================== RCS file: /home/ncvs/src/sys/kern/kern_sysctl.c,v retrieving revision 1.106 diff -u -r1.106 kern_sysctl.c --- kern/kern_sysctl.c 2001/03/08 01:20:43 1.106 +++ kern/kern_sysctl.c 2001/05/08 04:31:34 @@ -1013,9 +1013,15 @@ } /* If writing isn't allowed */ - if (req->newptr && (!(oid->oid_kind & CTLFLAG_WR) || - ((oid->oid_kind & CTLFLAG_SECURE) && securelevel > 0))) - return (EPERM); + if (req->newptr) { + if (!(oid->oid_kind & CTLFLAG_WR)) + return (EPERM); + if (oid->oid_kind & CTLFLAG_SECURE) { + error = securelevel_check(req->p->p_ucred, 0); + if (error) + return (error); + } + } /* Most likely only root can write */ if (!(oid->oid_kind & CTLFLAG_ANYBODY) && Index: kern/kern_time.c =================================================================== RCS file: /home/ncvs/src/sys/kern/kern_time.c,v retrieving revision 1.73 diff -u -r1.73 kern_time.c --- kern/kern_time.c 2001/05/01 08:12:57 1.73 +++ kern/kern_time.c 2001/05/08 04:31:35 @@ -103,7 +103,7 @@ * than one second, nor more than once per second. This allows * a miscreant to make the clock march double-time, but no worse. */ - if (securelevel > 1) { + if (securelevel_check(curproc->p_ucred, 1)) { if (delta.tv_sec < 0 || delta.tv_usec < 0) { /* * Update maxtime to latest time we've seen. Index: miscfs/procfs/procfs_subr.c =================================================================== RCS file: /home/ncvs/src/sys/miscfs/procfs/procfs_subr.c,v retrieving revision 1.33 diff -u -r1.33 procfs_subr.c --- miscfs/procfs/procfs_subr.c 2001/05/01 08:13:09 1.33 +++ miscfs/procfs/procfs_subr.c 2001/05/08 04:31:35 @@ -250,14 +250,17 @@ struct proc *curp = uio->uio_procp; struct pfsnode *pfs = VTOPFS(vp); struct proc *p; - int rtval; + int rtval, error; p = PFIND(pfs->pfs_pid); if (p == NULL) return (EINVAL); PROC_UNLOCK(p); - if (p->p_pid == 1 && securelevel > 0 && uio->uio_rw == UIO_WRITE) - return (EACCES); + if (p->p_pid == 1 && uio->uio_rw == UIO_WRITE) { + error = securelevel_check(curp->p_ucred, 0); + if (error) + return (EACCES); + } mp_fixme("pfs_lockowner needs a lock"); while (pfs->pfs_lockowner) { Index: miscfs/specfs/spec_vnops.c =================================================================== RCS file: /home/ncvs/src/sys/miscfs/specfs/spec_vnops.c,v retrieving revision 1.157 diff -u -r1.157 spec_vnops.c --- miscfs/specfs/spec_vnops.c 2001/04/30 14:35:35 1.157 +++ miscfs/specfs/spec_vnops.c 2001/05/08 04:31:36 @@ -176,13 +176,16 @@ * When running in secure mode, do not allow opens * for writing if the device is mounted */ - if (securelevel >= 1 && vfs_mountedon(vp)) - return (EPERM); + error = securelevel_check(ap->a_cred, 0); + if (error && vfs_mountedon(vp)) + return (error); /* * When running in very secure mode, do not allow * opens for writing of any devices. */ + error = securelevel_check(ap->a_cred, 1); + return (error); if (securelevel >= 2) return (EPERM); } Index: netinet/ip_dummynet.c =================================================================== RCS file: /home/ncvs/src/sys/netinet/ip_dummynet.c,v retrieving revision 1.39 diff -u -r1.39 ip_dummynet.c --- netinet/ip_dummynet.c 2001/02/10 00:10:18 1.39 +++ netinet/ip_dummynet.c 2001/05/08 04:31:53 @@ -1817,8 +1817,11 @@ struct dn_pipe *p, tmp_pipe; /* Disallow sets in really-really secure mode. */ - if (sopt->sopt_dir == SOPT_SET && securelevel >= 3) - return (EPERM); + if (sopt->sopt_dir == SOPT_SET) { + error = securelevel_check(curproc->p_ucred, 2); + if (error) + return (error); + } switch (sopt->sopt_name) { default : Index: netinet/ip_fw.c =================================================================== RCS file: /home/ncvs/src/sys/netinet/ip_fw.c,v retrieving revision 1.164 diff -u -r1.164 ip_fw.c --- netinet/ip_fw.c 2001/04/06 06:52:25 1.164 +++ netinet/ip_fw.c 2001/05/08 04:31:55 @@ -43,6 +43,7 @@ #include #include #include +#include #include #include #include @@ -1841,9 +1842,12 @@ * Disallow modifications in really-really secure mode, but still allow * the logging counters to be reset. */ - if (securelevel >= 3 && (sopt->sopt_name == IP_FW_ADD || - (sopt->sopt_dir == SOPT_SET && sopt->sopt_name != IP_FW_RESETLOG))) - return (EPERM); + if (sopt->sopt_name == IP_FW_ADD || (sopt->sopt_dir == SOPT_SET && + sopt->sopt_name != IP_FW_RESETLOG)) { + error = securelevel_check(curproc->p_ucred, 2); + if (error) + return (error); + } error = 0; switch (sopt->sopt_name) { Index: pc98/pc98/syscons.c =================================================================== RCS file: /home/ncvs/src/sys/pc98/pc98/syscons.c,v retrieving revision 1.159 diff -u -r1.159 syscons.c --- pc98/pc98/syscons.c 2001/05/01 08:13:15 1.159 +++ pc98/pc98/syscons.c 2001/05/08 04:31:58 @@ -997,8 +997,9 @@ error = suser(p); if (error != 0) return error; - if (securelevel > 0) - return EPERM; + error = securelevel(p->p_ucred, 0); + if (error != 0) + return error; #ifdef __i386__ p->p_md.md_regs->tf_eflags |= PSL_IOPL; #endif Index: sys/systm.h =================================================================== RCS file: /home/ncvs/src/sys/sys/systm.h,v retrieving revision 1.139 diff -u -r1.139 systm.h --- sys/systm.h 2001/04/27 19:28:25 1.139 +++ sys/systm.h 2001/05/08 04:31:59 @@ -164,6 +164,7 @@ /* flags for suser_xxx() */ #define PRISON_ROOT 1 +int securelevel_check __P((struct ucred *cred, int maxlevel)); int suser __P((struct proc *)); int suser_xxx __P((struct ucred *cred, struct proc *proc, int flag)); int u_cansee __P((struct ucred *u1, struct ucred *u2)); Index: ufs/ufs/ufs_vnops.c =================================================================== RCS file: /home/ncvs/src/sys/ufs/ufs/ufs_vnops.c,v retrieving revision 1.166 diff -u -r1.166 ufs_vnops.c --- ufs/ufs/ufs_vnops.c 2001/05/01 09:12:39 1.166 +++ ufs/ufs/ufs_vnops.c 2001/05/08 04:32:03 @@ -482,7 +482,7 @@ if (!suser_xxx(cred, NULL, 0)) { if ((ip->i_flags & (SF_NOUNLINK | SF_IMMUTABLE | SF_APPEND)) && - securelevel > 0) + securelevel_check(p->p_ucred, 0)) return (EPERM); /* Snapshot flag cannot be set or cleared */ if (((vap->va_flags & SF_SNAPSHOT) != 0 && Index: vm/vm_mmap.c =================================================================== RCS file: /home/ncvs/src/sys/vm/vm_mmap.c,v retrieving revision 1.118 diff -u -r1.118 vm_mmap.c --- vm/vm_mmap.c 2001/05/01 08:13:21 1.118 +++ vm/vm_mmap.c 2001/05/08 04:32:03 @@ -333,7 +333,8 @@ * other securelevel. * XXX this will have to go */ - if (securelevel >= 1) + error = securelevel_check(p->p_ucred, 0); + if (error) disablexworkaround = 1; else disablexworkaround = suser(p); To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Tue May 8 8:35:23 2001 Delivered-To: freebsd-arch@freebsd.org Received: from blount.mail.mindspring.net (blount.mail.mindspring.net [207.69.200.226]) by hub.freebsd.org (Postfix) with ESMTP id 954D637B423 for ; Tue, 8 May 2001 08:35:19 -0700 (PDT) (envelope-from tlambert2@mindspring.com) Received: from mindspring.com (pool0302.cvx21-bradley.dialup.earthlink.net [209.179.193.47]) by blount.mail.mindspring.net (8.9.3/8.8.5) with ESMTP id LAA04345; Tue, 8 May 2001 11:35:06 -0400 (EDT) Message-ID: <3AF8123F.632C02E6@mindspring.com> Date: Tue, 08 May 2001 08:35:27 -0700 From: Terry Lambert Reply-To: tlambert2@mindspring.com X-Mailer: Mozilla 4.7 [en]C-CCK-MCD {Sony} (Win98; U) X-Accept-Language: en MIME-Version: 1.0 To: Matt Dillon Cc: Bosko Milekic , freebsd-arch@FreeBSD.ORG Subject: Re: Mbuf slab [new allocator] References: <20010503195904.A53281@technokratis.com> <200105051833.f45IXiW49096@earth.backplane.com> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG Matt Dillon wrote: > Bosko Milekic wrote: > : Anyone interested in the mbuf subsystem code should > : probably read this. Others may still read it, but it > : is somewhat longer than your average Email, so consider > : this a warning. :-) Also, although I tried my best to > : cover most issues here, feel free to let me know if I > : should clarify some points. > : > : Not so long ago, as I'm sure some of you remember, > : Alfred committed a patch > : ... > > Sounds good. You know the motto - first make it work, > then make it fast. SLAB allocators are inherently pessimal for symmetry and kernel preemption, which is to say, this change would be inherently bad for SMP. I also personally think SLAB allocators are _not_ the way to go in the long run (or even in the short run). I would point you guys to: UNIX Internals: The New Frontiers Uresh Vahalia Chapter 12 Specifically, I suggest looking at the Dynix Allocator; the author likes the SLAB allocators, and when I was reviewing the book for Prentice Hall prior to its publication, we differed significantly on some aspects of Chapter 12. The Dynix allocator is still the best bet for optimal concurrency; a combination of the Dynaix allocator and a zone allocator would probably be the best we could hope for in the near term, without a total rewrite taking cache coloring into account. Note that the _primary factor_, IMO, limiting the number of processors usable by SVR4 prior to degrading unacceptably, is the use of a SLAB allocator, which places all processors into the same contention zone. If you guys _insist_ on going to a SLAB allocator, _at least_ do it right -- one of the few benefits of a SLAB allocator is the ability to perform allocations at interrupt level, if it is correctly implemented. -- Terry To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Tue May 8 8:46:58 2001 Delivered-To: freebsd-arch@freebsd.org Received: from peorth.iteration.net (peorth.iteration.net [208.190.180.178]) by hub.freebsd.org (Postfix) with ESMTP id 1BEFB37B422; Tue, 8 May 2001 08:46:54 -0700 (PDT) (envelope-from keichii@peorth.iteration.net) Received: by peorth.iteration.net (Postfix, from userid 1001) id B01A8595E8; Tue, 8 May 2001 10:46:51 -0500 (CDT) Date: Tue, 8 May 2001 10:46:51 -0500 From: "Michael C . Wu" To: Brian Dean Cc: freebsd-arch@freebsd.org, small@freebsd.org Subject: Re: rc.diskless* patches Message-ID: <20010508104651.B38957@peorth.iteration.net> Reply-To: "Michael C . Wu" Mail-Followup-To: "Michael C . Wu" , Brian Dean , freebsd-arch@freebsd.org, small@freebsd.org References: <20010502225656.A1173@vger.bsdhome.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5i In-Reply-To: <20010502225656.A1173@vger.bsdhome.com>; from bsd@bsdhome.com on Wed, May 02, 2001 at 10:56:56PM -0400 X-PGP-Fingerprint: 5025 F691 F943 8128 48A8 5025 77CE 29C5 8FA1 2E20 X-PGP-Key-ID: 0x8FA12E20 Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG On Wed, May 02, 2001 at 10:56:56PM -0400, Brian Dean scribbled: | I've put together some patches to the diskless startup code that I'd | like to commit. I've made both -stable and -current versions of the | patches. I've tested the -stable patches, but I have not tested the | -current patches, hopefully someone can do that and get back to me. | My -current environment is not working at the moment. | | The patches do three things: [snip] | My patches are at: | http://people.freebsd.org/~bsd/diskless I think this is fine, there should be no difference to users. Perhaps -small will think so too. Michael, -- +-----------------------------------------------------------+ | keichii@iteration.net | keichii@freebsd.org | | http://iteration.net/~keichii | Yes, BSD is a conspiracy. | +-----------------------------------------------------------+ To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Tue May 8 9:22: 5 2001 Delivered-To: freebsd-arch@freebsd.org Received: from fw.wintelcom.net (ns1.wintelcom.net [209.1.153.20]) by hub.freebsd.org (Postfix) with ESMTP id 8308537B422 for ; Tue, 8 May 2001 09:21:59 -0700 (PDT) (envelope-from bright@fw.wintelcom.net) Received: (from bright@localhost) by fw.wintelcom.net (8.10.0/8.10.0) id f48GLlP06751; Tue, 8 May 2001 09:21:47 -0700 (PDT) Date: Tue, 8 May 2001 09:21:47 -0700 From: Alfred Perlstein To: Terry Lambert Cc: Matt Dillon , Bosko Milekic , freebsd-arch@FreeBSD.ORG Subject: Re: Mbuf slab [new allocator] Message-ID: <20010508092146.L18676@fw.wintelcom.net> References: <20010503195904.A53281@technokratis.com> <200105051833.f45IXiW49096@earth.backplane.com> <3AF8123F.632C02E6@mindspring.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5i In-Reply-To: <3AF8123F.632C02E6@mindspring.com>; from tlambert2@mindspring.com on Tue, May 08, 2001 at 08:35:27AM -0700 X-all-your-base: are belong to us. Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG * Terry Lambert [010508 08:35] wrote: > Matt Dillon wrote: > > Bosko Milekic wrote: > > : Anyone interested in the mbuf subsystem code should > > : probably read this. Others may still read it, but it > > : is somewhat longer than your average Email, so consider > > : this a warning. :-) Also, although I tried my best to > > : cover most issues here, feel free to let me know if I > > : should clarify some points. > > : > > : Not so long ago, as I'm sure some of you remember, > > : Alfred committed a patch > > : ... > > > > Sounds good. You know the motto - first make it work, > > then make it fast. > > SLAB allocators are inherently pessimal for symmetry and > kernel preemption, which is to say, this change would be > inherently bad for SMP. > > > I also personally think SLAB allocators are _not_ the way > to go in the long run (or even in the short run). > > I would point you guys to: > > UNIX Internals: The New Frontiers > Uresh Vahalia > Chapter 12 Terry, I know. :) http://people.freebsd.org/~alfred/memcache/ /* Slab and mp caching allocator. The concepts used here are a combination of the slab, Dynix and Horde allocators. ... Of course it still needs a lot of work. -Alfred To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Tue May 8 13:53:39 2001 Delivered-To: freebsd-arch@freebsd.org Received: from beastie.mckusick.com (beastie.mckusick.com [209.31.233.184]) by hub.freebsd.org (Postfix) with ESMTP id 3D32037B422 for ; Tue, 8 May 2001 13:53:35 -0700 (PDT) (envelope-from mckusick@mckusick.com) Received: from beastie.mckusick.com (localhost [127.0.0.1]) by beastie.mckusick.com (8.9.3/8.9.3) with ESMTP id NAA08757; Tue, 8 May 2001 13:52:58 -0700 (PDT) (envelope-from mckusick@beastie.mckusick.com) Message-Id: <200105082052.NAA08757@beastie.mckusick.com> To: Matt Dillon Subject: Re: on load control / process swapping Cc: Rik van Riel , arch@FreeBSD.ORG, linux-mm@kvack.org, sfkaplan@cs.amherst.edu In-Reply-To: Your message of "Mon, 07 May 2001 15:50:20 PDT." <200105072250.f47MoKe68863@earth.backplane.com> Date: Tue, 08 May 2001 13:52:58 -0700 From: Kirk McKusick Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG I know that FreeBSD will swap out sleeping processes, but will it ever swap out running processes? The old BSD VM system would do so (we called it hard swapping). It is possible to get a set of running processes that simply do not all fit in memory, and the only way for them to make forward progress is to cycle them through memory. As to the size issue, we used to be biased towards the processes with large resident set sizes in kicking things out. In general, swapping out small things does not buy you much memory and it annoys more users. To avoid picking on the biggest, each time we needed to kick something out, we would find the five biggest, and kick out the one that had been memory resident the longest. The effect is to go round-robin among the big processes. Note that this algorithm allows you to kick out shells, if they are the biggest processes. Also note that this is a last ditch algorithm used only after there are no more idle processes available to kick out. Our decision that we had had to kick out running processes was: (1) no idle processes available to swap, (2) load average over one (if there is just one process, kicking it out does not solve the problem :-), (3) paging rate above a specified threshhold over the entire previous 30 seconds (e.g., been bad for a long time and not getting better in the short term), and (4) paging rate to/from swap area using more than half the available disk bandwidth (if your filesystems are on the same disk as you swap areas, you can get a false sense of success because all your process stop paging while they are blocked waiting for their file data. Kirk To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Tue May 8 17:18:34 2001 Delivered-To: freebsd-arch@freebsd.org Received: from earth.backplane.com (earth-nat-cw.backplane.com [208.161.114.67]) by hub.freebsd.org (Postfix) with ESMTP id 2518537B422 for ; Tue, 8 May 2001 17:18:31 -0700 (PDT) (envelope-from dillon@earth.backplane.com) Received: (from dillon@localhost) by earth.backplane.com (8.11.2/8.11.2) id f490IGR87881; Tue, 8 May 2001 17:18:16 -0700 (PDT) (envelope-from dillon) Date: Tue, 8 May 2001 17:18:16 -0700 (PDT) From: Matt Dillon Message-Id: <200105090018.f490IGR87881@earth.backplane.com> To: Kirk McKusick Cc: Rik van Riel , arch@FreeBSD.ORG, linux-mm@kvack.org, sfkaplan@cs.amherst.edu Subject: Re: on load control / process swapping References: <200105082052.NAA08757@beastie.mckusick.com> Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG : :I know that FreeBSD will swap out sleeping processes, but will it :ever swap out running processes? The old BSD VM system would do so :(we called it hard swapping). It is possible to get a set of running :processes that simply do not all fit in memory, and the only way :for them to make forward progress is to cycle them through memory. I looked at the code fairly carefully last night... it doesn't swap out running processes and it also does not appear to swap out processes blocked in a page-fault (on I/O). Now, of course we can't swap a process out right then (it might be holding locks), but I think it would be beneficial to be able to mark the process as 'requesting a swapout on return to user mode' or something like that. At the moment what gets picked for swapping is hit-or-miss due to the wait states. :As to the size issue, we used to be biased towards the processes :with large resident set sizes in kicking things out. In general, :swapping out small things does not buy you much memory and it The VM system does enforce the 'memoryuse' resource limit when the memory load gets heavy. But once the load goes beyond that the VM system doesn't appear to care how big the process is. :... :biggest processes. Also note that this is a last ditch algorithm :used only after there are no more idle processes available to :kick out. Our decision that we had had to kick out running :processes was: (1) no idle processes available to swap, (2) load :average over one (if there is just one process, kicking it out :does not solve the problem :-), (3) paging rate above a specified :threshhold over the entire previous 30 seconds (e.g., been bad :for a long time and not getting better in the short term), and :(4) paging rate to/from swap area using more than half the :available disk bandwidth (if your filesystems are on the same :disk as you swap areas, you can get a false sense of success :because all your process stop paging while they are blocked :waiting for their file data. : : Kirk I don't think we want to kick out running processes. Thrashing by definition means that many of the processes are stuck in disk-wait, usually from a VM fault, and not running. The other effect of thrashing is, of course, the the cpu idle time goes way up due to all the process stalls. A process that is actually able to run under these circumstances probably has a small run-time footprint (at least for whatever operation it is currently doing), so it should definitely be allowed to continue to run. -Matt To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Tue May 8 19: 8:40 2001 Delivered-To: freebsd-arch@freebsd.org Received: from netau1.alcanet.com.au (ntp.alcanet.com.au [203.62.196.27]) by hub.freebsd.org (Postfix) with ESMTP id 4C3C437B423 for ; Tue, 8 May 2001 19:08:34 -0700 (PDT) (envelope-from jeremyp@gsmx07.alcatel.com.au) Received: from mfg1.cim.alcatel.com.au (mfg1.cim.alcatel.com.au [139.188.23.1]) by netau1.alcanet.com.au (8.9.3 (PHNE_22672)/8.9.3) with ESMTP id MAA04160; Wed, 9 May 2001 12:07:48 +1000 (EST) Received: from gsmx07.alcatel.com.au by cim.alcatel.com.au (PMDF V5.2-32 #37641) with ESMTP id <01K3CY9R4TGGRX79H5@cim.alcatel.com.au>; Wed, 9 May 2001 12:07:37 +1100 Received: (from jeremyp@localhost) by gsmx07.alcatel.com.au (8.11.1/8.11.1) id f4927hR25482; Wed, 09 May 2001 12:07:43 +1000 (EST envelope-from jeremyp) Content-return: prohibited Date: Wed, 09 May 2001 12:07:43 +1000 From: Peter Jeremy Subject: Re: on load control / process swapping In-reply-to: <200105090018.f490IGR87881@earth.backplane.com>; from dillon@earth.backplane.com on Tue, May 08, 2001 at 05:18:16PM -0700 To: Matt Dillon Cc: Kirk McKusick , Rik van Riel , arch@FreeBSD.ORG, linux-mm@kvack.org, sfkaplan@cs.amherst.edu Mail-Followup-To: Matt Dillon , Kirk McKusick , Rik van Riel , arch@FreeBSD.ORG, linux-mm@kvack.org, sfkaplan@cs.amherst.edu Message-id: <20010509120743.Y59150@gsmx07.alcatel.com.au> MIME-version: 1.0 Content-type: text/plain; charset=us-ascii Content-disposition: inline User-Agent: Mutt/1.2.5i References: <200105082052.NAA08757@beastie.mckusick.com> <200105090018.f490IGR87881@earth.backplane.com> Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG On 2001-May-08 17:18:16 -0700, Matt Dillon wrote: > I don't think we want to kick out running processes. Thrashing > by definition means that many of the processes are stuck in > disk-wait, usually from a VM fault, and not running. The other > effect of thrashing is, of course, the the cpu idle time goes way > up due to all the process stalls. A process that is actually able > to run under these circumstances probably has a small run-time footprint > (at least for whatever operation it is currently doing), so it should > definitely be allowed to continue to run. I don't think this follows. A program that does something like: { extern char memory[BIG_NUMBER]; int i; for (i = 0; i < BIG_NUMBER; i += PAGE_SIZE) memory[i]++; } will thrash nicely (assuming BIG_NUMBER is large compared to the currently available physical memory). Occasionally, it will be runnable - at which stage it has a footprint of only two pages, but after executing a couple of instructions, it'll have another page fault. Old pages will remain resident for some time before they age enough to be paged out. If the VM system is stressed, swapping this process out completely would seem to be a win. Whilst this code is artificial, a process managing a very large hash table will have similar behaviour. Given that most (all?) recent CPU's have cheap hi-resolution clocks, would it be worthwhile for the VM system to maintain a per-process page fault rate? (average clock cycles before a process faults). If you ignore spikes due to process initialisation etc, a process that faults very quickly after being given the CPU wants a working set size that is larger than the VM system currently allows. The fault rate would seem to be proportional to the ratio between the wanted WSS and allowed RSS. This would seem to be a useful parameter to help decide which process to swap out - in an ideal world the VM subsystem would swap processes to keep the WSS of all in-core processes at about the size of non-kernel RAM. Peter To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Tue May 8 22: 9:49 2001 Delivered-To: freebsd-arch@freebsd.org Received: from fledge.watson.org (fledge.watson.org [204.156.12.50]) by hub.freebsd.org (Postfix) with ESMTP id BCB7E37B422; Tue, 8 May 2001 22:09:46 -0700 (PDT) (envelope-from robert@fledge.watson.org) Received: from fledge.watson.org (robert@fledge.pr.watson.org [192.0.2.3]) by fledge.watson.org (8.11.3/8.11.3) with SMTP id f4959hf80730; Wed, 9 May 2001 01:09:43 -0400 (EDT) (envelope-from robert@fledge.watson.org) Date: Wed, 9 May 2001 01:09:43 -0400 (EDT) From: Robert Watson X-Sender: robert@fledge.watson.org To: John Baldwin Cc: arch@FreeBSD.org Subject: RE: Patch to eliminate struct pcred In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG John, Thanks for your comments. As you point out, the srv4 exit change is replicated from your kern_exit change of similar ilk. It might be nice to revisit whatever rationale there was for breaking out the srv4 exit code, and see if we can just rely on a wrapped exit1(), which is the approach taken by the linuxulator. This would reduce code replication. I've likewise removed the intrace cached process flag, and increased the size of the "there's a race condition here" warning in the execve() code. As noted in the comment, and as you've indicated, we need to address this more broad locking problems that result in security issues before we un-giat this and a number of other calls (in particular, any operations involving inter-process activities such as tracing, debugging, and signalling). While modifying the code, I cleaned up the sv[ug]id modification code there -- I need to dig up a copy of POSIX.1 to verify that the new (and the old) behavior are consistent with the requirements. I've also added a comment indicating that we may want to set P_SUGID in the event that we do update the saved id's. I've also updated the patch to take into account my recent posix4 commits. The revised patch is available at: http://www.watson.org/~robert/pcred.2.diff Tomorrow I plan to run some more heavy-duty tests, and re-review the code. After that, I'd like to go ahead and commit, assuming no further reviews will be coming in. Thanks, Robert N M Watson FreeBSD Core Team, TrustedBSD Project robert@fledge.watson.org NAI Labs, Safeport Network Services To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Wed May 9 1: 3:19 2001 Delivered-To: freebsd-arch@freebsd.org Received: from Awfulhak.org (awfulhak.demon.co.uk [194.222.196.252]) by hub.freebsd.org (Postfix) with ESMTP id 8C72937B422; Wed, 9 May 2001 01:03:09 -0700 (PDT) (envelope-from brian@Awfulhak.org) Received: from hak.lan.Awfulhak.org (root@hak.lan.Awfulhak.org [172.16.0.12]) by Awfulhak.org (8.11.3/8.11.3) with ESMTP id f4989YW15445; Wed, 9 May 2001 09:09:35 +0100 (BST) (envelope-from brian@lan.Awfulhak.org) Received: from hak.lan.Awfulhak.org (brian@localhost [127.0.0.1]) by hak.lan.Awfulhak.org (8.11.3/8.11.3) with ESMTP id f49833B84293; Wed, 9 May 2001 09:03:04 +0100 (BST) (envelope-from brian@hak.lan.Awfulhak.org) Message-Id: <200105090803.f49833B84293@hak.lan.Awfulhak.org> X-Mailer: exmh version 2.3.1 01/18/2001 with nmh-1.0.4 To: Brian Somers Cc: cvs-committers@FreeBSD.org, cvs-all@FreeBSD.org, brian@Awfulhak.org, freebsd-arch@FreeBSD.org Subject: Re: cvs commit: src/etc rc In-Reply-To: Message from Brian Somers of "Wed, 09 May 2001 00:24:47 PDT." <200105090724.f497OlW22190@freefall.freebsd.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Date: Wed, 09 May 2001 09:03:03 +0100 From: Brian Somers Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG > brian 2001/05/09 00:24:47 PDT > > Modified files: (Branch: RELENG_4) > etc rc > Log: > Remove sockets as well as regular files in /var/run and /var/spool/lock > at boot time. This restores the pre-4.3 behaviour. > > Revision Changes Path > 1.212.2.25 +2 -2 src/etc/rc I think maybe this should just remove everything ? Comments ? -- Brian Don't _EVER_ lose your sense of humour ! Index: rc =================================================================== RCS file: /home/ncvs/src/etc/rc,v retrieving revision 1.261 diff -u -r1.261 rc --- rc 2001/04/15 13:44:05 1.261 +++ rc 2001/05/09 08:07:55 @@ -312,9 +312,12 @@ cd "$dir" && for file in .* * do [ ."$file" = .. -o ."$file" = ... ] && continue - [ -d "$file" -a ! -L "$file" ] && + if [ -d "$file" -a ! -L "$file" ] + then purgedir "$file" - [ -f "$file" -o -S "$file" ] && rm -f -- "$file" + else + rm -f -- "$file" + fi done ) done To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Wed May 9 9:22:10 2001 Delivered-To: freebsd-arch@freebsd.org Received: from meow.osd.bsdi.com (meow.osd.bsdi.com [204.216.28.88]) by hub.freebsd.org (Postfix) with ESMTP id 958CC37B423; Wed, 9 May 2001 09:22:06 -0700 (PDT) (envelope-from jhb@FreeBSD.org) Received: from laptop.baldwin.cx (john@jhb-laptop.osd.bsdi.com [204.216.28.241]) by meow.osd.bsdi.com (8.11.2/8.11.2) with ESMTP id f49GM2G52464; Wed, 9 May 2001 09:22:02 -0700 (PDT) (envelope-from jhb@FreeBSD.org) Message-ID: X-Mailer: XFMail 1.4.0 on FreeBSD X-Priority: 3 (Normal) Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 8bit MIME-Version: 1.0 In-Reply-To: Date: Wed, 09 May 2001 09:21:08 -0700 (PDT) From: John Baldwin To: Robert Watson Subject: RE: Patch to eliminate struct pcred Cc: arch@FreeBSD.org Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG On 09-May-01 Robert Watson wrote: > > John, > > Thanks for your comments. As you point out, the srv4 exit change is > replicated from your kern_exit change of similar ilk. It might be nice to > revisit whatever rationale there was for breaking out the srv4 exit code, > and see if we can just rely on a wrapped exit1(), which is the approach > taken by the linuxulator. This would reduce code replication. Yes, it does need to be wrapped. I think it is unwrapped because we got it from NetBSD and that may be how they do things. *shrug* > I've likewise removed the intrace cached process flag, Thanks. Some comments: @@ -274,21 +275,31 @@ ... - (p->p_flag & P_TRACED) == 0) { + p->p_flag & P_TRACED) { ... It looks like you've inverted the sense of that test. What is the XXX: locking comment about here: @@ -296,25 +307,50 @@ + p->p_flag &= ~P_SUGID; /* XXX locking */ PROC_UNLOCK(p); The process is locked when that flag is cleared. Looks fine otherwise. -- John Baldwin -- http://www.FreeBSD.org/~jhb/ PGP Key: http://www.baldwin.cx/~john/pgpkey.asc "Power Users Use the Power to Serve!" - http://www.FreeBSD.org/ To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Wed May 9 10:23:51 2001 Delivered-To: freebsd-arch@freebsd.org Received: from fledge.watson.org (fledge.watson.org [204.156.12.50]) by hub.freebsd.org (Postfix) with ESMTP id 61B5037B422; Wed, 9 May 2001 10:23:48 -0700 (PDT) (envelope-from robert@fledge.watson.org) Received: from fledge.watson.org (robert@fledge.pr.watson.org [192.0.2.3]) by fledge.watson.org (8.11.3/8.11.3) with SMTP id f49HNif90229; Wed, 9 May 2001 13:23:45 -0400 (EDT) (envelope-from robert@fledge.watson.org) Date: Wed, 9 May 2001 13:23:44 -0400 (EDT) From: Robert Watson X-Sender: robert@fledge.watson.org To: John Baldwin Cc: arch@FreeBSD.org Subject: RE: Patch to eliminate struct pcred In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG On Wed, 9 May 2001, John Baldwin wrote: > Some comments: > > @@ -274,21 +275,31 @@ > ... > - (p->p_flag & P_TRACED) == 0) { > + p->p_flag & P_TRACED) { > ... > > It looks like you've inverted the sense of that test. Oops, nice catch. I've now fixed that. > What is the XXX: locking comment about here: > > @@ -296,25 +307,50 @@ > + p->p_flag &= ~P_SUGID; /* XXX locking */ > PROC_UNLOCK(p); > > The process is locked when that flag is cleared. This is from an earlier incarnation where I had the locking rearranged some. It no longer applies, so I've removed it. A patch with those changes (only) is available at: http://www.watson.org/~robert/pcred.3.diff Thanks again, Robert N M Watson FreeBSD Core Team, TrustedBSD Project robert@fledge.watson.org NAI Labs, Safeport Network Services To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Wed May 9 11:50:43 2001 Delivered-To: freebsd-arch@freebsd.org Received: from sax.sax.de (sax.sax.de [193.175.26.33]) by hub.freebsd.org (Postfix) with ESMTP id C9BD437B423; Wed, 9 May 2001 11:50:36 -0700 (PDT) (envelope-from j@uriah.heep.sax.de) Received: (from uucp@localhost) by sax.sax.de (8.9.3/8.9.3) with UUCP id UAA28359; Wed, 9 May 2001 20:50:34 +0200 (CEST) Received: (from j@localhost) by uriah.heep.sax.de (8.11.3/8.11.3) id f49IgFx28974; Wed, 9 May 2001 20:42:15 +0200 (MET DST) (envelope-from j) Date: Wed, 9 May 2001 20:42:15 +0200 From: J Wunsch To: cvs-all@FreeBSD.org, freebsd-arch@FreeBSD.org Subject: Re: cvs commit: src/etc rc Message-ID: <20010509204214.A28936@uriah.heep.sax.de> Reply-To: Joerg Wunsch References: <200105090803.f49833B84293@hak.lan.Awfulhak.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 1.0.1i In-Reply-To: <200105090803.f49833B84293@hak.lan.Awfulhak.org>; from brian@Awfulhak.org on Wed, May 09, 2001 at 09:03:03AM +0100 X-Phone: +49-351-2012 669 X-PGP-Fingerprint: DC 47 E6 E4 FF A6 E9 8F 93 21 E0 7D F9 12 D6 4E Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG As Brian Somers wrote: [/var/run at boottime] > I think maybe this should just remove everything ? Comments ? I think so. Solaris 8 is even using a tmpfs for /var/run. Anybody who stores something in /var/run and expects it to survive a reboot needs to change his mind. To quote hier(9): run/ system information files describing various info about system since it was booted ^^^^^^^^^^^^^^^^^^^ -- cheers, J"org .-.-. --... ...-- -.. . DL8DTL http://www.sax.de/~joerg/ NIC: JW11-RIPE Never trust an operating system you don't have sources for. ;-) To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Wed May 9 12:41:57 2001 Delivered-To: freebsd-arch@freebsd.org Received: from earth.backplane.com (earth-nat-cw.backplane.com [208.161.114.67]) by hub.freebsd.org (Postfix) with ESMTP id 278DC37B424 for ; Wed, 9 May 2001 12:41:55 -0700 (PDT) (envelope-from dillon@earth.backplane.com) Received: (from dillon@localhost) by earth.backplane.com (8.11.2/8.11.2) id f49JfdD98861; Wed, 9 May 2001 12:41:39 -0700 (PDT) (envelope-from dillon) Date: Wed, 9 May 2001 12:41:39 -0700 (PDT) From: Matt Dillon Message-Id: <200105091941.f49JfdD98861@earth.backplane.com> To: Peter Jeremy Cc: Kirk McKusick , Rik van Riel , arch@FreeBSD.ORG, linux-mm@kvack.org, sfkaplan@cs.amherst.edu Subject: Re: on load control / process swapping References: <200105082052.NAA08757@beastie.mckusick.com> <200105090018.f490IGR87881@earth.backplane.com> <20010509120743.Y59150@gsmx07.alcatel.com.au> Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG :I don't think this follows. A program that does something like: :{ : extern char memory[BIG_NUMBER]; : int i; : : for (i = 0; i < BIG_NUMBER; i += PAGE_SIZE) : memory[i]++; :} :will thrash nicely (assuming BIG_NUMBER is large compared to the :currently available physical memory). Occasionally, it will be :runnable - at which stage it has a footprint of only two pages, but Why only two pages? It looks to me like the footprint is BIG_NUMBER bytes. :after executing a couple of instructions, it'll have another page :fault. Old pages will remain resident for some time before they age :enough to be paged out. If the VM system is stressed, swapping this :process out completely would seem to be a win. Not exactly. Page aging works both ways. Just accessing a page once does not give it priority over everything else in the page queues. :... :you ignore spikes due to process initialisation etc, a process that :faults very quickly after being given the CPU wants a working set size :that is larger than the VM system currently allows. The fault rate :would seem to be proportional to the ratio between the wanted WSS and :allowed RSS. This would seem to be a useful parameter to help decide :which process to swap out - in an ideal world the VM subsystem would :swap processes to keep the WSS of all in-core processes at about the :size of non-kernel RAM. : :Peter Fault rate isn't useful -- maybe faults that require large disk seeks would be useful, but just counting the faults themselves is not useful. -Matt To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Wed May 9 13:21:46 2001 Delivered-To: freebsd-arch@freebsd.org Received: from earth.backplane.com (earth-nat-cw.backplane.com [208.161.114.67]) by hub.freebsd.org (Postfix) with ESMTP id A024237B424; Wed, 9 May 2001 13:21:41 -0700 (PDT) (envelope-from dillon@earth.backplane.com) Received: (from dillon@localhost) by earth.backplane.com (8.11.2/8.11.2) id f49KLdT99914; Wed, 9 May 2001 13:21:39 -0700 (PDT) (envelope-from dillon) Date: Wed, 9 May 2001 13:21:39 -0700 (PDT) From: Matt Dillon Message-Id: <200105092021.f49KLdT99914@earth.backplane.com> To: Brian Somers Cc: Brian Somers , cvs-committers@FreeBSD.ORG, cvs-all@FreeBSD.ORG, brian@Awfulhak.org, freebsd-arch@FreeBSD.ORG Subject: Re: cvs commit: src/etc rc References: <200105090803.f49833B84293@hak.lan.Awfulhak.org> Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG :> brian 2001/05/09 00:24:47 PDT :> :> Modified files: (Branch: RELENG_4) :> etc rc :> Log: :> Remove sockets as well as regular files in /var/run and /var/spool/lock :> at boot time. This restores the pre-4.3 behaviour. :> :> Revision Changes Path :> 1.212.2.25 +2 -2 src/etc/rc : :I think maybe this should just remove everything ? Comments ? : :-- :Brian Yes. /var/run should be wiped completely. Programs needing persistent /var storage should use /var/db. That's why we have a /var/db separate from a /var/run. -Matt To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Wed May 9 23:14: 5 2001 Delivered-To: freebsd-arch@freebsd.org Received: from dt051n37.san.rr.com (dt051n37.san.rr.com [204.210.32.55]) by hub.freebsd.org (Postfix) with ESMTP id 7F35D37B422; Wed, 9 May 2001 23:13:55 -0700 (PDT) (envelope-from DougB@DougBarton.net) Received: from DougBarton.net (master [10.0.0.2]) by dt051n37.san.rr.com (8.9.3/8.9.3) with ESMTP id XAA22907; Wed, 9 May 2001 23:13:44 -0700 (PDT) (envelope-from DougB@DougBarton.net) Message-ID: <3AFA3198.20314F94@DougBarton.net> Date: Wed, 09 May 2001 23:13:44 -0700 From: Doug Barton Organization: Triborough Bridge & Tunnel Authority X-Mailer: Mozilla 4.77 [en] (X11; U; Linux 2.2.12 i386) X-Accept-Language: en MIME-Version: 1.0 To: Matt Dillon Cc: Brian Somers , cvs-committers@FreeBSD.org, cvs-all@FreeBSD.org, freebsd-arch@FreeBSD.org Subject: Re: cvs commit: src/etc rc References: <200105090803.f49833B84293@hak.lan.Awfulhak.org> <200105092021.f49KLdT99914@earth.backplane.com> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG Matt Dillon wrote: > > :> brian 2001/05/09 00:24:47 PDT > :> > :> Modified files: (Branch: RELENG_4) > :> etc rc > :> Log: > :> Remove sockets as well as regular files in /var/run and /var/spool/lock > :> at boot time. This restores the pre-4.3 behaviour. > :> > :> Revision Changes Path > :> 1.212.2.25 +2 -2 src/etc/rc > : > :I think maybe this should just remove everything ? Comments ? > : > :-- > :Brian > > Yes. /var/run should be wiped completely. Programs needing > persistent /var storage should use /var/db. That's why we have > a /var/db separate from a /var/run. If you need another vote, count me in. -- I need someone really bad. Are you really bad? To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Thu May 10 6:49:11 2001 Delivered-To: freebsd-arch@freebsd.org Received: from Awfulhak.org (awfulhak.demon.co.uk [194.222.196.252]) by hub.freebsd.org (Postfix) with ESMTP id 67A0D37B422; Thu, 10 May 2001 06:49:07 -0700 (PDT) (envelope-from brian@Awfulhak.org) Received: from hak.lan.Awfulhak.org (root@hak.lan.Awfulhak.org [172.16.0.12]) by Awfulhak.org (8.11.3/8.11.3) with ESMTP id f4ADn1308700; Thu, 10 May 2001 14:49:02 +0100 (BST) (envelope-from brian@lan.Awfulhak.org) Received: from hak.lan.Awfulhak.org (brian@localhost [127.0.0.1]) by hak.lan.Awfulhak.org (8.11.3/8.11.3) with ESMTP id f4ADn0d32593; Thu, 10 May 2001 14:49:00 +0100 (BST) (envelope-from brian@hak.lan.Awfulhak.org) Message-Id: <200105101349.f4ADn0d32593@hak.lan.Awfulhak.org> X-Mailer: exmh version 2.3.1 01/18/2001 with nmh-1.0.4 To: Peter Wemm , freebsd-arch@FreeBSD.org Cc: Brian Somers Subject: linker_search_path() Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Date: Thu, 10 May 2001 14:49:00 +0100 From: Brian Somers Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG Hi, The digi driver that I re-wrote recently uses linker_load_file() to grab some throwaway data from another purpose-built digi_* module. At the moment, I use an almost-hard-coded filename of snprintf(modfile, MAXPATHLEN, "/boot/kernel/digi_%s.ko", sc->module); which isn't really very bright. Can anyone tell me what the plans are for linker_search_path() in kern/kern_linker.c ? There's a comment (written by peter): /* * There will be a system to look up or guess a file name from * a module name. * For now we just try to load a file with the same name. */ pathname = linker_search_path(modname); I wouldn't mind implementing that ``system'' or even making linker_search_path() non-static so that I can use it from dev/digi/digi.c. Comments ? Cheers. -- Brian Don't _EVER_ lose your sense of humour ! To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Fri May 11 16:42:22 2001 Delivered-To: freebsd-arch@freebsd.org Received: from smtp04.primenet.com (smtp04.primenet.com [206.165.6.134]) by hub.freebsd.org (Postfix) with ESMTP id D90D037B62D; Fri, 11 May 2001 16:42:13 -0700 (PDT) (envelope-from tlambert@usr08.primenet.com) Received: (from daemon@localhost) by smtp04.primenet.com (8.9.3/8.9.3) id QAA12297; Fri, 11 May 2001 16:42:11 -0700 (MST) Received: from usr08.primenet.com(206.165.6.208) via SMTP by smtp04.primenet.com, id smtpdAAAx6aq3x; Fri May 11 16:41:56 2001 Received: (from tlambert@localhost) by usr08.primenet.com (8.8.5/8.8.5) id QAA04578; Fri, 11 May 2001 16:43:00 -0700 (MST) From: Terry Lambert Message-Id: <200105112343.QAA04578@usr08.primenet.com> Subject: FreeBSD breaks sockets two ways... To: freebsd-net@FreeBSD.ORG Date: Fri, 11 May 2001 23:43:00 +0000 (GMT) Cc: arch@FreeBSD.ORG X-Mailer: ELM [version 2.5 PL2] MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG I have run into two issues, that I find really, really annoying. This is in FreeBSD 4.3 and 5.x. Bot machines are on a local (non-switched) segment (it works the same with a switch, but taking that out proves it is not the switch causing the problem). Primus ------ The first is that when you create a socket, and bind it to a specific local IP address, and then connect, it fails to allocate an automatic port private to the socket; specifically: int s; struct sockaddr_in sockaddr; s = socket(AF_INET, SOCK_STREAM, 0); bzero(&sockaddr,sizeof(sockaddr)); sockaddr.sin_family = AF_INET; sockaddr.sin_addr.s_addr = s_addr2; sockaddr.sin_port = 0; if (bind(s, (struct sockaddr *) &sockaddr, sizeof(sockaddr)) == -1) { perror("bind"); errx(1, "bind failed"); } ...in other words, the sockets are all hashed into the same (global) collsion domain, even though they are _not_ global, they are specific to a particular IP address. Secondus -------- On an OS where the above actually works (e.g. _not_ FreeBSD), when I make connections from two ports which are the same, but with different IP addresses, it seems that the MAC address is used by FreeBSD to differentiate connections, and _not_ the IP/port pair. This means that on FreeBSD, the incoming connection on two different source IPs from the same MAC address end up resetting the first connection, when the second one comes in; instead of getting two total connections, I end up getting only a single connection. Both of these seem to be serious screwups in the routing code hash lookup algorithm, acting as if everything is in the INADDR_ANY domain, and as if it were keying off the MAC address, and not the IP address... as it should be. Has anyone else seen this? Obviously, it's hard to reproduce FreeBSD-to-FreeBSD (at least without a BPF program on the client side to cause the problem)... I'm primarily interested in a fix for 4.3. Thanks, Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Fri May 11 16:54:44 2001 Delivered-To: freebsd-arch@freebsd.org Received: from mail.tgd.net (rand.tgd.net [64.81.67.117]) by hub.freebsd.org (Postfix) with SMTP id B879637B43F for ; Fri, 11 May 2001 16:54:38 -0700 (PDT) (envelope-from sean@mailhost.tgd.net) Received: (qmail 67706 invoked by uid 1001); 11 May 2001 23:54:32 -0000 Date: Fri, 11 May 2001 16:54:32 -0700 From: Sean Chittenden To: Terry Lambert Cc: freebsd-net@FreeBSD.ORG, arch@FreeBSD.ORG Subject: Re: FreeBSD breaks sockets two ways... Message-ID: <20010511165432.A67648@rand.tgd.net> References: <200105112343.QAA04578@usr08.primenet.com> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="NzB8fVQJ5HfG6fxh" Content-Disposition: inline In-Reply-To: <200105112343.QAA04578@usr08.primenet.com>; from "tlambert@primenet.com" on Fri, May 11, 2001 at = 11:43:00PM X-PGP-Key: 0x1EDDFAAD X-PGP-Fingerprint: C665 A17F 9A56 286C 5CFB 1DEA 9F4F 5CEF 1EDD FAAD X-Web-Homepage: http://sean.chittenden.org/ X-All-your-base: are belong to us. Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG --NzB8fVQJ5HfG6fxh Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable Are you sure it's failing to allocate the port? I had a similar problem in trying to connect to a service, but found out that aliasing an IP didn't add the arp entry in the routing table (local connections were failing). If I added the arp entry by hand, everything was happy (is IP aliasing a part of the scneario you're describing?). arp -s a.b.c.d 00:60:08:aa:aa:aa pub arp -s a.b.c.e 00:60:08:aa:aa:ab pub A tad annoying, but it seems to work (yeah, I know about the ethers file, but I refuse to use it). -sc On Fri, May 11, 2001 at 11:43:00PM +0000, Terry Lambert wrote: > I have run into two issues, that I find really, really annoying. This > is in FreeBSD 4.3 and 5.x. Bot machines are on a local (non-switched) > segment (it works the same with a switch, but taking that out proves > it is not the switch causing the problem). >=20 >=20 > Primus > ------ >=20 > The first is that when you create a socket, and bind it to a > specific local IP address, and then connect, it fails to > allocate an automatic port private to the socket; specifically: --=20 Sean Chittenden --NzB8fVQJ5HfG6fxh Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Comment: Sean Chittenden iEYEARECAAYFAjr8e7cACgkQn09c7x7d+q2tuQCaA6PwZyW5IG33AgevgaN+n5so pZkAnRjax8S0kGKdusPUWJ1/dv9si1FN =tJcl -----END PGP SIGNATURE----- --NzB8fVQJ5HfG6fxh-- To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Fri May 11 18:46:57 2001 Delivered-To: freebsd-arch@freebsd.org Received: from smtp03.primenet.com (smtp03.primenet.com [206.165.6.133]) by hub.freebsd.org (Postfix) with ESMTP id B904B37B424; Fri, 11 May 2001 18:46:51 -0700 (PDT) (envelope-from tlambert@usr06.primenet.com) Received: (from daemon@localhost) by smtp03.primenet.com (8.9.3/8.9.3) id SAA00722; Fri, 11 May 2001 18:46:44 -0700 (MST) Received: from usr06.primenet.com(206.165.6.206) via SMTP by smtp03.primenet.com, id smtpdAAAh3a4vb; Fri May 11 18:46:35 2001 Received: (from tlambert@localhost) by usr06.primenet.com (8.8.5/8.8.5) id SAA24408; Fri, 11 May 2001 18:52:27 -0700 (MST) From: Terry Lambert Message-Id: <200105120152.SAA24408@usr06.primenet.com> Subject: Re: FreeBSD breaks sockets two ways... To: freebsd-net@FreeBSD.ORG Date: Sat, 12 May 2001 01:52:17 +0000 (GMT) Cc: arch@FreeBSD.ORG X-Mailer: ELM [version 2.5 PL2] MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG ] Are you sure it's failing to allocate the port? ] ] I had a similar problem in trying to connect to a service, but ] found out that aliasing an IP didn't add the arp entry in the routing ] table (local connections were failing). If I added the arp entry by ] hand, everything was happy (is IP aliasing a part of the scneario ] you're describing?). ] ] arp -s a.b.c.d 00:60:08:aa:aa:aa pub ] arp -s a.b.c.e 00:60:08:aa:aa:ab pub ] ] A tad annoying, but it seems to work (yeah, I know about the ] ethers file, but I refuse to use it). -sc Unfortunately, I'm very certain. I talked to Bill Paul about the gratuitous ARP problem last night; I was well aware of it; we added the ARP entries by hand to the target for the aliases on the source machine. I'm _positive_ on the outbound connection problem (the code fragment I attached should have done the job, and I've seen the FreeBSD code that's the problem, but am still pondering about how to fix it; I think I'll have to do two lookups, or hang a chain off a hash bucket indexed by IP last (instead of port). Hopefully, someone will get to this before I do. We've also tried by setting up the ARP table for the target machine, and then written the aforementioned BPF program to stage the connection attempts from a single client machine. We did the same thing from a second client on the same segment. The single client, two IP attempt failed, while the two machine attempt succeeded. The only difference in the packets that was reported by tcpdump was the source MAC address -- otherwise, they were byte-for-byte identical. So there is definitely a problem there with the index being by MAC instead of IP. Maybe this came in as part of the "aliased IP NFS client being seen as an attacker by the server" fix? Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Sat May 12 5: 6:30 2001 Delivered-To: freebsd-arch@freebsd.org Received: from pcnet1.pcnet.com (pcnet1.pcnet.com [204.213.232.3]) by hub.freebsd.org (Postfix) with ESMTP id 94A7937B423 for ; Sat, 12 May 2001 05:06:27 -0700 (PDT) (envelope-from eischen@vigrid.com) Received: (from eischen@localhost) by pcnet1.pcnet.com (8.8.7/PCNet) id IAA15937; Sat, 12 May 2001 08:05:38 -0400 (EDT) Date: Sat, 12 May 2001 08:05:38 -0400 (EDT) From: Daniel Eischen To: Bruce Evans Cc: arch@FreeBSD.org Subject: Re: cvs commit: src/sys/i386/linux linux_sysvec.c In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG [ Moved to -arch ] On Sat, 12 May 2001, Bruce Evans wrote: > On Fri, 11 May 2001, Daniel Eischen wrote: > > > deischen 2001/05/11 20:23:14 PDT > > > > Modified files: > > sys/i386/linux linux_sysvec.c > > Log: > > Preserve the state of the %gs register when setting up the signal > > handler in Linux emulation. According to bde, this is what Linux > > does. > > > > Recent versions of linuxthreads use %gs for thread-specific data, > > while FreeBSD uses %fs (mostly because WINE uses %gs). > > I think FreeBSD should use %gs too (except I think segment registers > should never be used). Are there different compatibility problems > with WINE? Using %gs is OK by me. I've never used WINE, so I'm not sure how it uses %gs. I think Terry raised the issue when we were discussing which register to use on -arch, and at the time nobody seemed to care if we used %fs or %gs. I suppose using %gs would conflict with WINE if it ever relied on our native threads libraries. But since Linux uses %gs, and WINE is more likely to run under Linux than anything else, it would seem safe for FreeBSD to use %gs also. -- Dan Eischen To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Sat May 12 7:24: 1 2001 Delivered-To: freebsd-arch@freebsd.org Received: from netbank.com.br (garrincha.netbank.com.br [200.203.199.88]) by hub.freebsd.org (Postfix) with ESMTP id B8AAB37B443 for ; Sat, 12 May 2001 07:23:54 -0700 (PDT) (envelope-from riel@conectiva.com.br) Received: from surriel.ddts.net (1-248.ctame701-1.telepar.net.br [200.181.137.248]) by netbank.com.br (Postfix) with ESMTP id 5D7CF46804; Sat, 12 May 2001 11:25:36 -0300 (BRST) Received: from localhost (mflznt@localhost [127.0.0.1]) by surriel.ddts.net (8.11.3/8.11.2) with ESMTP id f4CENhi04957; Sat, 12 May 2001 11:23:44 -0300 Date: Sat, 12 May 2001 11:23:43 -0300 (BRST) From: Rik van Riel X-Sender: riel@imladris.rielhome.conectiva To: Matt Dillon Cc: arch@freebsd.org, linux-mm@kvack.org, sfkaplan@cs.amherst.edu Subject: Re: on load control / process swapping In-Reply-To: <200105080056.f480u1Q71866@earth.backplane.com> Message-ID: X-spambait: aardvark@kernelnewbies.org X-spammeplease: aardvark@nl.linux.org MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG On Mon, 7 May 2001, Matt Dillon wrote: > Look at the loop line 1362 of vm_pageout.c. Note that it enforces > a HZ/2 tsleep (2 scans per second) if the pageout daemon is unable > to clean sufficient pages in two loops. The tsleep is not woken up > by anyone while waiting that 1/2 second becuase vm_pages_needed has > not been cleared yet. This is what is limiting the page queue scan. Ahhh, so FreeBSD _does_ have a maxscan equivalent, just one that only kicks in when the system is under very heavy memory pressure. That explains why FreeBSD's thrashing detection code works... ;) (I'm not convinced, though, that limiting the speed at which we scan the active list is a good thing. There are some arguments in favour of speed limiting, but it mostly seems to come down to a short-cut to thrashing detection...) > :But ... is this a good enough indication that the processes > :currently resident have enough memory available to make any > :progress ? > > Yes. Consider detecting the difference between a large process accessing > its pages randomly, and a small process accessing a relatively small > set of pages over and over again. Now consider what happens when the > system gets overloaded. The small process will be able to access its > pages enough that they will get page priority over the larger process. > The larger process, due to the more random accesses (or simply the fact > that it is accessing a larger set of pages) will tend to stall more on > pagein I/O which has the side effect of reducing the large process's > access rate on all of its pages. The result: small processes get more > priority just by being small. But if the larger processes never get a chance to make decent progress without thrashing, won't your system be slowed down forever by these (thrashing) large processes? It's nice to protect your small processes from the large ones, but if the large processes don't get to run to completion the system will never get out of thrashing... > :Especially if all the currently resident processes are waiting > :in page faults, won't that make it easier for the system to find > :pages to swap out, etc... ? > : > :One thing I _am_ wondering though: the pageout and the pagein > :thresholds are different. Can't this lead to problems where we > :always hit both the pageout threshold -and- the pagein threshold > :and the system thrashes swapping processes in and out ? > > The system will not page out a page it has just paged in due to the > center-of-the-road initialization of act_count (the page aging). Indeed, the speed limiting of the pageout scanning takes care of this. But still, having the swapout threshold defined as being short of inactive pages while the swapin threshold uses the number of free+cache pages as an indication could lead to the situation where you suspend and wake up processes while it isn't needed. Or worse, suspending one process which easily fit in memory and then waking up another process, which cannot be swapped in because the first process' memory is still sitting in RAM and cannot be removed yet due to the pageout scan speed limiting (and also cannot be used, because we suspended the process). The chance of this happening could be quite big in some situations because the swapout and swapin thresholds are measuring things that are only indirectly related... > The pagein and pageout rates have nothing to do with thrashing, per say, > and should never be arbitrarily limited. But they are, with the pageout daemon going to sleep for half a second if it doesn't succeed in freeing enough memory at once. It even does this if a large part of the memory on the active list belongs to a process which has just been suspended because of thrashing... > I don't think it's possible to write a nice neat thrash-handling > algorithm. It's a bunch of algorithms all working together, all > closely tied to the VM page cache. Each taken alone is fairly easy > to describe and understand. All of them together result in complex > interactions that are very easy to break if you make a mistake. Heheh, certainly true ;) cheers, Rik -- Virtual memory is like a game you can't win; However, without VM there's truly nothing to lose... http://www.surriel.com/ http://distro.conectiva.com/ Send all your spam to aardvark@nl.linux.org (spam digging piggy) To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Sat May 12 7:28:31 2001 Delivered-To: freebsd-arch@freebsd.org Received: from netbank.com.br (garrincha.netbank.com.br [200.203.199.88]) by hub.freebsd.org (Postfix) with ESMTP id BCAB337B423 for ; Sat, 12 May 2001 07:28:28 -0700 (PDT) (envelope-from riel@conectiva.com.br) Received: from surriel.ddts.net (1-248.ctame701-1.telepar.net.br [200.181.137.248]) by netbank.com.br (Postfix) with ESMTP id 0CFB746804; Sat, 12 May 2001 11:30:17 -0300 (BRST) Received: from localhost (svsumc@localhost [127.0.0.1]) by surriel.ddts.net (8.11.3/8.11.2) with ESMTP id f4CESPi05036; Sat, 12 May 2001 11:28:26 -0300 Date: Sat, 12 May 2001 11:28:25 -0300 (BRST) From: Rik van Riel X-Sender: riel@imladris.rielhome.conectiva To: Matt Dillon Cc: Kirk McKusick , arch@FreeBSD.ORG, linux-mm@kvack.org, sfkaplan@cs.amherst.edu Subject: Re: on load control / process swapping In-Reply-To: <200105090018.f490IGR87881@earth.backplane.com> Message-ID: X-spambait: aardvark@kernelnewbies.org X-spammeplease: aardvark@nl.linux.org MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG On Tue, 8 May 2001, Matt Dillon wrote: > :I know that FreeBSD will swap out sleeping processes, but will it > :ever swap out running processes? The old BSD VM system would do so > :(we called it hard swapping). It is possible to get a set of running > :processes that simply do not all fit in memory, and the only way > :for them to make forward progress is to cycle them through memory. > > I looked at the code fairly carefully last night... it doesn't > swap out running processes and it also does not appear to swap > out processes blocked in a page-fault (on I/O). Now, of course > we can't swap a process out right then (it might be holding locks), > but I think it would be beneficial to be able to mark the process > as 'requesting a swapout on return to user mode' or something > like that. In the (still very rough) swapping code for Linux I simply do this as "swapout on next pagefault". The idea behind that is: 1) it's easy, at a page fault we know we can suspend the process 2) if we're thrashing, we want every process to make as much progress as possible before it's suspended (swapped out), letting the process run until the next page fault means we will never suspend a process while it's still able to make progress 3) thrashing should be a rare situation, so you don't want to complicate fast-path code like "return to userspace"; instead we make sure to have as little impact on the rest of the kernel code as possible regards, Rik -- Virtual memory is like a game you can't win; However, without VM there's truly nothing to lose... http://www.surriel.com/ http://distro.conectiva.com/ Send all your spam to aardvark@nl.linux.org (spam digging piggy) To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Sat May 12 10:22: 2 2001 Delivered-To: freebsd-arch@freebsd.org Received: from earth.backplane.com (earth-nat-cw.backplane.com [208.161.114.67]) by hub.freebsd.org (Postfix) with ESMTP id EE99837B424 for ; Sat, 12 May 2001 10:21:56 -0700 (PDT) (envelope-from dillon@earth.backplane.com) Received: (from dillon@localhost) by earth.backplane.com (8.11.3/8.11.2) id f4CHLSS18553; Sat, 12 May 2001 10:21:28 -0700 (PDT) (envelope-from dillon) Date: Sat, 12 May 2001 10:21:28 -0700 (PDT) From: Matt Dillon Message-Id: <200105121721.f4CHLSS18553@earth.backplane.com> To: Rik van Riel Cc: arch@freebsd.org, linux-mm@kvack.org, sfkaplan@cs.amherst.edu Subject: Re: on load control / process swapping References: Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG : :Ahhh, so FreeBSD _does_ have a maxscan equivalent, just one that :only kicks in when the system is under very heavy memory pressure. : :That explains why FreeBSD's thrashing detection code works... ;) : :(I'm not convinced, though, that limiting the speed at which we :scan the active list is a good thing. There are some arguments :in favour of speed limiting, but it mostly seems to come down :to a short-cut to thrashing detection...) Note that there is a big distinction between limiting the page queue scan rate (which we do not do), and sleeping between full scans (which we do). Limiting the page queue scan rate on a page-by-page basis does not scale. Sleeping in between full queue scans (in an extreme case) does scale. -Matt To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Sat May 12 14:17:28 2001 Delivered-To: freebsd-arch@freebsd.org Received: from perninha.conectiva.com.br (perninha.conectiva.com.br [200.250.58.156]) by hub.freebsd.org (Postfix) with ESMTP id 695FF37B423 for ; Sat, 12 May 2001 14:17:24 -0700 (PDT) (envelope-from riel@conectiva.com.br) Received: from burns.conectiva (burns.conectiva [10.0.0.4]) by perninha.conectiva.com.br (Postfix) with SMTP id 2CE5B16C5C for ; Sat, 12 May 2001 18:17:17 -0300 (EST) Received: (qmail 15923 invoked by uid 0); 12 May 2001 21:15:52 -0000 Received: from duckman.distro.conectiva (HELO duckman.conectiva.com.br) (root@10.0.17.2) by burns.conectiva with SMTP; 12 May 2001 21:15:52 -0000 Received: from localhost (riel@localhost) by duckman.conectiva.com.br (8.11.3/8.11.3) with ESMTP id f4CLHFK11130; Sat, 12 May 2001 18:17:16 -0300 X-Authentication-Warning: duckman.distro.conectiva: riel owned process doing -bs Date: Sat, 12 May 2001 18:17:15 -0300 (BRST) From: Rik van Riel X-X-Sender: To: Matt Dillon Cc: , , Subject: Re: on load control / process swapping In-Reply-To: <200105121721.f4CHLSS18553@earth.backplane.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG On Sat, 12 May 2001, Matt Dillon wrote: > :Ahhh, so FreeBSD _does_ have a maxscan equivalent, just one that > :only kicks in when the system is under very heavy memory pressure. > : > :That explains why FreeBSD's thrashing detection code works... ;) > > Note that there is a big distinction between limiting the page > queue scan rate (which we do not do), and sleeping between full > scans (which we do). Limiting the page queue scan rate on a > page-by-page basis does not scale. Sleeping in between full queue > scans (in an extreme case) does scale. I'm not convinced it's doing a very useful thing, though ;) (see the rest of the email you replied to) Rik -- Linux MM bugzilla: http://linux-mm.org/bugzilla.shtml Virtual memory is like a game you can't win; However, without VM there's truly nothing to lose... http://www.surriel.com/ http://www.conectiva.com/ http://distro.conectiva.com/ To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Sat May 12 16:58:31 2001 Delivered-To: freebsd-arch@freebsd.org Received: from earth.backplane.com (earth-nat-cw.backplane.com [208.161.114.67]) by hub.freebsd.org (Postfix) with ESMTP id C044937B43E for ; Sat, 12 May 2001 16:58:24 -0700 (PDT) (envelope-from dillon@earth.backplane.com) Received: (from dillon@localhost) by earth.backplane.com (8.11.3/8.11.2) id f4CNwEr20137; Sat, 12 May 2001 16:58:14 -0700 (PDT) (envelope-from dillon) Date: Sat, 12 May 2001 16:58:14 -0700 (PDT) From: Matt Dillon Message-Id: <200105122358.f4CNwEr20137@earth.backplane.com> To: Rik van Riel Cc: arch@freebsd.org, linux-mm@kvack.org, sfkaplan@cs.amherst.edu Subject: Re: on load control / process swapping References: Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG : :But if the larger processes never get a chance to make decent :progress without thrashing, won't your system be slowed down :forever by these (thrashing) large processes? : :It's nice to protect your small processes from the large ones, :but if the large processes don't get to run to completion the :system will never get out of thrashing... Consider the case where you have one large process and many small processes. If you were to skew things to allow the large process to run at the cost of all the small processes, you have just inconvenienced 98% of your users so one ozob can run a big job. Not only that, but there is no guarentee that the 'big job' will ever finish (a topic of many a paper on scheduling, BTW)... what if it's been running for hours and still has hours to go? Do we blow away the rest of the system to let it run? What if there are several big jobs? If you skew things in favor of one the others could take 60 seconds *just* to recover their RSS when they are finally allowed to run. So much for timesharing... you would have to run each job exclusively for 5-10 minutes at a time to get any sort of effiency, which is not practical in a timeshare system. So there is really very little that you can do. :Indeed, the speed limiting of the pageout scanning takes care of :this. But still, having the swapout threshold defined as being :short of inactive pages while the swapin threshold uses the number :of free+cache pages as an indication could lead to the situation :where you suspend and wake up processes while it isn't needed. : :Or worse, suspending one process which easily fit in memory and :then waking up another process, which cannot be swapped in because :the first process' memory is still sitting in RAM and cannot be :removed yet due to the pageout scan speed limiting (and also cannot :be used, because we suspended the process). We don't suspend running processes, but I do believe FreeBSD is still vulnerable to this issue. Suspending the marked process when it hits the vm_fault code is a good idea and would solve the problem. If the process never takes an allocation fault, it probably doesn't have to be swapped out. The normal pageout would suffice for that process. :> The pagein and pageout rates have nothing to do with thrashing, per say, :> and should never be arbitrarily limited. : :But they are, with the pageout daemon going to sleep for half a :second if it doesn't succeed in freeing enough memory at once. :It even does this if a large part of the memory on the active :list belongs to a process which has just been suspended because :of thrashing... No. I did say the code was complex. A process which has been suspended for thrashing gets all of its pages depressed in priority. The page daemon would have no problem recovering the pages. See line 1458 of vm_pageout.c. This code also enforces the 'memoryuse' resource limit (which is perhaps even more important). It is not necessary to try to launder the pages immediately. Simply depressing their priority is sufficient and it allows for quicker recovery when the thrashing goes away. It also allows us to implement the vm.swap_idle_{threshold1,threshold2,enabled} sysctls trivially, which results in proactive swapping that is extremely useful in certain situations (like shell machines with lots of idle users). The pagedaemon gets behind when there are too many active pages in the system and the pagedaemon is unable to move them to the inactive queue due to the pages still being very active... that is, when the active resident set for all processes in the system exceeds available memory. This is what triggers thrashing. Swapping has the side effect of reducing the total active resident set for the system as a whole, fixing the thrashing problem. -Matt :> I don't think it's possible to write a nice neat thrash-handling :> algorithm. It's a bunch of algorithms all working together, all :> closely tied to the VM page cache. Each taken alone is fairly easy :> to describe and understand. All of them together result in complex :> interactions that are very easy to break if you make a mistake. : :Heheh, certainly true ;) : :cheers, : :Rik To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message