Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 6 Jan 2002 10:37:28 -0500 (EST)
From:      Robert Watson <rwatson@freebsd.org>
To:        Julian Elischer <julian@elischer.org>
Cc:        Terry Lambert <tlambert2@mindspring.com>, Alfred Perlstein <bright@mu.org>, arch@freebsd.org
Subject:   Re: freeing thread structures.
Message-ID:  <Pine.NEB.3.96L.1020106091145.90088A-100000@fledge.watson.org>
In-Reply-To: <Pine.BSF.4.21.0201060244220.35785-100000@InterJet.elischer.org>

next in thread | previous in thread | raw e-mail | index | archive | help

On Sun, 6 Jan 2002, Julian Elischer wrote:

> > As Julian notes, having per thread credentials means locking
> > in thread_exit(), when it would not otherwise be necessary.
> 
> NOT having it means needing locking at every access..  take your pick.. 
> I prefer having a reference per thread. 

I admit to being a supporter of the model proposed by John Baldwin a few
months ago: when a 'thread' enters kernel (is allocated), it gains a
read-only reference to the process's "real" credential, and slaps it in
td_ucred.  When it returns (is released), that reference is free'd.  When
the thread is performing read-only credential operations, it uses td_ucred
and no further locking is required.  When it needs to change the
credential, it must use p->p_ucred, requiring locking (and careful
handling of checks before modifications to prevent races).  This has the
effect of freezing the credential for the duration of a system call, so
that it doesn't have to be done manually (for example) on entry to a
system call, or require locking for all operations.  The downside is that
you always pay this cost, rather than when it's needed.  The upside is
that that you never pay the cost multiple times (and having done a bit of
experimentation with dripping locks around to handle credentials, that can
happen a lot).

If there are a few characteristic system calls that will *never* require
access to a credential (and it's a very small number, believe me), it
might be worth adding a flag to the system call table preventing the
grabbing of a ucred. That might apply to gettimeofday(), getpid(), et al. 
Ones that are used to do things like measure the cost of system calls for
benchmarks :-). However, this is an optimization that should be considered
later.

A brief survey of system calls suggests that the following classes of
system calls will always require credential access for the current
process, or cached thread credential access:

(1) File descriptor calls which would include chroot, dup, dup2, fcntl,
    select, poll, kqueue, kevent.

(2) VFS/device calls which would include read, write, close, creat, link,
    unlink, chdir, fchdir, mknod, chmod, chown, getfsstat, lseek, mount,
    unmount, access, chflags, fchflags, sync, stat, lstat, ioctl, revoke,
    symlink, readlink, fstat, fsync, fchown, fchmod, readv, writev,
    rename, truncate, ftruncate, flock, mkfifo, mkdir, rmdir, utimes,
    getdirentries, statfs, fstatfs, getfh, pread, pwrite, pathconf,
    fpathconf, undelete, futimes, lchown, fhstatfs, fhopen, fhstat,
    aio_read, aio_write, __getcwd, utrace (which may be implemented wrong
    due to potential races involving the vnode it's using -- the vref
    used to protect the write does not prevent a close on the vnode),
    sendfile, __acl_get_file, __acl_set_file, __acl_get_fd, __acl_set_fd,
    __acl_delete_file, __acl_delete_fd, __acl_aclcheck_file,
    __acl_aclcheck_fd, extattrctl, extattr_set_file, extattr_get_file,
    extattr_delete_file, __cap_get_fd, __cap_get_file, __cap_set_fd,
    __cap_set_file, extattr_set_fd, extattr_get_fd, extattr_delete_fd,
    eaccess, nmount.

(3) Inter-process communication calls, which would provide access to
    sockets and pipes, including read, write, open, close, recvmsg,
    sendmsg, recvfrom, accept, getpeername, getsockname, pipe, socket,
    connect, accept, send, recv, bind, setsockopt, listen, getsockopt,
    readv, writev, sendto, shutdown, socketpair, semsys, msgsys, shmsys,
    __semctl, semget, semop, msgctl, msgget, msgsnd, msgrcv, shmat,
    shmctl, shmdt, shmget.

(4) Inter-process services (signalling, debugging, visibility,
    scheduling), which would include wait4, ptrace, kill, ktrace, wait,
    setpriority, getpriority, getrusage, killpg, sched_setparam,
    sched_getparam, sched_setscheduler, sched_getscheduler,
    sched_rr_get_interval.

(5) VM calls such as mmap, mlock, munlock.

(6) Process lifetime calls, including sys_exit, fork, setlogin, execve,
    vfork, getpgid, setpgid, getpgrp, setrlimit, rtprio, rfork.

(7) Credential calls, including setuid, getuid, geteuid, getegid, getgid,
    getgroups, setgroups, setreuid, setregid, getresuid, setresuid, jail,
    getresuid, getresgid, __cap_get_proc, __cap_set_proc.

(8) System management calls, including acct, reboot, getkerninfo (because
    it relies on sysctl), swapon, gethostname, sethostname, settimeofday,
    adjtime, sethostid, quotactl, setdomainname, uname, ntp_adjtime,
    clock_settime, kldload, kldunload.

(9) Misc, including sysarch, nfsclnt, __setsugid.

Calls that don't require credentials fall into similar categories:

(1) Per-process signal calls, including sigaction, sigprocmask,
    sigpending, sigvec, sigblock, sigsetmask, sigsuspend (these mask out
    cases that aren't permitted, rather than attempting to return an
    error).

(2) Misc struct proc entries, such as getppid, getpid, profil, getlogin,
    umask, setsid (which may require a credential in the future),
    issetugid (which may require a credential in the future so as to
    expand the definition to reflect new privilege models, MAC
    downgrades, etc), getsid.

(3) static/global system settings retrieval, such as getpagesize,
    gethostid (might want to be per-jail in the future), getdomainname
    (might want to be per-jail in the future), modnext, modstat, modfnext,
    modfind, kldfind, kldnext, kldstat, kldfirstmod, kldsym.  Some
    interest in masking kld/module information from userland has been
    expressed, and if that were ever implemented, it would likely rely on
    the credential.

(4) System time retrieval, and per-process time, including setitimer,
    getitimer, gettimeofday, clock_gettime, clock_getres. 

(5) resource operations, such as getdtablesize, getrlimit.

(6) Scheduling calls, such as nanosleep, yield, sched_yield,
    sched_get_priority_max, sched_get_priority_min.

(7) Unimplemented, including oquota, vadvise.

(7) aio calls, including aio_return, aio_suspend, lio_listio (I have no
    idea what this is, so could be wrong), aio_waitcomplete.

Some calls, especially VM calls, cascade down into VFS, and therefore will
use either a cached credential, or rely on the credential generating the
system call or fault.  This is already the case for touching mmap'ings,
where you can actually already get faults based on mapping a file, then
trying to touch it in a manner that has been revoked (due to securelevels
and getting back EPERM from some write operations, or from NFS or other
file systems where revocation is supported, such as AFS), and with the
advent of MAC, this will occur more universally following a relabel of a
vnode.  Since I'm not very familiar with the VM code, I'll just group them
together here: 

  sigaltstack, msync, obreak, sbrk, sstk, mincore, sigreturn, sigstack,
  minherit, mlockall (which appears to be a noop?) 

There are enough calls that currently don't require credentials,
especially with regards to signals and timing (for example, the Apache
event loop does a lot of signal and timing stuff, as no doubt do thread
libraries), that it might be worth optimizing that case, but I'd argue
that optimizing that case should probably wait until td_ucred is fully
implemented in KSE-land.  Once the base case is implemented, we can do an
experimental implementation of the optimization, and see if resulting
complexity is worth it in terms of cost: note that the period of time
where the reference count is manipulated is very small -- the mutex isn't
held long, or better yet, it's an atomic operation.  There will be other
things we can invest time in, such as Giant, where the payoff will be far
greater :-).

Robert N M Watson             FreeBSD Core Team, TrustedBSD Project
robert@fledge.watson.org      NAI Labs, Safeport Network Services



To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?Pine.NEB.3.96L.1020106091145.90088A-100000>