Date: Sun, 6 Jan 2002 10:37:28 -0500 (EST) From: Robert Watson <rwatson@freebsd.org> To: Julian Elischer <julian@elischer.org> Cc: Terry Lambert <tlambert2@mindspring.com>, Alfred Perlstein <bright@mu.org>, arch@freebsd.org Subject: Re: freeing thread structures. Message-ID: <Pine.NEB.3.96L.1020106091145.90088A-100000@fledge.watson.org> In-Reply-To: <Pine.BSF.4.21.0201060244220.35785-100000@InterJet.elischer.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Sun, 6 Jan 2002, Julian Elischer wrote: > > As Julian notes, having per thread credentials means locking > > in thread_exit(), when it would not otherwise be necessary. > > NOT having it means needing locking at every access.. take your pick.. > I prefer having a reference per thread. I admit to being a supporter of the model proposed by John Baldwin a few months ago: when a 'thread' enters kernel (is allocated), it gains a read-only reference to the process's "real" credential, and slaps it in td_ucred. When it returns (is released), that reference is free'd. When the thread is performing read-only credential operations, it uses td_ucred and no further locking is required. When it needs to change the credential, it must use p->p_ucred, requiring locking (and careful handling of checks before modifications to prevent races). This has the effect of freezing the credential for the duration of a system call, so that it doesn't have to be done manually (for example) on entry to a system call, or require locking for all operations. The downside is that you always pay this cost, rather than when it's needed. The upside is that that you never pay the cost multiple times (and having done a bit of experimentation with dripping locks around to handle credentials, that can happen a lot). If there are a few characteristic system calls that will *never* require access to a credential (and it's a very small number, believe me), it might be worth adding a flag to the system call table preventing the grabbing of a ucred. That might apply to gettimeofday(), getpid(), et al. Ones that are used to do things like measure the cost of system calls for benchmarks :-). However, this is an optimization that should be considered later. A brief survey of system calls suggests that the following classes of system calls will always require credential access for the current process, or cached thread credential access: (1) File descriptor calls which would include chroot, dup, dup2, fcntl, select, poll, kqueue, kevent. (2) VFS/device calls which would include read, write, close, creat, link, unlink, chdir, fchdir, mknod, chmod, chown, getfsstat, lseek, mount, unmount, access, chflags, fchflags, sync, stat, lstat, ioctl, revoke, symlink, readlink, fstat, fsync, fchown, fchmod, readv, writev, rename, truncate, ftruncate, flock, mkfifo, mkdir, rmdir, utimes, getdirentries, statfs, fstatfs, getfh, pread, pwrite, pathconf, fpathconf, undelete, futimes, lchown, fhstatfs, fhopen, fhstat, aio_read, aio_write, __getcwd, utrace (which may be implemented wrong due to potential races involving the vnode it's using -- the vref used to protect the write does not prevent a close on the vnode), sendfile, __acl_get_file, __acl_set_file, __acl_get_fd, __acl_set_fd, __acl_delete_file, __acl_delete_fd, __acl_aclcheck_file, __acl_aclcheck_fd, extattrctl, extattr_set_file, extattr_get_file, extattr_delete_file, __cap_get_fd, __cap_get_file, __cap_set_fd, __cap_set_file, extattr_set_fd, extattr_get_fd, extattr_delete_fd, eaccess, nmount. (3) Inter-process communication calls, which would provide access to sockets and pipes, including read, write, open, close, recvmsg, sendmsg, recvfrom, accept, getpeername, getsockname, pipe, socket, connect, accept, send, recv, bind, setsockopt, listen, getsockopt, readv, writev, sendto, shutdown, socketpair, semsys, msgsys, shmsys, __semctl, semget, semop, msgctl, msgget, msgsnd, msgrcv, shmat, shmctl, shmdt, shmget. (4) Inter-process services (signalling, debugging, visibility, scheduling), which would include wait4, ptrace, kill, ktrace, wait, setpriority, getpriority, getrusage, killpg, sched_setparam, sched_getparam, sched_setscheduler, sched_getscheduler, sched_rr_get_interval. (5) VM calls such as mmap, mlock, munlock. (6) Process lifetime calls, including sys_exit, fork, setlogin, execve, vfork, getpgid, setpgid, getpgrp, setrlimit, rtprio, rfork. (7) Credential calls, including setuid, getuid, geteuid, getegid, getgid, getgroups, setgroups, setreuid, setregid, getresuid, setresuid, jail, getresuid, getresgid, __cap_get_proc, __cap_set_proc. (8) System management calls, including acct, reboot, getkerninfo (because it relies on sysctl), swapon, gethostname, sethostname, settimeofday, adjtime, sethostid, quotactl, setdomainname, uname, ntp_adjtime, clock_settime, kldload, kldunload. (9) Misc, including sysarch, nfsclnt, __setsugid. Calls that don't require credentials fall into similar categories: (1) Per-process signal calls, including sigaction, sigprocmask, sigpending, sigvec, sigblock, sigsetmask, sigsuspend (these mask out cases that aren't permitted, rather than attempting to return an error). (2) Misc struct proc entries, such as getppid, getpid, profil, getlogin, umask, setsid (which may require a credential in the future), issetugid (which may require a credential in the future so as to expand the definition to reflect new privilege models, MAC downgrades, etc), getsid. (3) static/global system settings retrieval, such as getpagesize, gethostid (might want to be per-jail in the future), getdomainname (might want to be per-jail in the future), modnext, modstat, modfnext, modfind, kldfind, kldnext, kldstat, kldfirstmod, kldsym. Some interest in masking kld/module information from userland has been expressed, and if that were ever implemented, it would likely rely on the credential. (4) System time retrieval, and per-process time, including setitimer, getitimer, gettimeofday, clock_gettime, clock_getres. (5) resource operations, such as getdtablesize, getrlimit. (6) Scheduling calls, such as nanosleep, yield, sched_yield, sched_get_priority_max, sched_get_priority_min. (7) Unimplemented, including oquota, vadvise. (7) aio calls, including aio_return, aio_suspend, lio_listio (I have no idea what this is, so could be wrong), aio_waitcomplete. Some calls, especially VM calls, cascade down into VFS, and therefore will use either a cached credential, or rely on the credential generating the system call or fault. This is already the case for touching mmap'ings, where you can actually already get faults based on mapping a file, then trying to touch it in a manner that has been revoked (due to securelevels and getting back EPERM from some write operations, or from NFS or other file systems where revocation is supported, such as AFS), and with the advent of MAC, this will occur more universally following a relabel of a vnode. Since I'm not very familiar with the VM code, I'll just group them together here: sigaltstack, msync, obreak, sbrk, sstk, mincore, sigreturn, sigstack, minherit, mlockall (which appears to be a noop?) There are enough calls that currently don't require credentials, especially with regards to signals and timing (for example, the Apache event loop does a lot of signal and timing stuff, as no doubt do thread libraries), that it might be worth optimizing that case, but I'd argue that optimizing that case should probably wait until td_ucred is fully implemented in KSE-land. Once the base case is implemented, we can do an experimental implementation of the optimization, and see if resulting complexity is worth it in terms of cost: note that the period of time where the reference count is manipulated is very small -- the mutex isn't held long, or better yet, it's an atomic operation. There will be other things we can invest time in, such as Giant, where the payoff will be far greater :-). Robert N M Watson FreeBSD Core Team, TrustedBSD Project robert@fledge.watson.org NAI Labs, Safeport Network Services To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?Pine.NEB.3.96L.1020106091145.90088A-100000>