Date: Tue, 08 Jan 2013 10:02:22 -0500 From: Alfred Perlstein <bright@mu.org> To: Konstantin Belousov <kostikbel@gmail.com> Cc: arch@freebsd.org, David Xu <davidxu@freebsd.org>, toolchain@freebsd.org Subject: Re: Fast sigblock (AKA rtld speedup) Message-ID: <50EC34FE.4070804@mu.org> In-Reply-To: <20130108145631.GD82219@kib.kiev.ua> References: <20130107182235.GA65279@kib.kiev.ua> <50EBAA1F.9070303@freebsd.org> <20130108145631.GD82219@kib.kiev.ua>
next in thread | previous in thread | raw e-mail | index | archive | help
On 1/8/13 9:56 AM, Konstantin Belousov wrote: > On Tue, Jan 08, 2013 at 01:09:51PM +0800, David Xu wrote: >> On 2013/01/08 02:22, Konstantin Belousov wrote: >>> Below is the forward of the patch for which I failed to obtain a private >>> review. Might be, the list generates more responses. >>> >>> Our rtld has a performance bootleneck, typically exposed by the images >>> with the lot of the run-time relocation processing, and by the C++ >>> exception handling. We block the signals delivery during the rtld >>> performing the relocations, as well as for the dl_iterate_phdr(3) (the >>> later is used for finding the dwarf unwinding tables). >>> >>> The signal blocking is needed to allow the rtld to resolve the symbols >>> for the signal handlers in the safe way, but also causes 2 syscalls >>> overhead per each rtld entry. >>> >>> The proposed approach allows to shave off those two syscalls, doubling >>> the FreeBSD performance for the (silly) throw/catch C++ microbenchmark. >>> >>> Date: Mon, 13 Aug 2012 15:26:00 +0300 >>> From: Konstantin Belousov <kostikbel@gmail.com> >>> >>> ... >>> >>> The basic idea is to implement sigprocmask() as single write into usermode >>> address. If kernel needs to calculate the signal mask for current thread, >>> it takes into the consideration non-zero value of the word at the agreed >>> address. Also, usermode is informed about signals which were put on hold >>> due to fast sigblock active. >>> >>> As I said, on my measurements in microbenchmark that did throw/catch in >>> a loop, I see equal user and system time spent for unpatched system, and >>> same user time with zero system time on patched system. >>> >>> Patch can be improved further, e.g. it would be nice to allow rtld to fall >>> back to sigprocmask(2) if kernel does not support fast sigblock, to prevent >>> flag day. Also, the mask enforced by fast sigblock can be made configurable. >>> >>> Note that libthr already blocks signals by catching them, and not using rtld >>> service in the first line handler. I tried to make the change in the spirit >>> of libthr interceptors, but handoff to libthr appears too complicated to >>> work. In fact, libthr can be changed to start using fast sigblock instead >>> of wrapping sigaction, but this is out of scope of the proposal right now. >>> >>> Please comment. >>> >>> diff --git a/lib/libc/sys/Symbol.map b/lib/libc/sys/Symbol.map >>> index 6888ea0..3b75539 100644 >>> --- a/lib/libc/sys/Symbol.map >>> +++ b/lib/libc/sys/Symbol.map >>> @@ -546,6 +546,7 @@ FBSDprivate_1.0 { >>> __sys_extattr_set_link; >>> _extattrctl; >>> __sys_extattrctl; >>> + __sys_fast_sigblock; >>> _fchdir; >>> __sys_fchdir; >>> _fchflags; >>> diff --git a/libexec/rtld-elf/rtld_lock.c b/libexec/rtld-elf/rtld_lock.c >>> index d1563e5..50c52c6 100644 >>> --- a/libexec/rtld-elf/rtld_lock.c >>> +++ b/libexec/rtld-elf/rtld_lock.c >>> @@ -43,6 +43,7 @@ >>> */ >>> >>> #include <sys/param.h> >>> +#include <sys/signalvar.h> >>> #include <signal.h> >>> #include <stdlib.h> >>> #include <time.h> >>> @@ -59,8 +60,8 @@ typedef struct Struct_Lock { >>> void *base; >>> } Lock; >>> >>> -static sigset_t fullsigmask, oldsigmask; >>> static int thread_flag; >>> +static uint32_t fsigblock; >>> >>> static void * >>> def_lock_create() >>> @@ -111,18 +112,26 @@ def_rlock_acquire(void *lock) >>> } >>> >>> static void >>> +sig_fastunblock(void) >>> +{ >>> + uint32_t oldval; >>> + >>> + oldval = atomic_fetchadd_32(&fsigblock, -FAST_SIGBLOCK_INC); >>> + if (oldval == (FAST_SIGBLOCK_PEND | FAST_SIGBLOCK_INC)) >>> + __sys_fast_sigblock(FAST_SIGBLOCK_UNBLOCK, NULL); >>> +} >>> + >>> +static void >>> def_wlock_acquire(void *lock) >>> { >>> Lock *l = (Lock *)lock; >>> - sigset_t tmp_oldsigmask; >>> >>> for ( ; ; ) { >>> - sigprocmask(SIG_BLOCK, &fullsigmask, &tmp_oldsigmask); >>> + atomic_add_rel_32(&fsigblock, FAST_SIGBLOCK_INC); >>> if (atomic_cmpset_acq_int(&l->lock, 0, WAFLAG)) >>> break; >>> - sigprocmask(SIG_SETMASK, &tmp_oldsigmask, NULL); >>> + sig_fastunblock(); >>> } >>> - oldsigmask = tmp_oldsigmask; >>> } >>> >>> static void >>> @@ -134,7 +143,7 @@ def_lock_release(void *lock) >>> atomic_add_rel_int(&l->lock, -RC_INCR); >>> else { >>> atomic_add_rel_int(&l->lock, -WAFLAG); >>> - sigprocmask(SIG_SETMASK, &oldsigmask, NULL); >>> + sig_fastunblock(); >>> } >>> } >>> >>> @@ -286,19 +295,7 @@ lockdflt_init() >>> >>> memcpy(&lockinfo, &deflockinfo, sizeof(lockinfo)); >>> _rtld_thread_init(NULL); >>> - /* >>> - * Construct a mask to block all signals except traps which might >>> - * conceivably be generated within the dynamic linker itself. >>> - */ >>> - sigfillset(&fullsigmask); >>> - sigdelset(&fullsigmask, SIGILL); >>> - sigdelset(&fullsigmask, SIGTRAP); >>> - sigdelset(&fullsigmask, SIGABRT); >>> - sigdelset(&fullsigmask, SIGEMT); >>> - sigdelset(&fullsigmask, SIGFPE); >>> - sigdelset(&fullsigmask, SIGBUS); >>> - sigdelset(&fullsigmask, SIGSEGV); >>> - sigdelset(&fullsigmask, SIGSYS); >>> + __sys_fast_sigblock(FAST_SIGBLOCK_SETPTR, &fsigblock); >>> } >>> >>> /* >>> @@ -319,7 +316,10 @@ _rtld_thread_init(struct RtldLockInfo *pli) >>> >>> if (pli == NULL) >>> pli = &deflockinfo; >>> - >>> + else { >>> + fsigblock = 0; >>> + __sys_fast_sigblock(FAST_SIGBLOCK_UNSETPTR, NULL); >>> + } >>> >>> for (i = 0; i < RTLD_LOCK_CNT; i++) >>> if ((locks[i] = pli->lock_create()) == NULL) >>> diff --git a/sys/compat/freebsd32/syscalls.master b/sys/compat/freebsd32/syscalls.master >>> index 0891e41..f9e8b9e 100644 >>> --- a/sys/compat/freebsd32/syscalls.master >>> +++ b/sys/compat/freebsd32/syscalls.master >>> @@ -997,3 +997,4 @@ >>> uint32_t offset1, uint32_t offset2,\ >>> uint32_t len1, uint32_t len2, \ >>> int advice); } >>> +532 AUE_NULL NOPROTO { int fast_sigblock(int cmd, uint32_t *ptr); } >>> diff --git a/sys/kern/kern_exec.c b/sys/kern/kern_exec.c >>> index 90f7311..8a3cd15 100644 >>> --- a/sys/kern/kern_exec.c >>> +++ b/sys/kern/kern_exec.c >>> @@ -1031,6 +1031,7 @@ exec_new_vmspace(imgp, sv) >>> int error; >>> struct proc *p = imgp->proc; >>> struct vmspace *vmspace = p->p_vmspace; >>> + struct thread *td = curthread; >>> vm_object_t obj; >>> vm_offset_t sv_minuser, stack_addr; >>> vm_map_t map; >>> @@ -1039,6 +1040,10 @@ exec_new_vmspace(imgp, sv) >>> imgp->vmspace_destroyed = 1; >>> imgp->sysent = sv; >>> >>> + td->td_pflags &= ~TDP_FAST_SIGBLOCK; >>> + td->td_sigblock_ptr = NULL; >>> + td->td_sigblock_val = 0; >>> + >>> /* May be called with Giant held */ >>> EVENTHANDLER_INVOKE(process_exec, p, imgp); >>> >>> diff --git a/sys/kern/kern_sig.c b/sys/kern/kern_sig.c >>> index 2685a8b..3c8a2f7 100644 >>> --- a/sys/kern/kern_sig.c >>> +++ b/sys/kern/kern_sig.c >>> @@ -230,6 +230,7 @@ static int sigproptbl[NSIG] = { >>> }; >>> >>> static void reschedule_signals(struct proc *p, sigset_t block, int flags); >>> +static sigset_t fastblock_mask; >>> >>> static void >>> sigqueue_start(void) >>> @@ -240,6 +241,16 @@ sigqueue_start(void) >>> p31b_setcfg(CTL_P1003_1B_REALTIME_SIGNALS, _POSIX_REALTIME_SIGNALS); >>> p31b_setcfg(CTL_P1003_1B_RTSIG_MAX, SIGRTMAX - SIGRTMIN + 1); >>> p31b_setcfg(CTL_P1003_1B_SIGQUEUE_MAX, max_pending_per_proc); >>> + SIGFILLSET(fastblock_mask); >>> + SIG_CANTMASK(fastblock_mask); >>> + SIGDELSET(fastblock_mask, SIGILL); >>> + SIGDELSET(fastblock_mask, SIGTRAP); >>> + SIGDELSET(fastblock_mask, SIGABRT); >>> + SIGDELSET(fastblock_mask, SIGEMT); >>> + SIGDELSET(fastblock_mask, SIGFPE); >>> + SIGDELSET(fastblock_mask, SIGBUS); >>> + SIGDELSET(fastblock_mask, SIGSEGV); >>> + SIGDELSET(fastblock_mask, SIGSYS); >>> } >>> >>> ksiginfo_t * >>> @@ -2525,6 +2536,7 @@ issignal(struct thread *td, int stop_allowed) >>> struct sigqueue *queue; >>> sigset_t sigpending; >>> int sig, prop, newsig; >>> + uint32_t oldval; >>> >>> p = td->td_proc; >>> ps = p->p_sigacts; >>> @@ -2541,6 +2553,32 @@ issignal(struct thread *td, int stop_allowed) >>> SIG_STOPSIGMASK(sigpending); >>> if (SIGISEMPTY(sigpending)) /* no signal to send */ >>> return (0); >>> + >>> + /* >>> + * Do fast sigblock if requested by usermode. Since >>> + * we do know that there was a signal pending at this >>> + * point, set the FAST_SIGBLOCK_PEND as indicator for >>> + * usermode to perform a dummy call to >>> + * FAST_SIGBLOCK_UNBLOCK, which causes immediate >>> + * delivery of postponed pending signal. >>> + */ >>> + if ((td->td_pflags & TDP_FAST_SIGBLOCK) != 0) { >>> + if (td->td_sigblock_val != 0) >>> + SIGSETNAND(sigpending, fastblock_mask); >>> + if (SIGISEMPTY(sigpending)) { >>> + oldval = fuword32(td->td_sigblock_ptr); >>> + if (oldval == -1) { >>> + fetch_fast_sigblock_failed(td, 0); >>> + return (0); >>> + } >>> + oldval |= FAST_SIGBLOCK_PEND; >>> + if (suword32(td->td_sigblock_ptr, oldval) == -1) >>> + fetch_fast_sigblock_failed(td, 1); >>> + td->td_sigblock_val = oldval; >>> + return (0); >>> + } >>> + } >>> + >>> sig = sig_ffs(&sigpending); >>> >>> if (p->p_stops & S_SIG) { >>> @@ -3456,3 +3494,92 @@ sigacts_shared(struct sigacts *ps) >>> mtx_unlock(&ps->ps_mtx); >>> return (shared); >>> } >>> + >>> +int >>> +sys_fast_sigblock(struct thread *td, struct fast_sigblock_args *uap) >>> +{ >>> + struct proc *p; >>> + int error; >>> + uint32_t oldval; >>> + >>> + error = 0; >>> + switch (uap->cmd) { >>> + case FAST_SIGBLOCK_SETPTR: >>> + if ((td->td_pflags & TDP_FAST_SIGBLOCK) != 0) >>> + error = EBUSY; >>> + else if (((uintptr_t)(uap->ptr) & (sizeof(uint32_t) - 1)) != 0) >>> + error = EINVAL; >>> + else { >>> + td->td_pflags |= TDP_FAST_SIGBLOCK; >>> + td->td_sigblock_ptr = uap->ptr; >>> + } >>> + break; >>> + >>> + case FAST_SIGBLOCK_UNBLOCK: >>> + if ((td->td_pflags & TDP_FAST_SIGBLOCK) != 0) { >>> + oldval = casuword32(td->td_sigblock_ptr, >>> + FAST_SIGBLOCK_PEND, 0); >>> + if (oldval == (uint32_t)-1) >>> + error = EFAULT; >>> + else if (oldval != FAST_SIGBLOCK_PEND) >>> + error = EBUSY; >>> + else >>> + td->td_sigblock_val = 0; >>> + } else >>> + error = EDOOFUS; >>> + >>> + /* >>> + * Rely on normal ast mechanism to deliver pending >>> + * signals to current thread. But notify others about >>> + * fake unblock. >>> + */ >>> + p = td->td_proc; >>> + if (error == 0 && p->p_numthreads != 1) { >>> + PROC_LOCK(p); >>> + reschedule_signals(p, td->td_sigmask, 0); >>> + PROC_UNLOCK(p); >>> + } >>> + break; >>> + >>> + case FAST_SIGBLOCK_UNSETPTR: >>> + if ((td->td_pflags & TDP_FAST_SIGBLOCK) != 0) { >>> + error = copyin(td->td_sigblock_ptr, &oldval, >>> + sizeof(uint32_t)); >>> + if (error != 0) >>> + break; >>> + if (oldval != 0 && oldval != FAST_SIGBLOCK_PEND) >>> + error = EBUSY; >>> + else { >>> + td->td_pflags &= ~TDP_FAST_SIGBLOCK; >>> + td->td_sigblock_val = 0; >>> + } >>> + } else >>> + error = EDOOFUS; >>> + break; >>> + } >>> + return (error); >>> +} >>> + >>> +void >>> +fetch_fast_sigblock(struct thread *td) >>> +{ >>> + >>> + if ((td->td_pflags & TDP_FAST_SIGBLOCK) == 0) >>> + return; >>> + td->td_sigblock_val = fuword32(td->td_sigblock_ptr); >>> + if (td->td_sigblock_val == -1) >>> + fetch_fast_sigblock_failed(td, 0); >>> +} >>> + >>> +void >>> +fetch_fast_sigblock_failed(struct thread *td, int write) >>> +{ >>> + ksiginfo_t ksi; >>> + >>> + td->td_sigblock_val = 0; >>> + ksiginfo_init_trap(&ksi); >>> + ksi.ksi_signo = SIGSEGV; >>> + ksi.ksi_code = write ? SEGV_ACCERR : SEGV_MAPERR; >>> + ksi.ksi_addr = td->td_sigblock_ptr; >>> + trapsignal(td, &ksi); >>> +} >>> diff --git a/sys/kern/subr_syscall.c b/sys/kern/subr_syscall.c >>> index 5aee684..77b250d 100644 >>> --- a/sys/kern/subr_syscall.c >>> +++ b/sys/kern/subr_syscall.c >>> @@ -131,6 +131,12 @@ syscallenter(struct thread *td, struct syscall_args *sa) >>> sa->callp, sa->args, 0); >>> #endif >>> >>> + /* >>> + * Fetch fast sigblock value at the time of syscall >>> + * entry because sleepqueue primitives might call >>> + * cursig(). >>> + */ >>> + fetch_fast_sigblock(td); >>> AUDIT_SYSCALL_ENTER(sa->code, td); >>> error = (sa->callp->sy_call)(td, sa->args); >>> AUDIT_SYSCALL_EXIT(error, td); >>> diff --git a/sys/kern/subr_trap.c b/sys/kern/subr_trap.c >>> index 095bbdf..66e485b 100644 >>> --- a/sys/kern/subr_trap.c >>> +++ b/sys/kern/subr_trap.c >>> @@ -237,6 +237,7 @@ ast(struct trapframe *framep) >>> */ >>> if (flags & TDF_NEEDSIGCHK || p->p_pendingcnt > 0 || >>> !SIGISEMPTY(p->p_siglist)) { >>> + fetch_fast_sigblock(td); >>> PROC_LOCK(p); >>> mtx_lock(&p->p_sigacts->ps_mtx); >>> while ((sig = cursig(td, SIG_STOP_ALLOWED)) != 0) >>> diff --git a/sys/kern/syscalls.master b/sys/kern/syscalls.master >>> index f62dad7..28a9393 100644 >>> --- a/sys/kern/syscalls.master >>> +++ b/sys/kern/syscalls.master >>> @@ -951,5 +951,6 @@ >>> off_t offset, off_t len); } >>> 531 AUE_NULL STD { int posix_fadvise(int fd, off_t offset, \ >>> off_t len, int advice); } >>> +532 AUE_NULL STD { int fast_sigblock(int cmd, uint32_t *ptr); } >>> ; Please copy any additions and changes to the following compatability tables: >>> ; sys/compat/freebsd32/syscalls.master >>> diff --git a/sys/sys/proc.h b/sys/sys/proc.h >>> index 06df632..4899ca2 100644 >>> --- a/sys/sys/proc.h >>> +++ b/sys/sys/proc.h >>> @@ -272,6 +272,9 @@ struct thread { >>> struct osd td_osd; /* (k) Object specific data. */ >>> struct vm_map_entry *td_map_def_user; /* (k) Deferred entries. */ >>> pid_t td_dbg_forked; /* (c) Child pid for debugger. */ >>> + void *td_sigblock_ptr; /* (k) uptr for fast sigblock. */ >>> + uint32_t td_sigblock_val; /* (k) fast sigblock value at >>> + kernel entry. */ >>> #define td_endzero td_sigmask >>> >>> /* Copied during fork1() or create_thread(). */ >>> @@ -424,6 +427,7 @@ do { \ >>> #define TDP_RESETSPUR 0x04000000 /* Reset spurious page fault history. */ >>> #define TDP_NERRNO 0x08000000 /* Last errno is already in td_errno */ >>> #define TDP_UIOHELD 0x10000000 /* Current uio has pages held in td_ma */ >>> +#define TDP_FAST_SIGBLOCK 0x20000000 /* Fast sigblock active */ >>> >>> /* >>> * Reasons that the current thread can not be run yet. >>> diff --git a/sys/sys/signalvar.h b/sys/sys/signalvar.h >>> index 71685e7..68cca58 100644 >>> --- a/sys/sys/signalvar.h >>> +++ b/sys/sys/signalvar.h >>> @@ -250,6 +250,20 @@ typedef struct sigqueue { >>> /* Flags for ksi_flags */ >>> #define SQ_INIT 0x01 >>> >>> +/* >>> + * Fast_sigblock >>> + */ >>> +#define FAST_SIGBLOCK_SETPTR 1 >>> +#define FAST_SIGBLOCK_UNBLOCK 2 >>> +#define FAST_SIGBLOCK_UNSETPTR 3 >>> + >>> +#define FAST_SIGBLOCK_PEND 0x1 >>> +#define FAST_SIGBLOCK_INC 0x10 >>> + >>> +#ifndef _KERNEL >>> +int __sys_fast_sigblock(int cmd, void *ptr); >>> +#endif >>> + >>> #ifdef _KERNEL >>> >>> /* Return nonzero if process p has an unmasked pending signal. */ >>> @@ -329,6 +343,8 @@ extern struct mtx sigio_lock; >>> >>> int cursig(struct thread *td, int stop_allowed); >>> void execsigs(struct proc *p); >>> +void fetch_fast_sigblock(struct thread *td); >>> +void fetch_fast_sigblock_failed(struct thread *td, int write); >>> void gsignal(int pgid, int sig, ksiginfo_t *ksi); >>> void killproc(struct proc *p, char *why); >>> ksiginfo_t * ksiginfo_alloc(int wait); >>> >>> >>> >>> ----- End forwarded message ----- >>> >> So you want to make kernel share data with userland. The only thing I >> don't like is it adds overhead to syscall, rumor said that our syscall >> is slow than others, this might not be a problem ? >> >> Long time ago, I proposed a schedctl() syscall to make kernel share >> data with userland, some old patches are still there: >> http://people.freebsd.org/~davidxu/schedctl/ >> >> Mine does not have such an overhead, but it has another one: >> memory page is allocated in kernel which can not be swapped out >> and can not be freed until process is exited, the page is doubly >> mapped into in kernel and userland, accessing the shared data >> in kernel has zero overhead though. > Initial version of the changes indeed used the remap of the user page into > the KVA. Besides the issues you described regarding the page being wired > for the whole life of the thread, as well as the one page frame in the KVA, > there were other non-trivial problems. It was mostly related to the fork(2) > copying the address spaces, but not limited too. The use of the direct > user address access appeared to be much less intrusive. > > I completely agree with your note about the additional memory access in > the syscall entry path. I do believe that it does not incur a overhead > in the real-world loads, but might make some skews in the microbenchmarks. > I did not noted any in the testing I did. > > Anyway, I am still not sure is it worth optimizing for the throw/catch > microbenchmark at ll. If general public consequence is yes, the patch > needs some work before being committable anyway, mostly to allow patched > rtld to work on the pre-patch kernels. > > Thank you for the reply. I think it would be great if it works, I just haven't had time to fully understand your more recent patch. If you feel it's safe and maybe do a few test programs to see what happens, that would be best. -Alfred
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?50EC34FE.4070804>