From owner-freebsd-threads@FreeBSD.ORG Sat Jan 3 21:28:40 2015 Return-Path: Delivered-To: threads@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id E6D75985; Sat, 3 Jan 2015 21:28:40 +0000 (UTC) Received: from mx1.stack.nl (relay02.stack.nl [IPv6:2001:610:1108:5010::104]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (Client CN "mailhost.stack.nl", Issuer "CA Cert Signing Authority" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id 76E153139; Sat, 3 Jan 2015 21:28:40 +0000 (UTC) Received: from snail.stack.nl (snail.stack.nl [IPv6:2001:610:1108:5010::131]) by mx1.stack.nl (Postfix) with ESMTP id D964E3592DF; Sat, 3 Jan 2015 22:28:37 +0100 (CET) Received: by snail.stack.nl (Postfix, from userid 1677) id BDAF028494; Sat, 3 Jan 2015 22:28:37 +0100 (CET) Date: Sat, 3 Jan 2015 22:28:37 +0100 From: Jilles Tjoelker To: Konstantin Belousov Subject: Re: Fixing dlopen("libpthread.so") Message-ID: <20150103212837.GC46373@stack.nl> References: <20141226165337.GJ1754@kib.kiev.ua> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20141226165337.GJ1754@kib.kiev.ua> User-Agent: Mutt/1.5.21 (2010-09-15) Cc: threads@freebsd.org, arch@freebsd.org X-BeenThere: freebsd-threads@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Threading on FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 03 Jan 2015 21:28:41 -0000 On Fri, Dec 26, 2014 at 06:53:37PM +0200, Konstantin Belousov wrote: > It is somewhat well-known that our libthr.so cannot be loaded > dynamically into the process. Or rather, it can be, but the > consequences are catastrophic. We recommend to link any program which > may load modules, explicitely with -lpthread; the known workaround is > to do LD_PRELOAD=libthr.so.3 for binaries which were not. I > implemented support for ld -z nodlopen some time ago, but attempt to > mark libthr.so as non-loadable caused extreme roar. > A common opinion is that the proper way to fix the problem is > to merge the actual code from libthr into libc, leaving libthr as the > filter to preserve the current ABI. Unfortunately, there are some > non-trivial and undesirable consequences of doing this. > First, all pthread mutexes (and other kind of locks) would become > fully initialized and used even for single-threaded programs, at least > I do not see a way to work around this. Right now, libc shims for > pthread_mutex_init() and pthread_mutex_lock(3) are nop. After the > merge, init needs to allocate memory and lock/unlock operations, > although uncontested, will start costing one atomic each. In > particular, malloc(3) and stdio(3) are affected. > Another very delicate issue is introducing unwanted cancellation > points into libc functions after libthr wrappers become mandatory. > This is fixable, but requires lot of mundane work and probably a long > time to find missed places (i.e. bugs). > There are probably more problems, and this brings an obvious > alternative: fix the issues which make dlopen("libthr.so") so > destructive. > One known show-stopper is the broken errno after the load. The libthr > provides the interposer for the errno and all cancellable functions > from libc. If any interposed symbols have been resolved before the > libthr.so was loaded, or non-lazy binding mode is requested, the > bindings cannot be undonde. In particular, references to __error(), > which implements errno, are bound to return locate of the main thread > errno variable. Similarly, code referencing cancellable functions > still gets the uncancellable libc implementations of them. > Another issue is the recursion between malloc(3) and mutex_init(). > The statically initialized pthread_mutex_t needs some further > initialization before first use. Jemalloc calls pthread_mutex_init(3) > for internally-used mutexes, which is nop stub from libc until libthr > is loaded. After the load, first use of any mutex by malloc(3) leads > to the thr_mutex.c initialization code, which needs calloc(3). This > immediately leads to hang due to recursion on some internal libthr > umtx. Making the lock recursive does not solve the problem, which is > the infinite mutual recursion between malloc and pthread_mutex_lock() > for uninitialized malloc mutex. > Yet another issue is the signal handlers. The libthr routes signal > delivery through its internal signal handler, to avoid interrupting > critical sections. Any signal handler installed prior to libthr is > loaded misses the wrapper, potentially breaking cancellation and > critical sections. > Proposed patch does the following: > - Remove libthr interposers of the libc functions, including > __error(). Instead, functions calls are indirected through the > interposing table, similar to how pthread stubs in libc are already > done. Libc by default points either to syscall trampolines or to > existing libc implementations. On libthr load, it rewrites the > pointers to the cancellable implementations already in libthr. > - Postpone the malloc(3) internal mutexes initialization until libthr > is loaded. > - Reinstall signal handlers with wrapper on libthr load. > The signal handler reinstallation on libthr initialization is only > needed when libthr.so is dlopened. Performing 128*2 sigaction(2) > calls on the startup of the binary which is linked to libthr, and thus > libthr is guaranteed to install proper sighandler wrappers, is huge > waste. So, I perform the hand-over of signal handlers only for the > dlopen-ed libthr, which now needs to detect loading at startup > vs. dlopen. I was unable to distinguish the cases using existing > facilities, so new private rtld interface is implemented, > _rtld_is_dlopened(), to query the way library was brought into the > process address space. > Without some special measures, static binaries would pull in the whole > set of the interposed syscalls due to references from the > interposition table. To fix it, the references are made weak. Also, > to not pull in the pthread stubs, the interposition table is separate > from pthreads stubs indirection table. > The patch is available at > https://www.kib.kiev.ua/kib/libthr_dlopen.1.patch . > Among other things, I tested it with the program illustrating the > issues https://www.kib.kiev.ua/kib/threaded_errno.c . > Note that you must use matching versions of rtld, libc and libthr. > Using old ld-elf.so.1 or old libc.so.7 with new libthr.so.3 will > break the system. I think this can work, conceptually. Does this also mean that the order of libc and libthr on the link line no longer matters? (so SVN r265003 could be reverted?) It looks like this change may introduce more indirection in common paths which might harm performance. The effect is slight and on the other hand more single-threaded applications can benefit from non-libthr performance. I noticed a problem in lib/libc/gen/raise.c: it calls the INTERPOS_pause entry instead of the INTERPOS_raise entry. Alternatively, libc could call thr_self() and thr_kill() so libthr need not do anything (much like how lib/libc/gen/sem_new.c works). The number of interposers could be reduced slightly by doing simple work like wait(), waitpid() and wait3() before instead of after interposition. This would be an additional change and the current patch is more like the existing code. In lib/libc/sys/__error.c, __error_selector should be declared static or __hidden and an accessor function added for libthr, to avoid an additional pointer chase in a common path. One more PLT and GOT reference in some situations may be avoided by making a __hidden alias for __error_unthreaded and making that __error_selector's initial value. Although __libc_interposing is properly not exported (with __libc_interposing_slot accessor), the full benefit is not obtained because __libc_interposing is not declared __hidden in lib/libc/include/libc_private.h, so the compiler will generate an indirect access anyway (and ld will fix it up using a relative relocation). In lib/libthr/thread/thr_private.h, __error_threaded() can be declared __hidden. Sharing __sys_* code between libc and libthr is a net loss (the two copies of the name, GOT entry, PLT entry and longer instruction sequences are almost certainly larger than a syscall stub (32 bytes on amd64)), so I think the best fix for most indirection from __libc_interposing entries is to give libthr its own copy and mark both copies hidden, instead of adding hidden aliases now. -- Jilles Tjoelker