Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 3 Jan 2015 22:28:37 +0100
From:      Jilles Tjoelker <jilles@stack.nl>
To:        Konstantin Belousov <kostikbel@gmail.com>
Cc:        threads@freebsd.org, arch@freebsd.org
Subject:   Re: Fixing dlopen("libpthread.so")
Message-ID:  <20150103212837.GC46373@stack.nl>
In-Reply-To: <20141226165337.GJ1754@kib.kiev.ua>
References:  <20141226165337.GJ1754@kib.kiev.ua>

next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, Dec 26, 2014 at 06:53:37PM +0200, Konstantin Belousov wrote:
> It is somewhat well-known that our libthr.so cannot be loaded
> dynamically into the process.  Or rather, it can be, but the
> consequences are catastrophic.  We recommend to link any program which
> may load modules, explicitely with -lpthread; the known workaround is
> to do LD_PRELOAD=libthr.so.3 for binaries which were not.  I
> implemented support for ld -z nodlopen some time ago, but attempt to
> mark libthr.so as non-loadable caused extreme roar.

> A common opinion is that the proper way to fix the problem is
> to merge the actual code from libthr into libc, leaving libthr as the
> filter to preserve the current ABI.  Unfortunately, there are some
> non-trivial and undesirable consequences of doing this.

> First, all pthread mutexes (and other kind of locks) would become
> fully initialized and used even for single-threaded programs, at least
> I do not see a way to work around this.  Right now, libc shims for
> pthread_mutex_init() and pthread_mutex_lock(3) are nop.  After the
> merge, init needs to allocate memory and lock/unlock operations,
> although uncontested, will start costing one atomic each.  In
> particular, malloc(3) and stdio(3) are affected.

> Another very delicate issue is introducing unwanted cancellation
> points into libc functions after libthr wrappers become mandatory.
> This is fixable, but requires lot of mundane work and probably a long
> time to find missed places (i.e. bugs).

> There are probably more problems, and this brings an obvious
> alternative: fix the issues which make dlopen("libthr.so") so
> destructive.

> One known show-stopper is the broken errno after the load.  The libthr
> provides the interposer for the errno and all cancellable functions
> from libc.  If any interposed symbols have been resolved before the
> libthr.so was loaded, or non-lazy binding mode is requested, the
> bindings cannot be undonde.  In particular, references to __error(),
> which implements errno, are bound to return locate of the main thread
> errno variable.  Similarly, code referencing cancellable functions
> still gets the uncancellable libc implementations of them.

> Another issue is the recursion between malloc(3) and mutex_init().
> The statically initialized pthread_mutex_t needs some further
> initialization before first use.  Jemalloc calls pthread_mutex_init(3)
> for internally-used mutexes, which is nop stub from libc until libthr
> is loaded.  After the load, first use of any mutex by malloc(3) leads
> to the thr_mutex.c initialization code, which needs calloc(3).  This
> immediately leads to hang due to recursion on some internal libthr
> umtx.  Making the lock recursive does not solve the problem, which is
> the infinite mutual recursion between malloc and pthread_mutex_lock()
> for uninitialized malloc mutex.

> Yet another issue is the signal handlers.  The libthr routes signal
> delivery through its internal signal handler, to avoid interrupting
> critical sections.  Any signal handler installed prior to libthr is
> loaded misses the wrapper, potentially breaking cancellation and
> critical sections.

> Proposed patch does the following:

> - Remove libthr interposers of the libc functions, including
>   __error(). Instead, functions calls are indirected through the
>   interposing table, similar to how pthread stubs in libc are already
>   done.  Libc by default points either to syscall trampolines or to
>   existing libc implementations.  On libthr load, it rewrites the
>   pointers to the cancellable implementations already in libthr.

> - Postpone the malloc(3) internal mutexes initialization until libthr
>   is loaded.

> - Reinstall signal handlers with wrapper on libthr load.

> The signal handler reinstallation on libthr initialization is only
> needed when libthr.so is dlopened.  Performing 128*2 sigaction(2)
> calls on the startup of the binary which is linked to libthr, and thus
> libthr is guaranteed to install proper sighandler wrappers, is huge
> waste.  So, I perform the hand-over of signal handlers only for the
> dlopen-ed libthr, which now needs to detect loading at startup
> vs. dlopen.  I was unable to distinguish the cases using existing
> facilities, so new private rtld interface is implemented,
> _rtld_is_dlopened(), to query the way library was brought into the
> process address space.

> Without some special measures, static binaries would pull in the whole
> set of the interposed syscalls due to references from the
> interposition table.  To fix it, the references are made weak.  Also,
> to not pull in the pthread stubs, the interposition table is separate
> from pthreads stubs indirection table.

> The patch is available at
> https://www.kib.kiev.ua/kib/libthr_dlopen.1.patch .
> Among other things, I tested it with the program illustrating the
> issues https://www.kib.kiev.ua/kib/threaded_errno.c .
> Note that you must use matching versions of rtld, libc and libthr.
> Using old ld-elf.so.1 or old libc.so.7 with new libthr.so.3 will
> break the system.

I think this can work, conceptually.

Does this also mean that the order of libc and libthr on the link line
no longer matters? (so SVN r265003 could be reverted?)

It looks like this change may introduce more indirection in common paths
which might harm performance. The effect is slight and on the other hand
more single-threaded applications can benefit from non-libthr
performance.

I noticed a problem in lib/libc/gen/raise.c: it calls the INTERPOS_pause
entry instead of the INTERPOS_raise entry. Alternatively, libc could
call thr_self() and thr_kill() so libthr need not do anything (much like
how lib/libc/gen/sem_new.c works).

The number of interposers could be reduced slightly by doing simple work
like wait(), waitpid() and wait3() before instead of after
interposition. This would be an additional change and the current patch
is more like the existing code.

In lib/libc/sys/__error.c, __error_selector should be declared static or
__hidden and an accessor function added for libthr, to avoid an
additional pointer chase in a common path. One more PLT and GOT
reference in some situations may be avoided by making a __hidden alias
for __error_unthreaded and making that __error_selector's initial value.

Although __libc_interposing is properly not exported (with
__libc_interposing_slot accessor), the full benefit is not obtained
because __libc_interposing is not declared __hidden in
lib/libc/include/libc_private.h, so the compiler will generate an
indirect access anyway (and ld will fix it up using a relative
relocation).

In lib/libthr/thread/thr_private.h, __error_threaded() can be declared
__hidden.

Sharing __sys_* code between libc and libthr is a net loss (the two
copies of the name, GOT entry, PLT entry and longer instruction
sequences are almost certainly larger than a syscall stub (32 bytes on
amd64)), so I think the best fix for most indirection from
__libc_interposing entries is to give libthr its own copy and mark both
copies hidden, instead of adding hidden aliases now.

-- 
Jilles Tjoelker



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20150103212837.GC46373>