Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 8 Mar 2000 20:00:39 -0800 (PST)
From:      John Polstra <jdp@polstra.com>
To:        dmmiller@cvzoom.net
Cc:        current@freebsd.org
Subject:   Re: More "ld-elf.so.1: assert failed" messages
Message-ID:  <200003090400.UAA39926@vashon.polstra.com>
In-Reply-To: <38C68F72.4070405@cvzoom.net>
References:  <38BA5751.2396AE87@cvzoom.net> <38C5A975.957756C4@cvzoom.net> <200003081702.JAA39397@vashon.polstra.com> <38C68F72.4070405@cvzoom.net>

next in thread | previous in thread | raw e-mail | index | archive | help
In article <38C68F72.4070405@cvzoom.net>,
Donn Miller  <dmmiller@cvzoom.net> wrote:

> I just reverted back to the "normal" version of ld-elf.so, the version 
> without the patch.  Mozilla doesn't have the problem with the 
> "non-patch" version.  So, maybe it isn't the application.  Or, maybe the 
> original, "non-patch" version wasn't doing something right.
> 
> Just wondering, in case the problem isn't with Mozilla.  I'm using 
> Mozilla right now, with the original ld-elf.so.1.  (The fonts are hard 
> on my eyes.)

Well, the whole situation is very complicated.  I'll append a
long-winded mail about it that I sent to Jordan and Peter.  (No answer
yet, but then Jordan was out of the country and Peter only answers 20%
of his mail as a rather crude form of flow control. ;-)

Briefly, we've been through 3 phases with the dynamic linker.

1. The olden days.  Resolving symbols was a read-only operation, so
it was re-entrant.  Dlopen was a "write" operation, so if any thread
was doing a dlopen then no other thread could be doing a dlopen or a
symbol lookup.  There weren't very many multi-threaded apps, and
there were even fewer that used dlopen.  So nobody saw any problems,
even though the potential was there.

2. What is in current today.  Resolving symbols is a "write"
operation, and so is dlopen.  So two threads trying to resolve
symbols simultaneously need mutual exclusion or problems can manifest
themselves.  (Resolving symbols being a "write" operation is basically
an implementation screw-up on my part, which is corrected in the
patch I posted.)  Meanwhile, there are a lot more multi-threaded
applications.  So we're seeing problems.  In an attempt to solve the
problems, I did two things: (a) provided the dllockinit(3) interface
so threads packages could tell the dynamic linker how to do locking,
and (b) implemented some default locking that would work in some cases
when threads packages didn't use dllockinit.  The default locking
works for userland threads packages, but not for kernel threads or
rfork threads.

3. The patch I posted.  It made symbol resolution read-only again,
but it also removed the default locking.  I was hoping that with
re-entrant symbol resolution, the default locking wouldn't be needed
any more.  The default locking is extremely bogus, because it makes
assumptions about how threads work which there's no real reason to
believe are correct.

John

[Here's the mail I sent that explains things in more detail.]

Sorry for this long mail, but I need your advice about a situation
that's kind of complicated.  I don't know whether you've been
following -current, but there's a new chapter in the continuing saga
of "The Dynamic Linker Breaks Some Random Multi-threaded Program
During a Code Freeze."  This time it's wine.  These breakages are all
connected, and I'll try to explain the whole story below, if only so
you won't think I just do shoddy work. :-) What I am looking for from
you is some guidance as to how much we want to mess with the rtld at
the 11th hour to fix wine.

There are two ways a program can enter the dynamic linker: either
(A) by innocently calling some external function for the first
time, causing lazy binding to be done for that function, or (B) by
deliberately calling dlopen() or one of its friends.  In long ago
olden times, A was effectively a read-only operation, while B mucked
with a bunch of data structures in the dynamic linker.  So a program
could safely have lots of threads doing A at the same time, or just
one thread doing B, but not both.

Then the need for better-scoped symbol lookups began to arise, and I
finally bit the bullet and made the lookup algorithm quite a bit more
complicated, a la Solaris and Linux.  This was the change we decided
to merge into 3.4 just before release, because it fixed some important
application -- which one I can't remember any more.  The changes I
put in to do this unfortunately changed operation "A" above into a
write operation too.  I.e., the lazy binding was no longer reentrant.
Soon I started receiving reports of strange failures in multi-threaded
programs.

I attacked this problem by introducing calls to reader/writer locking
functions in key places.  Since locking depends on the underlying
threads package, I also created the dllockinit() API through which
each threads package could tell the dynamic linker how to do locking.
At the same time I added default locking methods which would make
their best effort to lock (basically by masking a bunch of signals) in
case the threads package hadn't been modified to call dllockinit().
That's where things stand in -current today.

Unfortunately, wine uses rfork() to make threads, so each thread is
in its own process.  Consequently, the signal masking in the default
locking methods is ineffective.  Occasionally the dynamic linker gets
reentered and an assert fails.

I have changes in my local tree which fix the lookup algorithm so it
is reentrant again.  I didn't think it was possible before, but then
I found a way to do it using alloca().  This takes us back to the
traditional state of things, where lazy binding (the common case)
is reentrant but dlopen() is not.  I could make the default locking
methods no-ops and we'd be back to the situation we traditionally had.
That's what I think we should probably do.

For multi-threaded applications this would work almost all the
time.  It would fail only if one thread was in dlopen() when another
thread either called dlopen() itself or did something that made the
lazy binding code kick in.  Applications are smart enough to avoid
reentering dlopen, but short of suspending all other threads they
really don't have a way to prevent lazy binding from happening while a
separate thread is inside dlopen.  So applications could still fail,
but it would be very rare and it would be nothing new -- that risk has
been with us all along, even before I started messing with this stuff.
Applications whose thread packages called dllockinit() wouldn't have
problems under any circumstances, of course.

I think this would be a decent solution.  But past experience shows
that the N threads packages out there can tend to surprise me.  Do
you think I should just leave it alone until after 4.0 is released,
or is wine important enough that I should try to fix it even though
it's a bit risky?

The other possibility would be to fix the wine port so it calls
dllockinit() to set up locking.  I don't know for sure how hard that
would be, but it's probably a feasible solution.

John


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-current" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200003090400.UAA39926>