From owner-freebsd-current Wed Mar 8 20: 0:45 2000 Delivered-To: freebsd-current@freebsd.org Received: from wall.polstra.com (rtrwan160.accessone.com [206.213.115.74]) by hub.freebsd.org (Postfix) with ESMTP id 07A6F37B742 for ; Wed, 8 Mar 2000 20:00:41 -0800 (PST) (envelope-from jdp@polstra.com) Received: from vashon.polstra.com (vashon.polstra.com [206.213.73.13]) by wall.polstra.com (8.9.3/8.9.3) with ESMTP id UAA14615; Wed, 8 Mar 2000 20:00:40 -0800 (PST) (envelope-from jdp@polstra.com) From: John Polstra Received: (from jdp@localhost) by vashon.polstra.com (8.9.3/8.9.1) id UAA39926; Wed, 8 Mar 2000 20:00:39 -0800 (PST) (envelope-from jdp@polstra.com) Date: Wed, 8 Mar 2000 20:00:39 -0800 (PST) Message-Id: <200003090400.UAA39926@vashon.polstra.com> To: dmmiller@cvzoom.net Subject: Re: More "ld-elf.so.1: assert failed" messages In-Reply-To: <38C68F72.4070405@cvzoom.net> References: <38BA5751.2396AE87@cvzoom.net> <38C5A975.957756C4@cvzoom.net> <200003081702.JAA39397@vashon.polstra.com> <38C68F72.4070405@cvzoom.net> Organization: Polstra & Co., Seattle, WA Cc: current@freebsd.org Sender: owner-freebsd-current@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG In article <38C68F72.4070405@cvzoom.net>, Donn Miller wrote: > I just reverted back to the "normal" version of ld-elf.so, the version > without the patch.  Mozilla doesn't have the problem with the > "non-patch" version.  So, maybe it isn't the application.  Or, maybe the > original, "non-patch" version wasn't doing something right. > > Just wondering, in case the problem isn't with Mozilla.  I'm using > Mozilla right now, with the original ld-elf.so.1.  (The fonts are hard > on my eyes.) Well, the whole situation is very complicated. I'll append a long-winded mail about it that I sent to Jordan and Peter. (No answer yet, but then Jordan was out of the country and Peter only answers 20% of his mail as a rather crude form of flow control. ;-) Briefly, we've been through 3 phases with the dynamic linker. 1. The olden days. Resolving symbols was a read-only operation, so it was re-entrant. Dlopen was a "write" operation, so if any thread was doing a dlopen then no other thread could be doing a dlopen or a symbol lookup. There weren't very many multi-threaded apps, and there were even fewer that used dlopen. So nobody saw any problems, even though the potential was there. 2. What is in current today. Resolving symbols is a "write" operation, and so is dlopen. So two threads trying to resolve symbols simultaneously need mutual exclusion or problems can manifest themselves. (Resolving symbols being a "write" operation is basically an implementation screw-up on my part, which is corrected in the patch I posted.) Meanwhile, there are a lot more multi-threaded applications. So we're seeing problems. In an attempt to solve the problems, I did two things: (a) provided the dllockinit(3) interface so threads packages could tell the dynamic linker how to do locking, and (b) implemented some default locking that would work in some cases when threads packages didn't use dllockinit. The default locking works for userland threads packages, but not for kernel threads or rfork threads. 3. The patch I posted. It made symbol resolution read-only again, but it also removed the default locking. I was hoping that with re-entrant symbol resolution, the default locking wouldn't be needed any more. The default locking is extremely bogus, because it makes assumptions about how threads work which there's no real reason to believe are correct. John [Here's the mail I sent that explains things in more detail.] Sorry for this long mail, but I need your advice about a situation that's kind of complicated. I don't know whether you've been following -current, but there's a new chapter in the continuing saga of "The Dynamic Linker Breaks Some Random Multi-threaded Program During a Code Freeze." This time it's wine. These breakages are all connected, and I'll try to explain the whole story below, if only so you won't think I just do shoddy work. :-) What I am looking for from you is some guidance as to how much we want to mess with the rtld at the 11th hour to fix wine. There are two ways a program can enter the dynamic linker: either (A) by innocently calling some external function for the first time, causing lazy binding to be done for that function, or (B) by deliberately calling dlopen() or one of its friends. In long ago olden times, A was effectively a read-only operation, while B mucked with a bunch of data structures in the dynamic linker. So a program could safely have lots of threads doing A at the same time, or just one thread doing B, but not both. Then the need for better-scoped symbol lookups began to arise, and I finally bit the bullet and made the lookup algorithm quite a bit more complicated, a la Solaris and Linux. This was the change we decided to merge into 3.4 just before release, because it fixed some important application -- which one I can't remember any more. The changes I put in to do this unfortunately changed operation "A" above into a write operation too. I.e., the lazy binding was no longer reentrant. Soon I started receiving reports of strange failures in multi-threaded programs. I attacked this problem by introducing calls to reader/writer locking functions in key places. Since locking depends on the underlying threads package, I also created the dllockinit() API through which each threads package could tell the dynamic linker how to do locking. At the same time I added default locking methods which would make their best effort to lock (basically by masking a bunch of signals) in case the threads package hadn't been modified to call dllockinit(). That's where things stand in -current today. Unfortunately, wine uses rfork() to make threads, so each thread is in its own process. Consequently, the signal masking in the default locking methods is ineffective. Occasionally the dynamic linker gets reentered and an assert fails. I have changes in my local tree which fix the lookup algorithm so it is reentrant again. I didn't think it was possible before, but then I found a way to do it using alloca(). This takes us back to the traditional state of things, where lazy binding (the common case) is reentrant but dlopen() is not. I could make the default locking methods no-ops and we'd be back to the situation we traditionally had. That's what I think we should probably do. For multi-threaded applications this would work almost all the time. It would fail only if one thread was in dlopen() when another thread either called dlopen() itself or did something that made the lazy binding code kick in. Applications are smart enough to avoid reentering dlopen, but short of suspending all other threads they really don't have a way to prevent lazy binding from happening while a separate thread is inside dlopen. So applications could still fail, but it would be very rare and it would be nothing new -- that risk has been with us all along, even before I started messing with this stuff. Applications whose thread packages called dllockinit() wouldn't have problems under any circumstances, of course. I think this would be a decent solution. But past experience shows that the N threads packages out there can tend to surprise me. Do you think I should just leave it alone until after 4.0 is released, or is wine important enough that I should try to fix it even though it's a bit risky? The other possibility would be to fix the wine port so it calls dllockinit() to set up locking. I don't know for sure how hard that would be, but it's probably a feasible solution. John To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-current" in the body of the message