Date: Wed, 15 Sep 1999 18:18:18 +0000 (GMT) From: Terry Lambert <tlambert@primenet.com> To: gsstark@mit.edu (Greg Stark) Cc: tlambert@primenet.com, gsstark@mit.edu, denis@acacia.cts.ucla.edu, mozilla@FreeBSD.ORG Subject: Re: Communicator 4.5: "Xlib: Unexpected async reply" msg flood! Message-ID: <199909151818.LAA23542@usr09.primenet.com> In-Reply-To: <877lls682x.fsf@x2-513.mtl.Generation.NET> from "Greg Stark" at Sep 15, 99 01:45:58 am
next in thread | previous in thread | raw e-mail | index | archive | help
> > The serialization soloution was arrived at by me, after observing > > the problem and non-problem platforms, and taking into account my > > detailed knowledge of threads implemetnations on Solaris, Linux, > > Windows 98, Windows NT, Macintosh, and FreeBSD. > > So in your case you're fairly certain it was the GIF decoder that was buggy? The GIF decoder _at least_ is buggy, in that it assumes that the context switch will be back to the thread that was context switched out after an involuntary preemption. I suspect that it is not the _only_ code which is buggy, just the most visible to me, in my application. > Did you have a particular test page that could reliably crash communicator? Yes, if I move the mouse over the image map while it is being loaded, and the image map handling code gets time slices. > This would be especially good if it reliably triggered the sequence errors. It crashes the program. The crash may not be in the same place each time. I don't have a copy of Communicator with all symbols intact to be able to tell. > I'm certain the Linux version and fairly certain that the other versions of > communicator do _not_ use the native OS thread implementation. They use the > built in user-space NSPR thread implementation. Which I think is supposed to > use a simple FIFO scheduler like you describe. What about Macintosh and FreeBSD? Don't they use the same scheduler? Or are they trying to use native pthreads on these platforms? > My hunch on the bug is that Java's run-time sets up the signal handlers one > way and the rest of Netscape expects them to be set up a different way. And > the net result is that some call that doesn't expect to be interrupted gets a > SIGALRM and some X library call is preempted when it shouldn't be. Interesting hypothesis. This somewhat conflicts with the observed behaviour, however, in that the FreeBSD X library is not multithreaded, and the preemption should not be an issue for the code, since it (Netscape) runs without problems in a Windows environment. Hmmm. Have there been any crashes using the "-remote OpenURL("xxx")" reported on Windows? It seems to me that another access to the same image with a different thread may result in a crash without an explicit call the CreateFreeThreadedMarshallar(), since Windows threads instance per thread data onto thread local storage which is not accessible in the address space of a different kernel thread. The purpose of the Marshaller is to reinstantiate objects between these address spaces. Back to FreeBSD: If it is using a "sigsched" type threading mechanism, there are indeed differences in the signal handling mechanism between the OS's. It seems to me that in one case, the alarms are being delivered async (assuming it's the alarms), and in the other case, they are causing a preemption. This appears to either be missing mutex protection, or missing signal masking, either of which are really a coding error resulting from assuming too much about the underlying threads behaviour. I am unfamiliar with NSPR internals; is it perhaps the case that what gets scheduled by the alarm is a scheduler activation rather than a context switch? This would allow the thread to run to completion, if the signal masking specified system call restart for the signal being delivered. It's possible that system call restart is failing on FreeBSD, or on the Macintosh, especially if POSIX and non-POSIX signal access functions are being utilized simultaneously. A fast test in the BSD case would be to call siginterrupt(3), which was introduces in BSD 4.2b (via DEC Ultrix) to obtain traditional BSD signal behaviour (which was to restart all system calls). The POSIX behaviour of aborting the system call instead of restarting seems to me to be a SVR4 kludge to do things in signal handlers which ought not to be done there (e.g. other than setting a volatile flag to be examined in the main loop of the event driven application). I know that the FreeBSD user space threads code does not very robustly encapsulate signal interruption of "system calls" (really pthreads wrappers in libc_r), and that scheduler activations with a restart on system calls for all signals, with a scheduler activation on exit (similar to a trampoline) would probably be a better approach. Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-mozilla" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199909151818.LAA23542>