Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 15 Sep 1999 18:18:18 +0000 (GMT)
From:      Terry Lambert <tlambert@primenet.com>
To:        gsstark@mit.edu (Greg Stark)
Cc:        tlambert@primenet.com, gsstark@mit.edu, denis@acacia.cts.ucla.edu, mozilla@FreeBSD.ORG
Subject:   Re: Communicator 4.5: "Xlib: Unexpected async reply" msg flood!
Message-ID:  <199909151818.LAA23542@usr09.primenet.com>
In-Reply-To: <877lls682x.fsf@x2-513.mtl.Generation.NET> from "Greg Stark" at Sep 15, 99 01:45:58 am

next in thread | previous in thread | raw e-mail | index | archive | help
> > The serialization soloution was arrived at by me, after observing
> > the problem and non-problem platforms, and taking into account my
> > detailed knowledge of threads implemetnations on Solaris, Linux,
> > Windows 98, Windows NT, Macintosh, and FreeBSD.
> 
> So in your case you're fairly certain it was the GIF decoder that was buggy?

The GIF decoder _at least_ is buggy, in that it assumes that the
context switch will be back to the thread that was context switched
out after an involuntary preemption.

I suspect that it is not the _only_ code which is buggy, just the
most visible to me, in my application.


> Did you have a particular test page that could reliably crash communicator?

Yes, if I move the mouse over the image map while it is being loaded,
and the image map handling code gets time slices.


> This would be especially good if it reliably triggered the sequence errors.

It crashes the program.  The crash may not be in the same place each
time.  I don't have a copy of Communicator with all symbols intact to
be able to tell.


> I'm certain the Linux version and fairly certain that the other versions of
> communicator do _not_ use the native OS thread implementation. They use the
> built in user-space NSPR thread implementation. Which I think is supposed to
> use a simple FIFO scheduler like you describe.

What about Macintosh and FreeBSD?  Don't they use the same scheduler?  Or
are they trying to use native pthreads on these platforms?


> My hunch on the bug is that Java's run-time sets up the signal handlers one
> way and the rest of Netscape expects them to be set up a different way. And
> the net result is that some call that doesn't expect to be interrupted gets a
> SIGALRM and some X library call is preempted when it shouldn't be.

Interesting hypothesis.  This somewhat conflicts with the observed
behaviour, however, in that the FreeBSD X library is not multithreaded,
and the preemption should not be an issue for the code, since it
(Netscape) runs without problems in a Windows environment.

Hmmm.  Have there been any crashes using the "-remote OpenURL("xxx")"
reported on Windows?  It seems to me that another access to the
same image with a different thread may result in a crash without
an explicit call the CreateFreeThreadedMarshallar(), since Windows
threads instance per thread data onto thread local storage which is
not accessible in the address space of a different kernel thread.
The purpose of the Marshaller is to reinstantiate objects between
these address spaces.

Back to FreeBSD:

If it is using a "sigsched" type threading mechanism, there are indeed
differences in the signal handling mechanism between the OS's.

It seems to me that in one case, the alarms are being delivered async
(assuming it's the alarms), and in the other case, they are causing
a preemption.  This appears to either be missing mutex protection,
or missing signal masking, either of which are really a coding error
resulting from assuming too much about the underlying threads behaviour.


I am unfamiliar with NSPR internals; is it perhaps the case that
what gets scheduled by the alarm is a scheduler activation rather
than a context switch?  This would allow the thread to run to
completion, if the signal masking specified system call restart
for the signal being delivered.

It's possible that system call restart is failing on FreeBSD, or
on the Macintosh, especially if POSIX and non-POSIX signal access
functions are being utilized simultaneously.

A fast test in the BSD case would be to call siginterrupt(3), which
was introduces in BSD 4.2b (via DEC Ultrix) to obtain traditional
BSD signal behaviour (which was to restart all system calls).

The POSIX behaviour of aborting the system call instead of restarting
seems to me to be a SVR4 kludge to do things in signal handlers
which ought not to be done there (e.g. other than setting a volatile
flag to be examined in the main loop of the event driven application).

I know that the FreeBSD user space threads code does not very
robustly encapsulate signal interruption of "system calls" (really
pthreads wrappers in libc_r), and that scheduler activations
with a restart on system calls for all signals, with a scheduler
activation on exit (similar to a trampoline) would probably be a
better approach.


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-mozilla" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199909151818.LAA23542>