Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 23 Jul 2002 04:27:57 -0700
From:      Terry Lambert <tlambert2@mindspring.com>
To:        Sheldon Hearn <sheldonh@starjuice.net>
Cc:        Yann Berthier <yb@sainte-barbe.org>, current@freebsd.org
Subject:   Re: Is it just me or has -current suddenly got massively unstable?
Message-ID:  <3D3D3DBD.D5199F28@mindspring.com>
References:  <20020722101211.GA442@hsc.fr> <20020723070704.7B4CB3925@overcee.wemm.org> <20020723100853.GA433@hsc.fr> <20020723102747.GR32782@starjuice.net>

next in thread | previous in thread | raw e-mail | index | archive | help
Sheldon Hearn wrote:
> On (2002/07/23 12:08), Yann Berthier wrote:
> >    Thanks a lot, patch applied, and all is going fine. Peter: I knew you
> >    would come up with a solution :)
> >    (well, feel free to call it bandaid, but it solves the problem BTW)
> 
> To quote Terry Lambert on what he calls Occam's Corollary:
> 
>         Anything that works is better than anything that doesn't.
> 
> :-)

Be really, really careful here.

The reason it works is because it changes the memory to be type
stable, so it gets the previous values, if the structure has
not been reused, and signals a selwakeup() where there is no
one waiting.  If the structure *has* been reused, then it issues
a selwakeup() to a potentially unrelated thread.  In most cases,
this is a harmless event, that's not even being checked for; in
other cases, it's being checked for, and it looks like a bogus
return.

Most code that sits in a select loop will only trigger if a bit
is set.  However, it's a perfectly valid thing to think that you
won't get spurios returns -- and write code that *depends* on not
getting spurious returns.

Since I've only been following this vs. -current by reading,
rather than running, source code, and reading, rather than
applying patches, this is just my initial reaction to the patch.

So take the following with a grain of salt...

On the other hand: there is a *real* problem here; again, from
just reading the code, it looks like a pretty deep one having to
do with events being things which happen *on* descriptors, rather
than *to* processes (or threads).

I expect that the problem is that a thread has been terminated,
and it is the thread which opened a socket, and then did the
listen on it, but isn't around to do the accept, or receive the
connection event.

It's a deep problem because descriptors belong to processes, not
threads, and events belong to the decriptors, not to the callers;
before KSE's, it was OK to treat it as a commutitive property.

I rather expect that there is a similar panic that will show up
during stress testing, which will occur at NETISR on incoming
connections, in the bottom half of the "accept" code, which has
a similar looking selwakeup() call.

Probably, the only way to fix this is to make it a process event
rather than a thread event, which would avoid the list removal
and subsequent dereference.  Kind of an ugly kludge.  8-(.

It would not surprise me if the kevent() resulting from signals
is near the heart of the signal problem, as well, and has a
parallel basis.

-- Terry

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-current" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3D3D3DBD.D5199F28>