Date: Tue, 23 Jul 2002 04:27:57 -0700 From: Terry Lambert <tlambert2@mindspring.com> To: Sheldon Hearn <sheldonh@starjuice.net> Cc: Yann Berthier <yb@sainte-barbe.org>, current@freebsd.org Subject: Re: Is it just me or has -current suddenly got massively unstable? Message-ID: <3D3D3DBD.D5199F28@mindspring.com> References: <20020722101211.GA442@hsc.fr> <20020723070704.7B4CB3925@overcee.wemm.org> <20020723100853.GA433@hsc.fr> <20020723102747.GR32782@starjuice.net>
next in thread | previous in thread | raw e-mail | index | archive | help
Sheldon Hearn wrote: > On (2002/07/23 12:08), Yann Berthier wrote: > > Thanks a lot, patch applied, and all is going fine. Peter: I knew you > > would come up with a solution :) > > (well, feel free to call it bandaid, but it solves the problem BTW) > > To quote Terry Lambert on what he calls Occam's Corollary: > > Anything that works is better than anything that doesn't. > > :-) Be really, really careful here. The reason it works is because it changes the memory to be type stable, so it gets the previous values, if the structure has not been reused, and signals a selwakeup() where there is no one waiting. If the structure *has* been reused, then it issues a selwakeup() to a potentially unrelated thread. In most cases, this is a harmless event, that's not even being checked for; in other cases, it's being checked for, and it looks like a bogus return. Most code that sits in a select loop will only trigger if a bit is set. However, it's a perfectly valid thing to think that you won't get spurios returns -- and write code that *depends* on not getting spurious returns. Since I've only been following this vs. -current by reading, rather than running, source code, and reading, rather than applying patches, this is just my initial reaction to the patch. So take the following with a grain of salt... On the other hand: there is a *real* problem here; again, from just reading the code, it looks like a pretty deep one having to do with events being things which happen *on* descriptors, rather than *to* processes (or threads). I expect that the problem is that a thread has been terminated, and it is the thread which opened a socket, and then did the listen on it, but isn't around to do the accept, or receive the connection event. It's a deep problem because descriptors belong to processes, not threads, and events belong to the decriptors, not to the callers; before KSE's, it was OK to treat it as a commutitive property. I rather expect that there is a similar panic that will show up during stress testing, which will occur at NETISR on incoming connections, in the bottom half of the "accept" code, which has a similar looking selwakeup() call. Probably, the only way to fix this is to make it a process event rather than a thread event, which would avoid the list removal and subsequent dereference. Kind of an ugly kludge. 8-(. It would not surprise me if the kevent() resulting from signals is near the heart of the signal problem, as well, and has a parallel basis. -- Terry To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-current" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3D3D3DBD.D5199F28>