Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 31 May 2000 10:23:07 -0700 (PDT)
From:      Matthew Dillon <dillon@apollo.backplane.com>
To:        "David E. Cross" <crossd@cs.rpi.edu>
Cc:        freebsd-hackers@FreeBSD.ORG
Subject:   Re: PR #10971, not dead yet.
Message-ID:  <200005311723.KAA30252@apollo.backplane.com>
References:   <200005311706.NAA78305@cs.rpi.edu>

next in thread | previous in thread | raw e-mail | index | archive | help
:We have still have a problem with PR #10971 here running a -STABLE as of last
:week.  (Long since 10971 should have been dead).  It is a difficult problem
:to track down as stack corruption makes debugging files less than useless.
:I do, however, have a ktrace of an entire transaction that causes ypserv
:to die.  I am in the process of trying to track down why it is dying, it 
:appears to be a bug in the rpc library itself.  Normally what happens is
:the following:
:...
:
:# parent is interrupted in its select() call, dispatches to signal handler for
:# SIGCHLD
:...
:
:I believe the problem code is "next to" the section of the code where it
:selects(), and then accepts() if it is a TCP connection... but I cannot find
:where this code is.  a grep of 'accept' in both the ypserv and rpc code
:returns no usefull matches.  Also, it would certainly appear that there
:is another select loop than just the one in the the canonical ypsrever.
:
:Below is the dying moments for the parent process as reported by ktrace,
:ideas?

    If you can reproduce the problem regularly then I recommend putting
    a signal guard in to see if the corruption is being caused by the
    signal interrupting at an inausipcious moment.

    In main() block SIGHUP, SIGINT, SIGTERM, and SIGCHLD using sigsetmask().

    Just prior to the select call unblock the signals.

    Just after the select call reblock the signals.

    And see if the corruption still occurs.  If this fixes the problem, 
    then there is probably something in the reaper() (in yp_main.c) 
    that is causing corruption, probably by ripping a structure out from
    under whatever piece of code the signal happens to interrupt.

    I took a quick look at the code and as far as I can tell it implements
    no guards whatsoever.  The inetd code had similar problems in the past.

:Oh, this is true of all systems, not just 4.0-STABLE.  I was hoping the move
:to 4.0 might solve the problem, so I wasn't actively trying to debug it before.
:--
:David Cross                               | email: crossd@cs.rpi.edu 

					-Matt
					Matthew Dillon 
					<dillon@backplane.com>


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200005311723.KAA30252>