Date: Wed, 31 May 2000 17:51:38 -0400 From: "David E. Cross" <crossd@cs.rpi.edu> To: Guy Helmer <ghelmer@cs.iastate.edu> Cc: "David E. Cross" <crossd@cs.rpi.edu>, Matthew Dillon <dillon@apollo.backplane.com>, freebsd-hackers@FreeBSD.ORG, crossd@cs.rpi.edu Subject: Re: PR #10971, not dead yet. Message-ID: <200005312151.RAA86135@cs.rpi.edu> In-Reply-To: Message from Guy Helmer <ghelmer@cs.iastate.edu> of "Wed, 31 May 2000 15:12:34 CDT." <Pine.HPX.4.05.10005311509440.9820-100000@popeye.cs.iastate.edu>
next in thread | previous in thread | raw e-mail | index | archive | help
> > Alas, this is not something I have been able to reliably reproduce, it seems > > to trigger itself every so-often (and at inconvienient times). But no > > matter what I do by myself it will not trip. > > Is it possibly related to a low-memory situation? I'm trying to solve a > problem in cron that sounds similar, and seems to be triggered when the > machine goes into swapping. I'm unable to duplicate it myself :-( > > Guy > > Guy Helmer, Ph.D. Candidate, Iowa State University Dept. of Computer Science > Research Assistant, Dept. of Computer Science --- ghelmer@cs.iastate.edu > http://www.cs.iastate.edu/~ghelmer This does not appear to be memory related at all. In fact, I *think* I just found it... (bear with me y'all) In the case of a TCP connect that requests a yp_all transfer we fork() off, and then try to do a very good job of not allowing the client to handle any requests other than the yp_all; however the following code snippet from readtcp() tries to do an end run around us it would appear: /* * reads data from the tcp conection. * any error is fatal and the connection is closed. * (And a read of zero bytes is a half closed stream => error.) * * Note: we have to be careful here not to allow ourselves to become * blocked too long in this routine. While we're waiting for data from one * client, another client may be trying to connect. To avoid this situation, * some code from svc_run() is transplanted here: the select() loop checks * all RPC descriptors including the one we want and calls svc_getreqset2() * to handle new requests if any are detected. */ This is the code that I noted gets run sometimes instead of the main select loop. Would it be a good idea to not only close all of the DB-fds, but also all network FDs, except for the request it is specifically being asked to handle, in the child ypserv? Would it be as easy as stepping through the fd_set and closing anything that != designated connection? I am still not sure this is the cause, as all of the database FDs should already be closed, so even if a child did answer the request it shouldn't cause trouble for the parent (and I do not see any evidence in the ktrace() that the child is responding outside of its yp_all request). Indeed, I have just verified this is the code that causes the segfault in this case (as indicated by the tell-tale gettimeofday calls that I could not previously track). I still have no idea what is causing the trboule though. Especially confusing is the following sequence of events: 41096 ypserv CALL fork 41096 ypserv RET fork 62356/0xf394 41096 ypserv CALL gettimeofday(0xbfbff510,0) 41096 ypserv RET gettimeofday 0 41096 ypserv CALL select(0x10,0x8051040,0,0,0xbfbff518) 41096 ypserv PSIG SIGCHLD caught handler=0x804c75c mask=0x0 code=0x0 41096 ypserv RET select -1 errno 4 Interrupted system call 41096 ypserv CALL wait4(0xffffffff,0xbfbff308,0x1,0) 41096 ypserv RET wait4 62356/0xf394 41096 ypserv CALL wait4(0xffffffff,0xbfbff308,0x1,0) 41096 ypserv RET wait4 -1 errno 10 No child processes 41096 ypserv CALL sigreturn(0xbfbff328) 41096 ypserv RET sigreturn JUSTRETURN 41096 ypserv CALL gettimeofday(0xbfbff510,0) 41096 ypserv RET gettimeofday 0 41096 ypserv CALL read(0x1c,0x80f3fa0,0xfa0) 41096 ypserv GIO fd 28 read 4000 bytes Note that the select returned with -1, with errno set to 4, and it did not re-enter the select loop, but just started to read data. Also note that following the 'CALL/RET fork' that it branches to a gettimeofday(), this says that since readtcp() is acting as a bit of svc_run() that *it* dispatched to the yp_all() handler, and then it forked there, without the special handling that is done in the normal yp_svc_run(). Does this give anyone else any ideas? This is proving to be a very slow battle. -- David Cross | email: crossd@cs.rpi.edu Lab Director | Rm: 308 Lally Hall Rensselaer Polytechnic Institute, | Ph: 518.276.2860 Department of Computer Science | Fax: 518.276.4033 I speak only for myself. | WinNT:Linux::Linux:FreeBSD To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hackers" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200005312151.RAA86135>