From owner-freebsd-hackers Wed May 31 10: 6:54 2000 Delivered-To: freebsd-hackers@freebsd.org Received: from cs.rpi.edu (mumble.cs.rpi.edu [128.213.8.16]) by hub.freebsd.org (Postfix) with ESMTP id 2431037BE0B for ; Wed, 31 May 2000 10:06:43 -0700 (PDT) (envelope-from crossd@cs.rpi.edu) Received: from cs.rpi.edu (phoenix.cs.rpi.edu [128.113.96.153]) by cs.rpi.edu (8.9.3/8.9.3) with ESMTP id NAA78305 for ; Wed, 31 May 2000 13:06:40 -0400 (EDT) Message-Id: <200005311706.NAA78305@cs.rpi.edu> To: freebsd-hackers@freebsd.org Subject: PR #10971, not dead yet. Date: Wed, 31 May 2000 13:06:19 -0400 From: "David E. Cross" Sender: owner-freebsd-hackers@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG We have still have a problem with PR #10971 here running a -STABLE as of last week. (Long since 10971 should have been dead). It is a difficult problem to track down as stack corruption makes debugging files less than useless. I do, however, have a ktrace of an entire transaction that causes ypserv to die. I am in the process of trying to track down why it is dying, it appears to be a bug in the rpc library itself. Normally what happens is the following: # TCP request comes in, accept(). # yp_all request issued, parent forks. # child handles request, quits. # parent is interrupted in its select() call, dispatches to signal handler for # SIGCHLD # handler returns. # parent issues a read?!? (this is odd, since it doesn't re-enter the select # loop as the code I have read suggests it should). # read fails (0 bytes returned) # it does that a couple of times (probably falling out of loops), and FD is # closed # ypserv re-enters the select loop Under a failure condition the following happens: # Upon child return parent reads from a a DB file to a nonexistent buffer. # parent seg-faults. I believe the problem code is "next to" the section of the code where it selects(), and then accepts() if it is a TCP connection... but I cannot find where this code is. a grep of 'accept' in both the ypserv and rpc code returns no usefull matches. Also, it would certainly appear that there is another select loop than just the one in the the canonical ypsrever. Below is the dying moments for the parent process as reported by ktrace, ideas? 41096 ypserv CALL fork 41096 ypserv RET fork 62356/0xf394 41096 ypserv CALL gettimeofday(0xbfbff510,0) 41096 ypserv RET gettimeofday 0 41096 ypserv CALL select(0x10,0x8051040,0,0,0xbfbff518) 41096 ypserv PSIG SIGCHLD caught handler=0x804c75c mask=0x0 code=0x0 41096 ypserv RET select -1 errno 4 Interrupted system call 41096 ypserv CALL wait4(0xffffffff,0xbfbff308,0x1,0) 41096 ypserv RET wait4 62356/0xf394 41096 ypserv CALL wait4(0xffffffff,0xbfbff308,0x1,0) 41096 ypserv RET wait4 -1 errno 10 No child processes 41096 ypserv CALL sigreturn(0xbfbff328) 41096 ypserv RET sigreturn JUSTRETURN 41096 ypserv CALL gettimeofday(0xbfbff510,0) 41096 ypserv RET gettimeofday 0 41096 ypserv CALL read(0x1c,0x80f3fa0,0xfa0) 41096 ypserv GIO fd 28 read 4000 bytes 41096 ypserv RET read 4000/0xfa0 41096 ypserv PSIG SIGSEGV SIG_DFL 41096 ypserv NAMI "ypserv.core" Oh, this is true of all systems, not just 4.0-STABLE. I was hoping the move to 4.0 might solve the problem, so I wasn't actively trying to debug it before. -- David Cross | email: crossd@cs.rpi.edu Lab Director | Rm: 308 Lally Hall Rensselaer Polytechnic Institute, | Ph: 518.276.2860 Department of Computer Science | Fax: 518.276.4033 I speak only for myself. | WinNT:Linux::Linux:FreeBSD To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hackers" in the body of the message