From owner-freebsd-current Wed Nov 6 20:56:55 1996 Return-Path: owner-current Received: (from root@localhost) by freefall.freebsd.org (8.7.5/8.7.3) id UAA27207 for current-outgoing; Wed, 6 Nov 1996 20:56:55 -0800 (PST) Received: from skynet.ctr.columbia.edu (skynet.ctr.columbia.edu [128.59.64.70]) by freefall.freebsd.org (8.7.5/8.7.3) with SMTP id UAA27194; Wed, 6 Nov 1996 20:56:43 -0800 (PST) Received: (from wpaul@localhost) by skynet.ctr.columbia.edu (8.6.12/8.6.9) id XAA06495; Wed, 6 Nov 1996 23:56:18 -0500 From: Bill Paul Message-Id: <199611070456.XAA06495@skynet.ctr.columbia.edu> Subject: Re: yp_next failure To: asami@freebsd.org (Satoshi Asami) Date: Wed, 6 Nov 1996 23:56:17 -0500 (EST) Cc: current@freebsd.org In-Reply-To: <199611070310.TAA24792@silvia.HIP.Berkeley.EDU> from "Satoshi Asami" at Nov 6, 96 07:10:23 pm X-Mailer: ELM [version 2.4 PL24] Content-Type: text Sender: owner-current@freebsd.org X-Loop: FreeBSD.org Precedence: bulk Of all the gin joints in all the towns in all the world, Satoshi Asami had to walk into mine and say: > I noticed that a lot of user processes started dying due to high > network load. For instance, I get things like this: > > === > >> rlogin vader > Last login: Wed Nov 6 18:41:44 from 136.152.64.181 > yp_next: clnt_call: RPC: Timed out > > rlogin: connection closed. > === > > and this is in /var/log/messages: > > === > Nov 6 18:43:16 vader /kernel: pid 22470 (login), uid 0: exited on signal 11 > === > > Other processes that died today were (this is just a small part of the > log): > > === > Nov 6 15:40:31 vader /kernel: pid 852 (sendmail), uid 0: exited on signal 11 > Nov 6 15:40:48 vader /kernel: pid 856 (from), uid 5531: exited on signal 11 > Nov 6 15:41:05 vader /kernel: pid 863 (ssh), uid 0: exited on signal 11 > Nov 6 15:48:26 vader /kernel: pid 994 (xterm), uid 0: exited on signal 11 > Nov 6 15:48:39 vader /kernel: pid 1019 (xterm), uid 0: exited on signal 11 > Nov 6 15:48:40 vader /kernel: pid 1021 (xterm), uid 0: exited on signal 11 > Nov 6 15:48:41 vader /kernel: pid 1029 (xterm), uid 0: exited on signal 11 > Nov 6 15:48:41 vader /kernel: pid 1025 (xterm), uid 0: exited on signal 11 > Nov 6 15:48:41 vader /kernel: pid 1027 (xterm), uid 0: exited on signal 11 > Nov 6 15:48:41 vader /kernel: pid 1026 (xterm), uid 0: exited on signal 11 > Nov 6 15:48:41 vader /kernel: pid 1024 (xterm), uid 0: exited on signal 11 > Nov 6 15:48:49 vader /kernel: pid 1017 (xterm), uid 0: exited on signal 11 > Nov 6 15:51:27 vader /kernel: pid 1436 (ssh), uid 0: exited on signal 11 > Nov 6 15:53:41 vader /kernel: pid 1653 (xterm), uid 0: exited on signal 11 > Nov 6 15:55:26 vader /kernel: pid 1791 (cron), uid 0: exited on signal 11 (core dumped) > Nov 6 16:03:42 vader /kernel: pid 1888 (mailq), uid 0: exited on signal 11 > Nov 6 16:55:11 vader /kernel: pid 3804 (cron), uid 0: exited on signal 11 (core dumped) > Nov 6 16:55:41 vader /kernel: pid 3805 (sendmail), uid 0: exited on signal 11 > === > > Are these many programs supposed to die when a YP lookup has failed? How do you know that it really failed? If you look at yplib.c, the yp_next() function retries connections until it succeeds, or _yp_dobind() fails (and if _yp_dobind() fails, you're likely to get another error message). I'm a little leery of generating error messages from inside the NIS library code in the first place -- I never see the Sun code do it -- but I'm not sure if this could really cause the problem. If an NIS call fails, then the higher level libc function that called it should interpret it as an 'end of file.' So if it barfs in the middle of a getpwent() sequence for instance, it would look as though it reached the end of the passwd database a little early. How the application reacts in this case sort of depends on... well, on the application. One thing that might cause a problem is a bug in the file descriptor handling in _yp_dobind(). In order to save on some syscall and RPC overhead, I made the code create an RPC client handle just once when a binding is set up rather that setting it up and tearing it down every time an NIS function is called. The problem with doing this is that _yp_dobind() can get confused if the following sequence happens: - an NIS function is called, either directly or as part of another libc function like getpwent(3) or getgrent(3). As part of this call, _yp_dobind() does a clnt_create(3) and gets a socket descriptor, which it caches. - after the call, the application decides to close all of its file descriptors by doing something like: for(i = 0; i < 256; i++) close(i); - the application then starts using descriptors on its own, say for opening files, and it ends up reusing the descriptor number that _yp_dobind() had previously gotten for its socket. - an NIS function is called again. Now the RPC socket descriptor is a file descriptor, and hijinks ensue. To avoid this problem, _yp_dobind() actually binds its end of the socket descriptor, which associates a port number with it. This port number is then saved as part of the dom_binding structure; when _yp_dobind() is called again later, it can go a getsockname() on the socket and check that the port number matches the one that was saved. (And you also test that getsockname() suceeds at all; if it barfs, then the socket may have been closed or turned into Something Completely Different (tm).) If it discovers that the socket descriptor is invalid, _yp_dobind() very carefully tries to allocate a new descriptor without bothering the one that was changed. So there are two things I can think of that might be causing this problem: 1) Somehow I goofed up the part of _yp_dobind() that replaces invalidated socket descriptors. (Note that this only applies to 2.2.x.) 2) When the yplib code generates error messages to what it thinks is stderr, it's actually generating messages somewhere else and hosing applications somehow. > The network has been congested lately, but I haven't seen this kind of > mass suicide until just recently. Unfortunately, I haven't run into this sort of thing much myself. Without being able to reliably duplicate the problem, I can't easily debug it. -Bill -- ============================================================================= -Bill Paul (212) 854-6020 | System Manager, Master of Unix-Fu Work: wpaul@ctr.columbia.edu | Center for Telecommunications Research Home: wpaul@skynet.ctr.columbia.edu | Columbia University, New York City ============================================================================= "If you're ever in trouble, go to the CTR. Ask for Bill. He will help you." =============================================================================