From owner-freebsd-current  Wed Nov  6 20:56:55 1996
Return-Path: owner-current
Received: (from root@localhost)
          by freefall.freebsd.org (8.7.5/8.7.3) id UAA27207
          for current-outgoing; Wed, 6 Nov 1996 20:56:55 -0800 (PST)
Received: from skynet.ctr.columbia.edu (skynet.ctr.columbia.edu [128.59.64.70])
          by freefall.freebsd.org (8.7.5/8.7.3) with SMTP id UAA27194;
          Wed, 6 Nov 1996 20:56:43 -0800 (PST)
Received: (from wpaul@localhost) by skynet.ctr.columbia.edu (8.6.12/8.6.9) id XAA06495; Wed, 6 Nov 1996 23:56:18 -0500
From: Bill Paul <wpaul@skynet.ctr.columbia.edu>
Message-Id: <199611070456.XAA06495@skynet.ctr.columbia.edu>
Subject: Re: yp_next failure
To: asami@freebsd.org (Satoshi Asami)
Date: Wed, 6 Nov 1996 23:56:17 -0500 (EST)
Cc: current@freebsd.org
In-Reply-To: <199611070310.TAA24792@silvia.HIP.Berkeley.EDU> from "Satoshi Asami" at Nov 6, 96 07:10:23 pm
X-Mailer: ELM [version 2.4 PL24]
Content-Type: text
Sender: owner-current@freebsd.org
X-Loop: FreeBSD.org
Precedence: bulk

Of all the gin joints in all the towns in all the world, Satoshi Asami 
had to walk into mine and say:

> I noticed that a lot of user processes started dying due to high
> network load.  For instance, I get things like this:
> 
> ===
> >> rlogin vader
> Last login: Wed Nov  6 18:41:44 from 136.152.64.181
> yp_next: clnt_call: RPC: Timed out
> 
> rlogin: connection closed.
> ===
> 
> and this is in /var/log/messages:
> 
> ===
> Nov  6 18:43:16 vader /kernel: pid 22470 (login), uid 0: exited on signal 11
> ===
> 
> Other processes that died today were (this is just a small part of the 
> log):
> 
> ===
> Nov  6 15:40:31 vader /kernel: pid 852 (sendmail), uid 0: exited on signal 11
> Nov  6 15:40:48 vader /kernel: pid 856 (from), uid 5531: exited on signal 11
> Nov  6 15:41:05 vader /kernel: pid 863 (ssh), uid 0: exited on signal 11
> Nov  6 15:48:26 vader /kernel: pid 994 (xterm), uid 0: exited on signal 11
> Nov  6 15:48:39 vader /kernel: pid 1019 (xterm), uid 0: exited on signal 11
> Nov  6 15:48:40 vader /kernel: pid 1021 (xterm), uid 0: exited on signal 11
> Nov  6 15:48:41 vader /kernel: pid 1029 (xterm), uid 0: exited on signal 11
> Nov  6 15:48:41 vader /kernel: pid 1025 (xterm), uid 0: exited on signal 11
> Nov  6 15:48:41 vader /kernel: pid 1027 (xterm), uid 0: exited on signal 11
> Nov  6 15:48:41 vader /kernel: pid 1026 (xterm), uid 0: exited on signal 11
> Nov  6 15:48:41 vader /kernel: pid 1024 (xterm), uid 0: exited on signal 11
> Nov  6 15:48:49 vader /kernel: pid 1017 (xterm), uid 0: exited on signal 11
> Nov  6 15:51:27 vader /kernel: pid 1436 (ssh), uid 0: exited on signal 11
> Nov  6 15:53:41 vader /kernel: pid 1653 (xterm), uid 0: exited on signal 11
> Nov  6 15:55:26 vader /kernel: pid 1791 (cron), uid 0: exited on signal 11 (core dumped)
> Nov  6 16:03:42 vader /kernel: pid 1888 (mailq), uid 0: exited on signal 11
> Nov  6 16:55:11 vader /kernel: pid 3804 (cron), uid 0: exited on signal 11 (core dumped)
> Nov  6 16:55:41 vader /kernel: pid 3805 (sendmail), uid 0: exited on signal 11
> ===
> 
> Are these many programs supposed to die when a YP lookup has failed?

How do you know that it really failed? If you look at yplib.c, the
yp_next() function retries connections until it succeeds, or _yp_dobind()
fails (and if _yp_dobind() fails, you're likely to get another error
message). I'm a little leery of generating error messages from inside
the NIS library code in the first place -- I never see the Sun code
do it -- but I'm not sure if this could really cause the problem.

If an NIS call fails, then the higher level libc function that called
it should interpret it as an 'end of file.' So if it barfs in the middle
of a getpwent() sequence for instance, it would look as though it reached
the end of the passwd database a little early. How the application reacts
in this case sort of depends on... well, on the application.

One thing that might cause a problem is a bug in the file descriptor
handling in _yp_dobind(). In order to save on some syscall and RPC
overhead, I made the code create an RPC client handle just once when a 
binding is set up rather that setting it up and tearing it down every 
time an NIS function is called. The problem with doing this is that 
_yp_dobind() can get confused if the following sequence happens:

- an NIS function is called, either directly or as part of another
  libc function like getpwent(3) or getgrent(3). As part of this
  call, _yp_dobind() does a clnt_create(3) and gets a socket descriptor,
  which it caches.

- after the call, the application decides to close all of its file
  descriptors by doing something like:

	for(i = 0; i < 256; i++)
		close(i);

- the application then starts using descriptors on its own, say for
  opening files, and it ends up reusing the descriptor number that
  _yp_dobind() had previously gotten for its socket.

- an NIS function is called again. Now the RPC socket descriptor
  is a file descriptor, and hijinks ensue.

To avoid this problem, _yp_dobind() actually binds its end of the
socket descriptor, which associates a port number with it. This port
number is then saved as part of the dom_binding structure; when _yp_dobind()
is called again later, it can go a getsockname() on the socket and check
that the port number matches the one that was saved. (And you also test
that getsockname() suceeds at all; if it barfs, then the socket may have
been closed or turned into Something Completely Different (tm).) If it
discovers that the socket descriptor is invalid, _yp_dobind() very
carefully tries to allocate a new descriptor without bothering the one
that was changed.

So there are two things I can think of that might be causing this
problem:

1) Somehow I goofed up the part of _yp_dobind() that replaces invalidated
   socket descriptors. (Note that this only applies to 2.2.x.)

2) When the yplib code generates error messages to what it thinks is
   stderr, it's actually generating messages somewhere else and hosing
   applications somehow.

> The network has been congested lately, but I haven't seen this kind of 
> mass suicide until just recently.

Unfortunately, I haven't run into this sort of thing much myself.
Without being able to reliably duplicate the problem, I can't easily
debug it.

-Bill

-- 
=============================================================================
-Bill Paul            (212) 854-6020 | System Manager, Master of Unix-Fu
Work:         wpaul@ctr.columbia.edu | Center for Telecommunications Research
Home:  wpaul@skynet.ctr.columbia.edu | Columbia University, New York City
=============================================================================
 "If you're ever in trouble, go to the CTR. Ask for Bill. He will help you."
=============================================================================