From owner-freebsd-hackers Sat Jan 6 12:09:20 1996 Return-Path: owner-hackers Received: (from root@localhost) by freefall.freebsd.org (8.7.3/8.7.3) id MAA13294 for hackers-outgoing; Sat, 6 Jan 1996 12:09:20 -0800 (PST) Received: from cabal.io.org (cabal.io.org [198.133.36.103]) by freefall.freebsd.org (8.7.3/8.7.3) with SMTP id MAA13271 Sat, 6 Jan 1996 12:09:04 -0800 (PST) Received: (from taob@localhost) by cabal.io.org (8.6.12/8.6.12) id PAA00784; Sat, 6 Jan 1996 15:07:19 -0500 Date: Sat, 6 Jan 1996 15:07:19 -0500 (EST) From: Brian Tao To: FREEBSD-HACKERS-L cc: FREEBSD-ISP-L Subject: A few other concerns from a FreeBSD ISP Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-hackers@freebsd.org Precedence: bulk I am starting to phase in FreeBSD boxes in favour of BSD/OS systems at Internex Online, the ISP that employs me. The first machines are being used as customer login servers, so they are running your typical mix of Internet client software. Dialup access is provided via Livingston PM-2e terminal servers. There are a few items that concern me, none of which were brought up in the "ISP's state their FreeBSD concerns" thread from a couple of months back. For reference, the machines are Intel P133's on ASUS P/I-P55TP4XEG (note the extra 'G') motherboards with 512K pipeline burst cache, 4x32MB 60ns FPM SIMM's, a generic PCI VGA card, an SMC 9332 EtherPower 10/100Mbps Ethernet NIC, an NCR53c810 SCSI controller and a Quantum Fireball 1080S hard drive. 2.1.0-RELEASE is installed on all of them. 1. Rlogin problem The most troubling is an rlogin bug that has been around at least since January 1995. On seemingly random occasions, an rlogin to the FreeBSD host will fail. After the rlogin command is issued on the other system, there is a period of inactivity that lasts about one minute. Then I get a "Connection refused" error. I've had this problem since 2.0-RELEASE, whether the other system is running FreeBSD, BSD/OS, NetBSD, SunOS, AIX or IRIX. I have tcp_wrappers installed on the FreeBSD machine. When an rlogin fails, no connection is registered by tcpd and rlogind on the destination host doesn't even start. Running inetd in debug mode indicates that not even inetd is aware there is a connection attempt on port 513. Running netstat around the time of the rlogin attempt suggests that the rlogin hang may have to do with the kernel assigning the connection a port number that is still currently in TIME_WAIT from a previous rlogin. Once the TIME_WAIT status is cleared, the rlogin is completed. If this is the problem, would it be possible to get the kernel to use incremental port numbers instead of trying to "reuse" old ones? The TIME_WAIT's hang around for a few seconds after a connection is lost, and this becomes an issue when you have login/logout events once every few seconds. Our terminal servers use the rlogin service to transfer a user to a FreeBSD machine. If that connection times out, the line is dropped and they need to redial. Needless to say, this won't score any points with customers. I haven't noticed this problem with BSD/OS 2.0 yet, nor any other flavour of UNIX I've used. I haven't used NetBSD systems enough to know if they have the same problem in their socket code. Has anyone else seen this behaviour with 2.1.0-RELEASE (or with earlier or later kernels)? Better yet, does anyone have a solution? 2. Temporary loss of NFS service FreeBSD's NFS client code seems to be very sensitive to an unresponsive server. If our NFS server (a P100 BSD/OS 2.0 machine) needs to be taken offline, clients of that server will naturally get a lot of processes hanging in disk wait. The problem is that FreeBSD clients do not seem to ever recover from that state, while the BSD/OS clients take a few minutes to realize NFS is once again available, and continue on their merry way. The only wait out of this is to reboot the FreeBSD machines (again, not scoring any points with the paying customers who were online). I am running "nfsiod -n 4" on the clients, and "nfsd -t -u -n 6" on the BSD/OS server. About 24 gigabytes of disk over 7 filesystems are exported to the clients. Is there any way to "kickstart" processes on the client so they know that the NFS server is alive again? Or is there a tunable parameter in the kernel source that decreases the timeout or increases the frequency of retrying the NFS server? 3. Unrecoverable "mb_map full" condition I've noticed that once the kernel reports "mb_map full", networking is completely hosed. Is it possible for the kernel to release unused mbufs into a free pool of some sort instead of forcing me to reboot the machine? I've had this happen even with NMBCLUSTERS=2048, but I haven't seen it yet with 4096. 'netstat -m' typically reports: 226 mbufs in use: 100 mbufs allocated to data 81 mbufs allocated to packet headers 33 mbufs allocated to protocol control blocks 12 mbufs allocated to socket names and addresses 81/454 mbuf clusters in use 936 Kbytes allocated to network (20% in use) 0 requests for memory denied 0 requests for memory delayed 0 calls to protocol drain routines In the "mbuf clusters in use" line, is 81 the current number allocated, and 454 the high-water mark? Those numbers aren't anywhere near 2048, let alone 4096. Does the NMBCLUSTERS option in the kernel config refer to some other number? *** The last two problems aren't immediate concerns, but the first one will cause headaches once I switch the Livingstons to use the FreeBSD machines as login hosts instead of the BSD/OS machines. Any insight or advice appreciated. Thanks. -- Brian Tao (BT300, taob@io.org) Systems Administrator, Internex Online Inc. "Though this be madness, yet there is method in't"