From owner-freebsd-hackers  Sat Jan  6 12:09:20 1996
Return-Path: owner-hackers
Received: (from root@localhost)
          by freefall.freebsd.org (8.7.3/8.7.3) id MAA13294
          for hackers-outgoing; Sat, 6 Jan 1996 12:09:20 -0800 (PST)
Received: from cabal.io.org (cabal.io.org [198.133.36.103])
          by freefall.freebsd.org (8.7.3/8.7.3) with SMTP id MAA13271
          Sat, 6 Jan 1996 12:09:04 -0800 (PST)
Received: (from taob@localhost) by cabal.io.org (8.6.12/8.6.12) id PAA00784; Sat, 6 Jan 1996 15:07:19 -0500
Date: Sat, 6 Jan 1996 15:07:19 -0500 (EST)
From: Brian Tao <taob@io.org>
To: FREEBSD-HACKERS-L <freebsd-hackers@freebsd.org>
cc: FREEBSD-ISP-L <freebsd-isp@freebsd.org>
Subject: A few other concerns from a FreeBSD ISP
Message-ID: <Pine.BSF.3.91.960106150533.209B-100000@cabal.io.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-hackers@freebsd.org
Precedence: bulk

    I am starting to phase in FreeBSD boxes in favour of BSD/OS
systems at Internex Online, the ISP that employs me.  The first
machines are being used as customer login servers, so they are running
your typical mix of Internet client software.  Dialup access is
provided via Livingston PM-2e terminal servers.

    There are a few items that concern me, none of which were brought
up in the "ISP's state their FreeBSD concerns" thread from a couple of
months back.  For reference, the machines are Intel P133's on ASUS
P/I-P55TP4XEG (note the extra 'G') motherboards with 512K pipeline
burst cache, 4x32MB 60ns FPM SIMM's, a generic PCI VGA card, an SMC
9332 EtherPower 10/100Mbps Ethernet NIC, an NCR53c810 SCSI controller
and a Quantum Fireball 1080S hard drive.  2.1.0-RELEASE is installed
on all of them.


1.  Rlogin problem

    The most troubling is an rlogin bug that has been around at least
since January 1995.  On seemingly random occasions, an rlogin to the
FreeBSD host will fail.  After the rlogin command is issued on the
other system, there is a period of inactivity that lasts about one
minute.  Then I get a "Connection refused" error.  I've had this
problem since 2.0-RELEASE, whether the other system is running
FreeBSD, BSD/OS, NetBSD, SunOS, AIX or IRIX.

    I have tcp_wrappers installed on the FreeBSD machine.  When an
rlogin fails, no connection is registered by tcpd and rlogind on the
destination host doesn't even start.  Running inetd in debug mode
indicates that not even inetd is aware there is a connection attempt
on port 513.

    Running netstat around the time of the rlogin attempt suggests
that the rlogin hang may have to do with the kernel assigning the
connection a port number that is still currently in TIME_WAIT from a
previous rlogin.  Once the TIME_WAIT status is cleared, the rlogin is
completed.

    If this is the problem, would it be possible to get the kernel to
use incremental port numbers instead of trying to "reuse" old ones?
The TIME_WAIT's hang around for a few seconds after a connection is
lost, and this becomes an issue when you have login/logout events once
every few seconds.  Our terminal servers use the rlogin service to
transfer a user to a FreeBSD machine.  If that connection times out,
the line is dropped and they need to redial.  Needless to say, this
won't score any points with customers.

    I haven't noticed this problem with BSD/OS 2.0 yet, nor any other
flavour of UNIX I've used.  I haven't used NetBSD systems enough to
know if they have the same problem in their socket code.  Has anyone
else seen this behaviour with 2.1.0-RELEASE (or with earlier or later
kernels)?  Better yet, does anyone have a solution?


2. Temporary loss of NFS service

    FreeBSD's NFS client code seems to be very sensitive to an
unresponsive server.  If our NFS server (a P100 BSD/OS 2.0 machine)
needs to be taken offline, clients of that server will naturally get a
lot of processes hanging in disk wait.  The problem is that FreeBSD
clients do not seem to ever recover from that state, while the BSD/OS
clients take a few minutes to realize NFS is once again available, and
continue on their merry way.  The only wait out of this is to reboot
the FreeBSD machines (again, not scoring any points with the paying
customers who were online).

    I am running "nfsiod -n 4" on the clients, and "nfsd -t -u -n 6"
on the BSD/OS server.  About 24 gigabytes of disk over 7 filesystems
are exported to the clients.  Is there any way to "kickstart"
processes on the client so they know that the NFS server is alive
again?  Or is there a tunable parameter in the kernel source that
decreases the timeout or increases the frequency of retrying the NFS
server?


3. Unrecoverable "mb_map full" condition

    I've noticed that once the kernel reports "mb_map full",
networking is completely hosed.  Is it possible for the kernel to
release unused mbufs into a free pool of some sort instead of forcing
me to reboot the machine?  I've had this happen even with
NMBCLUSTERS=2048, but I haven't seen it yet with 4096.  'netstat -m'
typically reports:

226 mbufs in use:
	100 mbufs allocated to data
	81 mbufs allocated to packet headers
	33 mbufs allocated to protocol control blocks
	12 mbufs allocated to socket names and addresses
81/454 mbuf clusters in use
936 Kbytes allocated to network (20% in use)
0 requests for memory denied
0 requests for memory delayed
0 calls to protocol drain routines

    In the "mbuf clusters in use" line, is 81 the current number
allocated, and 454 the high-water mark?  Those numbers aren't anywhere
near 2048, let alone 4096.  Does the NMBCLUSTERS option in the kernel
config refer to some other number?

***

    The last two problems aren't immediate concerns, but the first one
will cause headaches once I switch the Livingstons to use the FreeBSD
machines as login hosts instead of the BSD/OS machines.  Any insight
or advice appreciated.  Thanks.

--
Brian Tao (BT300, taob@io.org)
Systems Administrator, Internex Online Inc.
"Though this be madness, yet there is method in't"