Date: Thu, 27 Apr 2000 17:03:53 EDT From: "Eric Withabee" <ericwithabee@hotmail.com> To: freebsd-questions@freebsd.org Subject: Network interface hanging on 3.3-RELEASE system Message-ID: <20000427210353.79863.qmail@hotmail.com>
next in thread | raw e-mail | index | archive | help
Hello.
I'm experiencing some strange problems with a 3.3-RELEASE system. It runs
fine for a few days, then it starts getting a continually increasing number
of TCP connections stuck in the TIME_WAIT state. The number of connections
keeps building until it reaches a total of about 4000 TCP connections, then
the server simply stops responding to any requests from the network. From
the time the connections start building up to the time the server hangs
varies from under half an hour to a few hours. Again, once the buildup
starts, the number of connections in the TIME_WAIT state only increases.
I've been trying to diagnose the problem, but haven't had much luck. I'm
not sure whether it's due to a bug or not, so I'm posting the question here
instead of to freebsd-bugs.
The problem started as soon as I took the system live. It replaced another
FreeBSD system, and took over all its duties. It's primarily acting as a
mail server (Sendmail 8.9.3 and QPopper 2.53) and a web server (Apache
1.3.9). It's also running MySQL 9.33. The server it replaced was a 133MHz
Pentium, and the new server is a 233MHz Pentium II. The old server did not
experience this problem -- in fact, it was extremely stable.
I originally thought that it might be the NIC card, a 3Com 3C905B, or the
"xl" driver, so I replaced it with a Linksys LNE100TX ("mx" driver). This
seemed to help somewhat, as the duration between occurrences increased from
a few hours to a few days. However, it continues to occur, and I'm
wondering if the improvement when I switched the NIC card was just a
coincidence. Although, since I made the switch, the problem has never
occurred as quickly as it did with the 3Com card. We've had very good luck
with 3Com NICs in the past, but this was the first time we'd used a 3C509B
and the "xl" driver.
The time between occurrences varies significantly. Sometimes, the system
will run for over a week, while other times it will run for less than a day.
Just in case the problem was related to the number of mbufs, I bumped up the
default settings so that it has a maximum of 4096 mbuf clusters. It didn't
help. The system seems to be peak at around 300 mbufs until the problem
occurs.
I decided to see whether it might be a DOS attack, even though that doesn't
really make sense, because the problem started as soon as I took the system
live. At the time the problem is occurring, the connections in the
TIME_WAIT state don't originate primarily from one IP address. I suppose
this doesn't rule out a distributed DOS attack, but I think that's pretty
unlikely.
Here's some specifics about the system:
ASUS P3B-F motherboard
Intel 233MHz PII
128MB RAM
2 Western Digital Expert 9.1GB 7200 RPM drives
Mirrored via an Arco DupliDisk (Bay Mount)
Linksys EtherFast 10/100 NIC (LNE100TX)
Adaptec 2940UW SCSI Adapter
HP SureStore T20i Travan Tape Drive
Full-tower case with lots of fans
In the meantime, while I've been trying to figure this out, I've got a
cron'ed a script that checks the number of connections and reboots the
server if it gets to a stage that indicates that the server has passed the
point of no return. Before it reboots it, it sends me an e-mail message
giving the output from a "netstat -n", a "netstat -m" (I just added this
today), and a "ps -ax". It's an ugly hack, but it's keeping me from getting
paged at 3:00AM.
Does anyone have any thoughts? Thanks for taking the time to read all this.
Eric
________________________________________________________________________
Get Your Private, Free E-mail from MSN Hotmail at http://www.hotmail.com
To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-questions" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20000427210353.79863.qmail>
