Date: Thu, 12 May 2005 17:13:48 -0600 (MDT) From: Matt Ruzicka <matt@frii.com> To: freebsd-net@freebsd.org Subject: Outbound TCP issue, potentially related to 'FreeBSD-SA-05:08.kmem [REVISED]' Message-ID: <Pine.BSF.4.58.0505121627400.66727@elara.frii.com>
next in thread | raw e-mail | index | archive | help
A couple days after we patched our systems, we started to receive a number of reports of mysql connection errors when our patched FreeBSD 4.9 web servers were trying to connect to our mysql server, which lives on a separate FreeBSD machine. Initially we thought this was a networking error related to our server load balancer (which has been a troublemaker in the past) or some other networking device, but testing has proven otherwise. * Problem description: Outbound TCP connections are randomly failing to connect. They receive a "Can't assign requested address" error from the connect() call. The error has been demonstrated against multiple machines on multiple different ports. It only impacts outgoing connections from our web servers - no inbound connections have failed or dropped. Also, we have not seen this problem on any of our other servers, which have also been patched. The errors are sporadic. The most frequent pattern we've seen is a 5 to 10 minute period of success, followed by a couple of seconds of frequent failures. When we start getting errors connecting to one port/machine we see concurrent errors to other ports/machines. * What we've tried: The impacted machines are in a server-load-balanced environment, so we spent quite a bit of time convincing ourselves that this was not an external network error. We created a perl test script that tries to connect to a given machine and port once per second and logs its success or failure. (script is included below) We then aimed it at machines both inside and outside the SLB environment. We originally tried it against multiple different ports, but after finding that the failures were not port-specific, we simplified the methodology to make all connections to port 5666. (a monitoring app) Reverse tests were also run to see if the failures impacted incoming connections. No failures were ever logged in this direction. The tests established that we reliably saw failures from the two impacted machines to any other server, including each other. (The two boxes are separated by a switch, but not the SLB.) It did not matter if the remote machine was on the same network, or was in front or behind the SLB switch. Connections between other machines behind the same switch showed no failures. We next set up tcpdump on one impacted machine and started logging the test connections. When a failure occurred, the dumps showed no packets leaving the box to the target machine. At that point we felt reasonably confident that the problem was not an external network issue, so we moved on to systems troubleshooting. Since this machine was running a few revisions behind we felt it would be prudent to upgrade to the latest release of FreeBSD. Both web servers have since been upgraded to the latest version of 4.11 to ensure it was not an issue related to the old versions we were running. After the upgrade errors returned to the previous levels after a few hour lull. Apache, PHP and related modules were both reinstalled on the boxes after the FreeBSD upgrade to ensure they were using the correct libraries and such. The only error we have found in the logs was right after boot and is related to PMAP_SHPGPERPROC and discussed here: http://lists.freebsd.org/pipermail/freebsd-hackers/2003-May/000695.html If I understand this correctly we should have plenty of PV entries available. ----- Message Queues: T ID KEY MODE OWNER GROUP CREATOR CGROUP CBYTES QNUM QBYTES LSPID LRPID STIME RTIME CTIME Shared Memory: T ID KEY MODE OWNER GROUP CREATOR CGROUP NATTCH SEGSZ CPID LPID ATIME DTIME CTIME m 262144 0 --rw------- root wheel root wheel 21 524288 81250 8125014:03:40 17:02:37 14:03:40 m 458754 0 --rw------- root wheel root wheel 42 524288 74667 7466716:06:03 17:02:39 16:06:03 Semaphores: T ID KEY MODE OWNER GROUP CREATOR CGROUP NSEMS OTIME CTIME ITEM SIZE LIMIT USED FREE REQUESTS PV ENTRY: 28, 2281326, 545883, 1036172, 589082427 ----- * Test script: Note that we also tried a similar script using raw socket calls, rather than using IO::Socket. The results were identical. ----- #!/usr/bin/perl use strict; use warnings; use Sys::Hostname qw(hostname); use IO::Socket; use constant LOG_FILE => '/tmp/'; # host to connect to my $host = shift(@ARGV) || 'xxx.xxx.xxx.xxx'; # open our log file my $log_file = LOG_FILE . hostname() . '_to_' . $host . '.nrpe'; open(LOG, '>>', $log_file) or die "Can't open log: $log_file $!"; while(1){ my $start_time = time(); # try a connection eval { my $socket = IO::Socket::INET->new($host . ':5666') or die "Can't connect: $!"; $socket->close(); }; my $result = "ok"; $result = "failed ($@)" if $@; print LOG hostname() . ' ' . scalar(localtime($start_time)) . ' ' . $result . "\n"; sleep 1; } ----- * Summary: Since this is not affecting any of our other servers, which have been patched, I do not feel it is a direct result of the patch, but suspect the patch may have accentuated an existing issue. Any suggestions as to what could be causing this would be greatly appreciated. Please let me know what additional information about the system I can gather if it will be of assistance. Thank you very much in advance. Matthew Ruzicka - Systems Administrator Front Range Internet, Inc. matt@frii.net - (970) 212-0728
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?Pine.BSF.4.58.0505121627400.66727>