Date: Tue, 22 Nov 2011 22:01:45 +0000 From: Matthew Seaman <m.seaman@infracaninophile.co.uk> To: freebsd-questions@freebsd.org Subject: Re: Diagnosing packet loss Message-ID: <4ECC1BC9.7060007@infracaninophile.co.uk> In-Reply-To: <B0BE38BD-CE86-4D42-9215-933150BA07D9@gmail.com> References: <B0BE38BD-CE86-4D42-9215-933150BA07D9@gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
This is an OpenPGP/MIME signed message (RFC 2440 and 3156) --------------enig88A62592967565C69995833A Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable On 22/11/2011 20:33, Kees Jan Koster wrote: > I am stuck with a machine that shows serious packet loss (about 1% of > all traffic is dropped). I tried the obvious (new network cable, > different switch port, different ethernet interface on the machine), > but the problems remain. >=20 > Another machine that sits in the same rack and is hooked up to the > same switch shows no such packet loss issues. The problematic machine > is a dual Opteron with FreeBSD 8.2-STABLE from Thu Aug 11 14:05:47 > CEST 2011. >=20 > The machine is lightly loaded. A MySQL slave is running, but the > machine is not serving queries. Plus a Munin server process. >=20 > I am at a loss where to start diagnosing this. Can you advise me > where to look? Are there network buffers that may be overflowing? You say "lightly loaded," but how much does that actually equate to in kb/s or Mb/s? I'd call anything less than about 1Mb/s on a GB ethernet link pretty light, but other people have different ideas. Check for duplex mismatch -- normally everything just works allowing the NIC and the switch to autonegotiate, but every so often some bright spark gets the idea that wiring down the speed setting is a good idea. Trouble is you have to set *both* ends of the ethernet link to the same settings -- if one end is trying to auto and the other is fixed, you'll end up with the auto end defaulting to 100baseTX half-duplex and performance will suck, and suck increasingly hard as network load goes up. Amazing how often that 'set both ends the same' thing leads to grief= =2E Another hideously embarrassing error would be to spend ages debugging before finding out you had a duplicate IP number on your network. Can you definitely rule that out? A third networking problem that also has the potential to make you the butt of a few jokes is if your network cables are kinked, crushed, over stretched or simply cable-tied too tightly. Anything like that can cause signal leakage between the pairs of conductors in the cable which can be enough to disrupt packet transmission. Simply snipping through a too-tight cable tie can have a magical-seeming effect. What sort on NICs are there on your machine? It's well known that re(4) interfaces simply cannot keep up with the throughput of a good server NIC like em(4) or bge(4). [But re(4)'s are cheap and good enough for most home systems...] If you can try swapping in a reasonably good NIC card -- beg, borrow or steal from another machine just for a few hours to use for testing -- and see if that cures the problem. Other considerations: are you doing anything beyond just plain ethernet networking? Any VLANS? What about ipsec or other tunnelled/encapsulated traffic? Are you using RSTP or lagg to make your networking resilient to failures? If the answer to any of these is "yes" -- does temporarily disabling that feature and doing it simple and stupid help with the packet loss? Do you get the same sort of packet loss if you take the switch away and just run a cable direct between two machines. (Nb. If your NICs don't support MSIx you'll need a crossover cable.) On another host on your net, can you use wireshark to capture and examine the traffic from your failing machine? For best results, either wire the two machines directly together or configure the switch port your wireshark box is connected to as a /monitor/ port so it sees all the traffic coming out of your problem box. Does your NIC have hardware checksumming? If so, does disabling that help with the error rate? (see ifconfig(8) and the man page for your NIC in section 4 for how.) There have been a number of instances of buggy checksumming causing problems in the past. Nb. with hardware checksumming, the checksum field is calculated and inserted in packets very late; after any way of examining the packets as they leave your machine has ceased to be possible. Makes it look like the checksums are all wrong if you sample the traffic on the originating machine. This is why you need to use another, external machine to watch for this sort of error. Cheers, Matthew --=20 Dr Matthew J Seaman MA, D.Phil. 7 Priory Courtyard Flat 3 PGP: http://www.infracaninophile.co.uk/pgpkey Ramsgate JID: matthew@infracaninophile.co.uk Kent, CT11 9PW --------------enig88A62592967565C69995833A Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG/MacGPG2 v2.0.16 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk7MG9EACgkQ8Mjk52CukIypbACdEMm/uOze6c/66UfuJmFG+uLh kb8AnRQ9IGvHaP53pQl5dmyG1rH8NL2B =KbEg -----END PGP SIGNATURE----- --------------enig88A62592967565C69995833A--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4ECC1BC9.7060007>