From owner-freebsd-net@FreeBSD.ORG Tue Jun 12 14:29:53 2007 Return-Path: X-Original-To: freebsd-net@freebsd.org Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 0260B16A400 for ; Tue, 12 Jun 2007 14:29:53 +0000 (UTC) (envelope-from wmoran@collaborativefusion.com) Received: from mx00.pub.collaborativefusion.com (mx00.pub.collaborativefusion.com [206.210.89.199]) by mx1.freebsd.org (Postfix) with ESMTP id A4A0A13C48A for ; Tue, 12 Jun 2007 14:29:52 +0000 (UTC) (envelope-from wmoran@collaborativefusion.com) Received: from vanquish.pgh.priv.collaborativefusion.com (vanquish.pgh.priv.collaborativefusion.com [192.168.2.61]) (SSL: TLSv1/SSLv3,256bits,AES256-SHA) by wingspan with esmtp; Tue, 12 Jun 2007 10:19:50 -0400 id 00056415.466EAB86.000099FE Date: Tue, 12 Jun 2007 10:19:49 -0400 From: Bill Moran To: freebsd-net@freebsd.org Message-Id: <20070612101949.646dcaa5.wmoran@collaborativefusion.com> Organization: Collaborative Fusion X-Mailer: Sylpheed 2.3.1 (GTK+ 2.10.11; i386-portbld-freebsd6.1) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Subject: Weird "ignoring syn" problem X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 12 Jun 2007 14:29:53 -0000 This one has got me pretty befuddled. We're seeing some really odd behaviour with FreeBSD ignoring SYN packets. I've been trying to diagnose this for a couple of weeks now, and my current guess is that there's something wrong with the em driver. Here's a narrowed down list of what I've ruled out: *) I've done my best to eliminate other network components as the problem. My theory at this point is that it can't possibly be any other network hardware, based on the tcpdump show below. *) The problem occurred on both FreeBSD 6.1 and FreeBSD 6.2-p3. *) The problem does not appear to be tied to CPU usage -- the CPU is nearly idle when the problem occurs. *) I can now reproduce it pretty easily, so I'll know when it's fixed. *) The system exhibiting the problem is running 15 jails, but they are idle 95% of the time. The problem initially occurred inside one of the jails, but I just recreated it outside the jail (on the host) and it's _easier_ to reproduce outside the jail. *) The problem occurred with both GENERIC, and the SMP kernel (this is a dual-CPU, hyperthreaded system) *) I've tested and the behavior occurs both with a dynamically generated file (from PHP) or from a static file. The nature of the beast is that we've got a SOAP application running under Apache and PHP. This application is subject to many requests in rapid succession, such that load can be simulated by the following loop: while true; do fetch http://192.168.121.250/test.php; done The problem is that occasionally, the Apache server machine just ignores SYN packets. Take the following tcpdump output for example: 13:34:17.312296 IP web04-v100.cust00.pitbpa1.priv.collaborativefusion.com.54808 > anchor-is00.is.pitbpa1.priv.collaborativefusion.com.http: S 2645061726:2645061726(0) win 65535 13:34:20.312398 IP web04-v100.cust00.pitbpa1.priv.collaborativefusion.com.54808 > anchor-is00.is.pitbpa1.priv.collaborativefusion.com.http: S 2645061726:2645061726(0) win 65535 13:34:23.512626 IP web04-v100.cust00.pitbpa1.priv.collaborativefusion.com.54808 > anchor-is00.is.pitbpa1.priv.collaborativefusion.com.http: S 2645061726:2645061726(0) win 65535 This is the _only_ traffic on port 80 during the test. It looks like the kernel has ignored the initial syn packet and two duplicates. I've seen it take as long as 45 seconds to establish a connection, and this causes ugly performance problems, as well as frequent timeouts on the client end. The only clue I've found so far is this output from netstat -s. 153099 syncache entries added 6184 retransmitted 6491 dupsyn 0 dropped 150923 completed 0 bucket overflow 0 cache overflow 235 reset 1941 stale 0 aborted 0 badack 0 unreach 0 zone failures Unfortunately, I've been unable to determine how to fix the problem. Any advice is welcome. Details: Copyright (c) 1992-2007 The FreeBSD Project. Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 The Regents of the University of California. All rights reserved. FreeBSD is a registered trademark of The FreeBSD Foundation. FreeBSD 6.2-RELEASE-p3 #2: Thu Jun 7 21:37:54 UTC 2007 root@is00:/usr/obj/usr/src/sys/SMP Timecounter "i8254" frequency 1193182 Hz quality 0 CPU: Intel(R) Xeon(TM) CPU 3.00GHz (2992.71-MHz 686-class CPU) Origin = "GenuineIntel" Id = 0xf43 Stepping = 3 Features=0xbfebfbff Features2=0x641d> AMD Features=0x20100000 Logical CPUs per core: 2 real memory = 2147221504 (2047 MB) avail memory = 2096107520 (1999 MB) ACPI APIC Table: FreeBSD/SMP: Multiprocessor System Detected: 4 CPUs cpu0 (BSP): APIC ID: 0 cpu1 (AP): APIC ID: 1 cpu2 (AP): APIC ID: 6 cpu3 (AP): APIC ID: 7 ioapic0: Changing APIC ID to 8 ioapic1: Changing APIC ID to 9 ioapic1: WARNING: intbase 32 != expected base 24 ioapic2: Changing APIC ID to 10 ioapic2: WARNING: intbase 64 != expected base 56 ioapic0 irqs 0-23 on motherboard ioapic1 irqs 32-55 on motherboard ioapic2 irqs 64-87 on motherboard kbd1 at kbdmux0 ath_hal: 0.9.17.2 (AR5210, AR5211, AR5212, RF5111, RF5112, RF2413, RF5413) acpi0: on motherboard acpi0: Power Button (fixed) Timecounter "ACPI-fast" frequency 3579545 Hz quality 1000 acpi_timer0: <24-bit timer at 3.579545MHz> port 0x808-0x80b on acpi0 cpu0: on acpi0 cpu1: on acpi0 cpu2: on acpi0 cpu3: on acpi0 pcib0: port 0xcf8-0xcff on acpi0 pci0: on pcib0 pcib1: at device 2.0 on pci0 pci1: on pcib1 pcib2: at device 0.0 on pci1 pci2: on pcib2 amr0: mem 0xd80f0000-0xd80fffff,0xdfde0000-0xdfdfffff irq 46 at device 14.0 on pci2 amr0: delete logical drives supported by controller amr0: Firmware 521X, BIOS H430, 256MB RAM pcib3: at device 0.2 on pci1 pci3: on pcib3 em0: port 0xecc0-0xecff mem 0xdfbe0000-0xdfbfffff irq 37 at device 11.0 on pci3 em0: Ethernet address: 00:04:23:c8:ff:f4 em1: port 0xec80-0xecbf mem 0xdfbc0000-0xdfbdffff irq 38 at device 11.1 on pci3 em1: Ethernet address: 00:04:23:c8:ff:f5 pcib4: at device 4.0 on pci0 pci4: on pcib4 pcib5: at device 5.0 on pci0 pci5: on pcib5 pcib6: at device 0.0 on pci5 pci6: on pcib6 em2: port 0xdcc0-0xdcff mem 0xdf8e0000-0xdf8fffff irq 64 at device 7.0 on pci6 em2: Ethernet address: 00:13:72:4f:71:23 pcib7: at device 0.2 on pci5 pci7: on pcib7 em3: port 0xccc0-0xccff mem 0xdf6e0000-0xdf6fffff irq 65 at device 8.0 on pci7 em3: Ethernet address: 00:13:72:4f:71:24 pcib8: at device 6.0 on pci0 pci8: on pcib8 uhci0: port 0xace0-0xacff irq 16 at device 29.0 on pci0 uhci0: [GIANT-LOCKED] usb0: on uhci0 usb0: USB revision 1.0 uhub0: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1 uhub0: 2 ports with 2 removable, self powered uhci1: port 0xacc0-0xacdf irq 19 at device 29.1 on pci0 uhci1: [GIANT-LOCKED] usb1: on uhci1 usb1: USB revision 1.0 uhub1: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1 uhub1: 2 ports with 2 removable, self powered uhci2: port 0xaca0-0xacbf irq 18 at device 29.2 on pci0 uhci2: [GIANT-LOCKED] usb2: on uhci2 usb2: USB revision 1.0 uhub2: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1 uhub2: 2 ports with 2 removable, self powered ehci0: mem 0xdff00000-0xdff003ff irq 23 at device 29.7 on pci0 ehci0: [GIANT-LOCKED] usb3: EHCI version 1.0 usb3: companion controllers, 2 ports each: usb0 usb1 usb2 usb3: on ehci0 usb3: USB revision 2.0 uhub3: Intel EHCI root hub, class 9/0, rev 2.00/1.00, addr 1 uhub3: 6 ports with 6 removable, self powered uhub4: vendor 0x413c product 0xa001, class 9/0, rev 2.00/0.00, addr 2 uhub4: multiple transaction translators uhub4: 2 ports with 2 removable, self powered pcib9: at device 30.0 on pci0 pci9: on pcib9 pci9: at device 5.0 (no driver attached) pci9: at device 5.1 (no driver attached) pci9: at device 5.2 (no driver attached) atapci0: port 0xbcf0-0xbcf7,0xbce4-0xbce7,0xbcd8-0xbcdf,0xbcd0-0xbcd3,0xbc70-0xbc7f mem 0xdf3fec00-0xdf3fecff irq 23 at device 6.0 on pci9 ata2: on atapci0 ata3: on atapci0 pci9: at device 13.0 (no driver attached) isab0: at device 31.0 on pci0 isa0: on isab0 atapci1: port 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xfc00-0xfc0f at device 31.1 on pci0 ata0: on atapci1 ata1: on atapci1 fdc0: port 0x3f0-0x3f5,0x3f7 irq 6 drq 2 on acpi0 fdc0: [FAST] sio0: <16550A-compatible COM port> port 0x3f8-0x3ff irq 4 flags 0x10 on acpi0 sio0: type 16550A pmtimer0 on isa0 orm0: at iomem 0xc0000-0xcafff,0xec000-0xeffff on isa0 atkbdc0: at port 0x60,0x64 on isa0 atkbd0: irq 1 on atkbdc0 kbd0 at atkbd0 atkbd0: [GIANT-LOCKED] ppc0: parallel port not found. sc0: at flags 0x100 on isa0 sc0: VGA <16 virtual consoles, flags=0x300> sio1: configured irq 3 not in bitmap of probed irqs 0 sio1: port may not be enabled vga0: at port 0x3c0-0x3df iomem 0xa0000-0xbffff on isa0 ukbd0: Dell DRAC4, rev 1.10/0.00, addr 2, iclass 3/1 kbd2 at ukbd0 ums0: Dell DRAC4, rev 1.10/0.00, addr 2, iclass 3/1 ums0: 3 buttons and Z dir. Timecounters tick every 1.000 msec acd0: CDROM at ata0-master UDMA33 device_attach: afd0 attach returned 6 acd1: CDROM at ata2-slave PIO3 amr0: delete logical drives supported by controller amrd0: on amr0 amrd0: 34680MB (71024640 sectors) RAID 1 (optimal) SMP: AP CPU #3 Launched! SMP: AP CPU #1 Launched! SMP: AP CPU #2 Launched! Trying to mount root from ufs:/dev/amrd0s1a -- Bill Moran Collaborative Fusion Inc. http://people.collaborativefusion.com/~wmoran/ wmoran@collaborativefusion.com Phone: 412-422-3463x4023