From owner-freebsd-performance@FreeBSD.ORG Thu Dec 14 22:40:09 2006 Return-Path: X-Original-To: freebsd-performance@freebsd.org Delivered-To: freebsd-performance@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id E3D8016A4A7 for ; Thu, 14 Dec 2006 22:40:09 +0000 (UTC) (envelope-from amesbury@umn.edu) Received: from mta-m2.tc.umn.edu (mta-m2.tc.umn.edu [160.94.23.21]) by mx1.FreeBSD.org (Postfix) with ESMTP id 176AA43F13 for ; Thu, 14 Dec 2006 22:33:20 +0000 (GMT) (envelope-from amesbury@umn.edu) Received: from [160.94.247.212] (paulaner.oitsec.umn.edu [160.94.247.212]) by mta-m2.tc.umn.edu (UMN smtpd) with ESMTP for ; Thu, 14 Dec 2006 16:34:45 -0600 (CST) X-Umn-Remote-Mta: [N] paulaner.oitsec.umn.edu [160.94.247.212] #+LO+TS+AU+HN Message-ID: <4581D185.7020702@umn.edu> Date: Thu, 14 Dec 2006 16:34:45 -0600 From: Alan Amesbury User-Agent: Thunderbird 1.5.0.7 (X11/20060915) MIME-Version: 1.0 To: freebsd-performance@freebsd.org X-Enigmail-Version: 0.94.0.0 Content-Type: multipart/mixed; boundary="------------050007070804010403090909" Subject: Polling tuning and performance X-BeenThere: freebsd-performance@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Performance/tuning List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 14 Dec 2006 22:40:10 -0000 This is a multi-part message in MIME format. --------------050007070804010403090909 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit This is a long one, but mainly because I've tried to include notes about what I've already looked at. Thanks in advance for taking the time to read this. I have a FreeBSD 6.1-RELEASE/amd64 system which routinely needs to accept traffic at fairly high speeds. The system is accepting traffic at fairly high rates; 'systat -if' suggests 428551GB (not a typo, but possibly a display bug in 'systat') over the past 63 days, or an average rate of a bit over 600Mb/sec. However, 'time tcpdump ...' tends to back up this assertion: amesbury@host % sudo time tcpdump -i bge1 -n -w /dev/null -c 1000000 tcpdump: WARNING: bge1: no IPv4 address assigned tcpdump: listening on bge1, link-type EN10MB (Ethernet), capture size 96 bytes 1000000 packets captured 1000395 packets received by filter 167 packets dropped by kernel 0.268u 0.153s 0:06.84 5.9% 901+3236k 0+0io 0pf+0w What I'm aiming for, of course, is zero packet loss. Realizing that's probably impossible for this system given its load, I'm trying to do what I can to minimize loss. The system is running a somewhat leaner kernel than GENERIC. Notable changes include: * PREEMPTION disabled - /sys/conf/NOTES says this helps with interactivity. I don't care about interactive performance on this host. * COMPAT_FREEBSD4, COMPAT_LINUX32, and COMPAT_43 are removed. They appear to be unneeded. * SMP is enabled, as this is a dual-core box (not HTT!). * Many devices are removed, e.g., ncr(4), sym(4), adv(4), and other unnecessary block devices; anything relating to cardbus; de(4), bce(4), ti(4), wb(4), ed(4), ex(4), lnc(4), and a number of other network devices that aren't going to ever be used; etc. * All wlan(4) and related drivers are gone. * pf(4), pflog(4), and some of the ALTQ stuff has been added in, but is not actively used on this host (at the moment). * ZERO_COPY_SOCKETS, MAC_BSDEXTENDED, MAC_PARTITION, and MAC are enabled. * Most importantly, HZ=1000, and DEVICE_POLLING and AUTO_EOI_1 are included. (AUTO_EOI_1 was added because /sys/amd64/conf/NOTES says this can save a few microseconds on some interrupts. I'm not worried about suspend/resume, but definitely want speed, so it got added. As mentioned above, this host is running FreeBSD/amd64, so there's no need to remove support for I586_CPU, et al; that stuff was never there in the first place. Since kern.polling.enable is marked as deprecated in /sys/kern/kern_poll.c, I'm enabling polling specifically for the interface receiving the high-volume traffic. (It is NOT enabled for the other interface on this system, but traffic loads there are orders of magnitude lower, so I didn't think it was necessary.) As mentioned above, I've got HZ set to 1000. Per /sys/amd64/conf/NOTES, I'd considered setting it to 2000, but have discovered previously that FreeBSD's RFC1323 support breaks. I documented this on -hackers last year: http://lists.freebsd.org/pipermail/freebsd-hackers/2005-December/014829.html Since I've not seen word on a correction for this being added to FreeBSD, I've limited HZ to 1000. After reading polling(4) a couple times, I set kern.polling.burst_max to 1000. The manpage says that "each interface can receive at most (HZ * burst_max) packets per second", and the default setting is 150, which is described as "adequate for 100Mbit network and HZ=1000." I figured, "Hey, gigabit, how about ten times the default?" but that's prevented by "#define MAX_POLL_BURST_MAX 1000" in /sys/kern/kern_poll.c. In theory that might've been good enough, but polling(4) says that kern.polling.burst is "[the] [m]aximum number of packets grabbed from each network interface in each timer tick. This number is dynamically adjusted by the kernel, according to the programmed user_frac, burst_max, CPU speed, and system load." I keep seeing kern.polling.burst hit a thousand, which leads me to believe that kern.polling.burst_max needs to be higher. For example: secs since epoch kern.polling.burst ---------- ------------------ 1166133997 1000 1166134006 550 1166134015 877 1166134024 1000 1166134033 1000 1166134042 1000 1166134051 1000 1166134060 1000 1166134069 1000 1166134078 1000 Unfortunately, that appears to be only possible through a) patching /sys/kern/kern_poll.c to allow larger values; or b) setting HZ to 2000, as indicated in one of the NOTES, which will effectively hose certain TCP connectivity because of the RFC1323 breakage. Looked at another way, both essentially require changes to source code, the former being fairly obvious, and the latter requiring fixes to the RFC1323 support. Either way, I think that's a bit beyond my abilities; I have NO illusions about my kernel h4cking sk1llz. Other possibly relevant data points: * System load hovers right around 1. * The system has almost zero disk activity. * With polling off: - 'vmstat 5' consistently shows about 13K context switches and ~6800 interrupts - 'vmstat -i' shows 2K interrupts per CPU, consistently 6286 for bge1, and near zero for everything else - CPU load drops to 0.4-0.8, but CPU idle time sits around 80% * With polling on, kern.polling.burst_max=150: - kern.polling.burst holds at 150 - 'vmstat 5' shows context switches hold around 2600, with interrupts holding around 30K - 'vmstat -i' shows bge1 interrupt rate of 6286 (but total doesn't increase!), other rates stay the same (looks like possible display bugs in 'vmstat -i' here!) - CPU load holds at 1, but CPU idle time usually stays >95% * With polling on, kern.polling.burst_max=1000: - kern.polling.burst is frequently 1000 and almost always >850 - 'vmstat 5' shows context switches unchanged, but interrupts are 150K-190K - 'vmstat -i' unchanged from burst_max=150 - CPU load and CPU idle time very similar to burst_max=150 So, with all that in mind..... Any ideas for improvement? Apologies in advance for missing the obvious. 'dmesg' and kernel config are attached. -- Alan Amesbury OIT Security and Assurance University of Minnesota --------------050007070804010403090909 Content-Type: text/plain; name="SPECIALIZED" Content-Transfer-Encoding: 7bit Content-Disposition: inline; filename="SPECIALIZED" machine amd64 cpu HAMMER ident SPECIALIZED # To statically compile in device wiring instead of /boot/device.hints #hints "GENERIC.hints" # Default places to look for devices. makeoptions DEBUG=-g # Build kernel with gdb(1) debug symbols #options SCHED_ULE # ULE scheduler options SCHED_4BSD # 4BSD scheduler #options PREEMPTION # Enable kernel thread preemption options INET # InterNETworking options INET6 # IPv6 communications protocols options FFS # Berkeley Fast Filesystem options SOFTUPDATES # Enable FFS soft updates support options UFS_ACL # Support for access control lists options UFS_DIRHASH # Improve performance on big directories options MD_ROOT # MD is a potential root device options NFSCLIENT # Network Filesystem Client options NFSSERVER # Network Filesystem Server options NFS_ROOT # NFS usable as /, requires NFSCLIENT options MSDOSFS # MSDOS Filesystem options CD9660 # ISO 9660 Filesystem options PROCFS # Process filesystem (requires PSEUDOFS) options PSEUDOFS # Pseudo-filesystem framework options GEOM_GPT # GUID Partition Tables. options COMPAT_IA32 # Compatible with i386 binaries options COMPAT_FREEBSD5 # Compatible with FreeBSD5 options SCSI_DELAY=5000 # Delay (in ms) before probing SCSI options KTRACE # ktrace(1) support options SYSVSHM # SYSV-style shared memory options SYSVMSG # SYSV-style message queues options SYSVSEM # SYSV-style semaphores options _KPOSIX_PRIORITY_SCHEDULING # POSIX P1003_1B real-time extensions options KBD_INSTALL_CDEV # install a CDEV entry in /dev options AHC_REG_PRETTY_PRINT # Print register bitfields in debug # output. Adds ~128k to driver. options AHD_REG_PRETTY_PRINT # Print register bitfields in debug # output. Adds ~215k to driver. options ADAPTIVE_GIANT # Giant mutex is adaptive. options SMP # Symmetric MultiProcessor Kernel # Workarounds for some known-to-be-broken chipsets (nVidia nForce3-Pro150) device atpic # 8259A compatability # Bus support. device acpi device isa device pci device mem device io # Floppy drives device fdc # ATA and ATAPI devices device ata device atadisk # ATA disk drives device ataraid # ATA RAID drives device atapicd # ATAPI CDROM drives device atapifd # ATAPI floppy drives device atapist # ATAPI tape drives options ATA_STATIC_ID # Static device numbering # SCSI Controllers device ahc # AHA2940 and onboard AIC7xxx devices device ahd # AHA39320/29320 and onboard AIC79xx devices device amd # AMD 53C974 (Tekram DC-390(T)) device isp # Qlogic family device mpt # LSI-Logic MPT-Fusion # SCSI peripherals device scbus # SCSI bus (required for SCSI) device ch # SCSI media changers device da # Direct Access (disks) device sa # Sequential Access (tape etc) device cd # CD device pass # Passthrough device (direct SCSI access) device ses # SCSI Environmental Services (and SAF-TE) # RAID controllers interfaced to the SCSI subsystem device amr # AMI MegaRAID device ciss # Compaq Smart RAID 5* device dpt # DPT Smartcache III, IV - See NOTES for options device hptmv # Highpoint RocketRAID 182x device iir # Intel Integrated RAID device ips # IBM (Adaptec) ServeRAID device mly # Mylex AcceleRAID/eXtremeRAID device twa # 3ware 9000 series PATA/SATA RAID # RAID controllers device aac # Adaptec FSA RAID device aacp # SCSI passthrough for aac (requires CAM) device ida # Compaq Smart RAID device twe # 3ware ATA RAID # atkbdc0 controls both the keyboard and the PS/2 mouse device atkbdc # AT keyboard controller device atkbd # AT keyboard device psm # PS/2 mouse device vga # VGA video card driver device splash # Splash screen and screen saver support # syscons is the default console driver, resembling an SCO console device sc device agp # support several AGP chipsets # Serial (COM) ports device sio # 8250, 16[45]50 based serial ports # If you've got a "dumb" serial or parallel PCI card that is # supported by the puc(4) glue driver, uncomment the following # line to enable it (connects to the sio and/or ppc drivers): #device puc # PCI Ethernet NICs. device em # Intel PRO/1000 adapter Gigabit Ethernet Card device ixgb # Intel PRO/10GbE Ethernet Card device txp # 3Com 3cR990 (``Typhoon'') device vx # 3Com 3c590, 3c595 (``Vortex'') # PCI Ethernet NICs that use the common MII bus controller code. # NOTE: Be sure to keep the 'device miibus' line in order to use these NICs! device miibus # MII bus support device bfe # Broadcom BCM440x 10/100 Ethernet device bge # Broadcom BCM570xx Gigabit Ethernet device dc # DEC/Intel 21143 and various workalikes device fxp # Intel EtherExpress PRO/100B (82557, 82558) device lge # Level 1 LXT1001 gigabit Ethernet device nge # NatSemi DP83820 gigabit Ethernet device re # RealTek 8139C+/8169/8169S/8110S device rl # RealTek 8129/8139 device sis # Silicon Integrated Systems SiS 900/SiS 7016 device sk # SysKonnect SK-984x & SK-982x gigabit Ethernet device tx # SMC EtherPower II (83c170 ``EPIC'') device xl # 3Com 3c90x (``Boomerang'', ``Cyclone'') # Pseudo devices. device loop # Network loopback device random # Entropy device device ether # Ethernet support device tun # Packet tunnel. device pty # Pseudo-ttys (telnet etc) device md # Memory "disks" device gif # IPv6 and IPv4 tunneling device faith # IPv6-to-IPv4 relaying (translation) # The `bpf' device enables the Berkeley Packet Filter. # Be aware of the administrative consequences of enabling this! # Note that 'bpf' is required for DHCP. device bpf # Berkeley packet filter # USB support device uhci # UHCI PCI->USB interface device ohci # OHCI PCI->USB interface device ehci # EHCI PCI->USB interface (USB 2.0) device usb # USB Bus (required) #device udbp # USB Double Bulk Pipe devices device ugen # Generic device uhid # "Human Interface Devices" device ukbd # Keyboard device ulpt # Printer device umass # Disks/Mass storage - Requires scbus and da device ums # Mouse # FireWire support device firewire # FireWire bus code device sbp # SCSI over FireWire (Requires scbus and da) device fwe # Ethernet over FireWire (non-standard!) options ALTQ options ALTQ_CBQ options ALTQ_HFSC options ALTQ_PRIQ options ALTQ_NOPCC device pf device pflog options BRIDGE options ZERO_COPY_SOCKETS options MAC options MAC_BSDEXTENDED options MAC_PARTITION options HZ=1000 options SC_HISTORY_SIZE=1000 options SC_KERNEL_CONS_ATTR=(FG_YELLOW|BG_BLACK) options SC_KERNEL_CONS_REV_ATTR=(FG_BLACK|BG_RED) options DEVICE_POLLING options AUTO_EOI_1 options INCLUDE_CONFIG_FILE --------------050007070804010403090909 Content-Type: text/plain; name="specialized_dmesg.boot" Content-Transfer-Encoding: 7bit Content-Disposition: inline; filename="specialized_dmesg.boot" Copyright (c) 1992-2006 The FreeBSD Project. Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 The Regents of the University of California. All rights reserved. FreeBSD 6.1-RELEASE-p10 #1: Thu Oct 12 14:14:54 CDT 2006 root@specialized:/usr/obj/usr/src/sys/SPECIALIZED Timecounter "i8254" frequency 1193182 Hz quality 0 CPU: Intel(R) Pentium(R) D CPU 2.80GHz (2800.11-MHz K8-class CPU) Origin = "GenuineIntel" Id = 0xf44 Stepping = 4 Features=0xbfebfbff Features2=0x641d> AMD Features=0x20100800 Cores per package: 2 real memory = 4563402752 (4352 MB) avail memory = 4140404736 (3948 MB) ACPI APIC Table: FreeBSD/SMP: Multiprocessor System Detected: 2 CPUs cpu0 (BSP): APIC ID: 0 cpu1 (AP): APIC ID: 1 Security policy loaded: TrustedBSD MAC/BSD Extended (mac_bsdextended) Security policy loaded: TrustedBSD MAC/Partition (mac_partition) ioapic0: Changing APIC ID to 2 ioapic1: Changing APIC ID to 3 ioapic1: WARNING: intbase 32 != expected base 24 ioapic0 irqs 0-23 on motherboard ioapic1 irqs 32-55 on motherboard acpi0: on motherboard acpi0: Power Button (fixed) Timecounter "ACPI-fast" frequency 3579545 Hz quality 1000 acpi_timer0: <24-bit timer at 3.579545MHz> port 0x808-0x80b on acpi0 cpu0: on acpi0 cpu1: on acpi0 pcib0: port 0xcf8-0xcff on acpi0 pci0: on pcib0 pcib1: at device 1.0 on pci0 pci1: on pcib1 pcib2: at device 28.0 on pci0 pci2: on pcib2 pcib3: at device 0.0 on pci2 pci3: on pcib3 pcib4: at device 28.4 on pci0 pci4: on pcib4 bge0: mem 0xfe8f0000-0xfe8fffff irq 16 at device 0.0 on pci4 miibus0: on bge0 brgphy0: on miibus0 brgphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseTX, 1000baseTX-FDX, auto bge0: Ethernet address: 00:15:c5:60:1b:dc pcib5: at device 28.5 on pci0 pci5: on pcib5 bge1: mem 0xfe6f0000-0xfe6fffff irq 17 at device 0.0 on pci5 miibus1: on bge1 brgphy1: on miibus1 brgphy1: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseTX, 1000baseTX-FDX, auto bge1: Ethernet address: 00:15:c5:60:1b:dd uhci0: port 0xbce0-0xbcff irq 20 at device 29.0 on pci0 uhci0: [GIANT-LOCKED] usb0: on uhci0 usb0: USB revision 1.0 uhub0: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1 uhub0: 2 ports with 2 removable, self powered uhci1: port 0xbcc0-0xbcdf irq 21 at device 29.1 on pci0 uhci1: [GIANT-LOCKED] usb1: on uhci1 usb1: USB revision 1.0 uhub1: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1 uhub1: 2 ports with 2 removable, self powered uhci2: port 0xbca0-0xbcbf irq 22 at device 29.2 on pci0 uhci2: [GIANT-LOCKED] usb2: on uhci2 usb2: USB revision 1.0 uhub2: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1 uhub2: 2 ports with 2 removable, self powered ehci0: mem 0xfeb00400-0xfeb007ff irq 20 at device 29.7 on pci0 ehci0: [GIANT-LOCKED] usb3: EHCI version 1.0 usb3: wrong number of companions (7 != 3) usb3: companion controllers, 2 ports each: usb0 usb1 usb2 usb3: on ehci0 usb3: USB revision 2.0 uhub3: Intel EHCI root hub, class 9/0, rev 2.00/1.00, addr 1 uhub3: 6 ports with 6 removable, self powered pcib6: at device 30.0 on pci0 pci6: on pcib6 pci6: at device 5.0 (no driver attached) isab0: at device 31.0 on pci0 isa0: on isab0 atapci0: port 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xfc00-0xfc0f at device 31.1 on pci0 ata0: on atapci0 ata1: on atapci0 atapci1: port 0xbc98-0xbc9f,0xbc90-0xbc93,0xbc80-0xbc87,0xbc78-0xbc7b,0xbc60-0xbc6f mem 0xfeb00000-0xfeb003ff irq 20 at device 31.2 on pci0 ata2: on atapci1 ata3: on atapci1 pci0: at device 31.3 (no driver attached) fdc0: port 0x3f0-0x3f5,0x3f7 irq 6 drq 2 on acpi0 fdc0: does not respond device_attach: fdc0 attach returned 6 sio0: <16550A-compatible COM port> port 0x3f8-0x3ff irq 4 flags 0x10 on acpi0 sio0: type 16550A, console fdc0: port 0x3f0-0x3f5,0x3f7 irq 6 drq 2 on acpi0 fdc0: does not respond device_attach: fdc0 attach returned 6 orm0: at iomem 0xc0000-0xc7fff,0xec000-0xeffff on isa0 atkbdc0: at port 0x60,0x64 on isa0 sc0: at flags 0x100 on isa0 sc0: VGA <16 virtual consoles, flags=0x100> sio1: configured irq 3 not in bitmap of probed irqs 0 sio1: port may not be enabled vga0: at port 0x3c0-0x3df iomem 0xa0000-0xbffff on isa0 Timecounters tick every 1.000 msec acd0: CDRW at ata0-master UDMA33 ad4: 152587MB at ata2-master SATA150 SMP: AP CPU #1 Launched! Trying to mount root from ufs:/dev/ad4s1a bge0: link state changed to UP bge1: link state changed to UP --------------050007070804010403090909--