Date: Mon, 8 Apr 2002 18:18:59 -0400 (EDT) From: Andrew Gallatin <gallatin@cs.duke.edu> To: freebsd-hackers@freebsd.org Subject: performance of mbufs vs contig buffers? Message-ID: <15538.5971.620626.548508@grasshopper.cs.duke.edu>
next in thread | raw e-mail | index | archive | help
After updating the firmware on our our 2 gigabit nic to allow enough scatter entries per packet to stock the 9K (jumbo frame) receive rings with cluster mubfs rather than contigmalloc'ed buffers(*), I noticed a dramatic performance decrease: netperf TCP_STREAM performance dropped from 1.6Gb/sec to 1.2Gb/sec. (*) By "contigmalloc'ed buffers", I mean a few megs of memory, carved up into 9K chunks and managed via slists, like is done in most of the in-tree gigabit ethernet drivers. My first thought was that the firmware and/or processor on the NIC was somehow overwhelmed by the extra work of doing 5 2K DMAs rather than one 9K DMA. So I rebuilt my kernel & driver using 4K cluster mbufs and added an option to the driver so that when it stocks the receive rings with contig buffers which are greater than a PAGE_SIZE, it breaks them up at page (4K) boundaries. After making these change, I'm roughly comparing apples to apples. Each packet is received into 3 DMA descriptors. However, I'm still seeing the same performance - 1.6Gb/sec receives into contigmalloc'ed buffers whose DMA descriptors are broken up into PAGE_SIZE'ed chunks, and 1.2Gb/sec into 4K mbufs. Is it possible that my problems are being caused by cache misses in on cluster mbufs occuring when copying out to userspace as another packet is being DMA'ed up? I'd thought that since the cache line size is 32 bytes, I'd be pretty much equally screwed either way. Also, UDP_STREAM performance goes from 1.75Gb/sec -> 1.25 Gb/sec, so its not some weird TCP quirk. All the UDP drops are from the socketbuffer being full (the host is receiving data at 1.9Gb/sec into main memory in both cases), so its as if I have less memory bandwidth when using normal cluster mbufs. I've been trying to use perfmon to compare cache misses, but I'm not sure what options I should be using.. Does anybody have any ideas why contig malloc'ed buffers are so much quicker? Thanks! Drew PS: Here's the dmesg from the machine in question. Serverworks LE 3.0, 1GHz PIII (256K cache). I've got page coloring enabled in the kernel; it doesn't seem to make much difference. Copyright (c) 1992-2002 The FreeBSD Project. Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 The Regents of the University of California. All rights reserved. FreeBSD 4.5-STABLE #1: Mon Apr 8 17:33:51 EDT 2002 gallatin@ugly:/usr/src/sys/compile/PERFMON Timecounter "i8254" frequency 1193182 Hz CPU: Pentium III/Pentium III Xeon/Celeron (999.53-MHz 686-class CPU) Origin = "GenuineIntel" Id = 0x68a Stepping = 10 Features=0x383fbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,MMX,FXSR,SSE> real memory = 536805376 (524224K bytes) avail memory = 517902336 (505764K bytes) Preloaded elf kernel "kernel.perfmon" at 0xc044f000. Pentium Pro MTRR support enabled md0: Malloc disk Using $PIR table, 9 entries at 0xc00f5250 npx0: <math processor> on motherboard npx0: INT 16 interface pcib0: <ServerWorks NB6635 3.0LE host to PCI bridge> on motherboard pci0: <PCI bus> on pcib0 atapci0: <Promise ATA66 controller> port 0xdf00-0xdf3f,0xdfe0-0xdfe3,0xdfa8-0xdfaf,0xdfe4-0xdfe7,0xdff0-0xdff7 mem 0xfc9e0000-0xfc9fffff irq 10 at device 2.0 on pci0 ata2: at 0xdff0 on atapci0 ata3: at 0xdfa8 on atapci0 fxp0: <Intel Pro 10/100B/100+ Ethernet> port 0xd800-0xd83f mem 0xfc800000-0xfc8fffff,0xfc9ce000-0xfc9cefff irq 9 at device 6.0 on pci0 fxp0: Ethernet address 00:30:48:21:e4:47 inphy0: <i82555 10/100 media interface> on miibus0 inphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto isab0: <ServerWorks IB6566 PCI to ISA bridge> at device 15.0 on pci0 isa0: <ISA bus> on isab0 atapci1: <ServerWorks ROSB4 ATA33 controller> port 0xffa0-0xffaf at device 15.1 on pci0 ata0: at 0x1f0 irq 14 on atapci1 ata1: at 0x170 irq 15 on atapci1 pci0: <OHCI USB controller> at 15.2 irq 10 pcib1: <ServerWorks NB6635 3.0LE host to PCI bridge> on motherboard pci1: <PCI bus> on pcib1 pci1: <ATI Mach64-GO graphics accelerator> at 1.0 irq 11 pci1: <unknown card> (vendor=0x14c1, dev=0x8043) at 2.0 irq 5 orm0: <Option ROMs> at iomem 0xc0000-0xc7fff,0xc8000-0xc97ff,0xc9800-0xca7ff on isa0 fdc0: <NEC 72065B or clone> at port 0x3f0-0x3f5,0x3f7 irq 6 drq 2 on isa0 fdc0: FIFO enabled, 8 bytes threshold fd0: <1440-KB 3.5" drive> on fdc0 drive 0 atkbdc0: <Keyboard controller (i8042)> at port 0x60,0x64 on isa0 vga0: <Generic ISA VGA> at port 0x3c0-0x3df iomem 0xa0000-0xbffff on isa0 sc0: <System console> at flags 0x100 on isa0 sc0: VGA <16 virtual consoles, flags=0x100> sio0 at port 0x3f8-0x3ff irq 4 flags 0x10 on isa0 sio0: type 16550A, console sio1 at port 0x2f8-0x2ff irq 3 on isa0 sio1: type 16550A ppc0: <Parallel port> at port 0x378-0x37f irq 7 on isa0 ppc0: Generic chipset (ECP/PS2/NIBBLE) in COMPATIBLE mode ppc0: FIFO with 16/16/8 bytes threshold plip0: <PLIP network interface> on ppbus0 lpt0: <Printer> on ppbus0 lpt0: Interrupt-driven port ppi0: <Parallel I/O> on ppbus0 ad4: 19092MB <ST320414A> [38792/16/63] at ata2-master UDMA66 acd0: CDROM <CDU5211> at ata1-master PIO4 Mounting root from ufs:/dev/ad4s2a gm0: <Myrinet PCI interface> mem 0xfb000000-0xfbffffff irq 5 at device 2.0 on pci1 GM: driver version 2.0e gallatin@big Mon Apr 8 16:58:43 EDT 2002 MCP for unit 0: L9 4K LANai rate set to 199 MHz (max = 202 MHz) Board 0 page hash cache has 8192 bins. GM: gm_register_memory will be able to lock 84014 pages (328 MBytes) GM: IP interface attach ok To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hackers" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?15538.5971.620626.548508>