From owner-freebsd-hackers Mon Apr 8 15:21:44 2002 Delivered-To: freebsd-hackers@freebsd.org Received: from duke.cs.duke.edu (duke.cs.duke.edu [152.3.140.1]) by hub.freebsd.org (Postfix) with ESMTP id 1515537B787 for ; Mon, 8 Apr 2002 15:20:00 -0700 (PDT) Received: from grasshopper.cs.duke.edu (grasshopper.cs.duke.edu [152.3.145.30]) by duke.cs.duke.edu (8.9.3/8.9.3) with ESMTP id SAA06006 for ; Mon, 8 Apr 2002 18:19:29 -0400 (EDT) Received: (from gallatin@localhost) by grasshopper.cs.duke.edu (8.11.6/8.9.1) id g38MIxq22876; Mon, 8 Apr 2002 18:18:59 -0400 (EDT) (envelope-from gallatin@cs.duke.edu) From: Andrew Gallatin MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <15538.5971.620626.548508@grasshopper.cs.duke.edu> Date: Mon, 8 Apr 2002 18:18:59 -0400 (EDT) To: freebsd-hackers@freebsd.org Subject: performance of mbufs vs contig buffers? X-Mailer: VM 6.75 under 21.1 (patch 12) "Channel Islands" XEmacs Lucid Sender: owner-freebsd-hackers@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG After updating the firmware on our our 2 gigabit nic to allow enough scatter entries per packet to stock the 9K (jumbo frame) receive rings with cluster mubfs rather than contigmalloc'ed buffers(*), I noticed a dramatic performance decrease: netperf TCP_STREAM performance dropped from 1.6Gb/sec to 1.2Gb/sec. (*) By "contigmalloc'ed buffers", I mean a few megs of memory, carved up into 9K chunks and managed via slists, like is done in most of the in-tree gigabit ethernet drivers. My first thought was that the firmware and/or processor on the NIC was somehow overwhelmed by the extra work of doing 5 2K DMAs rather than one 9K DMA. So I rebuilt my kernel & driver using 4K cluster mbufs and added an option to the driver so that when it stocks the receive rings with contig buffers which are greater than a PAGE_SIZE, it breaks them up at page (4K) boundaries. After making these change, I'm roughly comparing apples to apples. Each packet is received into 3 DMA descriptors. However, I'm still seeing the same performance - 1.6Gb/sec receives into contigmalloc'ed buffers whose DMA descriptors are broken up into PAGE_SIZE'ed chunks, and 1.2Gb/sec into 4K mbufs. Is it possible that my problems are being caused by cache misses in on cluster mbufs occuring when copying out to userspace as another packet is being DMA'ed up? I'd thought that since the cache line size is 32 bytes, I'd be pretty much equally screwed either way. Also, UDP_STREAM performance goes from 1.75Gb/sec -> 1.25 Gb/sec, so its not some weird TCP quirk. All the UDP drops are from the socketbuffer being full (the host is receiving data at 1.9Gb/sec into main memory in both cases), so its as if I have less memory bandwidth when using normal cluster mbufs. I've been trying to use perfmon to compare cache misses, but I'm not sure what options I should be using.. Does anybody have any ideas why contig malloc'ed buffers are so much quicker? Thanks! Drew PS: Here's the dmesg from the machine in question. Serverworks LE 3.0, 1GHz PIII (256K cache). I've got page coloring enabled in the kernel; it doesn't seem to make much difference. Copyright (c) 1992-2002 The FreeBSD Project. Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 The Regents of the University of California. All rights reserved. FreeBSD 4.5-STABLE #1: Mon Apr 8 17:33:51 EDT 2002 gallatin@ugly:/usr/src/sys/compile/PERFMON Timecounter "i8254" frequency 1193182 Hz CPU: Pentium III/Pentium III Xeon/Celeron (999.53-MHz 686-class CPU) Origin = "GenuineIntel" Id = 0x68a Stepping = 10 Features=0x383fbff real memory = 536805376 (524224K bytes) avail memory = 517902336 (505764K bytes) Preloaded elf kernel "kernel.perfmon" at 0xc044f000. Pentium Pro MTRR support enabled md0: Malloc disk Using $PIR table, 9 entries at 0xc00f5250 npx0: on motherboard npx0: INT 16 interface pcib0: on motherboard pci0: on pcib0 atapci0: port 0xdf00-0xdf3f,0xdfe0-0xdfe3,0xdfa8-0xdfaf,0xdfe4-0xdfe7,0xdff0-0xdff7 mem 0xfc9e0000-0xfc9fffff irq 10 at device 2.0 on pci0 ata2: at 0xdff0 on atapci0 ata3: at 0xdfa8 on atapci0 fxp0: port 0xd800-0xd83f mem 0xfc800000-0xfc8fffff,0xfc9ce000-0xfc9cefff irq 9 at device 6.0 on pci0 fxp0: Ethernet address 00:30:48:21:e4:47 inphy0: on miibus0 inphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto isab0: at device 15.0 on pci0 isa0: on isab0 atapci1: port 0xffa0-0xffaf at device 15.1 on pci0 ata0: at 0x1f0 irq 14 on atapci1 ata1: at 0x170 irq 15 on atapci1 pci0: at 15.2 irq 10 pcib1: on motherboard pci1: on pcib1 pci1: at 1.0 irq 11 pci1: (vendor=0x14c1, dev=0x8043) at 2.0 irq 5 orm0: