Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 8 Apr 2002 18:18:59 -0400 (EDT)
From:      Andrew Gallatin <gallatin@cs.duke.edu>
To:        freebsd-hackers@freebsd.org
Subject:   performance of mbufs vs contig buffers?
Message-ID:  <15538.5971.620626.548508@grasshopper.cs.duke.edu>

next in thread | raw e-mail | index | archive | help

After updating the firmware on our our 2 gigabit nic to allow enough
scatter entries per packet to stock the 9K (jumbo frame) receive
rings with cluster mubfs rather than contigmalloc'ed buffers(*), I
noticed a dramatic performance decrease: netperf TCP_STREAM
performance dropped from 1.6Gb/sec to 1.2Gb/sec.

(*) By "contigmalloc'ed buffers", I mean a few megs of memory, carved
up into 9K chunks and managed via slists, like is done in most of the
in-tree gigabit ethernet drivers.

My first thought was that the firmware and/or processor on the NIC was
somehow overwhelmed by the extra work of doing 5 2K DMAs rather than
one 9K DMA. So I rebuilt my kernel & driver using 4K cluster mbufs and
added an option to the driver so that when it stocks the receive rings
with contig buffers which are greater than a PAGE_SIZE, it breaks them
up at page (4K) boundaries.

After making these change, I'm roughly comparing apples to apples.  Each
packet is received into 3 DMA descriptors.  However, I'm still
seeing the same performance - 1.6Gb/sec receives into contigmalloc'ed
buffers whose DMA descriptors are broken up into PAGE_SIZE'ed chunks,
and 1.2Gb/sec into 4K mbufs.

Is it possible that my problems are being caused by cache misses in
on cluster mbufs occuring when copying out to userspace as another
packet is being DMA'ed up?  I'd thought that since the cache line size
is 32 bytes, I'd be pretty much equally screwed either way.

Also, UDP_STREAM performance goes from 1.75Gb/sec -> 1.25 Gb/sec, so
its not some weird TCP quirk.  All the UDP drops are from the
socketbuffer being full (the host is receiving data at 1.9Gb/sec into
main memory in both cases), so its as if I have less memory bandwidth
when using normal cluster mbufs.  I've been trying to use perfmon to
compare cache misses, but I'm not sure what options I should be
using..

Does anybody have any ideas why contig malloc'ed buffers are so much
quicker?  

Thanks!

Drew

PS: Here's the dmesg from the machine in question.  Serverworks LE
3.0, 1GHz PIII (256K cache).  I've got page coloring enabled in the
kernel; it doesn't seem to make much difference.

Copyright (c) 1992-2002 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
	The Regents of the University of California. All rights reserved.
FreeBSD 4.5-STABLE #1: Mon Apr  8 17:33:51 EDT 2002
    gallatin@ugly:/usr/src/sys/compile/PERFMON
Timecounter "i8254"  frequency 1193182 Hz
CPU: Pentium III/Pentium III Xeon/Celeron (999.53-MHz 686-class CPU)
  Origin = "GenuineIntel"  Id = 0x68a  Stepping = 10
  Features=0x383fbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,MMX,FXSR,SSE>
real memory  = 536805376 (524224K bytes)
avail memory = 517902336 (505764K bytes)
Preloaded elf kernel "kernel.perfmon" at 0xc044f000.
Pentium Pro MTRR support enabled
md0: Malloc disk
Using $PIR table, 9 entries at 0xc00f5250
npx0: <math processor> on motherboard
npx0: INT 16 interface
pcib0: <ServerWorks NB6635 3.0LE host to PCI bridge> on motherboard
pci0: <PCI bus> on pcib0
atapci0: <Promise ATA66 controller> port 0xdf00-0xdf3f,0xdfe0-0xdfe3,0xdfa8-0xdfaf,0xdfe4-0xdfe7,0xdff0-0xdff7 mem 0xfc9e0000-0xfc9fffff irq 10 at device 2.0 on pci0
ata2: at 0xdff0 on atapci0
ata3: at 0xdfa8 on atapci0
fxp0: <Intel Pro 10/100B/100+ Ethernet> port 0xd800-0xd83f mem 0xfc800000-0xfc8fffff,0xfc9ce000-0xfc9cefff irq 9 at device 6.0 on pci0
fxp0: Ethernet address 00:30:48:21:e4:47
inphy0: <i82555 10/100 media interface> on miibus0
inphy0:  10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto
isab0: <ServerWorks IB6566 PCI to ISA bridge> at device 15.0 on pci0
isa0: <ISA bus> on isab0
atapci1: <ServerWorks ROSB4 ATA33 controller> port 0xffa0-0xffaf at device 15.1 on pci0
ata0: at 0x1f0 irq 14 on atapci1
ata1: at 0x170 irq 15 on atapci1
pci0: <OHCI USB controller> at 15.2 irq 10
pcib1: <ServerWorks NB6635 3.0LE host to PCI bridge> on motherboard
pci1: <PCI bus> on pcib1
pci1: <ATI Mach64-GO graphics accelerator> at 1.0 irq 11
pci1: <unknown card> (vendor=0x14c1, dev=0x8043) at 2.0 irq 5
orm0: <Option ROMs> at iomem 0xc0000-0xc7fff,0xc8000-0xc97ff,0xc9800-0xca7ff on isa0
fdc0: <NEC 72065B or clone> at port 0x3f0-0x3f5,0x3f7 irq 6 drq 2 on isa0
fdc0: FIFO enabled, 8 bytes threshold
fd0: <1440-KB 3.5" drive> on fdc0 drive 0
atkbdc0: <Keyboard controller (i8042)> at port 0x60,0x64 on isa0
vga0: <Generic ISA VGA> at port 0x3c0-0x3df iomem 0xa0000-0xbffff on isa0
sc0: <System console> at flags 0x100 on isa0
sc0: VGA <16 virtual consoles, flags=0x100>
sio0 at port 0x3f8-0x3ff irq 4 flags 0x10 on isa0
sio0: type 16550A, console
sio1 at port 0x2f8-0x2ff irq 3 on isa0
sio1: type 16550A
ppc0: <Parallel port> at port 0x378-0x37f irq 7 on isa0
ppc0: Generic chipset (ECP/PS2/NIBBLE) in COMPATIBLE mode
ppc0: FIFO with 16/16/8 bytes threshold
plip0: <PLIP network interface> on ppbus0
lpt0: <Printer> on ppbus0
lpt0: Interrupt-driven port
ppi0: <Parallel I/O> on ppbus0
ad4: 19092MB <ST320414A> [38792/16/63] at ata2-master UDMA66
acd0: CDROM <CDU5211> at ata1-master PIO4
Mounting root from ufs:/dev/ad4s2a
gm0: <Myrinet PCI interface> mem 0xfb000000-0xfbffffff irq 5 at device 2.0 on pci1
GM: driver version 2.0e gallatin@big Mon Apr  8 16:58:43 EDT 2002
MCP for unit 0: L9 4K
LANai rate set to 199 MHz (max = 202 MHz)
Board 0 page hash cache has 8192 bins.
GM: gm_register_memory will be able to lock 84014 pages (328 MBytes)
GM: IP interface attach ok

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?15538.5971.620626.548508>