Date: Fri, 17 Feb 2006 11:14:16 -0500 (EST) From: Andrew Gallatin <gallatin@cs.duke.edu> To: Joseph Koshy <joseph.koshy@gmail.com> Cc: freebsd-amd64@freebsd.org Subject: Re: non-temporal copyin/copyout? Message-ID: <17397.63064.242130.484086@grasshopper.cs.duke.edu> In-Reply-To: <84dead720602170750j119080c9g32ec9f1ac0e3944d@mail.gmail.com> References: <17397.58669.457047.277510@grasshopper.cs.duke.edu> <84dead720602170750j119080c9g32ec9f1ac0e3944d@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
Joseph Koshy writes: > > I'm bringing this up because I've noticed that FreeBSD 10GbE > > performance is far below Solaris/amd64 and linux/x86_64 when > > using the PCI-e 10GbE adaptor that I'm doing drivers for. > > For example, Solaris can recieve a netperf TCP stream at > > There was a bug in my port of netperf; I had left the > `HISTOGRAM' option turned on, which causes it to slow > down significantly. > > v2.3.1,1 is the latest & bugfixed version of the port. I don't use the port specifically because of the HISTOGRAM (mis)feature :). I have my own copy of netperf that I use on all platforms I support (linux, solaris, macosx, freebsd, aix) with various bugs fixed (sendfile support for solaris, cpu time for macosx & aix, etc). > > 9.75Gb/sec while using only 47% CPU as measured by vmstat. > > (eg, it is using a little less than a single core). In > > contrast, FreeBSD is limited to 7.7Gb/sec, and uses nearly > > 90% CPU. When profiling with hwpmc, I see a profile which > > shows up to 70% of the time is spent in copyout. > > You could use the following events to probe the system: OK. I did these probes while a netperf was running at ~7.7Gb/s. I did each for roughly 10-20 seconds, not very scientifically :) Here is everything above 1% for all of them: > "k8-dc-miss" : data cache misses 91.5 6466.00 6466.00 0 100.00% copyout [1] 2.8 6666.00 200.00 0 100.00% soreceive [2] 1.5 6774.00 108.00 0 100.00% uiomoveco [3] 1.0 6846.00 72.00 0 100.00% mb_free_ext [4] > "k8-bu-fill-request-l2-miss,mask=dc-fill" : L2 fills for the > data cache 88.2 3866.00 3866.00 0 100.00% copyout [1] 4.0 4041.00 175.00 0 100.00% soreceive [2] 1.9 4125.00 84.00 0 100.00% uiomoveco [3] 1.9 4207.00 82.00 0 100.00% mb_free_ext [4] 1.5 4273.00 66.00 0 100.00% mb_dtor_clust[5] > "k8-dc-misaligned-data-reference": in case there are any 99.5 66763.00 66763.00 0 100.00% copyout [1] > "k8-fr-interrupts-masked-while-pending-cycles": for > finding spots in the code where spin-locks are being > held for long. I had to tweak the sample rate to 512 for this one. 52.5 330.00 330.00 0 100.00% acpi_cpu_idle [1] 10.4 395.00 65.00 0 100.00% spinlock_exit [2] 9.1 452.00 57.00 0 100.00% acpi_cpu_c1 [3] 6.1 490.00 38.00 0 100.00% _mtx_lock_sleep [4] 4.0 515.00 25.00 0 100.00% runq_remove [5] 2.4 530.00 15.00 0 100.00% ast [6] 2.2 544.00 14.00 0 100.00% _mtx_unlock_sleep [7] 2.1 557.00 13.00 0 100.00% turnstile_lock [8] 1.9 569.00 12.00 0 100.00% choosethread [9] 1.6 579.00 10.00 0 100.00% cpu_switch [10] 1.3 587.00 8.00 0 100.00% turnstile_release [11] 1.1 594.00 7.00 0 100.00% sched_switch [12] 1.0 600.00 6.00 0 100.00% sched_add [13] Drew
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?17397.63064.242130.484086>