Date: Fri, 14 Jun 2013 12:50:22 +0200 From: Remy Nonnenmacher <remy.nonnenmacher@activnetworks.com> To: David Xu <davidxu@freebsd.org> Cc: "freebsd-performance@freebsd.org" <freebsd-performance@freebsd.org> Subject: Re: Scaling and performance issues with FreeBSD 9 (& 10) on 4 socket systems Message-ID: <51BAF56E.7030700@activnetworks.com> In-Reply-To: <51BA7A78.7010904@freebsd.org> References: <20130612225849.GA2858@dragon.NUXI.org> <op.wyl7oryz34t2sn@markf.office.supranet.net> <51B9B497.70800@activnetworks.com> <51BA7A78.7010904@freebsd.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On 06/14/13 04:05, David Xu wrote: > On 2013/06/13 20:01, Remy Nonnenmacher wrote: >> >> On 06/13/13 13:32, Mark Felder wrote: >>> On Wed, 12 Jun 2013 17:58:49 -0500, David O'Brien <obrien@freebsd.org> >>> wrote: >>> >>>> We found FreeBSD 8.4 to perform better than FreeBSD 9.1, and Linux >>>> considerably better than both on the same machine. >>> >>> http://svnweb.freebsd.org/base?view=revision&revision=241246 >>> >>> The above link is likely why 8.4 is better than 9.1 on the same machine. >>> >>>> We've tried various things and haven't been able to explain why FreeBSD >>>> isn't scaling on the new hardware. Nor why it performs so much worse >>>> than FreeBSD on the older "M2" machines. >>> >>> The CPUs between those machines are quite different. I'm sure we're >>> looking at different cache sizes, different behavior for the >>> hyperthreading, etc. I'm sure others would be greatly interested in you >>> providing the same benchmark results for a recent snapshot of HEAD as >>> well. >>> _______________________________________________ >>> freebsd-performance@freebsd.org mailing list >>> http://lists.freebsd.org/mailman/listinfo/freebsd-performance >>> To unsubscribe, send any mail to >>> "freebsd-performance-unsubscribe@freebsd.org" >> >> We had same problem on 4x12 cores (AMD) machines. After investigating >> using hwpmc, it appears that performance was killed by a scheduler >> function trying to find "least used cpu" that unfortunately works on >> contended structures (ie: lots a cores are fighting to get works). A >> solution was found by using artificially long queue of stuck process >> (steal_thresh bumped to over 8) and by cpu affinity crafting. >> >> Was a year ago and from my memory. I guess you may give a try to see if >> it helps. >> >> Disregard is a scheduler specialist contradicts. >> >> Thanks. >> > > AMD's cache is very different than Intel, AFAIK eariler than Bulldozer, > AMD's L3 is exclusive cache, util Bulldozer, AMD describes the L3 cache > as a “non-inclusive victim cache”, it is still different than Intel > which is inclusive. > > "- In sched_pickcpu() change general logic of CPU selection. First > look for idle CPU, sharing last level cache with previously used one, > skipping SMT CPU groups. If none found, search all CPUs for the least > loaded > one, where the thread with its priority can run now. If none found, search > just for the least loaded CPU." > > For exclusive cache, the L3 has second-hand data, not hot data, when a > thread is migrated, will have negative effect, its hot data is lost. > I'd prefer to search idle CPU from L2, then L3. > > The problem was not really the excellent job done on cache locality via cpu detection. It was more a scaling problem with the number of cores that exacerbate a contention when trying to steal works from others queues. Basically, what happened (I say happened because I've not retested recently), is that you may have 1 core running and 47 others fighting in a loop where there is one winner and 46 losers, all of them playing with locks, and O(N=48) loops. All in all, you see degraded performance with little indication of a cause. This is where hwpmc is a wonderfull tool... Bumping up steal-thresh up changes the pattern. If it works for you, then the cause is probably the same.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?51BAF56E.7030700>