From owner-freebsd-performance@FreeBSD.ORG Fri Jun 14 11:02:25 2013 Return-Path: Delivered-To: freebsd-performance@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id CDACCAE5; Fri, 14 Jun 2013 11:02:25 +0000 (UTC) (envelope-from remy@activnetworks.com) Received: from fr-exchange.activnetworks.com (anwadmin.net8.nerim.net [213.41.185.85]) by mx1.freebsd.org (Postfix) with ESMTP id 55FD41125; Fri, 14 Jun 2013 11:02:24 +0000 (UTC) Received: from rn.activnetworks.com ([192.168.1.100]) by fr-exchange.activnetworks.com with Microsoft SMTPSVC(6.0.3790.4675); Fri, 14 Jun 2013 12:50:22 +0200 Message-ID: <51BAF56E.7030700@activnetworks.com> Date: Fri, 14 Jun 2013 12:50:22 +0200 From: Remy Nonnenmacher User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:17.0) Gecko/20130522 Thunderbird/17.0.6 MIME-Version: 1.0 To: David Xu Subject: Re: Scaling and performance issues with FreeBSD 9 (& 10) on 4 socket systems References: <20130612225849.GA2858@dragon.NUXI.org> <51B9B497.70800@activnetworks.com> <51BA7A78.7010904@freebsd.org> In-Reply-To: <51BA7A78.7010904@freebsd.org> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 8bit X-OriginalArrivalTime: 14 Jun 2013 10:50:22.0572 (UTC) FILETIME=[F82EF2C0:01CE68EC] Cc: "freebsd-performance@freebsd.org" X-BeenThere: freebsd-performance@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Performance/tuning List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 14 Jun 2013 11:02:25 -0000 On 06/14/13 04:05, David Xu wrote: > On 2013/06/13 20:01, Remy Nonnenmacher wrote: >> >> On 06/13/13 13:32, Mark Felder wrote: >>> On Wed, 12 Jun 2013 17:58:49 -0500, David O'Brien >>> wrote: >>> >>>> We found FreeBSD 8.4 to perform better than FreeBSD 9.1, and Linux >>>> considerably better than both on the same machine. >>> >>> http://svnweb.freebsd.org/base?view=revision&revision=241246 >>> >>> The above link is likely why 8.4 is better than 9.1 on the same machine. >>> >>>> We've tried various things and haven't been able to explain why FreeBSD >>>> isn't scaling on the new hardware. Nor why it performs so much worse >>>> than FreeBSD on the older "M2" machines. >>> >>> The CPUs between those machines are quite different. I'm sure we're >>> looking at different cache sizes, different behavior for the >>> hyperthreading, etc. I'm sure others would be greatly interested in you >>> providing the same benchmark results for a recent snapshot of HEAD as >>> well. >>> _______________________________________________ >>> freebsd-performance@freebsd.org mailing list >>> http://lists.freebsd.org/mailman/listinfo/freebsd-performance >>> To unsubscribe, send any mail to >>> "freebsd-performance-unsubscribe@freebsd.org" >> >> We had same problem on 4x12 cores (AMD) machines. After investigating >> using hwpmc, it appears that performance was killed by a scheduler >> function trying to find "least used cpu" that unfortunately works on >> contended structures (ie: lots a cores are fighting to get works). A >> solution was found by using artificially long queue of stuck process >> (steal_thresh bumped to over 8) and by cpu affinity crafting. >> >> Was a year ago and from my memory. I guess you may give a try to see if >> it helps. >> >> Disregard is a scheduler specialist contradicts. >> >> Thanks. >> > > AMD's cache is very different than Intel, AFAIK eariler than Bulldozer, > AMD's L3 is exclusive cache, util Bulldozer, AMD describes the L3 cache > as a “non-inclusive victim cache”, it is still different than Intel > which is inclusive. > > "- In sched_pickcpu() change general logic of CPU selection. First > look for idle CPU, sharing last level cache with previously used one, > skipping SMT CPU groups. If none found, search all CPUs for the least > loaded > one, where the thread with its priority can run now. If none found, search > just for the least loaded CPU." > > For exclusive cache, the L3 has second-hand data, not hot data, when a > thread is migrated, will have negative effect, its hot data is lost. > I'd prefer to search idle CPU from L2, then L3. > > The problem was not really the excellent job done on cache locality via cpu detection. It was more a scaling problem with the number of cores that exacerbate a contention when trying to steal works from others queues. Basically, what happened (I say happened because I've not retested recently), is that you may have 1 core running and 47 others fighting in a loop where there is one winner and 46 losers, all of them playing with locks, and O(N=48) loops. All in all, you see degraded performance with little indication of a cause. This is where hwpmc is a wonderfull tool... Bumping up steal-thresh up changes the pattern. If it works for you, then the cause is probably the same.