From owner-freebsd-performance@FreeBSD.ORG  Fri Jun 14 11:02:25 2013
Return-Path: <owner-freebsd-performance@FreeBSD.ORG>
Delivered-To: freebsd-performance@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 by hub.freebsd.org (Postfix) with ESMTP id CDACCAE5;
 Fri, 14 Jun 2013 11:02:25 +0000 (UTC)
 (envelope-from remy@activnetworks.com)
Received: from fr-exchange.activnetworks.com (anwadmin.net8.nerim.net
 [213.41.185.85])
 by mx1.freebsd.org (Postfix) with ESMTP id 55FD41125;
 Fri, 14 Jun 2013 11:02:24 +0000 (UTC)
Received: from rn.activnetworks.com ([192.168.1.100]) by
 fr-exchange.activnetworks.com with Microsoft SMTPSVC(6.0.3790.4675); 
 Fri, 14 Jun 2013 12:50:22 +0200
Message-ID: <51BAF56E.7030700@activnetworks.com>
Date: Fri, 14 Jun 2013 12:50:22 +0200
From: Remy Nonnenmacher <remy.nonnenmacher@activnetworks.com>
User-Agent: Mozilla/5.0 (X11; FreeBSD amd64;
 rv:17.0) Gecko/20130522 Thunderbird/17.0.6
MIME-Version: 1.0
To: David Xu <davidxu@freebsd.org>
Subject: Re: Scaling and performance issues with FreeBSD 9 (& 10) on 4 socket
 systems
References: <20130612225849.GA2858@dragon.NUXI.org>
 <op.wyl7oryz34t2sn@markf.office.supranet.net>
 <51B9B497.70800@activnetworks.com> <51BA7A78.7010904@freebsd.org>
In-Reply-To: <51BA7A78.7010904@freebsd.org>
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 8bit
X-OriginalArrivalTime: 14 Jun 2013 10:50:22.0572 (UTC)
 FILETIME=[F82EF2C0:01CE68EC]
Cc: "freebsd-performance@freebsd.org" <freebsd-performance@freebsd.org>
X-BeenThere: freebsd-performance@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Performance/tuning <freebsd-performance.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-performance>, 
 <mailto:freebsd-performance-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-performance>
List-Post: <mailto:freebsd-performance@freebsd.org>
List-Help: <mailto:freebsd-performance-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-performance>, 
 <mailto:freebsd-performance-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 14 Jun 2013 11:02:25 -0000


On 06/14/13 04:05, David Xu wrote:
> On 2013/06/13 20:01, Remy Nonnenmacher wrote:
>>
>> On 06/13/13 13:32, Mark Felder wrote:
>>> On Wed, 12 Jun 2013 17:58:49 -0500, David O'Brien <obrien@freebsd.org>
>>> wrote:
>>>
>>>> We found FreeBSD 8.4 to perform better than FreeBSD 9.1, and Linux
>>>> considerably better than both on the same machine.
>>>
>>> http://svnweb.freebsd.org/base?view=revision&revision=241246
>>>
>>> The above link is likely why 8.4 is better than 9.1 on the same machine.
>>>
>>>> We've tried various things and haven't been able to explain why FreeBSD
>>>> isn't scaling on the new hardware.  Nor why it performs so much worse
>>>> than FreeBSD on the older "M2" machines.
>>>
>>> The CPUs between those machines are quite different. I'm sure we're
>>> looking at different cache sizes, different behavior for the
>>> hyperthreading, etc. I'm sure others would be greatly interested in you
>>> providing the same benchmark results for a recent snapshot of HEAD as
>>> well.
>>> _______________________________________________
>>> freebsd-performance@freebsd.org mailing list
>>> http://lists.freebsd.org/mailman/listinfo/freebsd-performance
>>> To unsubscribe, send any mail to
>>> "freebsd-performance-unsubscribe@freebsd.org"
>>
>> We had same problem on 4x12 cores (AMD) machines. After investigating
>> using hwpmc, it appears that performance was killed by a scheduler
>> function trying to find "least used cpu" that unfortunately works on
>> contended structures (ie: lots a cores are fighting to get works). A
>> solution was found by using artificially long queue of stuck process
>> (steal_thresh bumped to over 8) and by cpu affinity crafting.
>>
>> Was a year ago and from my memory. I guess you may give a try to see if
>> it helps.
>>
>> Disregard is a scheduler specialist contradicts.
>>
>> Thanks.
>>
>
> AMD's cache is very different than Intel, AFAIK eariler than Bulldozer,
> AMD's L3 is exclusive cache, util Bulldozer, AMD describes the L3 cache
> as a “non-inclusive victim cache”, it is still different than Intel
> which is inclusive.
>
> "- In sched_pickcpu() change general logic of CPU selection. First
> look for idle CPU, sharing last level cache with previously used one,
> skipping SMT CPU groups. If none found, search all CPUs for the least
> loaded
> one, where the thread with its priority can run now. If none found, search
> just for the least loaded CPU."
>
> For exclusive cache, the L3 has second-hand data, not hot data, when a
> thread is migrated, will have negative effect, its hot data is lost.
> I'd prefer to search idle CPU from L2, then L3.
>
>

The problem was not really the excellent job done on cache locality via 
cpu detection. It was more a scaling problem with the number of cores 
that exacerbate a contention when trying to steal works from others 
queues. Basically, what happened (I say happened because I've not 
retested recently), is that you may have 1 core running and 47 others 
fighting in a loop where there is one winner and 46 losers, all of them 
playing with locks, and O(N=48) loops. All in all, you see degraded 
performance with little indication of a cause. This is where hwpmc is a 
wonderfull tool...

Bumping up steal-thresh up changes the pattern. If it works for you, 
then the cause is probably the same.