From owner-freebsd-amd64@freebsd.org Thu Mar 23 19:14:09 2017 Return-Path: Delivered-To: freebsd-amd64@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id F2CB9CBA53C for ; Thu, 23 Mar 2017 19:14:09 +0000 (UTC) (envelope-from se@freebsd.org) Received: from mailout10.t-online.de (mailout10.t-online.de [194.25.134.21]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "mailout00.t-online.de", Issuer "TeleSec ServerPass DE-2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id B92A11EE2 for ; Thu, 23 Mar 2017 19:14:09 +0000 (UTC) (envelope-from se@freebsd.org) Received: from fwd27.aul.t-online.de (fwd27.aul.t-online.de [172.20.26.132]) by mailout10.t-online.de (Postfix) with SMTP id B855541F5C52 for ; Thu, 23 Mar 2017 20:14:06 +0100 (CET) Received: from Stefans-MBP.fritz.box (VOfuG-ZHghBmuo--NQ3YNBUzWG0stw9P429lyrRttpDQ3K63KE8lwCjKmOqnUeeZgQ@[84.154.122.135]) by fwd27.t-online.de with (TLSv1.2:ECDHE-RSA-AES256-SHA encrypted) esmtp id 1cr8BQ-2FCb0S0; Thu, 23 Mar 2017 20:14:04 +0100 Subject: Re: FreeBSD on Ryzen To: freebsd-amd64@freebsd.org References: <201703222030.v2MKUJJs026400@gw.catspoiler.org> From: Stefan Esser Message-ID: <51b6c5d5-fc66-f371-ef54-c3d85a6f2c2d@freebsd.org> Date: Thu, 23 Mar 2017 20:14:03 +0100 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:45.0) Gecko/20100101 Thunderbird/45.8.0 MIME-Version: 1.0 In-Reply-To: <201703222030.v2MKUJJs026400@gw.catspoiler.org> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit X-ID: VOfuG-ZHghBmuo--NQ3YNBUzWG0stw9P429lyrRttpDQ3K63KE8lwCjKmOqnUeeZgQ X-TOI-MSGID: fd4e067d-b9c7-48ee-8019-37cb3776045d X-BeenThere: freebsd-amd64@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Porting FreeBSD to the AMD64 platform List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 23 Mar 2017 19:14:10 -0000 Am 22.03.17 um 21:30 schrieb Don Lewis: > I put together a Ryzen 1700X machine over the weekend and installed the > 12.0-CURRENT r315413 snapshot on it a couple of days ago. The RAM is > DDR4 2400. > > First impression is that it's pretty zippy. Compared to my previous > fastest machine: > CPU: AMD FX-8320E Eight-Core Processor (3210.84-MHz K8-class CPU) > make -j8 buildworld using tmpfs is a bit more than 2x faster. Since the > Ryzen has SMT, it's eight cores look like 16 CPUs to FreeBSD, I get > almost a 2.6x speedup with -j16 as compared to my old machine. > > I do see that the reported total CPU time increases quite a bit at -j16 > (~19900u) as compared to -j8 (~13600u) so it is running into some > hardware bottlenecks that are slowing down instruction execution. It > could be the resources shared by both SMT threads that share each core, It is the resources shared by the cores. Under full CPU load, SMT makes a 3.3 GHz 8 core CPU "simulate" a ~2 GHz 16 core CPU. The throughput is (in 1st order) proportional to cores * CPU clock, and comes out as 8 * 3.3 = 26.4 vs. 16 * ~2 = ~32 (estimated) I'm positively surprised by the observed gain of +30% due to SMT. This seems to match the reported user times: 13,600 / 8 = 1,700 seconds user time per physical core (on average) 19,900 / 16 = 1,244 seconds per virtual (SMT) core vs. an estimate of the throughput with a CPU with SMT but without any gain in throughput: 27,200 / 16 = 1,700 seconds per virtual core with ineffective SMT (i.e. assuming SMT that does not increase effective IPC, resulting in identical real time compared to the non-SMT case) This result seems to match the increased performance when going from -j 8 to -j 16: 27,200 / 19,900 = 2.7 ~ 2.6 / 2.0 > or it could be cache or memory bandwidth related. The Ryzen topology is > a bit complicated. There are two groups of four cores, where each group > of four cores shares half of the L3 cache, with a slowish interconnect > bus between the groups. This probably causes some NUMA-like issues. I > wonder if the ULE scheduler could be tweaked to handle this better. I've been wondering whether it is possible to teach the scheduler about above mentioned effect, i.e. by distinguishing a SMT core that executes only 1 runnable thread from one that executes 2. The latter one should be assumed to run at an estimated 60% clock (which makes both threads proceed at 120% of the non-SMT speed). OTOH, the lower "effective clock rate" should be irrelevant under high load (when all cores are executing 2 threads), or under low load, when some cores are idle (assuming, that the scheduler prefers to assign only 1 thread per each core until there are more runnable threads then cores. If you assume that user time accounting is a raw measure of instructions executed, then assuming a reduced clock rate would lead to "fairer" results.