Date: Thu, 23 Mar 2017 20:14:03 +0100 From: Stefan Esser <se@freebsd.org> To: freebsd-amd64@freebsd.org Subject: Re: FreeBSD on Ryzen Message-ID: <51b6c5d5-fc66-f371-ef54-c3d85a6f2c2d@freebsd.org> In-Reply-To: <201703222030.v2MKUJJs026400@gw.catspoiler.org> References: <201703222030.v2MKUJJs026400@gw.catspoiler.org>
next in thread | previous in thread | raw e-mail | index | archive | help
Am 22.03.17 um 21:30 schrieb Don Lewis: > I put together a Ryzen 1700X machine over the weekend and installed the > 12.0-CURRENT r315413 snapshot on it a couple of days ago. The RAM is > DDR4 2400. > > First impression is that it's pretty zippy. Compared to my previous > fastest machine: > CPU: AMD FX-8320E Eight-Core Processor (3210.84-MHz K8-class CPU) > make -j8 buildworld using tmpfs is a bit more than 2x faster. Since the > Ryzen has SMT, it's eight cores look like 16 CPUs to FreeBSD, I get > almost a 2.6x speedup with -j16 as compared to my old machine. > > I do see that the reported total CPU time increases quite a bit at -j16 > (~19900u) as compared to -j8 (~13600u) so it is running into some > hardware bottlenecks that are slowing down instruction execution. It > could be the resources shared by both SMT threads that share each core, It is the resources shared by the cores. Under full CPU load, SMT makes a 3.3 GHz 8 core CPU "simulate" a ~2 GHz 16 core CPU. The throughput is (in 1st order) proportional to cores * CPU clock, and comes out as 8 * 3.3 = 26.4 vs. 16 * ~2 = ~32 (estimated) I'm positively surprised by the observed gain of +30% due to SMT. This seems to match the reported user times: 13,600 / 8 = 1,700 seconds user time per physical core (on average) 19,900 / 16 = 1,244 seconds per virtual (SMT) core vs. an estimate of the throughput with a CPU with SMT but without any gain in throughput: 27,200 / 16 = 1,700 seconds per virtual core with ineffective SMT (i.e. assuming SMT that does not increase effective IPC, resulting in identical real time compared to the non-SMT case) This result seems to match the increased performance when going from -j 8 to -j 16: 27,200 / 19,900 = 2.7 ~ 2.6 / 2.0 > or it could be cache or memory bandwidth related. The Ryzen topology is > a bit complicated. There are two groups of four cores, where each group > of four cores shares half of the L3 cache, with a slowish interconnect > bus between the groups. This probably causes some NUMA-like issues. I > wonder if the ULE scheduler could be tweaked to handle this better. I've been wondering whether it is possible to teach the scheduler about above mentioned effect, i.e. by distinguishing a SMT core that executes only 1 runnable thread from one that executes 2. The latter one should be assumed to run at an estimated 60% clock (which makes both threads proceed at 120% of the non-SMT speed). OTOH, the lower "effective clock rate" should be irrelevant under high load (when all cores are executing 2 threads), or under low load, when some cores are idle (assuming, that the scheduler prefers to assign only 1 thread per each core until there are more runnable threads then cores. If you assume that user time accounting is a raw measure of instructions executed, then assuming a reduced clock rate would lead to "fairer" results.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?51b6c5d5-fc66-f371-ef54-c3d85a6f2c2d>