Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 22 Mar 2017 13:50:35 -0700
From:      Freddie Cash <fjwcash@gmail.com>
To:        Don Lewis <truckman@freebsd.org>
Cc:        freebsd-amd64@freebsd.org
Subject:   Re: FreeBSD on Ryzen
Message-ID:  <CAOjFWZ4T1Z89gjMXF0pLtvwa1c=gWJxyhdvwNDOmnfQRgE1hqQ@mail.gmail.com>
In-Reply-To: <201703222030.v2MKUJJs026400@gw.catspoiler.org>
References:  <201703222030.v2MKUJJs026400@gw.catspoiler.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, Mar 22, 2017 at 1:30 PM, Don Lewis <truckman@freebsd.org> wrote:

> I put together a Ryzen 1700X machine over the weekend and installed the
> 12.0-CURRENT r315413 snapshot on it a couple of days ago.  The RAM is
> DDR4 2400.
>
> First impression is that it's pretty zippy.  Compared to my previous
> fastest machine:
>   CPU: AMD FX-8320E Eight-Core Processor (3210.84-MHz K8-class CPU)
> make -j8 buildworld using tmpfs is a bit more than 2x faster.  Since the
> Ryzen has SMT, it's eight cores look like 16 CPUs to FreeBSD, I get
> almost a 2.6x speedup with -j16 as compared to my old machine.
>
> I do see that the reported total CPU time increases quite a bit at -j16
> (~19900u) as compared to -j8 (~13600u) so it is running into some
> hardware bottlenecks that are slowing down instruction execution.  It
> could be the resources shared by both SMT threads that share each core,
> or it could be cache or memory bandwidth related.  The Ryzen topology is
> a bit complicated. There are two groups of four cores, where each group
> of four cores shares half of the L3 cache, with a slowish interconnect
> bus between the groups.  This probably causes some NUMA-like issues.  I
> wonder if the ULE scheduler could be tweaked to handle this better.
>

=E2=80=8BThe interconnect, aka Infinity Fabric, runs at the speed of the me=
mory
controller, so if you put faster RAM into the system, the fabric runs
faster, and inter-CCX latency should drop to match.

There's 2 MB of L3 cache shared between every two cores, but any core can
access data in the L3 cache of any other core.  Latency for those requests
depends on whether it's within the same CCX (4-core cluster), or in the
other CCX=E2=80=8B (going across the Infinity Fabric).

There's a lot of finicky timing issues with L3 cache accesses, and with
thread migration (in-CCX vs across the fabric).

This is a whole other level of NUMA fun.  And it'll get even more fun when
the server version ships where you have 4 CCXes in a single CPU, with
multiple sockets on a motherboard, and Infinity Fabric joining everything
together.  :)

I feel sorry for the scheduler devs who get to figure all this out.  :D
 Supposedly, the Linux folks have this mostly figured out in kernel 4.10,
but I'll wait for the benchmarks to believe it.  There's a bunch up on
Phoronix ... but, well, it's Phoronix.  :)


--=20
Freddie Cash
fjwcash@gmail.com



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAOjFWZ4T1Z89gjMXF0pLtvwa1c=gWJxyhdvwNDOmnfQRgE1hqQ>