Date: Fri, 19 Jan 2024 00:22:01 -0800 From: Mark Millard <marklmi@yahoo.com> To: freebsd-amd64@freebsd.org Subject: 7950X3D: using 1 hardware thread per core vs. 2 hardware threads per core: a fairly large difference Message-ID: <8001F567-7E02-43FB-8D08-D42B560369D8@yahoo.com> References: <8001F567-7E02-43FB-8D08-D42B560369D8.ref@yahoo.com>
next in thread | previous in thread | raw e-mail | index | archive | help
I do not know how much the below generalizes as I do not have access to other rather modern FreeBSD amd64 systems to test, just the 7950X3D system with 192 GiBytes of RAM. The gist: https://gist.github.com/markmi/193423c6fd6f534a72725d7d5cd0236a is an image showing performance curves for a benchmark. Each curve is for 8 hardware threads in use. The x axis is for the problem size (Bytes, logarithmic scaling). The y axis is performance (linear). (It is a mathematical definition in a mathematical approximation problem that is handled a specific way in the benchmark.) As the problem size grows signficantly larger than a RAM cache, the access pattern makes the RAM-cache become notably less effective. The benchmark variant restricts each software thread to a specific hardware thread (singleton cpuset) after the thread starts, generally avoiding losing structural information to thread migration variability in the structures used. The major performance difference ends up being tied to: 1 hardware thread per core vs. 2 hardware threads per core A quick textual summary giving a clue is: 1 per core, 8 cores: around 800*(10^6) to 850*(10^6) peak. 2 per core, 4 cores: around 500*(10^6) to 550*(10^6) peak. (same units) But far more than the peaks show large differences in the same orientation for the same caching generally. Think of an area under a curve for a size range being important for that size range. Each hardware thread does independent processing. (But the threads' results are combined to get the overall result for a problem size.) So more RAM cache sharing and other resource sharing is involved for 2 threads per core --and it has non-trivial performance consequences from the competition for shared resources. The far right of each curve [around 150*(10^6)] vs. the peaks of the curve suggest how much the RAM-caching helps the performance (or how much the processor waits for RAM when RAM-caching is not very effective vs. when RAM-caching is more effective). The RAM is DDR5-5200, 2 DIMMS per channel, 2 channels, 48 GiBytes per DIMM. Note: The benchmark can also be built to not have the CPU LockDown used, allowing general migration of software threads across the hardware threads in a cpuset. Seeing the CPU LockDown results first can help interpret the messier with-migration results. === Mark Millard marklmi at yahoo.com
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?8001F567-7E02-43FB-8D08-D42B560369D8>