Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 1 Aug 2016 00:30:14 +1000 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        Slawa Olhovchenkov <slw@zxy.spb.ru>
Cc:        Bruce Evans <brde@optusnet.com.au>, Mateusz Guzik <mjg@freebsd.org>,  svn-src-head@freebsd.org, svn-src-all@freebsd.org,  src-committers@freebsd.org
Subject:   Re: svn commit: r303583 - head/sys/amd64/amd64
Message-ID:  <20160801000046.X3364@besplex.bde.org>
In-Reply-To: <20160731135129.GA22212@zxy.spb.ru>
References:  <201607311134.u6VBY81j031059@repo.freebsd.org> <20160731220407.Q3033@besplex.bde.org> <20160731135129.GA22212@zxy.spb.ru>

next in thread | previous in thread | raw e-mail | index | archive | help
On Sun, 31 Jul 2016, Slawa Olhovchenkov wrote:

> On Sun, Jul 31, 2016 at 11:11:25PM +1000, Bruce Evans wrote:
>
>> Misalignment of this loop made it almost twice as slow on old Turion2 with
>> slow DDR2 memory.  It made no difference on Haswell.  I added an extra
>> movnti, but that makes little or no differences.  2 more movnti's wouldn't
>> fit in a 16-byte cache line so are slower unless even more care is taken
>> with alignment (or with less care, 4 with misalignment are not less than
>> twice as slow as 1 with alignment).
>>
>> I thought that alignment and unrolling didn't matter here, because movnti
>> has to wait for memory and almost any loop runs fast enough to keep up.
>> The timing on my old system is something like: CPUs at 2 GHz; main memory
>> at 4 GB/sec; movnti is only 4 bytes wide on i386 (so this problem
>> only affects i386, at least with slow memory).  So sustaining 4 GB/sec
>> requires 1 G movnti's/sec, so the loop needs to run at 2 cycles/iteration
>> to keep up.  But when it is misaligned, it runs at 3-4 cycles/iteration.
>> Alignment makes it take about 2, and the extra movnti is for safety and
>> to work with faster memory.
>>
>> On Haswell with CPUs at 4 GHz, 2 cycles/iteration gives 8 GB/sec on
>> i386 and 16 GB/sec on amd64 with wider movnti.  IIRC, 16 GB/sec is about
>> the main memory speed so nothing better is possible but just 1 extra
>> movnti gives more with faster memory.  This is just worse than bzero()
>
> What about modern system with 120 GB/sec main memory speed?

Is there such a system?  It would have main memory almost twice as fast
as Haswell L2 and almost half as fast as Haswell L1.

My fastest memory actually does 20001 MB/s according to old memtest
and that is about right according to other tests.

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20160801000046.X3364>