Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 29 Aug 2016 20:37:26 +0300
From:      Slawa Olhovchenkov <slw@zxy.spb.ru>
To:        Bruce Evans <bde@FreeBSD.org>
Cc:        src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-head@freebsd.org
Subject:   Re: svn commit: r305004 - in head/sys: amd64/amd64 amd64/include i386/i386
Message-ID:  <20160829173726.GX22212@zxy.spb.ru>
In-Reply-To: <201608291307.u7TD7L6H025649@repo.freebsd.org>
References:  <201608291307.u7TD7L6H025649@repo.freebsd.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, Aug 29, 2016 at 01:07:21PM +0000, Bruce Evans wrote:

> Author: bde
> Date: Mon Aug 29 13:07:21 2016
> New Revision: 305004
> URL: https://svnweb.freebsd.org/changeset/base/305004
> 
> Log:
>   On amd64, declare sse2_pagezero() and start using it again, but only
>   for zeroing pages in idle where nontemporal writes are clearly best.
>   This is almost a no-op since zeroing in idle works does nothing good
>   and is off by default.  Fix END() statement forgotten in previous
>   commit.
>   
>   Align the loop in sse2_pagezero().  Since it writes to main memory,
>   the loop doesn't have to be very carefully written to keep up.
>   Unrolling it was considered useless or harmful and was not done on
>   i386, but that was too careless.
>   
>   Timing for i386: the loop was not unrolled at all, and moved only 4
>   bytes/iteration.  So on a 2GHz CPU, it needed to run at 2 cycles/
>   iteration to keep up with a memory speed of just 4GB/sec.  But when
>   it crossed a 16-byte boundary, on old CPUs it ran at 3 cycles/
>   iteration so it gave a maximum speed of 2.67GB/sec and couldn't even
>   keep up with PC3200 memory.  Fix the alignment so that it keep up with
>   4GB/sec memory, and unroll once to get nearer to 8GB/sec.  Further
>   unrolling might be useless or harmful since it would prevent the loop
>   fitting in 16-bytes.  My test system with an old CPU and old DDR1 only
>   needed 5+ GB/sec.  My test system with a new CPU and DDR3 doesn't need
>   any changes to keep up ~16GB/sec.
>   
>   Timing for amd64: with 8-byte accesses and newer faster CPUs it is
>   easy to reach 16GB/sec but not so easy to go much faster.  The
>   alignment doesn't matter much if the CPU is not very old.  The loop
>   was already unrolled 4 times, but needs 32 bytes and uses a fancy
>   method that doesn't work for 2-way unrolling in 16 bytes.  Just
>   align it to 32-bytes.

Do you think about using nontemporal writes for copying from user
space to kernel space? In much cases this copy result don't need any parsing
anf just go to DMA, eliminate cache use.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20160829173726.GX22212>