From owner-svn-src-all@freebsd.org  Sun Jan 22 17:10:33 2017
Return-Path: <owner-svn-src-all@freebsd.org>
Delivered-To: svn-src-all@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 7C966CBDFC2;
 Sun, 22 Jan 2017 17:10:33 +0000 (UTC)
 (envelope-from kostikbel@gmail.com)
Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 18DF9DEA;
 Sun, 22 Jan 2017 17:10:32 +0000 (UTC)
 (envelope-from kostikbel@gmail.com)
Received: from tom.home (kib@localhost [127.0.0.1])
 by kib.kiev.ua (8.15.2/8.15.2) with ESMTPS id v0MHAQ6U094157
 (version=TLSv1.2 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO);
 Sun, 22 Jan 2017 19:10:26 +0200 (EET)
 (envelope-from kostikbel@gmail.com)
DKIM-Filter: OpenDKIM Filter v2.10.3 kib.kiev.ua v0MHAQ6U094157
Received: (from kostik@localhost)
 by tom.home (8.15.2/8.15.2/Submit) id v0MHAQI6094156;
 Sun, 22 Jan 2017 19:10:26 +0200 (EET)
 (envelope-from kostikbel@gmail.com)
X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com
 using -f
Date: Sun, 22 Jan 2017 19:10:26 +0200
From: Konstantin Belousov <kostikbel@gmail.com>
To: Bruce Evans <brde@optusnet.com.au>
Cc: Mateusz Guzik <mjguzik@gmail.com>, Mateusz Guzik <mjg@freebsd.org>,
 src-committers@freebsd.org, svn-src-all@freebsd.org,
 svn-src-head@freebsd.org
Subject: Re: svn commit: r312600 - head/sys/kern
Message-ID: <20170122171026.GH2349@kib.kiev.ua>
References: <201701211838.v0LIcHIv072626@repo.freebsd.org>
 <20170121195114.GA2349@kib.kiev.ua>
 <20170122115716.T953@besplex.bde.org>
 <20170122092228.GC20930@dft-labs.eu>
 <20170122224849.E897@besplex.bde.org>
 <20170122125716.GD2349@kib.kiev.ua>
 <20170123015806.E1391@besplex.bde.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20170123015806.E1391@besplex.bde.org>
User-Agent: Mutt/1.7.2 (2016-11-26)
X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00,
 DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no
 autolearn_force=no version=3.4.1
X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on tom.home
X-BeenThere: svn-src-all@freebsd.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: "SVN commit messages for the entire src tree \(except for &quot;
 user&quot; and &quot; projects&quot; \)" <svn-src-all.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/svn-src-all>,
 <mailto:svn-src-all-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/svn-src-all/>
List-Post: <mailto:svn-src-all@freebsd.org>
List-Help: <mailto:svn-src-all-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/svn-src-all>,
 <mailto:svn-src-all-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 22 Jan 2017 17:10:33 -0000

On Mon, Jan 23, 2017 at 04:00:19AM +1100, Bruce Evans wrote:
> On Sun, 22 Jan 2017, Konstantin Belousov wrote:
> 
> > On Sun, Jan 22, 2017 at 11:41:09PM +1100, Bruce Evans wrote:
> >> On Sun, 22 Jan 2017, Mateusz Guzik wrote:
> >>> ...
> >>> I have to disagree about the usefulness remark. If you check generated
> >>> assembly for amd64 (included below), you will see the uncommon code is
> >>> moved out of the way and in particular there are no forward jumps in the
> >>> common case.
> >>
> >> Check benchmarks.  A few cycles here and there are in the noise.  Kernel
> >> code has very few possibilities for useful optimizations since it doesn't
> >> have many inner loops.
> >>
> >>> With __predict_false:
> >>>
> >>> [snip prologue]
> >>>   0xffffffff8084ecaf <+31>:	mov    0x24(%rbx),%eax
> >>>   0xffffffff8084ecb2 <+34>:	test   $0x40,%ah
> >>>   0xffffffff8084ecb5 <+37>:	jne    0xffffffff8084ece2 <vn_closefile+82>
> >>
> >> All that this does is as you say -- invert the condition to jump to the
> >> uncommon code.  This made more of a difference on old CPUs (possiblly
> >> still on low end/non-x86).  Now branch predictors are too good for the
> >> slow case to be much slower.
> >>
> >> I think x86 branch predictors initially predict forward branches as not
> >> taken and backward branches as taken.  So this branch was initially
> >> mispredicted, and the change fixes this.  But the first branch doesn't
> >> really matter.  Starting up takes hundreds or thousands of cycles for
> >> cache misses.
> > This is only true if branch predictor memory is large enough to keep the
> > state for the given branch between exercising it.  Even if the predictor
> > state could be attached to every byte in the icache, or more likely,
> > every line in the icache or uop cache, it still probably too small to
> > survive between user->kernel transitions for syscalls.  Might be there is
> > performance counter which shows branch predictor mis-predictions.
> >
> > In other words, I suspect that there almost all cases might be
> > mis-predictions without manual hint, and mis-predictions together with
> > the full pipeline flush on VFS-intensive load very well might give tens
> > percents of the total cycles on the modern cores.
> >
> > Just speculation.
> 
> Check benchmarks.
> 
> I looked at the mis-prediction counts mainly for a networking micro-benchmark
> alsmost 10 years ago.  They seemed to be among the least of the performance
> problems (the main ones were general bloat and cache misses).  I think the
> branch-predictor caches on even 10-year old x86 are quite large, enough to
> hold tens or hundreds of syscalls.  Otherwise performance would be lower
> than it is.
> 
> Testing shows that the cache size is about 2048 on Athlon-XP.  I might be
> measuring just the size of the L1 Icache interacting with the branch
> predictor:
> 
> The program is for i386 and needs some editing:
> 
> X int
> X main(void)
> X {
> X 	asm("				\n\
> X 	pushal				\n\
> X 	movl	$192105,%edi		\n\
> 
> Set this to $(sysctl -n machdep.tsc_freq) / 10000 to count cycles easily.
> 
> X 1:					\n\
> X 	# beware of clobbering in padding	\n\
> X 	pushal				\n\
> X 	xorl	%eax,%eax		\n\
> X 	# repeat next many times, e.g., 2047 times on Athlon-xp	\n\
> X 	jz	2f; .p2align 3; 2:	\n\
> 
> With up to 2048 branches, each branch takes 2 cycles on Athlon-XP.
> After that, each branch takes 10.8 cycles.
> 
> I don't understand why the alignment is needed, but without it each branch
> takes 9 cycles instead of 2 starting with just 2 jz's.
My guess this is the predictor graining issue I alluded to earlier.
E.g. Agner Fog' manuals state that for K8/K10,
==============================
The branch prediction mechanism allows no more than three taken branches
for every aligned 16-byte block of code.
==============================

The benchmark does not check the cost of misprediction, since the test
consists only of the thread of branches, there is no speculative state
to unwind.
> 
> "jmp" branches are not predicted any better than the always-taken "jz"
> braches.  Alignment is needed similarly.
> 
> Change "jz" to "jnz" to see the speed with branches never taken.  This
> takes 2 cycles for any number of branches up to 8K when the L1 Icache
> runs out.  Now the default prediction of not-taken is correct, so there
> are no mispredictions.
> 
> The alignment costs 0.5 cycles with a small number of jnz's and 0.03
> cycles with a large number of jz's or jmp's.  It helps with a large
> number of jnz's.
> 
> X 	popal				\n\
> X 	decl	%edi			\n\
> X 	jne	1b			\n\
> X 	popal				\n\
> X 	");
> X 	return (0);
> X }
> 
> Timing on Haswell:
> - Haswell only benefits slightly from the alignment and reaches full
>    speed with ".p2align 2"
> - 1 cycle instead of 2 for branch-not-taken
> - 2.1 cycles instead of 2 minimum for branch-taken
> - predictor cache size 4K instead of 2K
> - 12 cycles instead of 10.8 for branches mispredicted by the default for
>    more than 4K jz's.
> 
> Bruce