Date: Sat, 12 Mar 2011 06:11:26 -0500 From: Mehmet Erol Sanliturk <m.e.sanliturk@gmail.com> To: Martin Matuska <mm@freebsd.org> Cc: Poul-Henning Kamp <phk@phk.freebsd.dk>, freebsd-current@freebsd.org, freebsd-performance@freebsd.org Subject: Re: FreeBSD Compiler Benchmark: gcc-base vs. gcc-ports vs. clang Message-ID: <AANLkTi=AmrB_LYwbzaXEM1HFJM362WgzmA5KD0Exxwzy@mail.gmail.com> In-Reply-To: <4D7B44AF.7040406@FreeBSD.org> References: <98496.1299861978@critter.freebsd.dk> <4D7B44AF.7040406@FreeBSD.org>
next in thread | previous in thread | raw e-mail | index | archive | help
2011/3/12 Martin Matuska <mm@freebsd.org> > Hi Poul-Henning, > > I have redone the test for majority of the processors, this time taking > 5 samples of each whole testrun, calculating the average, standard > deviation, relative standard deviation, standard error and relative > standard error. > > The relative standard error is below 0.25% for ~91%, between 0.25% and > 0.5% for ~7%, 0.5%-1.0% for ~1% and between 1.0%-2.0% for <1% of the > tests. Under a "test" I mean 5 runs for the same setting of the same > compiler on the same preocessor. > > So let's say I have now the string/base64 test for a core i7 showing the > following (score +/- standard deviation): > gcc421: 82.7892 points +/- 0.8314 (1%) > gcc45-nocona: 96.0882 points +/- 1.1652 (1.21%) > > For a relative comparsion of two settings of the same test I could > calculate the difference of averages =3D 13.299 (16.06%) points and sum o= f > standard deviations =3D 2.4834 points (3.00%) > > Therefore if assuming normal distribution intervals I could say that: > With a 95% probability gcc45-nocona is faster than gcc421 by at least > 10.18% (16.06 - 1.96x3.00) or with a 99.9% probability by at least 6.12% > (16,06 - 3.2906x3.00). > > So I should probably take a significance level (e.g. 95%, 99% or 99.9%) > and normalize all the test scores for this level. Results out of the > interval (difference is below zero) are then not significant. > > What significance level should I take? > > I hope this approach is better :) > > D=C5=88a 11.03.2011 17:46, Poul-Henning Kamp wrote / nap=C3=ADsal(a): > > In message <4D7A42CC.8020807@FreeBSD.org>, Martin Matuska writes: > > > >> But what I can say, e.g. for the Intel Atom processor, if there are > >> performance gains in all but one test (that falls 2% behind), generic > >> perl code (the routines benchmarked) on this processor is very likely = to > >> run faster with that setup. > > > > No, actually you cannot say that, unless you run all the tests at > > least three times for each compiler(+flag), calculate the average > > and standard deviation of all the tests, and see which, if any of > > the results are statistically significant. > > > > Until you do that, you numbers are meaningless, because we have no > > idea what the signal/noise ratio is. > > > > Additionally to possible answer by Poul-Henning Kamp , you may consider the following pages because strength ( sensitivity ) of hypothesis tests are determined by statistical power computations : http://en.wikipedia.org/wiki/Statistical_power http://en.wikipedia.org/wiki/Statistical_hypothesis_testing http://en.wikipedia.org/wiki/Category:Hypothesis_testing http://en.wikipedia.org/wiki/Category:Statistical_terminology Thank you very much . Mehmet Erol Sanliturk
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?AANLkTi=AmrB_LYwbzaXEM1HFJM362WgzmA5KD0Exxwzy>