Date: Thu, 15 Dec 2011 21:46:27 +0200 From: Ivan Klymenko <fidaj@ukr.net> To: Attilio Rao <attilio@freebsd.org> Cc: "O. Hartmann" <ohartman@mail.zedat.fu-berlin.de>, Current FreeBSD <freebsd-current@freebsd.org>, freebsd-stable@freebsd.org, freebsd-performance@freebsd.org, Jeremy Chadwick <freebsd@jdc.parodius.com> Subject: Re: SCHED_ULE should not be the default Message-ID: <20111215214627.16f472bf@nonamehost.> In-Reply-To: <CAJ-FndDnk%2BtMCuY=VRkLurRc8qKLuYjeCuuuK=1%2Bk7cyTFumQA@mail.gmail.com> References: <4EE1EAFE.3070408@m5p.com> <4EE22421.9060707@gmail.com> <4EE6060D.5060201@mail.zedat.fu-berlin.de> <20111213073615.GA69641@icarus.home.lan> <CAJ-FndCoxXV-dOT4QAzt-Qs%2BzUyCGfeFPgbAx%2BpTot8SrVXA7w@mail.gmail.com> <20111215174857.GA28551@icarus.home.lan> <CAJ-FndDnk%2BtMCuY=VRkLurRc8qKLuYjeCuuuK=1%2Bk7cyTFumQA@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
=D0=92 Thu, 15 Dec 2011 20:02:44 +0100 Attilio Rao <attilio@freebsd.org> =D0=BF=D0=B8=D1=88=D0=B5=D1=82: > 2011/12/15 Jeremy Chadwick <freebsd@jdc.parodius.com>: > > On Thu, Dec 15, 2011 at 05:26:27PM +0100, Attilio Rao wrote: > >> 2011/12/13 Jeremy Chadwick <freebsd@jdc.parodius.com>: > >> > On Mon, Dec 12, 2011 at 02:47:57PM +0100, O. Hartmann wrote: > >> >> > Not fully right, boinc defaults to run on idprio 31 so this > >> >> > isn't an issue. And yes, there are cases where SCHED_ULE > >> >> > shows much better performance then SCHED_4BSD. ??[...] > >> >> > >> >> Do we have any proof at hand for such cases where SCHED_ULE > >> >> performs much better than SCHED_4BSD? Whenever the subject > >> >> comes up, it is mentioned, that SCHED_ULE has better > >> >> performance on boxes with a ncpu > 2. But in the end I see here > >> >> contradictionary statements. People complain about poor > >> >> performance (especially in scientific environments), and other > >> >> give contra not being the case. > >> >> > >> >> Within our department, we developed a highly scalable code for > >> >> planetary science purposes on imagery. It utilizes present GPUs > >> >> via OpenCL if present. Otherwise it grabs as many cores as it > >> >> can. By the end of this year I'll get a new desktop box based > >> >> on Intels new Sandy Bridge-E architecture with plenty of > >> >> memory. If the colleague who developed the code is willing > >> >> performing some benchmarks on the same hardware platform, we'll > >> >> benchmark bot FreeBSD 9.0/10.0 and the most recent Suse. For > >> >> FreeBSD I intent also to look for performance with both > >> >> different schedulers available. > >> > > >> > This is in no way shape or form the same kind of benchmark as > >> > what you're planning to do, but I thought I'd throw it out there > >> > for folks to take in as they see fit. > >> > > >> > I know folks were focused mainly on buildworld. > >> > > >> > I personally would find it interesting if someone with a > >> > higher-end system (e.g. 2 physical CPUs, with 6 or 8 cores per > >> > CPU) was to do the same test (changing -jX to -j{numofcores} of > >> > course). > >> > > >> > -- > >> > | Jeremy > >> > Chadwick ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ??jdc at > >> > parodius.com | | Parodius > >> > Networking ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? > >> > http://www.parodius.com/ | | UNIX Systems > >> > Administrator ?? ?? ?? ?? ?? ?? ?? ?? ?? Mountain View, CA, US | > >> > | Making life hard for others since 1977. ?? ?? ?? ?? ?? ?? ?? > >> > PGP 4BD6C0CB | > >> > > >> > > >> > sched_ule > >> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > >> > - time make -j2 buildworld > >> > ??1689.831u 229.328s 18:46.20 170.4% 6566+2051k 432+4264io > >> > 4565pf+0w > >> > - time make -j2 buildkernel > >> > ??640.542u 87.737s 9:01.38 134.5% 6490+1920k 134+5968io 0pf+0w > >> > > >> > > >> > sched_4bsd > >> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > >> > - time make -j2 buildworld > >> > ??1662.793u 206.908s 17:12.02 181.1% 6578+2054k 23750+4271io > >> > 6451pf+0w > >> > - time make -j2 buildkernel > >> > ??638.717u 76.146s 8:34.90 138.8% 6530+1927k 6415+5903io 0pf+0w > >> > > >> > > >> > software > >> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > >> > * sched_ule test: ??FreeBSD 8.2-STABLE, Thu Dec ??1 04:37:29 PST > >> > 2011 > >> > * sched_4bsd test: FreeBSD 8.2-STABLE, Mon Dec 12 22:42:54 PST > >> > 2011 > >> > >> Hi Jeremy, > >> thanks for the time you spent on this. > >> > >> However, I wanted to ask/let you note 3 things: > >> 1) Did you use 2 different code base for the test? (one updated on > >> December 1 and another one on December 12) > > > > No; src-all (/usr/src on this system) was not updated between > > December 1st and December 12th PST. =C2=A0I do believe I updated it > > today (15th PST). I can/will obviously hold off so that we have a > > consistent code base for comparing numbers between schedulers > > during buildworld and/or buildkernel. > > > >> 2) Please note that you should have repeated this test several > >> times (basically until you don't get a standard deviation which is > >> acceptable with ministat) and report the ministat output > > > > This is the first time I have heard of ministat(1). =C2=A0I'm pretty > > sure I see what it's for and how it applies to this situation, but > > boy that man page could use some clarification (I have 3 people > > looking at this thing right now trying to figure out what means > > what in the graph :-) ). Anyway, graph or not, I see the point. > > > > Regarding multiple tests: yup, you're absolutely right, the only > > way to do it would be to run a sequence of tests repeatedly > > (probably 10 per scheduler). =C2=A0Reboots and rm -fr /usr/obj/* would > > be required after each test too, to guarantee empty kernel caches > > (of all types) consistently every time. > > > > What I posted was supposed to give people just a "general idea" if > > there was any gigantic difference between the two, and there really > > isn't. But, as others have stated (and you below), buildworld may > > not be an effective way to "benchmark" what we're trying to test. > > > > Hence me wondering exactly what would make for a good test. > > =C2=A0Example: > > > > 1. Run + background some program that "beats on things" (I really > > don't know what; creation/deletion of threads? =C2=A0CPU benchmark? > > =C2=A0bonnie++?), with output going to /dev/null. > > 2. Run + background "time make -j2 buildworld" with output going > > to /dev/null 3. Record/save output from "time". > > 4. rm -fr /usr/obj && shutdown -r now > > 5. Repeat all steps ~10 times > > 6. Adjust kernel configuration file to use other scheduler > > 7. Repeat steps 1-5. > > > > What I'm trying to figure out is what #1 and #2 should be in the > > above example. > > > >> 3) The difference is less than 2% which I suspect is really > >> statistically unuseful/the same > > > > Understood. > > > >> I'm not really even surprised ULE is not faster than 4BSD in this > >> case because usually buildworld/buildkernel tests are driven for > >> the vast majority by I/O overhead rather than scheduler capacity. > >> It would be more interesting to analyze how buildworld does while > >> another type of workload is going on. > > > > Yup, agreed/understood, hence me trying to find out what would > > classify as a good stress test for all of this. > > > > I have a testbed system in my garage which I could set up to > > literally do all of this in a loop, meaning automate the entire > > above process and just let it go, writing stderr from time to a > > file (which wouldn't skew the results at all). > > > > Let me know what #1 and #2 above, re: "the workloads", should be and > > I'll be happy to set it up. >=20 > My idea, in order to gather meaningful datas for both ULE and 4BSD > would be to see how well they behave in the futher situation: > - 2 concurrent interactive workloads > - 2 concurrent cpu-intensive workloads > - mixed >=20 > and having the number of threads for both varying as: N/2, N, N + > small_amount (1 or 2 or 3, etc), N*2 (where N is the number of > available CPUs) which automatically translates into: >=20 > - 2 concurrent interactive and intensive (A and B workloads): > * A N/2 threads, B N/2 threads > * A N threads, B N/2 threads > * A N + small_amount, B N/2 threads > * A N*2 threads, B N/2 threads > * A N threads, B N threads > * A N + small_amount, B N threads > * A N*2 threads, B N threads > * A N + small_amount, B N + small_amount threads > * A N*2 threads, B N + small_amount threads > * A N*2 threads, B N*2 threads >=20 > For the mixed case, instead, we should try all the 16 combinations > possibly and it is likely the most interesting case, to be honest. >=20 > About the workload, we could use: > interactives: buildworld and bonnie++ (I'm not totally sure if > bonnie++ let you decides how many threads to run, but I'm sure we can > replace with something that really does that) > cpu-intensive: dnetc and SOMETHINGELSE (please propose something that > can be setup very easilly!) > mixed case: buildworld and dnetc >=20 > About the environment I'd suggest the following things: > - Try to boot with a maximum of 16 CPUs. I'm sure past that point TLB > shootdown overhead is going to be too overwhelming, make doesn't > really scale well, and also there could be too much contention on > vm_page_lock_queue for interactive threads. > - Try to reduce the I/O effect by using tmpfs as a storage for in and > out datas when working out the benchmark > - Use 10.0 with both kerneland and userland totally debug-free (please > recall to set MALLOC_PRODUCTION in jemalloc) and always at the same > svn revision, with the only change being the scheduler switch and the > number of threads changing during the runs >=20 > About the test itself I'd suggest the following things: > - After every test combination, please reboot the machine (like, after > you have tested the A N/2 threads and B N/2 threads case on > sched_4bsd, reboot the machine before to do A N threads and B N/2 > threads) > - For every test combination I suggest to run the workloads 4 times, > discard the first one (but keep the value!) and ministat the other > three. Showing the "uncached" case against the average cached one will > give much more indication than expected. > - Expect a standard deviation from ministat to be 95% (or beyond) to > be valuable > - For every difference in performance we find we should likely start > worry about if it is as or bigger than 3% and being very concerned > from 5% to above >=20 > I think we already have some datas of ULE being broken in some cases > (like George's and Steven's case) but we really need to characterize > more, I think. >=20 > Now, I understand this seems a gigantic work but I think there is much > people which is interested in working on this and we may scatter these > tests around, to different testers, to find meaningful datas. >=20 > If it was me, I would start with comparisons involving all the N and N > + small_amount cases which should be the most interesting. >=20 > Do you have questions? >=20 > Thanks, > Attilio >=20 >=20 Perhaps it makes sense to co-write a script to automate these actions? And place it in /usr/src/tools/sched/...
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20111215214627.16f472bf>