Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 22 Dec 2011 11:31:45 +0100
From:      Luigi Rizzo <rizzo@iet.unipi.it>
To:        Steve Kargl <sgk@troutmask.apl.washington.edu>
Cc:        Attilio Rao <attilio@freebsd.org>, Andrey Chernov <ache@nagual.pp.ru>, George Mitchell <george+freebsd@m5p.com>, Doug Barton <dougb@freebsd.org>, freebsd-stable@freebsd.org
Subject:   Re: SCHED_ULE should not be the default
Message-ID:  <20111222103145.GA42457@onelab2.iet.unipi.it>
In-Reply-To: <20111222005250.GA23115@troutmask.apl.washington.edu>
References:  <4EE1EAFE.3070408@m5p.com> <CAJ-FndBSOS3hKYqmPnVkoMhPmowBBqy9-%2BeJJEMTdoVjdMTEdw@mail.gmail.com> <20111215215554.GA87606@troutmask.apl.washington.edu> <CAJ-FndD0vFWUnRPxz6CTR5JBaEaY3gh9y7-Dy6Gds69_aRgfpg@mail.gmail.com> <20111222005250.GA23115@troutmask.apl.washington.edu>

next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, Dec 21, 2011 at 04:52:50PM -0800, Steve Kargl wrote:
> On Fri, Dec 16, 2011 at 12:14:24PM +0100, Attilio Rao wrote:
> > 2011/12/15 Steve Kargl <sgk@troutmask.apl.washington.edu>:
> > > On Thu, Dec 15, 2011 at 05:25:51PM +0100, Attilio Rao wrote:
> > >>
> > >> I basically went through all the e-mail you just sent and identified 4
> > >> real report on which we could work on and summarizied in the attached
> > >> Excel file.
> > >> I'd like that George, Steve, Doug, Andrey and Mike possibly review the
> > >> few datas there and add more, if they want, or make more important
> > >> clarifications in particular about the Xorg presence (or rather not)
> > >> in their workload.
> > >
> > > Your summary of my observations appears correct.
> > >
> > > I have grabbed an up-to-date /usr/src, built and
> > > installed world, and built and installed a new
> > > kernel on one of the nodes in my cluster. ??It
> > > has
> > >
> > 
> > It seems a perfect environment, just please make sure you made a
> > debug-free userland (setting MALLOC_PRODUCTION in jemalloc basically).
> > 
> > The first thing is, can you try reproducing your case? As far as I got
> > it, for you it was enough to run N + small_amount of CPU-bound threads
> > to show performance penalty, so I'd ask you to start with using dnetc
> > or just your preferred cpu-bound workload and verify you can reproduce
> > the issue.
> > As it happens, please monitor the threads bouncing and CPU utilization
> > via 'top' (you don't need to be 100% precise, jut to get an idea, and
> > keep an eye on things like excessive threads migration, thread binding
> > obsessity, low throughput on CPU).
> > One note: if your workloads need to do I/O please use a tempfs or
> > memory storage to do so, in order to reduce I/O effects at all.
> > Also, verify this doesn't happen with 4BSD scheduler, just in case.
> > 
> > Finally, if the problem is still in place, please recompile your
> > kernel by adding:
> > options KTR
> > options KTR_ENTRIES=262144
> > options KTR_COMPILE=(KTR_SCHED)
> > options KTR_MASK=(KTR_SCHED)
> > 
> > And reproduce the issue.
> > When you are in the middle of the scheduling issue go with:
> > # ktrdump -ctf > ktr-ule-problem-YOURNAME.out
> > 
> > and send to the mailing list along with your dmesg and the
> > informations on the CPU utilization you gathered by top(1).
> > 
> > That should cover it all, but if you have further questions, please
> > just go ahead.
> 
> Attilio,
> 
> I have placed several files at
> 
> http://troutmask.apl.washington.edu/~kargl/freebsd
> 
> dmesg.txt      --> dmesg for ULE kernel
> summary        --> A summary that includes top(1) output of all runs.
> sysctl.ule.txt --> sysctl -a for the ULE kernel
> ktr-ule-problem-kargl.out.gz 
> 
> I performed a series of tests with both 4BSD and ULE kernels.
> The 4BSD and ULE kernels are identical except of course for the
> scheduler.  Both witness and invariants are disabled, and malloc
> has been compiled without debugging.
> 
> Here's what I did.  On the master node in my cluster, I ran an
> OpenMPI code that sends N jobs off to the node with the kernel
> of interest.  There is communication between the master and
> slaves to generate 16 independent chunks of data.  Note, there
> is no disk IO.  So, for example, N=4 will start 4 essentially
> identical numerically intensity jobs.  At the start of a run,
> the master node instructs each slave job to create a chunk of
> data.  After the data is created, the slave sends it back to the
> master and the master sends instructions to create the next chunk
> of data.  This communication continues until the 16 chunks have
> been assigned, computed, and returned to the master.  
> 
> Here is a rough measurement of the problem with ULE and numerical
> intensity loads.  This command is executed on the master
> 
> time mpiexec -machinefile mf3 -np N sasmp sas.in
> 
> Since time is executed on the master, only the 'real' time is of
> interest (the summary file includes user and sys times).  This
> command is run at 5 times for each N value and up to 10 time for
> some N values with the ULE kernel.  The following table records
> the average 'real' time and the number in (...) is the mean
> absolute deviations. 
> 
> #  N         ULE             4BSD
> # -------------------------------------
> #  4    223.27 (0.502)   221.76 (0.551)
> #  5    404.35 (73.82)   270.68 (0.866)
> #  6    627.56 (173.0)   247.23 (1.442)
> #  7    475.53 (84.07)   285.78 (1.421)
> #  8    429.45 (134.9)   223.64 (1.316)

One explanation for taking 1.5-2x times is that with ULE the
threads are not migrated properly, so you end up with idle cores
and ready threads not running (the other possible explanation
would be that there are migrations, but they are so frequent and
expensive that they completely trash the caches. But this seems
unlikely for this type of task).

Also, perhaps one could build a simple test process that replicates
this workload (so one can run it as part of regression tests):
	1. define a CPU-intensive function f(n) which issues no
	   system calls, optionally touching
	   a lot of memory, where n  determines the number of iterations.
	2. by trial and error (or let the program find it),
	   pick a value N1 so that the minimum execution time
	   of f(N1) is in the 10..100ms range
	3. now run the function f() again from an outer loop so
	   that the total execution time is large (10..100s)
	   again with no intervening system calls.
	4. use an external shell script can rerun a process
	   when it terminates, and then run multiple instances
	   in parallel. Instead of the external script one could
	   fork new instances before terminating, but i am a bit
	   unclear how CPU inheritance works when a process forks.
	   Going through the shell possibly breaks the chain.

cheers
luigi

> These numbers to me demonstrate that ULE is not a good choice
> for a HPC workload.
> 
> If you need more information, feel free to ask.  If you would
> like access to the node, I can probably arrange that.  But,
> we can discuss that off-line.
> 
> -- 
> Steve
> _______________________________________________
> freebsd-stable@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-stable
> To unsubscribe, send any mail to "freebsd-stable-unsubscribe@freebsd.org"



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20111222103145.GA42457>