Date: Thu, 22 Dec 2011 16:59:32 -0800 From: Steve Kargl <sgk@troutmask.apl.washington.edu> To: Adrian Chadd <adrian@freebsd.org> Cc: freebsd-stable@freebsd.org, Andriy Gapon <avg@freebsd.org> Subject: Re: SCHED_ULE should not be the default Message-ID: <20111223005932.GA47439@troutmask.apl.washington.edu> In-Reply-To: <CAJ-VmokeyDrKb-yQkzTm8tnOYcRm603hz%2B6nen10F3zFQVmCEQ@mail.gmail.com> References: <4EE1EAFE.3070408@m5p.com> <CAJ-FndBSOS3hKYqmPnVkoMhPmowBBqy9-%2BeJJEMTdoVjdMTEdw@mail.gmail.com> <20111215215554.GA87606@troutmask.apl.washington.edu> <CAJ-FndD0vFWUnRPxz6CTR5JBaEaY3gh9y7-Dy6Gds69_aRgfpg@mail.gmail.com> <20111222005250.GA23115@troutmask.apl.washington.edu> <20111222103145.GA42457@onelab2.iet.unipi.it> <20111222184531.GA36084@troutmask.apl.washington.edu> <4EF37E7B.4020505@FreeBSD.org> <20111222194740.GA36796@troutmask.apl.washington.edu> <CAJ-VmokeyDrKb-yQkzTm8tnOYcRm603hz%2B6nen10F3zFQVmCEQ@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, Dec 22, 2011 at 04:23:29PM -0800, Adrian Chadd wrote: > On 22 December 2011 11:47, Steve Kargl <sgk@troutmask.apl.washington.edu> wrote: > > > There is the additional observation in one of my 2008 > > emails (URLs have been posted) that if you have N+1 > > cpu-bound jobs with, say, job0 and job1 ping-ponging > > on cpu0 (due to ULE's cpu-affinity feature) and if I > > kill job2 running on cpu1, then neither job0 nor job1 > > will migrate to cpu1. ?So, one now has N cpu-bound > > jobs running on N-1 cpus. > > .. and this sounds like a pretty serious regression. Have you ever > filed a PR for it? No. I was interacting directly with jeffr in 2008. I got as far as setting up root access on a node for jeffr. Unfortunately, both jeffr and I got busy with real life, and 4BSD allowed me to get my work done. > > Finally, my initial post in this email thread was to > > tell O. Hartman to quit beating his head against > > a wall with ULE (in an HPC environment). ?Switch to > > 4BSD. ?This was based on my 2008 observations and > > I've now wasted 2 days gather additional information > > which only re-affirms my recommendation. > > I personally don't think this is time wasted. You've done something > that noone else has actually done - provided actual results from > real-life testing, rather than a hundred posts of "I remember seeing > X, so I don't use ULE." > > If you can definitely and consistently reproduce that N-1 cpu bound > job bug, you're now in a great position to easily test and re-report > KTR/schedtrace results to see what impact they have. Please don't > underestimate exactly how valuable this is. I'll try this tomorrow. I first need to modify the code I used in the 2008 test to disable IO, so that it is nearly completely cpu-bound. > How often are those two jobs migrating between CPUs? How am I supposed > to read "CPU load" ? Why isn't it just sitting at 100% the whole time? This is my 1st foray into ktr and schedgraph, so I may not have done something incorrectly. In particular, it seems that schedgraph takes the cpu clock as a command line argument, so there is probably some scaling that I'm missing. > Would you mind repeating this with 4BSD (the N+1 jobs) so we can see > how the jobs are scheduled/interleaved? Something tells me we'll see > it the jobs being scheduled evenly Sure, I'll do this tomorrow as well. -- Steve
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20111223005932.GA47439>