From owner-freebsd-current@FreeBSD.ORG Mon Oct 12 07:44:38 2009 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 301381065672; Mon, 12 Oct 2009 07:44:38 +0000 (UTC) (envelope-from ohartman@zedat.fu-berlin.de) Received: from outpost1.zedat.fu-berlin.de (outpost1.zedat.fu-berlin.de [130.133.4.66]) by mx1.freebsd.org (Postfix) with ESMTP id AABF18FC1C; Mon, 12 Oct 2009 07:44:37 +0000 (UTC) Received: from inpost2.zedat.fu-berlin.de ([130.133.4.69]) by outpost1.zedat.fu-berlin.de (Exim 4.69) with esmtp (envelope-from ) id <1MxFa8-0000Wg-DE>; Mon, 12 Oct 2009 09:44:36 +0200 Received: from telesto.geoinf.fu-berlin.de ([130.133.86.198]) by inpost2.zedat.fu-berlin.de (Exim 4.69) with esmtpsa (envelope-from ) id <1MxFa8-0004DV-Bd>; Mon, 12 Oct 2009 09:44:36 +0200 Message-ID: <4AD2DE71.5060601@zedat.fu-berlin.de> Date: Mon, 12 Oct 2009 07:44:49 +0000 From: "O. Hartmann" Organization: Freie =?ISO-8859-15?Q?Universit=E4t_Berlin?= User-Agent: Thunderbird 2.0.0.23 (X11/20090824) MIME-Version: 1.0 To: Steve Kargl References: <6729ad0409e449f8dbda69ecd8feb618.squirrel@webmail.lerctr.org> <20091012014846.GB38325@troutmask.apl.washington.edu> <20091012023912.GA38822@troutmask.apl.washington.edu> <4AD29937.2040004@mailinglist.ahhyes.net> <20091012043358.GA39364@troutmask.apl.washington.edu> <4AD2B203.8030405@mailinglist.ahhyes.net> <20091012044911.GA39479@troutmask.apl.washington.edu> In-Reply-To: <20091012044911.GA39479@troutmask.apl.washington.edu> Content-Type: text/plain; charset=ISO-8859-15; format=flowed Content-Transfer-Encoding: 7bit X-Originating-IP: 130.133.86.198 Cc: "freebsd-performance@freebsd.org" , freebsd-current@freebsd.org Subject: Re: Scheduler weirdness X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 12 Oct 2009 07:44:38 -0000 Steve Kargl wrote: > On Mon, Oct 12, 2009 at 03:35:15PM +1100, Alex R wrote: >> Steve Kargl wrote: >>> On Mon, Oct 12, 2009 at 01:49:27PM +1100, Alex R wrote: >>> >>>> Steve Kargl wrote: >>>> >>>>> So, you have 4 cpus and 4 folding-at-home processes and you're >>>>> trying to use the system with other apps? Switch to 4BSD. >>>>> >>>>> >>>>> >>>> I thought SCHED_ULE was meant to be a much better choice under an SMP >>>> environment. Why are you suggesting he rebuild his kernel and use the >>>> legacy scheduler? >>>> >>>> >>> If you have N cpus and N+1 numerical intensitive applications, >>> ULE may have poor performance compared to 4BSD. In OP's case, >>> he has 4 cpus and 4 numerical intensity (?) applications. He, >>> however, also is trying to use the system in some interactive >>> way. >>> >>> >> Ah ok. Is this just an accepted thing by the freebsd dev's or are they >> trying to fix it? >> > > Jeff appears to be extremely busy with other projects. He is aware of > the problem, and I have set up my system to give him access when/if it > is so desired. > > Here's the text of my last set of tests that I sent to him > > OK, I've manage to recreate the problem. User kargl launches a mpi > job on node10 that creates two images on node20. This is command z > in the top(1) info. 30 seconds later, user sgk lauches a mpi process > on node10 that creates 8 images on node20. This is command rivmp in > top(1) info. With 8 available cpus, this is a (slightly) oversubscribed > node. > > For 4BSD, I see > > last pid: 1432; load averages: 8.68, 5.65, 2.82 up 0+01:52:14 17:07:22 > 40 processes: 11 running, 29 sleeping > CPU: 100% user, 0.0% nice, 0.0% system, 0.0% interrupt, 0.0% idle > Mem: 32M Active, 12M Inact, 203M Wired, 424K Cache, 29M Buf, 31G Free > Swap: 4096M Total, 4096M Free > > PID USERNAME THR PRI NICE SIZE RES STATE C TIME CPU COMMAND > 1428 sgk 1 124 0 81788K 5848K CPU3 6 1:13 78.81% rivmp > 1431 sgk 1 124 0 81788K 5652K RUN 1 1:13 78.52% rivmp > 1415 kargl 1 124 0 78780K 4668K CPU7 1 1:38 78.42% z > 1414 kargl 1 124 0 78780K 4664K CPU0 0 1:37 77.25% z > 1427 sgk 1 124 0 81788K 5852K CPU4 3 1:13 78.42% rivmp > 1432 sgk 1 124 0 81788K 5652K CPU2 4 1:13 78.27% rivmp > 1425 sgk 1 124 0 81788K 6004K CPU5 5 1:12 78.17% rivmp > 1426 sgk 1 124 0 81788K 5832K RUN 6 1:13 78.03% rivmp > 1429 sgk 1 124 0 81788K 5788K CPU6 7 1:12 77.98% rivmp > 1430 sgk 1 124 0 81788K 5764K RUN 2 1:13 77.93% rivmp > > > Notice, the accumulated times appear reasonable. At this point in the > computations, rivmp is doing no communication between processes. z is > the netpipe benchmark and is essentially sending messages between the > two processes over the memory bus. > > > For ULE, I see > > last pid: 1169; load averages: 7.56, 2.61, 1.02 up 0+00:03:15 17:13:01 > 40 processes: 11 running, 29 sleeping > CPU: 100% user, 0.0% nice, 0.0% system, 0.0% interrupt, 0.0% idle > Mem: 31M Active, 9392K Inact, 197M Wired, 248K Cache, 26M Buf, 31G Free > Swap: 4096M Total, 4096M Free > > PID USERNAME THR PRI NICE SIZE RES STATE C TIME CPU COMMAND > 1168 sgk 1 118 0 81788K 5472K CPU6 6 1:18 100.00% rivmp > 1169 sgk 1 118 0 81788K 5416K CPU7 7 1:18 100.00% rivmp > 1167 sgk 1 118 0 81788K 5496K CPU5 5 1:18 100.00% rivmp > 1166 sgk 1 118 0 81788K 5564K RUN 4 1:18 100.00% rivmp > 1151 kargl 1 118 0 78780K 4464K CPU3 3 1:48 99.27% z > 1152 kargl 1 110 0 78780K 4464K CPU0 0 1:18 62.89% z > 1164 sgk 1 113 0 81788K 5592K CPU1 1 0:55 80.76% rivmp > 1165 sgk 1 110 0 81788K 5544K RUN 0 0:52 62.16% rivmp > 1163 sgk 1 107 0 81788K 5624K RUN 2 0:40 50.68% rivmp > 1162 sgk 1 107 0 81788K 5824K CPU2 2 0:39 50.49% rivmp > > > In the above, processes 1162-1165 are clearly not receiving sufficient time > slices to keep up with the other 4 rivmp images. From watching top at a > 1 second interval, once the 4 rivmp hit 100% CPU, they stayed pinned to > their cpu and stay at 100% CPU. It is also seen that processes 1152, 1165 > and 1162, 1163 are stuck on cpus 0 and 2, respectively. > This isn't only bound to floating-point intense applications, even the operating system itselfs seems to suffer from SCHED_ULE. I saw, see and reported several performance issue under heavy load and for seconds (if not minutes!) 4+ CPU boxes get as stuck as a UP box does. Those sticky sitiuations are painful in cases where the box needs to be accessed via X11. The remaining four FreeBSD 8.0-boxes used for numerical applications in our lab (others switched to Linux a long time ago) all uses SCHED_ULE, as this scheduler was introduced to be the superior scheduler over the legacy 4BSD. Well, I'll give 4BSD a chance again. At the moment, even our 8-core DELL Poweredge box is in production use, but if there is something I can do, menas: benchmarking, I'll give it a try. Regards, Oliver