From owner-freebsd-performance@FreeBSD.ORG  Mon Oct 12 07:44:38 2009
Return-Path: <owner-freebsd-performance@FreeBSD.ORG>
Delivered-To: freebsd-performance@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 301381065672;
	Mon, 12 Oct 2009 07:44:38 +0000 (UTC)
	(envelope-from ohartman@zedat.fu-berlin.de)
Received: from outpost1.zedat.fu-berlin.de (outpost1.zedat.fu-berlin.de
	[130.133.4.66])
	by mx1.freebsd.org (Postfix) with ESMTP id AABF18FC1C;
	Mon, 12 Oct 2009 07:44:37 +0000 (UTC)
Received: from inpost2.zedat.fu-berlin.de ([130.133.4.69])
	by outpost1.zedat.fu-berlin.de (Exim 4.69) with esmtp
	(envelope-from <ohartman@zedat.fu-berlin.de>)
	id <1MxFa8-0000Wg-DE>; Mon, 12 Oct 2009 09:44:36 +0200
Received: from telesto.geoinf.fu-berlin.de ([130.133.86.198])
	by inpost2.zedat.fu-berlin.de (Exim 4.69) with esmtpsa
	(envelope-from <ohartman@zedat.fu-berlin.de>)
	id <1MxFa8-0004DV-Bd>; Mon, 12 Oct 2009 09:44:36 +0200
Message-ID: <4AD2DE71.5060601@zedat.fu-berlin.de>
Date: Mon, 12 Oct 2009 07:44:49 +0000
From: "O. Hartmann" <ohartman@zedat.fu-berlin.de>
Organization: Freie =?ISO-8859-15?Q?Universit=E4t_Berlin?=
User-Agent: Thunderbird 2.0.0.23 (X11/20090824)
MIME-Version: 1.0
To: Steve Kargl <sgk@troutmask.apl.washington.edu>
References: <6729ad0409e449f8dbda69ecd8feb618.squirrel@webmail.lerctr.org>	<20091012014846.GB38325@troutmask.apl.washington.edu>	<fe073255a48a675c0a8ab5bb8c105e61.squirrel@webmail.lerctr.org>	<20091012023912.GA38822@troutmask.apl.washington.edu>	<4AD29937.2040004@mailinglist.ahhyes.net>	<20091012043358.GA39364@troutmask.apl.washington.edu>	<4AD2B203.8030405@mailinglist.ahhyes.net>
	<20091012044911.GA39479@troutmask.apl.washington.edu>
In-Reply-To: <20091012044911.GA39479@troutmask.apl.washington.edu>
Content-Type: text/plain; charset=ISO-8859-15; format=flowed
Content-Transfer-Encoding: 7bit
X-Originating-IP: 130.133.86.198
Cc: "freebsd-performance@freebsd.org" <freebsd-performance@freebsd.org>,
	freebsd-current@freebsd.org
Subject: Re: Scheduler weirdness
X-BeenThere: freebsd-performance@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Performance/tuning <freebsd-performance.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-performance>,
	<mailto:freebsd-performance-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-performance>
List-Post: <mailto:freebsd-performance@freebsd.org>
List-Help: <mailto:freebsd-performance-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-performance>,
	<mailto:freebsd-performance-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 12 Oct 2009 07:44:38 -0000

Steve Kargl wrote:
> On Mon, Oct 12, 2009 at 03:35:15PM +1100, Alex R wrote:
>> Steve Kargl wrote:
>>> On Mon, Oct 12, 2009 at 01:49:27PM +1100, Alex R wrote:
>>>  
>>>> Steve Kargl wrote:
>>>>    
>>>>> So, you have 4 cpus and 4 folding-at-home processes and you're
>>>>> trying to use the system with other apps?  Switch to 4BSD.
>>>>>
>>>>>
>>>>>      
>>>> I thought SCHED_ULE was meant to be a much better choice under an SMP 
>>>> environment. Why are you suggesting he rebuild his kernel and use the 
>>>> legacy scheduler?
>>>>
>>>>    
>>> If you have N cpus and N+1 numerical intensitive applications,
>>> ULE may have poor performance compared to 4BSD.   In OP's case,
>>> he has 4 cpus and 4 numerical intensity (?) applications.  He,
>>> however, also is trying to use the system in some interactive
>>> way.
>>>
>>>  
>> Ah ok. Is this just an accepted thing by the freebsd dev's or are they 
>> trying to fix it?
>>
> 
> Jeff appears to be extremely busy with other projects.  He is aware of
> the problem, and I have set up my system to give him access when/if it
> is so desired.
> 
> Here's the text of my last set of tests that I sent to him
> 
> OK, I've manage to recreate the problem.  User kargl launches a mpi
> job on node10 that creates two images on node20.  This is command z
> in the top(1) info.  30 seconds later, user sgk lauches a mpi process
> on node10 that creates 8 images on node20.  This is command rivmp in
> top(1) info.  With 8 available cpus, this is a (slightly) oversubscribed
> node.
> 
> For 4BSD, I see
> 
> last pid:  1432;  load averages:  8.68,  5.65,  2.82                up 0+01:52:14  17:07:22
> 40 processes:  11 running, 29 sleeping
> CPU:  100% user,  0.0% nice,  0.0% system,  0.0% interrupt,  0.0% idle
> Mem: 32M Active, 12M Inact, 203M Wired, 424K Cache, 29M Buf, 31G Free
> Swap: 4096M Total, 4096M Free
> 
>   PID USERNAME    THR PRI NICE   SIZE    RES STATE   C   TIME    CPU COMMAND
>  1428 sgk           1 124    0 81788K  5848K CPU3    6   1:13 78.81% rivmp
>  1431 sgk           1 124    0 81788K  5652K RUN     1   1:13 78.52% rivmp
>  1415 kargl         1 124    0 78780K  4668K CPU7    1   1:38 78.42% z
>  1414 kargl         1 124    0 78780K  4664K CPU0    0   1:37 77.25% z
>  1427 sgk           1 124    0 81788K  5852K CPU4    3   1:13 78.42% rivmp
>  1432 sgk           1 124    0 81788K  5652K CPU2    4   1:13 78.27% rivmp
>  1425 sgk           1 124    0 81788K  6004K CPU5    5   1:12 78.17% rivmp
>  1426 sgk           1 124    0 81788K  5832K RUN     6   1:13 78.03% rivmp
>  1429 sgk           1 124    0 81788K  5788K CPU6    7   1:12 77.98% rivmp
>  1430 sgk           1 124    0 81788K  5764K RUN     2   1:13 77.93% rivmp
> 
> 
> Notice, the accumulated times appear reasonable.  At this point in the
> computations, rivmp is doing no communication between processes.  z is
> the netpipe benchmark and is essentially sending messages between the
> two processes over the memory bus.
> 
> 
> For ULE, I see
> 
> last pid:  1169;  load averages:  7.56,  2.61,  1.02                up 0+00:03:15  17:13:01
> 40 processes:  11 running, 29 sleeping
> CPU:  100% user,  0.0% nice,  0.0% system,  0.0% interrupt,  0.0% idle
> Mem: 31M Active, 9392K Inact, 197M Wired, 248K Cache, 26M Buf, 31G Free
> Swap: 4096M Total, 4096M Free
> 
>   PID USERNAME    THR PRI NICE   SIZE    RES STATE   C   TIME    CPU COMMAND
>  1168 sgk           1 118    0 81788K  5472K CPU6    6   1:18 100.00% rivmp
>  1169 sgk           1 118    0 81788K  5416K CPU7    7   1:18 100.00% rivmp
>  1167 sgk           1 118    0 81788K  5496K CPU5    5   1:18 100.00% rivmp
>  1166 sgk           1 118    0 81788K  5564K RUN     4   1:18 100.00% rivmp
>  1151 kargl         1 118    0 78780K  4464K CPU3    3   1:48 99.27% z
>  1152 kargl         1 110    0 78780K  4464K CPU0    0   1:18 62.89% z
>  1164 sgk           1 113    0 81788K  5592K CPU1    1   0:55 80.76% rivmp
>  1165 sgk           1 110    0 81788K  5544K RUN     0   0:52 62.16% rivmp
>  1163 sgk           1 107    0 81788K  5624K RUN     2   0:40 50.68% rivmp
>  1162 sgk           1 107    0 81788K  5824K CPU2    2   0:39 50.49% rivmp
> 
> 
> In the above, processes 1162-1165 are clearly not receiving sufficient time
> slices to keep up with the other 4 rivmp images.  From watching top at a
> 1 second interval, once the 4 rivmp hit 100% CPU, they stayed pinned to
> their cpu and stay at 100% CPU.  It is also seen that processes 1152, 1165
> and 1162, 1163 are stuck on cpus 0 and 2, respectively.
> 

This isn't only bound to floating-point intense applications, even the 
operating system itselfs seems to suffer from SCHED_ULE. I saw, see and 
reported several performance issue under heavy load and for seconds (if 
not minutes!) 4+ CPU boxes get as stuck as a UP box does. Those sticky 
sitiuations are painful in cases where the box needs to be accessed via 
X11.
The remaining four FreeBSD 8.0-boxes used for numerical applications in 
our lab (others switched to Linux a long time ago) all uses SCHED_ULE, 
as this scheduler was introduced to be the superior scheduler over the 
legacy 4BSD. Well, I'll give 4BSD a chance again.

At the moment, even our 8-core DELL Poweredge box is in production use, 
but if there is something I can do, menas: benchmarking, I'll give it a try.
Regards,
Oliver