From owner-freebsd-current@FreeBSD.ORG  Thu Jul  7 20:08:46 2011
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: freebsd-current@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 1F479106566B;
	Thu,  7 Jul 2011 20:08:46 +0000 (UTC)
	(envelope-from sgk@troutmask.apl.washington.edu)
Received: from troutmask.apl.washington.edu (troutmask.apl.washington.edu
	[128.95.76.21])
	by mx1.freebsd.org (Postfix) with ESMTP id E94EF8FC08;
	Thu,  7 Jul 2011 20:08:45 +0000 (UTC)
Received: from troutmask.apl.washington.edu (localhost.apl.washington.edu
	[127.0.0.1])
	by troutmask.apl.washington.edu (8.14.4/8.14.4) with ESMTP id
	p67K8jVj077116; Thu, 7 Jul 2011 13:08:45 -0700 (PDT)
	(envelope-from sgk@troutmask.apl.washington.edu)
Received: (from sgk@localhost)
	by troutmask.apl.washington.edu (8.14.4/8.14.4/Submit) id
	p67K8j3F077115; Thu, 7 Jul 2011 13:08:45 -0700 (PDT)
	(envelope-from sgk)
Date: Thu, 7 Jul 2011 13:08:45 -0700
From: Steve Kargl <sgk@troutmask.apl.washington.edu>
To: Andriy Gapon <avg@FreeBSD.org>
Message-ID: <20110707200845.GA77049@troutmask.apl.washington.edu>
References: <20110706170132.GA68775@troutmask.apl.washington.edu>
	<5080.1309971941@critter.freebsd.dk>
	<20110706180001.GA69157@troutmask.apl.washington.edu>
	<4E14A54A.4050106@freebsd.org> <4E155FF9.5090905@FreeBSD.org>
	<20110707151440.GA75537@troutmask.apl.washington.edu>
	<4E160C2F.8020001@FreeBSD.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <4E160C2F.8020001@FreeBSD.org>
User-Agent: Mutt/1.4.2.3i
Cc: FreeBSD Current <freebsd-current@FreeBSD.org>, "Hartmann,
	O." <ohartman@zedat.fu-berlin.de>,
	Nathan Whitehorn <nwhitehorn@FreeBSD.org>
Subject: Re: Heavy I/O blocks FreeBSD box for several seconds
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>, 
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 07 Jul 2011 20:08:46 -0000

On Thu, Jul 07, 2011 at 10:42:39PM +0300, Andriy Gapon wrote:
> on 07/07/2011 18:14 Steve Kargl said the following:
>> 
>> I'm using OpenMPI.  These are N > Ncpu processes not threads,
>
> I used 'thread' in a sense of a kernel thread.  It shouldn't
> actually matter if it's a process or a thread in userland
> in this context.
> 
> > and without
> > the loss of generality let N = Ncpu + 1.  It is a classic master-slave
> > situation where 1 process initializes all others.  The n-1 slave processes
> > are then independent of each other.  After 20 minutes or so of number
> > crunching, each slave sends a few 10s of KB of data to the master.  The
> > master collects all the data, writes it to disk, and then sends the
> > slaves the next set of computations to do.  The computations are nearly 
> > identical, so each slave finishes it task in the same amount of time. The
> > problem appears to be that 2 slaves are bound to the same cpu and the 
> > remaining N - 3 slaves are bound to a specific cpu.  The N - 3 slaves
> > finish their task, send data to the master, and then spin (chewing up
> > nearly 100% cpu) waiting for the 2 ping-ponging slaves to finishes.
> > This causes a stall in the computation.  When a complete computation
> > takes days to complete, theses stall become problematic.  So, yes, I 
> > want the processes to get a more uniform access to cpus via migration
> > to other cpus.  This is what 4BSD appears to do.
> 
> I would imagine that periodic rebalancing would take care of this,
> but probably the ULE rebalancing algorithm is not perfect.

:-)

> There was a suggestion on performance@ to try to use a lower value for
> kern.sched.steal_thresh, a value of 1 was recommended:
> http://article.gmane.org/gmane.os.freebsd.performance/3459

node16:kargl[215] uname -a
FreeBSD node16.cimu.org 9.0-CURRENT FreeBSD 9.0-CURRENT #2 r223824M:
Thu Jul  7 11:12:15 PDT 2011 

node16:kargl[216] sysctl -a | grep smp.cpu
kern.smp.cpus: 4

4BSD kernel gives for N = Ncpu.

33 processes:  5 running, 28 sleeping

  PID USERNAME  THR PRI NICE   SIZE    RES STATE   C   TIME    CPU COMMAND
 1387 kargl       1  67    0   370M   293M CPU1    1   1:31 98.34% sasmp
 1384 kargl       1  67    0   370M   293M CPU2    2   1:31 98.34% sasmp
 1386 kargl       1  67    0   370M   294M CPU3    3   1:30 98.34% sasmp
 1385 kargl       1  67    0   370M   294M RUN     0   1:31 98.29% sasmp

4BSD kernel gives for N = Ncpu + 1.

34 processes:  6 running, 28 sleeping

  PID USERNAME  THR PRI NICE   SIZE    RES STATE   C   TIME    CPU COMMAND
 1417 kargl       1  71    0   370M   294M RUN     0   1:30 79.39% sasmp
 1416 kargl       1  71    0   370M   294M RUN     0   1:30 79.20% sasmp
 1418 kargl       1  71    0   370M   294M CPU2    0   1:29 78.81% sasmp
 1420 kargl       1  71    0   370M   294M CPU1    2   1:30 78.27% sasmp
 1419 kargl       1  70    0   370M   294M CPU3    0   1:30 77.59% sasmp


Recompiling the kernel to use ULE instead of 4BSD with the exact same
hardware and kernel configuration.

ULE kernel gives for N = Ncpu.

33 processes:  5 running, 28 sleeping

  PID USERNAME  THR PRI NICE   SIZE    RES STATE   C   TIME    CPU COMMAND
 1294 kargl       1 103    0   370M   294M CPU3    3   1:30 100.00% sasmp
 1292 kargl       1 103    0   370M   294M RUN     2   1:30 100.00% sasmp
 1295 kargl       1 103    0   370M   293M CPU0    0   1:30 100.00% sasmp
 1293 kargl       1 103    0   370M   294M CPU1    1   1:28 100.00% sasmp

ULE kernel gives for N = Ncpu + 1.

34 processes:  6 running, 28 sleeping

  PID USERNAME  THR PRI NICE   SIZE    RES STATE   C   TIME    CPU COMMAND
 1318 kargl       1 103    0   370M   294M CPU0    0   1:31 100.00% sasmp
 1319 kargl       1 103    0   370M   294M RUN     1   1:29 100.00% sasmp
 1322 kargl       1  99    0   370M   294M CPU2    2   1:03 87.26% sasmp
 1320 kargl       1  91    0   370M   294M RUN     3   1:07 60.79% sasmp
 1321 kargl       1  89    0   370M   294M CPU3    3   1:06 55.18% sasmp

node16:root[165] sysctl -w kern.sched.steal_thresh=1
kern.sched.steal_thresh: 2 -> 1

34 processes:  6 running, 28 sleeping

  PID USERNAME  THR PRI NICE   SIZE    RES STATE   C   TIME   WCPU COMMAND
 1396 kargl       1 103    0   366M   291M CPU3    3   1:30 100.00% sasmp
 1397 kargl       1 103    0   366M   291M CPU2    2   1:30 99.17% sasmp
 1400 kargl       1  97    0   366M   291M CPU0    0   1:05 83.25% sasmp
 1399 kargl       1  94    0   366M   291M RUN     1   1:04 73.97% sasmp
 1398 kargl       1  98    0   366M   291M RUN     0   1:01 54.05% sasmp

-- 
Steve