From owner-freebsd-current@FreeBSD.ORG Thu Jul 7 20:08:46 2011 Return-Path: Delivered-To: freebsd-current@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 1F479106566B; Thu, 7 Jul 2011 20:08:46 +0000 (UTC) (envelope-from sgk@troutmask.apl.washington.edu) Received: from troutmask.apl.washington.edu (troutmask.apl.washington.edu [128.95.76.21]) by mx1.freebsd.org (Postfix) with ESMTP id E94EF8FC08; Thu, 7 Jul 2011 20:08:45 +0000 (UTC) Received: from troutmask.apl.washington.edu (localhost.apl.washington.edu [127.0.0.1]) by troutmask.apl.washington.edu (8.14.4/8.14.4) with ESMTP id p67K8jVj077116; Thu, 7 Jul 2011 13:08:45 -0700 (PDT) (envelope-from sgk@troutmask.apl.washington.edu) Received: (from sgk@localhost) by troutmask.apl.washington.edu (8.14.4/8.14.4/Submit) id p67K8j3F077115; Thu, 7 Jul 2011 13:08:45 -0700 (PDT) (envelope-from sgk) Date: Thu, 7 Jul 2011 13:08:45 -0700 From: Steve Kargl To: Andriy Gapon Message-ID: <20110707200845.GA77049@troutmask.apl.washington.edu> References: <20110706170132.GA68775@troutmask.apl.washington.edu> <5080.1309971941@critter.freebsd.dk> <20110706180001.GA69157@troutmask.apl.washington.edu> <4E14A54A.4050106@freebsd.org> <4E155FF9.5090905@FreeBSD.org> <20110707151440.GA75537@troutmask.apl.washington.edu> <4E160C2F.8020001@FreeBSD.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4E160C2F.8020001@FreeBSD.org> User-Agent: Mutt/1.4.2.3i Cc: FreeBSD Current , "Hartmann, O." , Nathan Whitehorn Subject: Re: Heavy I/O blocks FreeBSD box for several seconds X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 07 Jul 2011 20:08:46 -0000 On Thu, Jul 07, 2011 at 10:42:39PM +0300, Andriy Gapon wrote: > on 07/07/2011 18:14 Steve Kargl said the following: >> >> I'm using OpenMPI. These are N > Ncpu processes not threads, > > I used 'thread' in a sense of a kernel thread. It shouldn't > actually matter if it's a process or a thread in userland > in this context. > > > and without > > the loss of generality let N = Ncpu + 1. It is a classic master-slave > > situation where 1 process initializes all others. The n-1 slave processes > > are then independent of each other. After 20 minutes or so of number > > crunching, each slave sends a few 10s of KB of data to the master. The > > master collects all the data, writes it to disk, and then sends the > > slaves the next set of computations to do. The computations are nearly > > identical, so each slave finishes it task in the same amount of time. The > > problem appears to be that 2 slaves are bound to the same cpu and the > > remaining N - 3 slaves are bound to a specific cpu. The N - 3 slaves > > finish their task, send data to the master, and then spin (chewing up > > nearly 100% cpu) waiting for the 2 ping-ponging slaves to finishes. > > This causes a stall in the computation. When a complete computation > > takes days to complete, theses stall become problematic. So, yes, I > > want the processes to get a more uniform access to cpus via migration > > to other cpus. This is what 4BSD appears to do. > > I would imagine that periodic rebalancing would take care of this, > but probably the ULE rebalancing algorithm is not perfect. :-) > There was a suggestion on performance@ to try to use a lower value for > kern.sched.steal_thresh, a value of 1 was recommended: > http://article.gmane.org/gmane.os.freebsd.performance/3459 node16:kargl[215] uname -a FreeBSD node16.cimu.org 9.0-CURRENT FreeBSD 9.0-CURRENT #2 r223824M: Thu Jul 7 11:12:15 PDT 2011 node16:kargl[216] sysctl -a | grep smp.cpu kern.smp.cpus: 4 4BSD kernel gives for N = Ncpu. 33 processes: 5 running, 28 sleeping PID USERNAME THR PRI NICE SIZE RES STATE C TIME CPU COMMAND 1387 kargl 1 67 0 370M 293M CPU1 1 1:31 98.34% sasmp 1384 kargl 1 67 0 370M 293M CPU2 2 1:31 98.34% sasmp 1386 kargl 1 67 0 370M 294M CPU3 3 1:30 98.34% sasmp 1385 kargl 1 67 0 370M 294M RUN 0 1:31 98.29% sasmp 4BSD kernel gives for N = Ncpu + 1. 34 processes: 6 running, 28 sleeping PID USERNAME THR PRI NICE SIZE RES STATE C TIME CPU COMMAND 1417 kargl 1 71 0 370M 294M RUN 0 1:30 79.39% sasmp 1416 kargl 1 71 0 370M 294M RUN 0 1:30 79.20% sasmp 1418 kargl 1 71 0 370M 294M CPU2 0 1:29 78.81% sasmp 1420 kargl 1 71 0 370M 294M CPU1 2 1:30 78.27% sasmp 1419 kargl 1 70 0 370M 294M CPU3 0 1:30 77.59% sasmp Recompiling the kernel to use ULE instead of 4BSD with the exact same hardware and kernel configuration. ULE kernel gives for N = Ncpu. 33 processes: 5 running, 28 sleeping PID USERNAME THR PRI NICE SIZE RES STATE C TIME CPU COMMAND 1294 kargl 1 103 0 370M 294M CPU3 3 1:30 100.00% sasmp 1292 kargl 1 103 0 370M 294M RUN 2 1:30 100.00% sasmp 1295 kargl 1 103 0 370M 293M CPU0 0 1:30 100.00% sasmp 1293 kargl 1 103 0 370M 294M CPU1 1 1:28 100.00% sasmp ULE kernel gives for N = Ncpu + 1. 34 processes: 6 running, 28 sleeping PID USERNAME THR PRI NICE SIZE RES STATE C TIME CPU COMMAND 1318 kargl 1 103 0 370M 294M CPU0 0 1:31 100.00% sasmp 1319 kargl 1 103 0 370M 294M RUN 1 1:29 100.00% sasmp 1322 kargl 1 99 0 370M 294M CPU2 2 1:03 87.26% sasmp 1320 kargl 1 91 0 370M 294M RUN 3 1:07 60.79% sasmp 1321 kargl 1 89 0 370M 294M CPU3 3 1:06 55.18% sasmp node16:root[165] sysctl -w kern.sched.steal_thresh=1 kern.sched.steal_thresh: 2 -> 1 34 processes: 6 running, 28 sleeping PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND 1396 kargl 1 103 0 366M 291M CPU3 3 1:30 100.00% sasmp 1397 kargl 1 103 0 366M 291M CPU2 2 1:30 99.17% sasmp 1400 kargl 1 97 0 366M 291M CPU0 0 1:05 83.25% sasmp 1399 kargl 1 94 0 366M 291M RUN 1 1:04 73.97% sasmp 1398 kargl 1 98 0 366M 291M RUN 0 1:01 54.05% sasmp -- Steve