From owner-freebsd-current@FreeBSD.ORG  Tue Jul 12 08:05:21 2011
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: freebsd-current@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id B5781106564A
	for <freebsd-current@FreeBSD.org>; Tue, 12 Jul 2011 08:05:21 +0000 (UTC)
	(envelope-from avg@FreeBSD.org)
Received: from citadel.icyb.net.ua (citadel.icyb.net.ua [212.40.38.140])
	by mx1.freebsd.org (Postfix) with ESMTP id 056888FC14
	for <freebsd-current@FreeBSD.org>; Tue, 12 Jul 2011 08:05:20 +0000 (UTC)
Received: from porto.starpoint.kiev.ua (porto-e.starpoint.kiev.ua
	[212.40.38.100])
	by citadel.icyb.net.ua (8.8.8p3/ICyb-2.3exp) with ESMTP id LAA07126;
	Tue, 12 Jul 2011 11:05:16 +0300 (EEST)
	(envelope-from avg@FreeBSD.org)
Received: from localhost ([127.0.0.1])
	by porto.starpoint.kiev.ua with esmtp (Exim 4.34 (FreeBSD))
	id 1QgXy0-000PkH-0Y; Tue, 12 Jul 2011 11:05:16 +0300
Message-ID: <4E1C003B.4090604@FreeBSD.org>
Date: Tue, 12 Jul 2011 11:05:15 +0300
From: Andriy Gapon <avg@FreeBSD.org>
User-Agent: Mozilla/5.0 (X11; FreeBSD amd64;
	rv:5.0) Gecko/20110706 Thunderbird/5.0
MIME-Version: 1.0
To: Steve Kargl <sgk@troutmask.apl.washington.edu>
References: <20110706170132.GA68775@troutmask.apl.washington.edu>
	<5080.1309971941@critter.freebsd.dk>
	<20110706180001.GA69157@troutmask.apl.washington.edu>
	<4E14A54A.4050106@freebsd.org> <4E155FF9.5090905@FreeBSD.org>
	<20110707151440.GA75537@troutmask.apl.washington.edu>
	<4E160C2F.8020001@FreeBSD.org>
	<20110707200845.GA77049@troutmask.apl.washington.edu>
	<ivf221$oo2$1@dough.gmane.org> <4E1B1198.6090308@FreeBSD.org>
	<20110711161654.GA97361@troutmask.apl.washington.edu>
In-Reply-To: <20110711161654.GA97361@troutmask.apl.washington.edu>
X-Enigmail-Version: 1.2pre
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Cc: freebsd-current@FreeBSD.org
Subject: Re: Heavy I/O blocks FreeBSD box for several seconds
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>, 
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 12 Jul 2011 08:05:21 -0000

on 11/07/2011 19:16 Steve Kargl said the following:
> On Mon, Jul 11, 2011 at 06:07:04PM +0300, Andriy Gapon wrote:
>> But it's not clear which of the processes are slaves and which is master.
>> It's also not clear why the master takes so much CPU (on par with the
>> slaves) -
>> from my reading of its description (by Steve) it should be doing only light
>> periodic work.
> 
> These are all slave processes.  The master process was on a different
> node in the cluster.  Each process is doing the exact same computation
> with only a small change in a coordinate from (x,y,z) to (x,y+n*dy,z)
> with n = 1, 2, 3, 4.  The small change does not causes a different 
> code path, so all should complete in nearly identical times.

OK, the situation is much clearer (to me) now.

>> If it does have to do CPU-heavy work, then I'd imagine that it should
>> spawn only Ncpus - 1 slaves.
> 
> And if you have M users on the system?  Also note, you can get the
> exact same loading problem by launching Ncpu+1 completely independent
> cpu-bound processes.  Ncpu-1 processes will be bound to specific cpus
> and 2 processes will ping-pong on one cpu.  This ping-ponging will
> simply kill performance.

I'd still argue that if someone cares about doing some calculations as fast as
possible then he shouldn't have more than Ncpu CPU-bound processes.  How to
achieve that is a technical/administrative issue.

But nevertheless I now see what the problem is.
I think that the best thing you can further provide (as objective evidence for
the problem at hand) is ktr(4) traces for at least KTR_SCHED mask.  Perhaps you
even already have them from your previous sessions with Jeff.

P.S. This is not a promise to actually debug this issue based on the traces :-)
-- 
Andriy Gapon