From owner-freebsd-hackers@FreeBSD.ORG  Fri Oct 10 22:29:07 2008
Return-Path: <owner-freebsd-hackers@FreeBSD.ORG>
Delivered-To: freebsd-hackers@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 865161065690
	for <freebsd-hackers@freebsd.org>; Fri, 10 Oct 2008 22:29:07 +0000 (UTC)
	(envelope-from jdc@koitsu.dyndns.org)
Received: from QMTA07.westchester.pa.mail.comcast.net
	(qmta07.westchester.pa.mail.comcast.net [76.96.62.64])
	by mx1.freebsd.org (Postfix) with ESMTP id 334C28FC1B
	for <freebsd-hackers@freebsd.org>; Fri, 10 Oct 2008 22:29:06 +0000 (UTC)
	(envelope-from jdc@koitsu.dyndns.org)
Received: from OMTA13.westchester.pa.mail.comcast.net ([76.96.62.52])
	by QMTA07.westchester.pa.mail.comcast.net with comcast
	id R8Wt1a00517dt5G57AV6UC; Fri, 10 Oct 2008 22:29:06 +0000
Received: from koitsu.dyndns.org ([69.181.141.110])
	by OMTA13.westchester.pa.mail.comcast.net with comcast
	id RAV41a00P2P6wsM3ZAV59K; Fri, 10 Oct 2008 22:29:06 +0000
X-Authority-Analysis: v=1.0 c=1 a=QycZ5dHgAAAA:8 a=C1AWhVUT42d_71i-uLQA:9
	a=AI56OMMXf-ldpafupfgA:7 a=ycrOskimge5EXfuqaha76G6AFBcA:4
	a=EoioJ0NPDVgA:10 a=LY0hPdMaydYA:10
Received: by icarus.home.lan (Postfix, from userid 1000)
	id 6D004C9419; Fri, 10 Oct 2008 15:29:04 -0700 (PDT)
Date: Fri, 10 Oct 2008 15:29:04 -0700
From: Jeremy Chadwick <koitsu@FreeBSD.org>
To: Steve Kargl <sgk@troutmask.apl.washington.edu>
Message-ID: <20081010222904.GA44873@icarus.home.lan>
References: <20081010213042.GA96822@troutmask.apl.washington.edu>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20081010213042.GA96822@troutmask.apl.washington.edu>
User-Agent: Mutt/1.5.18 (2008-05-17)
Cc: freebsd-hackers@freebsd.org, jeff@freebsd.org
Subject: Re: HPC with ULE vs 4BSD
X-BeenThere: freebsd-hackers@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Technical Discussions relating to FreeBSD
	<freebsd-hackers.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>, 
	<mailto:freebsd-hackers-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hackers>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Help: <mailto:freebsd-hackers-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>,
	<mailto:freebsd-hackers-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 10 Oct 2008 22:29:07 -0000

On Fri, Oct 10, 2008 at 02:30:42PM -0700, Steve Kargl wrote:
> Yes, this is a long email.
> 
> In working with a colleague to diagnosis poor performance of
> his MPI code, we've discovered that ULE is drastically inferior
> to 4BSD in utilizing a system with 2 physical cpus (opteron) and
> a total of 8 cores.  We have observed this problem with the Open MPI
> implementation of MPI and with the MPICH2 implementation.
> 
> Note, I am using the exact same hardware and FreeBSD-current
> code dated Sep 22, 2008.  The only difference in the kernel
> config file is whether ULE or 4BSD is used.
> 
> Using the following command,
> 
> % time /OpenMPI/mpiexec -machinefile mf -n 8 ./Test_mpi |& tee sgk.log
> 
> we have 
> 
> ULE -->  546.99 real    0.02 user      0.03 sys
> 4BSD ->  218.96 real    0.03 user      0.02 sys
> 
> where the machinefile simply tells Open MPI to launch 8 jobs on the
> local node.  Test_mpi uses MPI's scatter, gather, and all_to_all
> functions to transmit various arrays between the 8 jobs.  To get
> meaningful numbers, a number of iterations are done in a tight loop.
> 
> With ULE, a snapshot of top(1) shows
> 
> last pid: 33765;  load averages:  7.98,  7.51,  5.63   up 10+03:20:30  13:13:56
> 43 processes:  9 running, 34 sleeping
> CPU: 68.6% user,  0.0% nice, 18.9% system,  0.0% interrupt, 12.5% idle
> Mem: 296M Active, 20M Inact, 192M Wired, 1112K Cache, 132M Buf, 31G Free
> Swap: 4096M Total, 4096M Free
> 
>   PID USERNAME    THR PRI NICE   SIZE    RES STATE  C   TIME    CPU COMMAND
> 33743 kargl         1 118    0   300M 22788K CPU7   7   4:48 100.00% Test_mpi
> 33747 kargl         1 118    0   300M 22820K CPU3   3   4:43 100.00% Test_mpi
> 33742 kargl         1 118    0   300M 22692K CPU5   5   4:42 100.00% Test_mpi
> 33744 kargl         1 117    0   300M 22752K CPU6   6   4:29 100.00% Test_mpi
> 33748 kargl         1 117    0   300M 22768K CPU2   2   4:31 96.39% Test_mpi
> 33741 kargl         1 112    0   299M 43628K CPU1   1   4:40 80.08% Test_mpi
> 33745 kargl         1 113    0   300M 44272K RUN    0   4:27 76.17% Test_mpi
> 33746 kargl         1 109    0   300M 22740K RUN    0   4:25 57.86% Test_mpi
> 33749 kargl         1  44    0  8196K  2280K CPU4   4   0:00  0.20% top
> 
> while with 4BSD, a snapshot of top(1) shows
> 
> last pid:  1019;  load averages:  7.24,  3.05,  1.25    up 0+00:04:40  13:27:09
> 43 processes:  9 running, 34 sleeping
> CPU: 45.4% user,  0.0% nice, 54.5% system,  0.1% interrupt,  0.0% idle
> Mem: 329M Active, 33M Inact, 107M Wired, 104K Cache, 14M Buf, 31G Free
> Swap: 4096M Total, 4096M Free
> 
>   PID USERNAME    THR PRI NICE   SIZE    RES STATE  C   TIME    CPU COMMAND
>  1012 kargl         1 126    0   300M 44744K CPU6   6   2:16 99.07% Test_mpi
>  1016 kargl         1 126    0   314M 59256K RUN    4   2:16 99.02% Test_mpi
>  1011 kargl         1 126    0   300M 44652K CPU5   5   2:16 99.02% Test_mpi
>  1013 kargl         1 126    0   300M 44680K CPU2   2   2:16 99.02% Test_mpi
>  1010 kargl         1 126    0   300M 44740K CPU7   7   2:16 99.02% Test_mpi
>  1009 kargl         1 126    0   299M 43884K CPU0   0   2:16 98.97% Test_mpi
>  1014 kargl         1 126    0   300M 44664K CPU1   1   2:16 98.97% Test_mpi
>  1015 kargl         1 126    0   300M 44620K CPU3   3   2:16 98.93% Test_mpi
>   989 kargl         1  96    0  8196K  2460K CPU4   4   0:00  0.10% top
> 
> Notice the interesting, or even perhaps odd, scheduling with ULE that results
> in a 20 second gap between the "fastest" job (4:48) and the "slowest" (4:25). 
> With ULE, 2 Test_mpi jobs are always scheduled on the same core while one
> core remains idle.  Also, note the difference in the reported load averages.
> 
> Various stats are generated by and collected from executing the MPI program
> With ULE, the numbers are
> 
> Procs  Array size   Kb   Iters  Function   Bandwidth(Mbs)   Time(s) 
>   8     800000     3125   100     scatter    12.58386     0.24251367
>   8     800000     3125   100  all_to_all    17.24503     0.17696444
>   8     800000     3125   100      gather    14.82058     0.20591355
>  
>   8    1600000     6250   100     scatter    28.25922     0.21598316
>   8    1600000     6250   100  all_to_all  1985.74915     0.00307366
>   8    1600000     6250   100      gather    30.42038     0.20063902
> 
>   8    2400000     9375   100     scatter    44.65615     0.20501709
>   8    2400000     9375   100  all_to_all    16.09386     0.56886748
>   8    2400000     9375   100      gather    44.38801     0.20625555
>  
>   8    3200000    12500   100     scatter    60.04160     0.20330956
>   8    3200000    12500   100  all_to_all  2157.10010     0.00565900
>   8    3200000    12500   100      gather    59.72242     0.20439614
>  
>   8    4000000    15625   100     scatter    86.65769     0.17608117
>   8    4000000    15625   100  all_to_all  2081.25195     0.00733154
>   8    4000000    15625   100      gather    27.47257     0.55541896
>  
>   8    4800000    18750   100     scatter    33.02306     0.55447768
>   8    4800000    18750   100  all_to_all   200.09908     0.09150740
>   8    4800000    18750   100      gather    91.08742     0.20102168
>  
>   8    5600000    21875   100     scatter   109.82005     0.19452098
>   8    5600000    21875   100  all_to_all    76.87574     0.27788095
>   8    5600000    21875   100      gather    41.67106     0.51264128
>  
>   8    6400000    25000   100     scatter    26.92482     0.90674917
>   8    6400000    25000   100  all_to_all    64.74528     0.37707868
>   8    6400000    25000   100      gather    41.29724     0.59117904
>  
> and with 4BSD, the numbers are
> 
> Procs  Array size   Kb     Iters  Function    Bandwidth(Mbs)  Time(s) 
>   8      800000     3125   100      scatter      21.33697    0.14302677
>   8      800000     3125   100   all_to_all    3941.39624    0.00077428
>   8      800000     3125   100       gather      24.75520    0.12327747
> 
>   8     1600000     6250   100      scatter      45.20134    0.13502954
>   8     1600000     6250   100   all_to_all    1987.94348    0.00307027
>   8     1600000     6250   100       gather      42.02498    0.14523541
> 
>   8     2400000     9375   100      scatter      63.03553    0.14523989
>   8     2400000     9375   100   all_to_all    2015.19580    0.00454312
>   8     2400000     9375   100       gather      66.72807    0.13720272
>  
>   8     3200000    12500   100      scatter      91.90541    0.13282169
>   8     3200000    12500   100   all_to_all    2029.62622    0.00601442
>   8     3200000    12500   100       gather      87.99693    0.13872112
> 
>   8     4000000    15625   100      scatter     107.48991    0.14195556
>   8     4000000    15625   100   all_to_all    1970.66907    0.00774295
>   8     4000000    15625   100       gather     110.70226    0.13783630
> 
>   8     4800000    18750   100      scatter     140.39014    0.13042616
>   8     4800000    18750   100   all_to_all    2401.80054    0.00762367
>   8     4800000    18750   100       gather     134.60948    0.13602717
>  
>   8     5600000    21875   100      scatter     152.31958    0.14024661
>   8     5600000    21875   100   all_to_all    2379.12207    0.00897907
>   8     5600000    21875   100       gather     154.60051    0.13817745
>  
>   8     6400000    25000   100      scatter     190.03561    0.12847099
>   8     6400000    25000   100   all_to_all    2661.36963    0.00917350
>   8     6400000    25000   100       gather     183.08250    0.13335006
>  
> Noting that all communication is over the memory bus, a comparison of
> the Bandwidth columns suggests that ULE is causing the MPI jobs to stall
> waiting for data.  This has potentially serious negative impact on
> clusters used for HPC.

What surprises me is that you didn't CC the individual who wrote ULE:
Jeff Roberson.  :-)  I've CC'd him here.

-- 
| Jeremy Chadwick                                jdc at parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                  Mountain View, CA, USA |
| Making life hard for others since 1977.              PGP: 4BD6C0CB |