From owner-freebsd-hackers@FreeBSD.ORG Fri Oct 10 22:29:07 2008 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 865161065690 for ; Fri, 10 Oct 2008 22:29:07 +0000 (UTC) (envelope-from jdc@koitsu.dyndns.org) Received: from QMTA07.westchester.pa.mail.comcast.net (qmta07.westchester.pa.mail.comcast.net [76.96.62.64]) by mx1.freebsd.org (Postfix) with ESMTP id 334C28FC1B for ; Fri, 10 Oct 2008 22:29:06 +0000 (UTC) (envelope-from jdc@koitsu.dyndns.org) Received: from OMTA13.westchester.pa.mail.comcast.net ([76.96.62.52]) by QMTA07.westchester.pa.mail.comcast.net with comcast id R8Wt1a00517dt5G57AV6UC; Fri, 10 Oct 2008 22:29:06 +0000 Received: from koitsu.dyndns.org ([69.181.141.110]) by OMTA13.westchester.pa.mail.comcast.net with comcast id RAV41a00P2P6wsM3ZAV59K; Fri, 10 Oct 2008 22:29:06 +0000 X-Authority-Analysis: v=1.0 c=1 a=QycZ5dHgAAAA:8 a=C1AWhVUT42d_71i-uLQA:9 a=AI56OMMXf-ldpafupfgA:7 a=ycrOskimge5EXfuqaha76G6AFBcA:4 a=EoioJ0NPDVgA:10 a=LY0hPdMaydYA:10 Received: by icarus.home.lan (Postfix, from userid 1000) id 6D004C9419; Fri, 10 Oct 2008 15:29:04 -0700 (PDT) Date: Fri, 10 Oct 2008 15:29:04 -0700 From: Jeremy Chadwick To: Steve Kargl Message-ID: <20081010222904.GA44873@icarus.home.lan> References: <20081010213042.GA96822@troutmask.apl.washington.edu> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20081010213042.GA96822@troutmask.apl.washington.edu> User-Agent: Mutt/1.5.18 (2008-05-17) Cc: freebsd-hackers@freebsd.org, jeff@freebsd.org Subject: Re: HPC with ULE vs 4BSD X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 10 Oct 2008 22:29:07 -0000 On Fri, Oct 10, 2008 at 02:30:42PM -0700, Steve Kargl wrote: > Yes, this is a long email. > > In working with a colleague to diagnosis poor performance of > his MPI code, we've discovered that ULE is drastically inferior > to 4BSD in utilizing a system with 2 physical cpus (opteron) and > a total of 8 cores. We have observed this problem with the Open MPI > implementation of MPI and with the MPICH2 implementation. > > Note, I am using the exact same hardware and FreeBSD-current > code dated Sep 22, 2008. The only difference in the kernel > config file is whether ULE or 4BSD is used. > > Using the following command, > > % time /OpenMPI/mpiexec -machinefile mf -n 8 ./Test_mpi |& tee sgk.log > > we have > > ULE --> 546.99 real 0.02 user 0.03 sys > 4BSD -> 218.96 real 0.03 user 0.02 sys > > where the machinefile simply tells Open MPI to launch 8 jobs on the > local node. Test_mpi uses MPI's scatter, gather, and all_to_all > functions to transmit various arrays between the 8 jobs. To get > meaningful numbers, a number of iterations are done in a tight loop. > > With ULE, a snapshot of top(1) shows > > last pid: 33765; load averages: 7.98, 7.51, 5.63 up 10+03:20:30 13:13:56 > 43 processes: 9 running, 34 sleeping > CPU: 68.6% user, 0.0% nice, 18.9% system, 0.0% interrupt, 12.5% idle > Mem: 296M Active, 20M Inact, 192M Wired, 1112K Cache, 132M Buf, 31G Free > Swap: 4096M Total, 4096M Free > > PID USERNAME THR PRI NICE SIZE RES STATE C TIME CPU COMMAND > 33743 kargl 1 118 0 300M 22788K CPU7 7 4:48 100.00% Test_mpi > 33747 kargl 1 118 0 300M 22820K CPU3 3 4:43 100.00% Test_mpi > 33742 kargl 1 118 0 300M 22692K CPU5 5 4:42 100.00% Test_mpi > 33744 kargl 1 117 0 300M 22752K CPU6 6 4:29 100.00% Test_mpi > 33748 kargl 1 117 0 300M 22768K CPU2 2 4:31 96.39% Test_mpi > 33741 kargl 1 112 0 299M 43628K CPU1 1 4:40 80.08% Test_mpi > 33745 kargl 1 113 0 300M 44272K RUN 0 4:27 76.17% Test_mpi > 33746 kargl 1 109 0 300M 22740K RUN 0 4:25 57.86% Test_mpi > 33749 kargl 1 44 0 8196K 2280K CPU4 4 0:00 0.20% top > > while with 4BSD, a snapshot of top(1) shows > > last pid: 1019; load averages: 7.24, 3.05, 1.25 up 0+00:04:40 13:27:09 > 43 processes: 9 running, 34 sleeping > CPU: 45.4% user, 0.0% nice, 54.5% system, 0.1% interrupt, 0.0% idle > Mem: 329M Active, 33M Inact, 107M Wired, 104K Cache, 14M Buf, 31G Free > Swap: 4096M Total, 4096M Free > > PID USERNAME THR PRI NICE SIZE RES STATE C TIME CPU COMMAND > 1012 kargl 1 126 0 300M 44744K CPU6 6 2:16 99.07% Test_mpi > 1016 kargl 1 126 0 314M 59256K RUN 4 2:16 99.02% Test_mpi > 1011 kargl 1 126 0 300M 44652K CPU5 5 2:16 99.02% Test_mpi > 1013 kargl 1 126 0 300M 44680K CPU2 2 2:16 99.02% Test_mpi > 1010 kargl 1 126 0 300M 44740K CPU7 7 2:16 99.02% Test_mpi > 1009 kargl 1 126 0 299M 43884K CPU0 0 2:16 98.97% Test_mpi > 1014 kargl 1 126 0 300M 44664K CPU1 1 2:16 98.97% Test_mpi > 1015 kargl 1 126 0 300M 44620K CPU3 3 2:16 98.93% Test_mpi > 989 kargl 1 96 0 8196K 2460K CPU4 4 0:00 0.10% top > > Notice the interesting, or even perhaps odd, scheduling with ULE that results > in a 20 second gap between the "fastest" job (4:48) and the "slowest" (4:25). > With ULE, 2 Test_mpi jobs are always scheduled on the same core while one > core remains idle. Also, note the difference in the reported load averages. > > Various stats are generated by and collected from executing the MPI program > With ULE, the numbers are > > Procs Array size Kb Iters Function Bandwidth(Mbs) Time(s) > 8 800000 3125 100 scatter 12.58386 0.24251367 > 8 800000 3125 100 all_to_all 17.24503 0.17696444 > 8 800000 3125 100 gather 14.82058 0.20591355 > > 8 1600000 6250 100 scatter 28.25922 0.21598316 > 8 1600000 6250 100 all_to_all 1985.74915 0.00307366 > 8 1600000 6250 100 gather 30.42038 0.20063902 > > 8 2400000 9375 100 scatter 44.65615 0.20501709 > 8 2400000 9375 100 all_to_all 16.09386 0.56886748 > 8 2400000 9375 100 gather 44.38801 0.20625555 > > 8 3200000 12500 100 scatter 60.04160 0.20330956 > 8 3200000 12500 100 all_to_all 2157.10010 0.00565900 > 8 3200000 12500 100 gather 59.72242 0.20439614 > > 8 4000000 15625 100 scatter 86.65769 0.17608117 > 8 4000000 15625 100 all_to_all 2081.25195 0.00733154 > 8 4000000 15625 100 gather 27.47257 0.55541896 > > 8 4800000 18750 100 scatter 33.02306 0.55447768 > 8 4800000 18750 100 all_to_all 200.09908 0.09150740 > 8 4800000 18750 100 gather 91.08742 0.20102168 > > 8 5600000 21875 100 scatter 109.82005 0.19452098 > 8 5600000 21875 100 all_to_all 76.87574 0.27788095 > 8 5600000 21875 100 gather 41.67106 0.51264128 > > 8 6400000 25000 100 scatter 26.92482 0.90674917 > 8 6400000 25000 100 all_to_all 64.74528 0.37707868 > 8 6400000 25000 100 gather 41.29724 0.59117904 > > and with 4BSD, the numbers are > > Procs Array size Kb Iters Function Bandwidth(Mbs) Time(s) > 8 800000 3125 100 scatter 21.33697 0.14302677 > 8 800000 3125 100 all_to_all 3941.39624 0.00077428 > 8 800000 3125 100 gather 24.75520 0.12327747 > > 8 1600000 6250 100 scatter 45.20134 0.13502954 > 8 1600000 6250 100 all_to_all 1987.94348 0.00307027 > 8 1600000 6250 100 gather 42.02498 0.14523541 > > 8 2400000 9375 100 scatter 63.03553 0.14523989 > 8 2400000 9375 100 all_to_all 2015.19580 0.00454312 > 8 2400000 9375 100 gather 66.72807 0.13720272 > > 8 3200000 12500 100 scatter 91.90541 0.13282169 > 8 3200000 12500 100 all_to_all 2029.62622 0.00601442 > 8 3200000 12500 100 gather 87.99693 0.13872112 > > 8 4000000 15625 100 scatter 107.48991 0.14195556 > 8 4000000 15625 100 all_to_all 1970.66907 0.00774295 > 8 4000000 15625 100 gather 110.70226 0.13783630 > > 8 4800000 18750 100 scatter 140.39014 0.13042616 > 8 4800000 18750 100 all_to_all 2401.80054 0.00762367 > 8 4800000 18750 100 gather 134.60948 0.13602717 > > 8 5600000 21875 100 scatter 152.31958 0.14024661 > 8 5600000 21875 100 all_to_all 2379.12207 0.00897907 > 8 5600000 21875 100 gather 154.60051 0.13817745 > > 8 6400000 25000 100 scatter 190.03561 0.12847099 > 8 6400000 25000 100 all_to_all 2661.36963 0.00917350 > 8 6400000 25000 100 gather 183.08250 0.13335006 > > Noting that all communication is over the memory bus, a comparison of > the Bandwidth columns suggests that ULE is causing the MPI jobs to stall > waiting for data. This has potentially serious negative impact on > clusters used for HPC. What surprises me is that you didn't CC the individual who wrote ULE: Jeff Roberson. :-) I've CC'd him here. -- | Jeremy Chadwick jdc at parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB |