From owner-freebsd-hackers@FreeBSD.ORG Fri Oct 10 21:30:53 2008 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id D2BD51065692 for ; Fri, 10 Oct 2008 21:30:53 +0000 (UTC) (envelope-from sgk@troutmask.apl.washington.edu) Received: from troutmask.apl.washington.edu (troutmask.apl.washington.edu [128.208.78.105]) by mx1.freebsd.org (Postfix) with ESMTP id BC5ED8FC0C for ; Fri, 10 Oct 2008 21:30:53 +0000 (UTC) (envelope-from sgk@troutmask.apl.washington.edu) Received: from troutmask.apl.washington.edu (localhost.apl.washington.edu [127.0.0.1]) by troutmask.apl.washington.edu (8.14.3/8.14.3) with ESMTP id m9ALUgDN096853; Fri, 10 Oct 2008 14:30:42 -0700 (PDT) (envelope-from sgk@troutmask.apl.washington.edu) Received: (from sgk@localhost) by troutmask.apl.washington.edu (8.14.3/8.14.3/Submit) id m9ALUg46096852; Fri, 10 Oct 2008 14:30:42 -0700 (PDT) (envelope-from sgk) Date: Fri, 10 Oct 2008 14:30:42 -0700 From: Steve Kargl To: freebsd-hackers@freebsd.org Message-ID: <20081010213042.GA96822@troutmask.apl.washington.edu> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.4.2.3i X-Mailman-Approved-At: Fri, 10 Oct 2008 22:02:54 +0000 Cc: sgk@troutmask.apl.washington.edu Subject: HPC with ULE vs 4BSD X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 10 Oct 2008 21:30:54 -0000 Yes, this is a long email. In working with a colleague to diagnosis poor performance of his MPI code, we've discovered that ULE is drastically inferior to 4BSD in utilizing a system with 2 physical cpus (opteron) and a total of 8 cores. We have observed this problem with the Open MPI implementation of MPI and with the MPICH2 implementation. Note, I am using the exact same hardware and FreeBSD-current code dated Sep 22, 2008. The only difference in the kernel config file is whether ULE or 4BSD is used. Using the following command, % time /OpenMPI/mpiexec -machinefile mf -n 8 ./Test_mpi |& tee sgk.log we have ULE --> 546.99 real 0.02 user 0.03 sys 4BSD -> 218.96 real 0.03 user 0.02 sys where the machinefile simply tells Open MPI to launch 8 jobs on the local node. Test_mpi uses MPI's scatter, gather, and all_to_all functions to transmit various arrays between the 8 jobs. To get meaningful numbers, a number of iterations are done in a tight loop. With ULE, a snapshot of top(1) shows last pid: 33765; load averages: 7.98, 7.51, 5.63 up 10+03:20:30 13:13:56 43 processes: 9 running, 34 sleeping CPU: 68.6% user, 0.0% nice, 18.9% system, 0.0% interrupt, 12.5% idle Mem: 296M Active, 20M Inact, 192M Wired, 1112K Cache, 132M Buf, 31G Free Swap: 4096M Total, 4096M Free PID USERNAME THR PRI NICE SIZE RES STATE C TIME CPU COMMAND 33743 kargl 1 118 0 300M 22788K CPU7 7 4:48 100.00% Test_mpi 33747 kargl 1 118 0 300M 22820K CPU3 3 4:43 100.00% Test_mpi 33742 kargl 1 118 0 300M 22692K CPU5 5 4:42 100.00% Test_mpi 33744 kargl 1 117 0 300M 22752K CPU6 6 4:29 100.00% Test_mpi 33748 kargl 1 117 0 300M 22768K CPU2 2 4:31 96.39% Test_mpi 33741 kargl 1 112 0 299M 43628K CPU1 1 4:40 80.08% Test_mpi 33745 kargl 1 113 0 300M 44272K RUN 0 4:27 76.17% Test_mpi 33746 kargl 1 109 0 300M 22740K RUN 0 4:25 57.86% Test_mpi 33749 kargl 1 44 0 8196K 2280K CPU4 4 0:00 0.20% top while with 4BSD, a snapshot of top(1) shows last pid: 1019; load averages: 7.24, 3.05, 1.25 up 0+00:04:40 13:27:09 43 processes: 9 running, 34 sleeping CPU: 45.4% user, 0.0% nice, 54.5% system, 0.1% interrupt, 0.0% idle Mem: 329M Active, 33M Inact, 107M Wired, 104K Cache, 14M Buf, 31G Free Swap: 4096M Total, 4096M Free PID USERNAME THR PRI NICE SIZE RES STATE C TIME CPU COMMAND 1012 kargl 1 126 0 300M 44744K CPU6 6 2:16 99.07% Test_mpi 1016 kargl 1 126 0 314M 59256K RUN 4 2:16 99.02% Test_mpi 1011 kargl 1 126 0 300M 44652K CPU5 5 2:16 99.02% Test_mpi 1013 kargl 1 126 0 300M 44680K CPU2 2 2:16 99.02% Test_mpi 1010 kargl 1 126 0 300M 44740K CPU7 7 2:16 99.02% Test_mpi 1009 kargl 1 126 0 299M 43884K CPU0 0 2:16 98.97% Test_mpi 1014 kargl 1 126 0 300M 44664K CPU1 1 2:16 98.97% Test_mpi 1015 kargl 1 126 0 300M 44620K CPU3 3 2:16 98.93% Test_mpi 989 kargl 1 96 0 8196K 2460K CPU4 4 0:00 0.10% top Notice the interesting, or even perhaps odd, scheduling with ULE that results in a 20 second gap between the "fastest" job (4:48) and the "slowest" (4:25). With ULE, 2 Test_mpi jobs are always scheduled on the same core while one core remains idle. Also, note the difference in the reported load averages. Various stats are generated by and collected from executing the MPI program With ULE, the numbers are Procs Array size Kb Iters Function Bandwidth(Mbs) Time(s) 8 800000 3125 100 scatter 12.58386 0.24251367 8 800000 3125 100 all_to_all 17.24503 0.17696444 8 800000 3125 100 gather 14.82058 0.20591355 8 1600000 6250 100 scatter 28.25922 0.21598316 8 1600000 6250 100 all_to_all 1985.74915 0.00307366 8 1600000 6250 100 gather 30.42038 0.20063902 8 2400000 9375 100 scatter 44.65615 0.20501709 8 2400000 9375 100 all_to_all 16.09386 0.56886748 8 2400000 9375 100 gather 44.38801 0.20625555 8 3200000 12500 100 scatter 60.04160 0.20330956 8 3200000 12500 100 all_to_all 2157.10010 0.00565900 8 3200000 12500 100 gather 59.72242 0.20439614 8 4000000 15625 100 scatter 86.65769 0.17608117 8 4000000 15625 100 all_to_all 2081.25195 0.00733154 8 4000000 15625 100 gather 27.47257 0.55541896 8 4800000 18750 100 scatter 33.02306 0.55447768 8 4800000 18750 100 all_to_all 200.09908 0.09150740 8 4800000 18750 100 gather 91.08742 0.20102168 8 5600000 21875 100 scatter 109.82005 0.19452098 8 5600000 21875 100 all_to_all 76.87574 0.27788095 8 5600000 21875 100 gather 41.67106 0.51264128 8 6400000 25000 100 scatter 26.92482 0.90674917 8 6400000 25000 100 all_to_all 64.74528 0.37707868 8 6400000 25000 100 gather 41.29724 0.59117904 and with 4BSD, the numbers are Procs Array size Kb Iters Function Bandwidth(Mbs) Time(s) 8 800000 3125 100 scatter 21.33697 0.14302677 8 800000 3125 100 all_to_all 3941.39624 0.00077428 8 800000 3125 100 gather 24.75520 0.12327747 8 1600000 6250 100 scatter 45.20134 0.13502954 8 1600000 6250 100 all_to_all 1987.94348 0.00307027 8 1600000 6250 100 gather 42.02498 0.14523541 8 2400000 9375 100 scatter 63.03553 0.14523989 8 2400000 9375 100 all_to_all 2015.19580 0.00454312 8 2400000 9375 100 gather 66.72807 0.13720272 8 3200000 12500 100 scatter 91.90541 0.13282169 8 3200000 12500 100 all_to_all 2029.62622 0.00601442 8 3200000 12500 100 gather 87.99693 0.13872112 8 4000000 15625 100 scatter 107.48991 0.14195556 8 4000000 15625 100 all_to_all 1970.66907 0.00774295 8 4000000 15625 100 gather 110.70226 0.13783630 8 4800000 18750 100 scatter 140.39014 0.13042616 8 4800000 18750 100 all_to_all 2401.80054 0.00762367 8 4800000 18750 100 gather 134.60948 0.13602717 8 5600000 21875 100 scatter 152.31958 0.14024661 8 5600000 21875 100 all_to_all 2379.12207 0.00897907 8 5600000 21875 100 gather 154.60051 0.13817745 8 6400000 25000 100 scatter 190.03561 0.12847099 8 6400000 25000 100 all_to_all 2661.36963 0.00917350 8 6400000 25000 100 gather 183.08250 0.13335006 Noting that all communication is over the memory bus, a comparison of the Bandwidth columns suggests that ULE is causing the MPI jobs to stall waiting for data. This has potentially serious negative impact on clusters used for HPC. -- Steve