From owner-freebsd-amd64@FreeBSD.ORG Sun Sep 12 03:08:07 2004 Return-Path: Delivered-To: freebsd-amd64@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 6716916A4CE; Sun, 12 Sep 2004 03:08:07 +0000 (GMT) Received: from satie.private.org (YahooBB219196184005.bbtec.net [219.196.184.5]) by mx1.FreeBSD.org (Postfix) with ESMTP id 5E85F43D41; Sun, 12 Sep 2004 03:08:06 +0000 (GMT) (envelope-from chat95@mac.com) Received: from localhost (localhost [127.0.0.1]) by satie.private.org (8.12.10/8.12.10) with ESMTP id i8C383Cm001275; Sun, 12 Sep 2004 12:08:04 +0900 (JST) (envelope-from chat95@mac.com) Date: Sun, 12 Sep 2004 12:08:03 +0900 (JST) Message-Id: <20040912.120803.607953196.chat95@mac.com> To: freebsd-amd64@FreeBSD.org, developers@FreeBSD.org From: NAKATA Maho Organization: private X-Mailer: Mew version 3.3 on XEmacs 21.4.14 (Reasonable Discussion) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit Subject: KSE and SMP problem in FreeBSD/amd64 5.3BETA3, namely KSE dosen't make use of SMP. X-BeenThere: freebsd-amd64@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Porting FreeBSD to the AMD64 platform List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 12 Sep 2004 03:08:07 -0000 Dear amd64 freaks, I noticed that there seems to be a bug in KSE with SMP configuration. Here, I describe my problem in detail. the math/atlas port utilize SMP by threading. namely, if you have 2 processors you can gain the nearly double performance so KSE is the key technology for SMP. However, for amd64, KSE doesn't utilize second CPU at all. My machine is: Tyan S2885 Opteron 1.6GHz x 2 2G bytes of memory I confirmed that: o FreeBSD/amd64 5.2.1-RELEASE with KSE doesn't work at all, dumps core or memory fault, while without KSE works well but without performance gain (using libmap.conf, and this is not shown here). o FreeBSD/amd64 5.3-BEAT3 with KSE works at least, however, doesn't utilize SMP. o FreeBSD/i386 5.2.1-RELEASE, and 5.3-BEAT3 works well. How to repreat: (it took huge hours to build math/atlas, so I put work dir at) CVSup your ports tree, please use: # $FreeBSD: ports/math/atlas/Makefile,v 1.27 2004/09/02 00:25:45 maho Exp $ 0a. prepare opteron SMP machine, and install FreeBSD/amd64 5.3-BETA3. 1a. cd /usr/ports/math/atlas 2a. make 3a. wait for long time 4a. cd /usr/ports/math/atlas/work/ATLAS/bin/THREADED 5a. make xdlutst (it took only seconds) 6a. make xdlutst_pt (it took only seconds) 7a. type ./xdlutst -N 1000 2000 200 (this doesn't utilize SMP and KSE) NREPS Major M N lda NPVTS TIME MFLOP RESID ===== ===== ===== ===== ===== ===== ======== ======== ======== 0 Col 1000 1000 1000 995 0.301 2210.755 3.821e-02 0 Col 1200 1200 1200 1194 0.504 2282.569 3.793e-02 0 Col 1400 1400 1400 1395 0.794 2303.707 2.843e-02 0 Col 1600 1600 1600 1595 1.156 2360.557 2.893e-02 0 Col 1800 1800 1800 1793 1.637 2374.130 2.803e-02 0 Col 2000 2000 2000 1990 2.192 2431.838 2.744e-02 6 cases ran, 6 cases passed 8a. type ./xdlutst_pt -N 2000 3000 200 ./xdlutst_pt -N 2000 3000 200 NREPS Major M N lda NPVTS TIME MFLOP RESID ===== ===== ===== ===== ===== ===== ======== ======== ======== 0 Col 2000 2000 2000 1990 2.286 2332.527 2.744e-02 0 Col 2200 2200 2200 2194 2.764 2567.795 2.639e-02 0 Col 2400 2400 2400 2394 3.766 2446.449 2.721e-02 0 Col 2600 2600 2600 2593 4.722 2480.761 2.472e-02 0 Col 2800 2800 2800 2795 5.855 2499.038 2.441e-02 0 Col 3000 3000 3000 2992 7.302 2464.553 2.442e-02 6 cases ran, 6 cases passed Please see the MFLOP column. This indicates the FLOPS of the calculation. Opteron 1.6G's performance is 2.4GFlops for LU decomposition. and as you can see no perfomance gain :( typical output of top is like that: PID USERNAME PRI NICE SIZE RES STATE C TIME WCPU CPU COMMAND 716 root 134 0 185M 179M CPU0 0 1:05 21.09% 21.09% xdlutst_pt 716 root 134 0 185M 179M RUN 0 1:05 19.53% 19.53% xdlutst_pt 716 root 20 0 185M 179M kserel 1 1:05 0.00% 0.00% xdlutst_pt 716 root 20 0 185M 179M ksesig 1 1:05 0.00% 0.00% xdlutst_pt 716 root 20 0 185M 179M kserel 0 1:05 0.00% 0.00% xdlutst_pt two threads of xdlutst_pt are always running on *ONLY CPU0 or CPU1* -------------------------------------------------------------------- Next, I have tried i386 version 0i. prepare opteron SMP machine same as above, and install FreeBSD/i386 5.3-BETA3. CVSup your ports tree. 1i. cd /usr/ports/math/atlas 2i. make 3i. wait for long time 4i. cd /usr/ports/math/atlas/work/ATLAS/bin/THREADED 5i. make xdlutst (it took only seconds) 6i. make xdlutst_pt (it took only seconds) 7i. type ./xdlutst -N 1000 2000 200 (this doesn't utilize SMP and KSE) ./xdlutst -N 1000 2000 200 NREPS Major M N lda NPVTS TIME MFLOP RESID ===== ===== ===== ===== ===== ===== ======== ======== ======== 0 Col 1000 1000 1000 995 0.307 2170.617 3.437e-02 0 Col 1200 1200 1200 1194 0.522 2204.335 3.482e-02 0 Col 1400 1400 1400 1395 0.799 2286.888 4.150e-02 0 Col 1600 1600 1600 1595 1.164 2345.104 3.598e-02 0 Col 1800 1800 1800 1793 1.616 2405.542 3.601e-02 0 Col 2000 2000 2000 1990 2.218 2403.157 3.436e-02 6 cases ran, 6 cases passed 8i. type ./xdlutst_pt -N 3000 4000 200 (this utilize KSE so that make full use of SMP) ./xdlutst_pt -N 3000 4000 200 NREPS Major M N lda NPVTS TIME MFLOP RESID ===== ===== ===== ===== ===== ===== ======== ======== ======== 0 Col 3000 3000 3000 2992 7.157 2514.351 3.650e-02 0 Col 3200 3200 3200 3186 5.127 4259.986 3.207e-02 0 Col 3400 3400 3400 3392 5.867 4465.006 3.528e-02 0 Col 3600 3600 3600 3589 6.791 4579.468 3.519e-02 0 Col 3800 3800 3800 3791 8.510 4297.730 3.285e-02 0 Col 4000 4000 4000 3995 9.207 4633.234 3.218e-02 6 cases ran, 6 cases passed yes, there are perfomance gain by utilizing SMP. typical output of top seems like PID USERNAME PRI NICE SIZE RES STATE C TIME WCPU CPU COMMAND 714 root 139 0 301M 300M CPU1 1 2:16 66.41% 66.41% xdlutst_pt 714 root 139 0 301M 300M RUN 0 2:16 66.41% 66.41% xdlutst_pt 714 root 20 0 301M 300M kserel 1 2:16 0.00% 0.00% xdlutst_pt 714 root 20 0 301M 300M kserel 0 2:16 0.00% 0.00% xdlutst_pt 714 root 20 0 301M 300M ksesig 0 2:16 0.00% 0.00% xdlutst_pt Summary: Difference between 8a and 8i are: o there are no perfomance gain in 8a whereas 8i gains nearly double. o the result of top indicates that by KSE of amd64, two threads are produced correctly, however scheduling is somwhat odd, so that two threads runs at the same processor, apparently threads are spread over different processors, though. You can try easily, work directory of these two ports are available: http://people.freebsd.org/~maho/atlas/atlas-work-opteron_dual-amd64.tar.bz http://people.freebsd.org/~maho/atlas/atlas-work-opteron_dual-i386.tar.bz MD5 (atlas-work-opteron_dual-amd64.tar.bz) = 9d9d7e8b00b34a783b7d2172bc404e23 MD5 (atlas-work-opteron_dual-i386.tar.bz) = 8076a753c7b3edaea7bd446c6473f120 Does anybody can fix it? Best regards, --nakata maho