From owner-freebsd-cluster@FreeBSD.ORG Sun Apr 11 12:20:13 2004 Return-Path: Delivered-To: freebsd-cluster@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id C2E3916A4CE; Sun, 11 Apr 2004 12:20:13 -0700 (PDT) Received: from madison.lisco.com (madison.lisco.com [69.18.32.36]) by mx1.FreeBSD.org (Postfix) with ESMTP id E267A43D2F; Sun, 11 Apr 2004 12:20:12 -0700 (PDT) (envelope-from freebsd@thebeatbox.org) Received: from rw (69-18-60-38.lisco.net [69.18.60.38]) by madison.lisco.com (8.11.6/8.11.2) with ESMTP id i3BJK7J00333; Sun, 11 Apr 2004 14:20:11 -0500 (CDT) From: "Roland Wells" To: "'Jeffrey Racine'" , , Date: Sun, 11 Apr 2004 14:20:01 -0500 Message-ID: <024f01c41ffa$029327e0$0c03a8c0@internal.thebeatbox.org> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable X-Priority: 3 (Normal) X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook, Build 10.0.2616 In-Reply-To: <1081635706.67575.26.camel@x1-6-00-b0-d0-c2-67-0e.twcny.rr.com> X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1165 Importance: Normal Subject: RE: LAM MPI on dual processor opteron box sees only one cpu... X-BeenThere: freebsd-cluster@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Clustering FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 11 Apr 2004 19:20:14 -0000 Jeffrey, I am not familiar with the LAM MPI issue, but in a dual proc box, you should also get an additional line towards the bottom in your dmesg, similar to: SMP: AP CPU #1 Launched! -Roland -----Original Message----- From: owner-freebsd-cluster@freebsd.org [mailto:owner-freebsd-cluster@freebsd.org] On Behalf Of Jeffrey Racine Sent: Saturday, April 10, 2004 5:22 PM To: freebsd-amd64@freebsd.org; freebsd-cluster@freebsd.org Subject: LAM MPI on dual processor opteron box sees only one cpu... Hi. I am converging on getting a new dual opteron box running. Now I am setting up and testing LAM MPI, however, the OS is not farming out=20 the job as expected, and only sees one processor.=20 This runs fine on RH 7.3 and RH 9.0 both on a cluster and on a dual processor PIV desktop. I am running 5-current. Basically, mpirun -np 1 binaryfile has the same runtime as mpirun -np 2 binaryfile, while on the dual PIV box it runs in half the time. When I check top, mpirun -np 2 both run on CPU 0... here is the relevant portion from top with -np 2... 9306 jracine 4 0 7188K 2448K sbwait 0 0:03 19.53% 19.53% n_lam 29307 jracine 119 0 7148K 2372K CPU0 0 0:03 19.53% 19.53% n_lam I include output from laminfo, dmesg (cpu relevnt info), and lamboot -d bhost.lam... any suggestions most appreciated, and thanks in advance! -- laminfo LAM/MPI: 7.0.4 Prefix: /usr/local Architecture: amd64-unknown-freebsd5.2 Configured by: root Configured on: Sat Apr 10 11:22:02 EDT 2004 Configure host: jracine.maxwell.syr.edu C bindings: yes C++ bindings: yes Fortran bindings: yes C profiling: yes C++ profiling: yes Fortran profiling: yes ROMIO support: yes IMPI support: no Debug support: no Purify clean: no SSI boot: globus (Module v0.5) SSI boot: rsh (Module v1.0) SSI coll: lam_basic (Module v7.0) SSI coll: smp (Module v1.0) SSI rpi: crtcp (Module v1.0.1) SSI rpi: lamd (Module v7.0) SSI rpi: sysv (Module v7.0) SSI rpi: tcp (Module v7.0) SSI rpi: usysv (Module v7.0) -- dmesg sees two cpus... CPU: AMD Opteron(tm) Processor 248 (2205.02-MHz K8-class CPU) Origin =3D "AuthenticAMD" Id =3D 0xf58 Stepping =3D 8 Features=3D0x78bfbff AMD Features=3D0xe0500800 real memory =3D 3623813120 (3455 MB) avail memory =3D 3494363136 (3332 MB) FreeBSD/SMP: Multiprocessor System Detected: 2 CPUs cpu0 (BSP): APIC ID: 0 cpu1 (AP): APIC ID: 1 -- bhost has the requisite information 128.230.130.10 cpu=3D2 user=3Djracine -- Here are the results from lamboot -d bhost.lam -bash-2.05b$ lamboot -d ~/bhost.lam n0<29283> ssi:boot: Opening n0<29283> ssi:boot: opening module globus n0<29283> ssi:boot: initializing module globus n0<29283> ssi:boot:globus: globus-job-run not found, globus boot will not run n0<29283> ssi:boot: module not available: globus n0<29283> ssi:boot: opening module rsh n0<29283> ssi:boot: initializing module rsh n0<29283> ssi:boot:rsh: module initializing n0<29283> ssi:boot:rsh:agent: rsh n0<29283> ssi:boot:rsh:username: n0<29283> ssi:boot:rsh:verbose: 1000 n0<29283> ssi:boot:rsh:algorithm: linear n0<29283> ssi:boot:rsh:priority: 10 n0<29283> ssi:boot: module available: rsh, priority: 10 n0<29283> ssi:boot: finalizing module globus n0<29283> ssi:boot:globus: finalizing n0<29283> ssi:boot: closing module globus n0<29283> ssi:boot: Selected boot module rsh =20 LAM 7.0.4/MPI 2 C++/ROMIO - Indiana University =20 n0<29283> ssi:boot:base: looking for boot schema in following directories: n0<29283> ssi:boot:base: n0<29283> ssi:boot:base: $TROLLIUSHOME/etc n0<29283> ssi:boot:base: $LAMHOME/etc n0<29283> ssi:boot:base: /usr/local/etc n0<29283> ssi:boot:base: looking for boot schema file: n0<29283> ssi:boot:base: /home/jracine/bhost.lam n0<29283> ssi:boot:base: found boot schema: /home/jracine/bhost.lam n0<29283> ssi:boot:rsh: found the following hosts: n0<29283> ssi:boot:rsh: n0 jracine.maxwell.syr.edu (cpu=3D2) n0<29283> ssi:boot:rsh: resolved hosts: n0<29283> ssi:boot:rsh: n0 jracine.maxwell.syr.edu --> 128.230.130.10 (origin)n0<29283> ssi:boot:rsh: starting RTE procs n0<29283> ssi:boot:base:linear: starting n0<29283> ssi:boot:base:server: opening server TCP socket n0<29283> ssi:boot:base:server: opened port 49832 n0<29283> ssi:boot:base:linear: booting n0 (jracine.maxwell.syr.edu) n0<29283> ssi:boot:rsh: starting lamd on (jracine.maxwell.syr.edu) n0<29283> ssi:boot:rsh: starting on n0 (jracine.maxwell.syr.edu): hboot -t -c lam-conf.lamd -d -I -H 128.230.130.10 -P 49832 -n 0 -o 0 n0<29283> ssi:boot:rsh: launching locally hboot: performing tkill hboot: tkill -d tkill: setting prefix to (null) tkill: setting suffix to (null) tkill: got killname back: /tmp/lam-jracine@jracine.maxwell.syr.edu/lam-killfile tkill: removing socket file ... tkill: socket file: /tmp/lam-jracine@jracine.maxwell.syr.edu/lam-kernel-socketd tkill: removing IO daemon socket file ... tkill: IO daemon socket file: /tmp/lam-jracine@jracine.maxwell.syr.edu/lam-io-socket tkill: f_kill =3D = "/tmp/lam-jracine@jracine.maxwell.syr.edu/lam-killfile" tkill: nothing to kill: "/tmp/lam-jracine@jracine.maxwell.syr.edu/lam-killfile" hboot: booting... hboot: fork /usr/local/bin/lamd [1] 29286 lamd -H 128.230.130.10 -P 49832 -n 0 -o 0 -d n0<29283> ssi:boot:rsh: successfully launched on n0 (jracine.maxwell.syr.edu) n0<29283> ssi:boot:base:server: expecting connection from finite list hboot: attempting to execute n-1<29286> ssi:boot: Opening n-1<29286> ssi:boot: opening module globus n-1<29286> ssi:boot: initializing module globus n-1<29286> ssi:boot:globus: globus-job-run not found, globus boot will not run n-1<29286> ssi:boot: module not available: globus n-1<29286> ssi:boot: opening module rsh n-1<29286> ssi:boot: initializing module rsh n-1<29286> ssi:boot:rsh: module initializing n-1<29286> ssi:boot:rsh:agent: rsh n-1<29286> ssi:boot:rsh:username: n-1<29286> ssi:boot:rsh:verbose: 1000 n-1<29286> ssi:boot:rsh:algorithm: linear n-1<29286> ssi:boot:rsh:priority: 10 n-1<29286> ssi:boot: module available: rsh, priority: 10 n-1<29286> ssi:boot: finalizing module globus n-1<29286> ssi:boot:globus: finalizing n-1<29286> ssi:boot: closing module globus n-1<29286> ssi:boot: Selected boot module rsh n0<29283> ssi:boot:base:server: got connection from 128.230.130.10 n0<29283> ssi:boot:base:server: this connection is expected (n0) n0<29283> ssi:boot:base:server: remote lamd is at 128.230.130.10:50206 n0<29283> ssi:boot:base:server: closing server socket n0<29283> ssi:boot:base:server: connecting to lamd at 128.230.130.10:49833 n0<29283> ssi:boot:base:server: connected n0<29283> ssi:boot:base:server: sending number of links (1) n0<29283> ssi:boot:base:server: sending info: n0 (jracine.maxwell.syr.edu) n0<29283> ssi:boot:base:server: finished sending n0<29283> ssi:boot:base:server: disconnected from 128.230.130.10:49833 n0<29283> ssi:boot:base:linear: finished n0<29283> ssi:boot:rsh: all RTE procs started n0<29283> ssi:boot:rsh: finalizing n0<29283> ssi:boot: Closing n-1<29286> ssi:boot:rsh: finalizing n-1<29286> ssi:boot: Closing _______________________________________________ freebsd-cluster@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-cluster To unsubscribe, send any mail to "freebsd-cluster-unsubscribe@freebsd.org"