Date: Mon, 12 Apr 2004 09:36:18 -0400 From: "Jeff Racine" <jracine@maxwell.syr.edu> To: "Bob Willcox" <bob@immure.com> Cc: freebsd-cluster@freebsd.org Subject: RE: LAM MPI on dual processor opteron box sees only one cpu... Message-ID: <32A8B2CB12BFC84D8D11D872C787AA9A015E9AD2@EXCHANGE.forest.maxwell.syr.edu>
next in thread | raw e-mail | index | archive | help
Hi Bob. Good to hear someone else has seen this behavior. re: what scheduler am I using... ULE... Thanks! -- Jeff -----Original Message----- From: Bob Willcox [mailto:bob@immure.com] Sent: Mon 4/12/2004 9:13 AM To: Jeff Racine Cc: Roland Wells; freebsd-cluster@freebsd.org; freebsd-amd64@freebsd.org Subject: Re: LAM MPI on dual processor opteron box sees only one cpu... =20 On Mon, Apr 12, 2004 at 09:04:24AM -0400, Jeffrey Racine wrote: > Hi Roland. >=20 > I do get CPU #1 launched. This is not the problem. >=20 > The problem appears to be with the way that current is scheduling. >=20 > With mpirun np 2 I get the job running on CPU 0 (two instances on one > proc). However, it turns out that with np 4 I get the job running on = CPU > 0 and 1 though with 4 instances (and associated overhead). Here is top > for np 4... notice that in the C column it is using both procs. >=20 > PID USERNAME PRI NICE SIZE RES STATE C TIME WCPU CPU > COMMAND > 96090 jracine 131 0 7148K 2172K CPU1 1 0:19 44.53% 44.53% > n_lam > 96088 jracine 125 0 7148K 2172K RUN 0 0:18 43.75% 43.75% > n_lam > 96089 jracine 136 0 7148K 2172K RUN 1 0:19 42.19% 42.19% > n_lam > 96087 jracine 135 0 7188K 2248K RUN 0 0:19 41.41% 41.41% > n_lam >=20 >=20 > One run (once when I rebooted lam) did allocate the job correctly with > np 2, but this is not in general the case. On other systems I use, > however, they correctly farm out np 2 to CPU 0 and 1... >=20 > Thanks, and any suggestions welcome. What scheduler are you using? I've seen this behavior on my 5-current Athlon MP system when running two instances of setiathome (when running with the default SCHED_ULE scheduler). Sometimes it would run both setiathome processes on the same CPU for hours (even days) leaving one CPU essentially idle. When I switched to the SCHED_4BSD scheduler it then ran setiathome on both CPUs. Bob >=20 > -- Jeff >=20 > On Sun, 2004-04-11 at 14:20 -0500, Roland Wells wrote: > > Jeffrey, > > I am not familiar with the LAM MPI issue, but in a dual proc box, = you > > should also get an additional line towards the bottom in your dmesg, > > similar to: > >=20 > > SMP: AP CPU #1 Launched! > >=20 > > -Roland > > -----Original Message----- > > From: owner-freebsd-cluster@freebsd.org > > [mailto:owner-freebsd-cluster@freebsd.org] On Behalf Of Jeffrey = Racine > > Sent: Saturday, April 10, 2004 5:22 PM > > To: freebsd-amd64@freebsd.org; freebsd-cluster@freebsd.org > > Subject: LAM MPI on dual processor opteron box sees only one cpu... > >=20 > >=20 > > Hi. > >=20 > > I am converging on getting a new dual opteron box running. Now I am > > setting up and testing LAM MPI, however, the OS is not farming out=20 > > the job as expected, and only sees one processor.=20 > >=20 > > This runs fine on RH 7.3 and RH 9.0 both on a cluster and on a dual > > processor PIV desktop. I am running 5-current. Basically, mpirun -np = 1 > > binaryfile has the same runtime as mpirun -np 2 binaryfile, while on = the > > dual PIV box it runs in half the time. When I check top, mpirun -np = 2 > > both run on CPU 0... here is the relevant portion from top with -np = 2... > >=20 > > 9306 jracine 4 0 7188K 2448K sbwait 0 0:03 19.53% 19.53% = n_lam > > 29307 jracine 119 0 7148K 2372K CPU0 0 0:03 19.53% 19.53% > > n_lam > >=20 > > I include output from laminfo, dmesg (cpu relevnt info), and lamboot = -d > > bhost.lam... any suggestions most appreciated, and thanks in = advance! > >=20 > > -- laminfo > >=20 > > LAM/MPI: 7.0.4 > > Prefix: /usr/local > > Architecture: amd64-unknown-freebsd5.2 > > Configured by: root > > Configured on: Sat Apr 10 11:22:02 EDT 2004 > > Configure host: jracine.maxwell.syr.edu > > C bindings: yes > > C++ bindings: yes > > Fortran bindings: yes > > C profiling: yes > > C++ profiling: yes > > Fortran profiling: yes > > ROMIO support: yes > > IMPI support: no > > Debug support: no > > Purify clean: no > > SSI boot: globus (Module v0.5) > > SSI boot: rsh (Module v1.0) > > SSI coll: lam_basic (Module v7.0) > > SSI coll: smp (Module v1.0) > > SSI rpi: crtcp (Module v1.0.1) > > SSI rpi: lamd (Module v7.0) > > SSI rpi: sysv (Module v7.0) > > SSI rpi: tcp (Module v7.0) > > SSI rpi: usysv (Module v7.0) > >=20 > > -- dmesg sees two cpus... > >=20 > > CPU: AMD Opteron(tm) Processor 248 (2205.02-MHz K8-class CPU) > > Origin =3D "AuthenticAMD" Id =3D 0xf58 Stepping =3D 8 > >=20 > > = Features=3D0x78bfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE= , > > MCA,CMOV,PAT,PSE36,CLFLUSH,MMX,FXSR,SSE,SSE2> > > AMD Features=3D0xe0500800<SYSCALL,NX,MMX+,LM,3DNow!+,3DNow!> > > real memory =3D 3623813120 (3455 MB) > > avail memory =3D 3494363136 (3332 MB) > > FreeBSD/SMP: Multiprocessor System Detected: 2 CPUs > > cpu0 (BSP): APIC ID: 0 > > cpu1 (AP): APIC ID: 1 > >=20 > > -- bhost has the requisite information > >=20 > > 128.230.130.10 cpu=3D2 user=3Djracine > >=20 > > -- Here are the results from lamboot -d bhost.lam > >=20 > > -bash-2.05b$ lamboot -d ~/bhost.lam > > n0<29283> ssi:boot: Opening > > n0<29283> ssi:boot: opening module globus > > n0<29283> ssi:boot: initializing module globus > > n0<29283> ssi:boot:globus: globus-job-run not found, globus boot = will > > not run n0<29283> ssi:boot: module not available: globus n0<29283> > > ssi:boot: opening module rsh n0<29283> ssi:boot: initializing module = rsh > > n0<29283> ssi:boot:rsh: module initializing n0<29283> > > ssi:boot:rsh:agent: rsh n0<29283> ssi:boot:rsh:username: <same> > > n0<29283> ssi:boot:rsh:verbose: 1000 n0<29283> = ssi:boot:rsh:algorithm: > > linear n0<29283> ssi:boot:rsh:priority: 10 n0<29283> ssi:boot: = module > > available: rsh, priority: 10 n0<29283> ssi:boot: finalizing module > > globus n0<29283> ssi:boot:globus: finalizing n0<29283> ssi:boot: = closing > > module globus n0<29283> ssi:boot: Selected boot module rsh > > =20 > > LAM 7.0.4/MPI 2 C++/ROMIO - Indiana University > > =20 > > n0<29283> ssi:boot:base: looking for boot schema in following > > directories: > > n0<29283> ssi:boot:base: <current directory> > > n0<29283> ssi:boot:base: $TROLLIUSHOME/etc > > n0<29283> ssi:boot:base: $LAMHOME/etc > > n0<29283> ssi:boot:base: /usr/local/etc > > n0<29283> ssi:boot:base: looking for boot schema file: > > n0<29283> ssi:boot:base: /home/jracine/bhost.lam > > n0<29283> ssi:boot:base: found boot schema: /home/jracine/bhost.lam > > n0<29283> ssi:boot:rsh: found the following hosts: > > n0<29283> ssi:boot:rsh: n0 jracine.maxwell.syr.edu (cpu=3D2) > > n0<29283> ssi:boot:rsh: resolved hosts: > > n0<29283> ssi:boot:rsh: n0 jracine.maxwell.syr.edu --> = 128.230.130.10 > > (origin)n0<29283> ssi:boot:rsh: starting RTE procs > > n0<29283> ssi:boot:base:linear: starting > > n0<29283> ssi:boot:base:server: opening server TCP socket n0<29283> > > ssi:boot:base:server: opened port 49832 n0<29283> = ssi:boot:base:linear: > > booting n0 (jracine.maxwell.syr.edu) n0<29283> ssi:boot:rsh: = starting > > lamd on (jracine.maxwell.syr.edu) n0<29283> ssi:boot:rsh: starting = on n0 > > (jracine.maxwell.syr.edu): hboot -t -c lam-conf.lamd -d -I -H > > 128.230.130.10 -P 49832 -n 0 -o 0 n0<29283> ssi:boot:rsh: launching > > locally > > hboot: performing tkill > > hboot: tkill -d > > tkill: setting prefix to (null) > > tkill: setting suffix to (null) > > tkill: got killname > > back: /tmp/lam-jracine@jracine.maxwell.syr.edu/lam-killfile > > tkill: removing socket file ... > > tkill: socket > > file: /tmp/lam-jracine@jracine.maxwell.syr.edu/lam-kernel-socketd > > tkill: removing IO daemon socket file ... > > tkill: IO daemon socket > > file: /tmp/lam-jracine@jracine.maxwell.syr.edu/lam-io-socket > > tkill: f_kill =3D = "/tmp/lam-jracine@jracine.maxwell.syr.edu/lam-killfile" > > tkill: nothing to kill: > > "/tmp/lam-jracine@jracine.maxwell.syr.edu/lam-killfile" > > hboot: booting... > > hboot: fork /usr/local/bin/lamd > > [1] 29286 lamd -H 128.230.130.10 -P 49832 -n 0 -o 0 -d n0<29283> > > ssi:boot:rsh: successfully launched on n0 > > (jracine.maxwell.syr.edu) > > n0<29283> ssi:boot:base:server: expecting connection from finite = list > > hboot: attempting to execute > > n-1<29286> ssi:boot: Opening > > n-1<29286> ssi:boot: opening module globus > > n-1<29286> ssi:boot: initializing module globus > > n-1<29286> ssi:boot:globus: globus-job-run not found, globus boot = will > > not run n-1<29286> ssi:boot: module not available: globus n-1<29286> > > ssi:boot: opening module rsh n-1<29286> ssi:boot: initializing = module > > rsh n-1<29286> ssi:boot:rsh: module initializing n-1<29286> > > ssi:boot:rsh:agent: rsh n-1<29286> ssi:boot:rsh:username: <same> > > n-1<29286> ssi:boot:rsh:verbose: 1000 n-1<29286> = ssi:boot:rsh:algorithm: > > linear n-1<29286> ssi:boot:rsh:priority: 10 n-1<29286> ssi:boot: = module > > available: rsh, priority: 10 n-1<29286> ssi:boot: finalizing module > > globus n-1<29286> ssi:boot:globus: finalizing n-1<29286> ssi:boot: > > closing module globus n-1<29286> ssi:boot: Selected boot module rsh > > n0<29283> ssi:boot:base:server: got connection from 128.230.130.10 > > n0<29283> ssi:boot:base:server: this connection is expected (n0) > > n0<29283> ssi:boot:base:server: remote lamd is at = 128.230.130.10:50206 > > n0<29283> ssi:boot:base:server: closing server socket n0<29283> > > ssi:boot:base:server: connecting to lamd at 128.230.130.10:49833 > > n0<29283> ssi:boot:base:server: connected n0<29283> > > ssi:boot:base:server: sending number of links (1) n0<29283> > > ssi:boot:base:server: sending info: n0 > > (jracine.maxwell.syr.edu) > > n0<29283> ssi:boot:base:server: finished sending > > n0<29283> ssi:boot:base:server: disconnected from = 128.230.130.10:49833 > > n0<29283> ssi:boot:base:linear: finished n0<29283> ssi:boot:rsh: all = RTE > > procs started n0<29283> ssi:boot:rsh: finalizing n0<29283> ssi:boot: > > Closing n-1<29286> ssi:boot:rsh: finalizing n-1<29286> ssi:boot: = Closing > >=20 > >=20 > >=20 > > _______________________________________________ > > freebsd-cluster@freebsd.org mailing list > > http://lists.freebsd.org/mailman/listinfo/freebsd-cluster > > To unsubscribe, send any mail to > > "freebsd-cluster-unsubscribe@freebsd.org" > >=20 >=20 > _______________________________________________ > freebsd-amd64@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-amd64 > To unsubscribe, send any mail to = "freebsd-amd64-unsubscribe@freebsd.org" --=20 Bob Willcox What's done to children, they will do to = society. bob@immure.com Austin, TX
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?32A8B2CB12BFC84D8D11D872C787AA9A015E9AD2>