FreeBSD Mail Archives

Date:      Mon, 12 Apr 2004 09:36:18 -0400
From:      "Jeff Racine" <jracine@maxwell.syr.edu>
To:        "Bob Willcox" <bob@immure.com>
Cc:        freebsd-cluster@freebsd.org
Subject:   RE: LAM MPI on dual processor opteron box sees only one cpu...
Message-ID:  <32A8B2CB12BFC84D8D11D872C787AA9A015E9AD2@EXCHANGE.forest.maxwell.syr.edu>

next in thread | raw e-mail | index | archive | help

Hi Bob.

Good to hear someone else has seen this behavior.

re: what scheduler am I using... ULE...

Thanks!

-- Jeff

-----Original Message-----
From: Bob Willcox [mailto:bob@immure.com]
Sent: Mon 4/12/2004 9:13 AM
To: Jeff Racine
Cc: Roland Wells; freebsd-cluster@freebsd.org; freebsd-amd64@freebsd.org
Subject: Re: LAM MPI on dual processor opteron box sees only one cpu...
=20
On Mon, Apr 12, 2004 at 09:04:24AM -0400, Jeffrey Racine wrote:
> Hi Roland.
>=20
> I do get CPU #1 launched. This is not the problem.
>=20
> The problem appears to be with the way that current is scheduling.
>=20
> With mpirun np 2 I get the job running on CPU 0 (two instances on one
> proc). However, it turns out that with np 4 I get the job running on =
CPU
> 0 and 1 though with 4 instances (and associated overhead). Here is top
> for np 4... notice that in the C column it is using both procs.
>=20
>   PID USERNAME PRI NICE   SIZE    RES STATE  C   TIME   WCPU    CPU
> COMMAND
> 96090 jracine  131    0  7148K  2172K CPU1   1   0:19 44.53% 44.53%
> n_lam
> 96088 jracine  125    0  7148K  2172K RUN    0   0:18 43.75% 43.75%
> n_lam
> 96089 jracine  136    0  7148K  2172K RUN    1   0:19 42.19% 42.19%
> n_lam
> 96087 jracine  135    0  7188K  2248K RUN    0   0:19 41.41% 41.41%
> n_lam
>=20
>=20
> One run (once when I rebooted lam) did allocate the job correctly with
> np 2, but this is not in general the case. On other systems I use,
> however, they correctly farm out np 2 to CPU 0 and 1...
>=20
> Thanks, and any suggestions welcome.

What scheduler are you using? I've seen this behavior on my 5-current
Athlon MP system when running two instances of setiathome (when running
with the default SCHED_ULE scheduler). Sometimes it would run both
setiathome processes on the same CPU for hours (even days) leaving one
CPU essentially idle. When I switched to the SCHED_4BSD scheduler it
then ran setiathome on both CPUs.

Bob

>=20
> -- Jeff
>=20
> On Sun, 2004-04-11 at 14:20 -0500, Roland Wells wrote:
> > Jeffrey,
> > I am not familiar with the LAM MPI issue, but in a dual proc box, =
you
> > should also get an additional line towards the bottom in your dmesg,
> > similar to:
> >=20
> > SMP: AP CPU #1 Launched!
> >=20
> > -Roland
> > -----Original Message-----
> > From: owner-freebsd-cluster@freebsd.org
> > [mailto:owner-freebsd-cluster@freebsd.org] On Behalf Of Jeffrey =
Racine
> > Sent: Saturday, April 10, 2004 5:22 PM
> > To: freebsd-amd64@freebsd.org; freebsd-cluster@freebsd.org
> > Subject: LAM MPI on dual processor opteron box sees only one cpu...
> >=20
> >=20
> > Hi.
> >=20
> > I am converging on getting a new dual opteron box running. Now I am
> > setting up and testing LAM MPI, however, the OS is not farming out=20
> > the job as expected, and only sees one processor.=20
> >=20
> > This runs fine on RH 7.3 and RH 9.0 both on a cluster and on a dual
> > processor PIV desktop. I am running 5-current. Basically, mpirun -np =
1
> > binaryfile has the same runtime as mpirun -np 2 binaryfile, while on =
the
> > dual PIV box it runs in half the time. When I check top, mpirun -np =
2
> > both run on CPU 0... here is the relevant portion from top with -np =
2...
> >=20
> > 9306 jracine    4    0  7188K  2448K sbwait 0   0:03 19.53% 19.53% =
n_lam
> > 29307 jracine  119    0  7148K  2372K CPU0   0   0:03 19.53% 19.53%
> > n_lam
> >=20
> > I include output from laminfo, dmesg (cpu relevnt info), and lamboot =
-d
> > bhost.lam... any suggestions most appreciated, and thanks in =
advance!
> >=20
> > -- laminfo
> >=20
> >            LAM/MPI: 7.0.4
> >             Prefix: /usr/local
> >       Architecture: amd64-unknown-freebsd5.2
> >      Configured by: root
> >      Configured on: Sat Apr 10 11:22:02 EDT 2004
> >     Configure host: jracine.maxwell.syr.edu
> >         C bindings: yes
> >       C++ bindings: yes
> >   Fortran bindings: yes
> >        C profiling: yes
> >      C++ profiling: yes
> >  Fortran profiling: yes
> >      ROMIO support: yes
> >       IMPI support: no
> >      Debug support: no
> >       Purify clean: no
> >           SSI boot: globus (Module v0.5)
> >           SSI boot: rsh (Module v1.0)
> >           SSI coll: lam_basic (Module v7.0)
> >           SSI coll: smp (Module v1.0)
> >            SSI rpi: crtcp (Module v1.0.1)
> >            SSI rpi: lamd (Module v7.0)
> >            SSI rpi: sysv (Module v7.0)
> >            SSI rpi: tcp (Module v7.0)
> >            SSI rpi: usysv (Module v7.0)
> >=20
> > -- dmesg sees two cpus...
> >=20
> > CPU: AMD Opteron(tm) Processor 248 (2205.02-MHz K8-class CPU)
> >   Origin =3D "AuthenticAMD"  Id =3D 0xf58  Stepping =3D 8
> >=20
> > =
Features=3D0x78bfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE=
,
> > MCA,CMOV,PAT,PSE36,CLFLUSH,MMX,FXSR,SSE,SSE2>
> >   AMD Features=3D0xe0500800<SYSCALL,NX,MMX+,LM,3DNow!+,3DNow!>
> > real memory  =3D 3623813120 (3455 MB)
> > avail memory =3D 3494363136 (3332 MB)
> > FreeBSD/SMP: Multiprocessor System Detected: 2 CPUs
> >  cpu0 (BSP): APIC ID:  0
> >  cpu1 (AP): APIC ID:  1
> >=20
> > -- bhost has the requisite information
> >=20
> > 128.230.130.10 cpu=3D2 user=3Djracine
> >=20
> > -- Here are the results from lamboot -d bhost.lam
> >=20
> > -bash-2.05b$ lamboot -d ~/bhost.lam
> > n0<29283> ssi:boot: Opening
> > n0<29283> ssi:boot: opening module globus
> > n0<29283> ssi:boot: initializing module globus
> > n0<29283> ssi:boot:globus: globus-job-run not found, globus boot =
will
> > not run n0<29283> ssi:boot: module not available: globus n0<29283>
> > ssi:boot: opening module rsh n0<29283> ssi:boot: initializing module =
rsh
> > n0<29283> ssi:boot:rsh: module initializing n0<29283>
> > ssi:boot:rsh:agent: rsh n0<29283> ssi:boot:rsh:username: <same>
> > n0<29283> ssi:boot:rsh:verbose: 1000 n0<29283> =
ssi:boot:rsh:algorithm:
> > linear n0<29283> ssi:boot:rsh:priority: 10 n0<29283> ssi:boot: =
module
> > available: rsh, priority: 10 n0<29283> ssi:boot: finalizing module
> > globus n0<29283> ssi:boot:globus: finalizing n0<29283> ssi:boot: =
closing
> > module globus n0<29283> ssi:boot: Selected boot module rsh
> > =20
> > LAM 7.0.4/MPI 2 C++/ROMIO - Indiana University
> > =20
> > n0<29283> ssi:boot:base: looking for boot schema in following
> > directories:
> > n0<29283> ssi:boot:base:   <current directory>
> > n0<29283> ssi:boot:base:   $TROLLIUSHOME/etc
> > n0<29283> ssi:boot:base:   $LAMHOME/etc
> > n0<29283> ssi:boot:base:   /usr/local/etc
> > n0<29283> ssi:boot:base: looking for boot schema file:
> > n0<29283> ssi:boot:base:   /home/jracine/bhost.lam
> > n0<29283> ssi:boot:base: found boot schema: /home/jracine/bhost.lam
> > n0<29283> ssi:boot:rsh: found the following hosts:
> > n0<29283> ssi:boot:rsh:   n0 jracine.maxwell.syr.edu (cpu=3D2)
> > n0<29283> ssi:boot:rsh: resolved hosts:
> > n0<29283> ssi:boot:rsh:   n0 jracine.maxwell.syr.edu --> =
128.230.130.10
> > (origin)n0<29283> ssi:boot:rsh: starting RTE procs
> > n0<29283> ssi:boot:base:linear: starting
> > n0<29283> ssi:boot:base:server: opening server TCP socket n0<29283>
> > ssi:boot:base:server: opened port 49832 n0<29283> =
ssi:boot:base:linear:
> > booting n0 (jracine.maxwell.syr.edu) n0<29283> ssi:boot:rsh: =
starting
> > lamd on (jracine.maxwell.syr.edu) n0<29283> ssi:boot:rsh: starting =
on n0
> > (jracine.maxwell.syr.edu): hboot -t -c lam-conf.lamd -d -I -H
> > 128.230.130.10 -P 49832 -n 0 -o 0 n0<29283> ssi:boot:rsh: launching
> > locally
> > hboot: performing tkill
> > hboot: tkill -d
> > tkill: setting prefix to (null)
> > tkill: setting suffix to (null)
> > tkill: got killname
> > back: /tmp/lam-jracine@jracine.maxwell.syr.edu/lam-killfile
> > tkill: removing socket file ...
> > tkill: socket
> > file: /tmp/lam-jracine@jracine.maxwell.syr.edu/lam-kernel-socketd
> > tkill: removing IO daemon socket file ...
> > tkill: IO daemon socket
> > file: /tmp/lam-jracine@jracine.maxwell.syr.edu/lam-io-socket
> > tkill: f_kill =3D =
"/tmp/lam-jracine@jracine.maxwell.syr.edu/lam-killfile"
> > tkill: nothing to kill:
> > "/tmp/lam-jracine@jracine.maxwell.syr.edu/lam-killfile"
> > hboot: booting...
> > hboot: fork /usr/local/bin/lamd
> > [1]  29286 lamd -H 128.230.130.10 -P 49832 -n 0 -o 0 -d n0<29283>
> > ssi:boot:rsh: successfully launched on n0
> > (jracine.maxwell.syr.edu)
> > n0<29283> ssi:boot:base:server: expecting connection from finite =
list
> > hboot: attempting to execute
> > n-1<29286> ssi:boot: Opening
> > n-1<29286> ssi:boot: opening module globus
> > n-1<29286> ssi:boot: initializing module globus
> > n-1<29286> ssi:boot:globus: globus-job-run not found, globus boot =
will
> > not run n-1<29286> ssi:boot: module not available: globus n-1<29286>
> > ssi:boot: opening module rsh n-1<29286> ssi:boot: initializing =
module
> > rsh n-1<29286> ssi:boot:rsh: module initializing n-1<29286>
> > ssi:boot:rsh:agent: rsh n-1<29286> ssi:boot:rsh:username: <same>
> > n-1<29286> ssi:boot:rsh:verbose: 1000 n-1<29286> =
ssi:boot:rsh:algorithm:
> > linear n-1<29286> ssi:boot:rsh:priority: 10 n-1<29286> ssi:boot: =
module
> > available: rsh, priority: 10 n-1<29286> ssi:boot: finalizing module
> > globus n-1<29286> ssi:boot:globus: finalizing n-1<29286> ssi:boot:
> > closing module globus n-1<29286> ssi:boot: Selected boot module rsh
> > n0<29283> ssi:boot:base:server: got connection from 128.230.130.10
> > n0<29283> ssi:boot:base:server: this connection is expected (n0)
> > n0<29283> ssi:boot:base:server: remote lamd is at =
128.230.130.10:50206
> > n0<29283> ssi:boot:base:server: closing server socket n0<29283>
> > ssi:boot:base:server: connecting to lamd at 128.230.130.10:49833
> > n0<29283> ssi:boot:base:server: connected n0<29283>
> > ssi:boot:base:server: sending number of links (1) n0<29283>
> > ssi:boot:base:server: sending info: n0
> > (jracine.maxwell.syr.edu)
> > n0<29283> ssi:boot:base:server: finished sending
> > n0<29283> ssi:boot:base:server: disconnected from =
128.230.130.10:49833
> > n0<29283> ssi:boot:base:linear: finished n0<29283> ssi:boot:rsh: all =
RTE
> > procs started n0<29283> ssi:boot:rsh: finalizing n0<29283> ssi:boot:
> > Closing n-1<29286> ssi:boot:rsh: finalizing n-1<29286> ssi:boot: =
Closing
> >=20
> >=20
> >=20
> > _______________________________________________
> > freebsd-cluster@freebsd.org mailing list
> > http://lists.freebsd.org/mailman/listinfo/freebsd-cluster
> > To unsubscribe, send any mail to
> > "freebsd-cluster-unsubscribe@freebsd.org"
> >=20
>=20
> _______________________________________________
> freebsd-amd64@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-amd64
> To unsubscribe, send any mail to =
"freebsd-amd64-unsubscribe@freebsd.org"

--=20
Bob Willcox                  What's done to children, they will do to =
society.
bob@immure.com
Austin, TX

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?32A8B2CB12BFC84D8D11D872C787AA9A015E9AD2>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation