Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 25 May 2003 17:17:30 -0400
From:      Anthony Schneider <anthony@x-anthony.com>
To:        freebsd-current@freebsd.org
Subject:   Re: mpi + shmem issues
Message-ID:  <20030525211730.GA5226@x-anthony.com>
In-Reply-To: <20030525064929.GA96588@x-anthony.com>
References:  <20030525064929.GA96588@x-anthony.com>

next in thread | previous in thread | raw e-mail | index | archive | help

--SLDf9lqlvOQaIe6s
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

additional information:

when recompiling mpich for debugging symbols, configure fails on:
checking that usable shared memory locks were found... no

so, does this mean that mpich somehow exhausted all shmem locks?
after running the program only 10 times, i see this as infeasible,
considering
	a) mpich (presumably in MPI_Init()) would only want 1 or
	   2 locks on init
and
	b) any shared memory locks mpich grabs should be freed
	   upon process completion (whether clean or not) by the
	   operating system, no?

well, configure with --with-device=3Dch_shmem:-usesysv succeeds
(checking that usable shared memory locks were found... yes), so
I will try this out.

But for reference, can anyone make a guess as to why/how a shared
memory application can exhaust locks like this?

thank you.

-Anthony.


On Sun, May 25, 2003 at 02:49:30AM -0400, Anthony Schneider wrote:
> Hello,
> My machine is a dual athlon:
> FreeBSD pickle. 5.1-BETA FreeBSD 5.1-BETA #6: Sun May 25 02:16:15 EDT 2003
> anthony@pickle.:/usr/src/sys/i386/compile/PICKLE  i386
>=20
> I started having this issue, which may or may not exist on uniprocessor
> systems or 4.x systems.  I built mpi with ch_shmem device for shared memo=
ry
> programs (instead of the more common rsh/ssh), and something strange
> happens.  For even the most basic little program, the program will launch
> fine (usually) the first time i run it after the system boots, but after a
> few executions, execution starts failing consistently until after i reboo=
t.
>=20
> as an example, here is a small acknowledgment program:
>=20
> #include <mpi.h>
> #include <stdio.h>
>=20
> int main (int argc, char *argv[]) {
>         int mpiRank, mpiSize;
>=20
>         MPI_Init (&argc, &argv);
>         MPI_Comm_rank (MPI_COMM_WORLD, &mpiRank);
>=20
>         printf ("#%d here\n", mpiRank);
>=20
>         return 0;
>=20
> }
>=20
> and here is the history of executing it:
>=20
> pickle:anthony:/home/anthony/src/mpi:6% mpirun -np 2 ./foo
> #0 here
> #1 here
> Child process exited unexpectedly 0
> Abort trap (core dumped)
> pickle:anthony:/home/anthony/src/mpi:7% mpirun -np 2 ./foo
> #0 here
> pickle:anthony:/home/anthony/src/mpi:8% #1 here
>=20
> pickle:anthony:/home/anthony/src/mpi:8% mpirun -np 2 ./foo
> #0 here
> #1 here
> pickle:anthony:/home/anthony/src/mpi:9% mpirun -np 2 ./foo
> #0 here
> #1 here
> pickle:anthony:/home/anthony/src/mpi:10% mpirun -np 2 ./foo
> #1 here
> #0 here
> Child process exited unexpectedly 0
> Abort trap (core dumped)
> pickle:anthony:/home/anthony/src/mpi:11% mpirun -np 2 ./foo
> #0 here
> #1 here
> Child process exited unexpectedly 0
> Abort trap (core dumped)
> pickle:anthony:/home/anthony/src/mpi:12% mpirun -np 2 ./foo
> #0 here
> #1 here
> pickle:anthony:/home/anthony/src/mpi:13% mpirun -np 2 ./foo
> #1 here
> #0 here
> Child process exited unexpectedly 0
> Abort trap (core dumped)
> pickle:anthony:/home/anthony/src/mpi:14% mpirun -np 2 ./foo
> #0 here
> #1 here
> pickle:anthony:/home/anthony/src/mpi:15% mpirun -np 2 ./foo
> #0 here
> #1 here
> pickle:anthony:/home/anthony/src/mpi:16% mpirun -np 2 ./foo
> semget failed for setnum =3D  0
> Abort trap (core dumped)
> pickle:anthony:/home/anthony/src/mpi:17% mpirun -np 2 ./foo
> semget failed for setnum =3D  0
> Abort trap (core dumped)
> pickle:anthony:/home/anthony/src/mpi:18% mpirun -np 2 ./foo
> semget failed for setnum =3D  0
> Abort trap (core dumped)
>=20
> ... (continues until i reboot)
>=20
> the first run that aborts is strange, but since it is not something
> i've witnessed previously, i'd like to forget that and focus on
> the repeated semget failures.  i would normally be looking into
> the mpi implementation (mpich 1.2.5), but since after semget fails
> once it never seems to succeed again with other mpi programs, i
> think this to be a freebsd problem.
>=20
> i'm runing a (barely) custom kernel, with nothing added to it.
> i just cvsup'd and rebuilt less than an hour ago, and the problem
> has persisted from beta #5 through beta #6.
>=20
> any suggestions?
>=20
> thank you for your help.
>=20
> -Anthony.



--SLDf9lqlvOQaIe6s
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.7 (FreeBSD)

iD8DBQE+0TLpKUeW47UGY2kRAssxAKCZs5We5Q/IWgdxdQTRxzLD4tT+xQCfe6+5
sePLca9mYSn9wqa6t/c6868=
=qRIN
-----END PGP SIGNATURE-----

--SLDf9lqlvOQaIe6s--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20030525211730.GA5226>