Date: Sun, 25 May 2003 02:49:30 -0400 From: Anthony Schneider <anthony@x-anthony.com> To: freebsd-current@freebsd.org Subject: mpi + shmem issues Message-ID: <20030525064929.GA96588@x-anthony.com>
next in thread | raw e-mail | index | archive | help
--6TrnltStXW4iwmi0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Hello, My machine is a dual athlon: FreeBSD pickle. 5.1-BETA FreeBSD 5.1-BETA #6: Sun May 25 02:16:15 EDT 2003 anthony@pickle.:/usr/src/sys/i386/compile/PICKLE i386 I started having this issue, which may or may not exist on uniprocessor systems or 4.x systems. I built mpi with ch_shmem device for shared memory programs (instead of the more common rsh/ssh), and something strange happens. For even the most basic little program, the program will launch fine (usually) the first time i run it after the system boots, but after a few executions, execution starts failing consistently until after i reboot. as an example, here is a small acknowledgment program: #include <mpi.h> #include <stdio.h> int main (int argc, char *argv[]) { int mpiRank, mpiSize; MPI_Init (&argc, &argv); MPI_Comm_rank (MPI_COMM_WORLD, &mpiRank); printf ("#%d here\n", mpiRank); return 0; } and here is the history of executing it: pickle:anthony:/home/anthony/src/mpi:6% mpirun -np 2 ./foo #0 here #1 here Child process exited unexpectedly 0 Abort trap (core dumped) pickle:anthony:/home/anthony/src/mpi:7% mpirun -np 2 ./foo #0 here pickle:anthony:/home/anthony/src/mpi:8% #1 here pickle:anthony:/home/anthony/src/mpi:8% mpirun -np 2 ./foo #0 here #1 here pickle:anthony:/home/anthony/src/mpi:9% mpirun -np 2 ./foo #0 here #1 here pickle:anthony:/home/anthony/src/mpi:10% mpirun -np 2 ./foo #1 here #0 here Child process exited unexpectedly 0 Abort trap (core dumped) pickle:anthony:/home/anthony/src/mpi:11% mpirun -np 2 ./foo #0 here #1 here Child process exited unexpectedly 0 Abort trap (core dumped) pickle:anthony:/home/anthony/src/mpi:12% mpirun -np 2 ./foo #0 here #1 here pickle:anthony:/home/anthony/src/mpi:13% mpirun -np 2 ./foo #1 here #0 here Child process exited unexpectedly 0 Abort trap (core dumped) pickle:anthony:/home/anthony/src/mpi:14% mpirun -np 2 ./foo #0 here #1 here pickle:anthony:/home/anthony/src/mpi:15% mpirun -np 2 ./foo #0 here #1 here pickle:anthony:/home/anthony/src/mpi:16% mpirun -np 2 ./foo semget failed for setnum = 0 Abort trap (core dumped) pickle:anthony:/home/anthony/src/mpi:17% mpirun -np 2 ./foo semget failed for setnum = 0 Abort trap (core dumped) pickle:anthony:/home/anthony/src/mpi:18% mpirun -np 2 ./foo semget failed for setnum = 0 Abort trap (core dumped) ... (continues until i reboot) the first run that aborts is strange, but since it is not something i've witnessed previously, i'd like to forget that and focus on the repeated semget failures. i would normally be looking into the mpi implementation (mpich 1.2.5), but since after semget fails once it never seems to succeed again with other mpi programs, i think this to be a freebsd problem. i'm runing a (barely) custom kernel, with nothing added to it. i just cvsup'd and rebuilt less than an hour ago, and the problem has persisted from beta #5 through beta #6. any suggestions? thank you for your help. -Anthony. --6TrnltStXW4iwmi0 Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.0.7 (FreeBSD) iD8DBQE+0Gd5KUeW47UGY2kRAjJxAJ9pgtjX0siafq+1AZ8FIeBrIF9tIwCaAxCj GdV8I/NePVDjCT2Zb8kTZ5E= =hYzt -----END PGP SIGNATURE----- --6TrnltStXW4iwmi0--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20030525064929.GA96588>