From owner-freebsd-current@FreeBSD.ORG Sun May 25 14:12:52 2003 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id C17A437B401 for ; Sun, 25 May 2003 14:12:52 -0700 (PDT) Received: from gunjin.wccnet.org (gunjin.wccnet.org [198.111.176.99]) by mx1.FreeBSD.org (Postfix) with ESMTP id EC86243F3F for ; Sun, 25 May 2003 14:12:51 -0700 (PDT) (envelope-from anthony@gunjin.wccnet.org) Received: from gunjin.wccnet.org (localhost.rexroof.com [127.0.0.1]) by gunjin.wccnet.org (8.12.3/8.12.2) with ESMTP id h4PLHVsn005413 for ; Sun, 25 May 2003 17:17:31 -0400 (EDT) Received: (from anthony@localhost) by gunjin.wccnet.org (8.12.3/8.12.3/Submit) id h4PLHVkg005412 for freebsd-current@freebsd.org; Sun, 25 May 2003 17:17:31 -0400 (EDT) Date: Sun, 25 May 2003 17:17:30 -0400 From: Anthony Schneider To: freebsd-current@freebsd.org Message-ID: <20030525211730.GA5226@x-anthony.com> References: <20030525064929.GA96588@x-anthony.com> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="SLDf9lqlvOQaIe6s" Content-Disposition: inline In-Reply-To: <20030525064929.GA96588@x-anthony.com> User-Agent: Mutt/1.4i Subject: Re: mpi + shmem issues X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 25 May 2003 21:12:53 -0000 --SLDf9lqlvOQaIe6s Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable additional information: when recompiling mpich for debugging symbols, configure fails on: checking that usable shared memory locks were found... no so, does this mean that mpich somehow exhausted all shmem locks? after running the program only 10 times, i see this as infeasible, considering a) mpich (presumably in MPI_Init()) would only want 1 or 2 locks on init and b) any shared memory locks mpich grabs should be freed upon process completion (whether clean or not) by the operating system, no? well, configure with --with-device=3Dch_shmem:-usesysv succeeds (checking that usable shared memory locks were found... yes), so I will try this out. But for reference, can anyone make a guess as to why/how a shared memory application can exhaust locks like this? thank you. -Anthony. On Sun, May 25, 2003 at 02:49:30AM -0400, Anthony Schneider wrote: > Hello, > My machine is a dual athlon: > FreeBSD pickle. 5.1-BETA FreeBSD 5.1-BETA #6: Sun May 25 02:16:15 EDT 2003 > anthony@pickle.:/usr/src/sys/i386/compile/PICKLE i386 >=20 > I started having this issue, which may or may not exist on uniprocessor > systems or 4.x systems. I built mpi with ch_shmem device for shared memo= ry > programs (instead of the more common rsh/ssh), and something strange > happens. For even the most basic little program, the program will launch > fine (usually) the first time i run it after the system boots, but after a > few executions, execution starts failing consistently until after i reboo= t. >=20 > as an example, here is a small acknowledgment program: >=20 > #include > #include >=20 > int main (int argc, char *argv[]) { > int mpiRank, mpiSize; >=20 > MPI_Init (&argc, &argv); > MPI_Comm_rank (MPI_COMM_WORLD, &mpiRank); >=20 > printf ("#%d here\n", mpiRank); >=20 > return 0; >=20 > } >=20 > and here is the history of executing it: >=20 > pickle:anthony:/home/anthony/src/mpi:6% mpirun -np 2 ./foo > #0 here > #1 here > Child process exited unexpectedly 0 > Abort trap (core dumped) > pickle:anthony:/home/anthony/src/mpi:7% mpirun -np 2 ./foo > #0 here > pickle:anthony:/home/anthony/src/mpi:8% #1 here >=20 > pickle:anthony:/home/anthony/src/mpi:8% mpirun -np 2 ./foo > #0 here > #1 here > pickle:anthony:/home/anthony/src/mpi:9% mpirun -np 2 ./foo > #0 here > #1 here > pickle:anthony:/home/anthony/src/mpi:10% mpirun -np 2 ./foo > #1 here > #0 here > Child process exited unexpectedly 0 > Abort trap (core dumped) > pickle:anthony:/home/anthony/src/mpi:11% mpirun -np 2 ./foo > #0 here > #1 here > Child process exited unexpectedly 0 > Abort trap (core dumped) > pickle:anthony:/home/anthony/src/mpi:12% mpirun -np 2 ./foo > #0 here > #1 here > pickle:anthony:/home/anthony/src/mpi:13% mpirun -np 2 ./foo > #1 here > #0 here > Child process exited unexpectedly 0 > Abort trap (core dumped) > pickle:anthony:/home/anthony/src/mpi:14% mpirun -np 2 ./foo > #0 here > #1 here > pickle:anthony:/home/anthony/src/mpi:15% mpirun -np 2 ./foo > #0 here > #1 here > pickle:anthony:/home/anthony/src/mpi:16% mpirun -np 2 ./foo > semget failed for setnum =3D 0 > Abort trap (core dumped) > pickle:anthony:/home/anthony/src/mpi:17% mpirun -np 2 ./foo > semget failed for setnum =3D 0 > Abort trap (core dumped) > pickle:anthony:/home/anthony/src/mpi:18% mpirun -np 2 ./foo > semget failed for setnum =3D 0 > Abort trap (core dumped) >=20 > ... (continues until i reboot) >=20 > the first run that aborts is strange, but since it is not something > i've witnessed previously, i'd like to forget that and focus on > the repeated semget failures. i would normally be looking into > the mpi implementation (mpich 1.2.5), but since after semget fails > once it never seems to succeed again with other mpi programs, i > think this to be a freebsd problem. >=20 > i'm runing a (barely) custom kernel, with nothing added to it. > i just cvsup'd and rebuilt less than an hour ago, and the problem > has persisted from beta #5 through beta #6. >=20 > any suggestions? >=20 > thank you for your help. >=20 > -Anthony. --SLDf9lqlvOQaIe6s Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.0.7 (FreeBSD) iD8DBQE+0TLpKUeW47UGY2kRAssxAKCZs5We5Q/IWgdxdQTRxzLD4tT+xQCfe6+5 sePLca9mYSn9wqa6t/c6868= =qRIN -----END PGP SIGNATURE----- --SLDf9lqlvOQaIe6s--