From owner-freebsd-stable@FreeBSD.ORG Thu Mar 1 01:19:09 2007 Return-Path: X-Original-To: freebsd-stable@freebsd.org Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 710D316A402 for ; Thu, 1 Mar 2007 01:19:09 +0000 (UTC) (envelope-from doconnor@gsoft.com.au) Received: from cain.gsoft.com.au (cain.gsoft.com.au [203.31.81.10]) by mx1.freebsd.org (Postfix) with ESMTP id 6628A13C442 for ; Thu, 1 Mar 2007 01:19:08 +0000 (UTC) (envelope-from doconnor@gsoft.com.au) Received: from inchoate.gsoft.com.au (inchoate.gsoft.com.au [203.31.81.25]) (authenticated bits=0) by cain.gsoft.com.au (8.13.8/8.13.8) with ESMTP id l211J6En061646 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Thu, 1 Mar 2007 11:49:06 +1030 (CST) (envelope-from doconnor@gsoft.com.au) From: "Daniel O'Connor" To: freebsd-stable@freebsd.org Date: Thu, 1 Mar 2007 11:48:54 +1030 User-Agent: KMail/1.9.5 MIME-Version: 1.0 Content-Type: multipart/signed; boundary="nextPart6375052.DeoMG3dhqs"; protocol="application/pgp-signature"; micalg=pgp-sha1 Content-Transfer-Encoding: 7bit Message-Id: <200703011149.02601.doconnor@gsoft.com.au> X-Spam-Score: -3.161 () ALL_TRUSTED,AWL,BAYES_00 X-Scanned-By: MIMEDefang 2.58 on 203.31.81.10 Subject: fork wedging (I think) X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 01 Mar 2007 01:19:09 -0000 --nextPart6375052.DeoMG3dhqs Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Content-Disposition: inline We run a set of meteor and wind radar systems based on FreeBSD and one of t= hem I am seeing an odd problem that does not occur very often (once a month perhaps). The main process we use for gathering data regularly fork()s children off to process data (at each 30 minute acquisition cycle). The data is read from a= n=20 external acquisition chassis using a (slow) custom PCI card. Typical data=20 rates are 1-2Mbyte/sec. It will also fork off processes to send email, although unfortunately I don't know how to extract a Tcl stack trace from=20 inside gdb :( The problem seems to occur when the acquisition process (called the Recorde= r) tries to fork off a new child. I gathered some information during an earlier failure (mid January) show below. I tried to get a crash dump today when it failed again but as I ran sysctl -a the system wedged. I think processes got stuck in allproc (I pressed ctrl-t on an ls process running another session opened before I ran sysctl). UID PID PPID CPU PRI NI VSZ RSS MWCHAN STAT TT TIME COMMAND 19999 1453 1 44 -8 -5 17056 11380 piperd Igdb $GSHOME/libexec/Recorder GNU gdb 6.1.1 [FreeBSD] Copyright 2004 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain condition= s. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "amd64-marcel-freebsd"... (gdb) eureka:~>fg gdb /usr/home/radar/skiymet/libexec/Recorder (gdb) (gdb) attach 1453 Attaching to program: /usr/home/radar/skiymet/libexec/Recorder, process 1453 Reading symbols from /usr/local/lib/libtcl84.so.1...done. Loaded symbols for /usr/local/lib/libtcl84.so.1 Reading symbols from /lib/libm.so.4...done. Loaded symbols for /lib/libm.so.4 Reading symbols from /lib/libc.so.6...done. Loaded symbols for /lib/libc.so.6 Reading symbols from /usr/home/radar/skiymet/lib/acquisition//synccheck.plu= gin...done. Loaded symbols for /usr/home/radar/skiymet/lib/acquisition//synccheck.plugin Reading symbols from /usr/home/radar/skiymet/lib/acquisition//integrate.plu= gin...done. Loaded symbols for /usr/home/radar/skiymet/lib/acquisition//integrate.plugin Reading symbols from /libexec/ld-elf.so.1...done. Loaded symbols for /libexec/ld-elf.so.1 0x00000008009c395c in read () from /lib/libc.so.6 (gdb) bt #0 0x00000008009c395c in read () from /lib/libc.so.6 #1 0x000000080072f77f in TclpCreateProcess () from /usr/local/lib/libtcl84= =2Eso.1 #2 0x0000000800717d25 in TclCreatePipeline () from /usr/local/lib/libtcl84= =2Eso.1 #3 0x00000008007186d0 in Tcl_OpenCommandChannel () from /usr/local/lib/lib= tcl84.so.1 #4 0x0000000800704af8 in Tcl_ExecObjCmd () from /usr/local/lib/libtcl84.so= =2E1 #5 0x00000008006c9c72 in TclEvalObjvInternal () from /usr/local/lib/libtcl= 84.so.1 #6 0x00000008006f0b03 in TclExprFloatError () from /usr/local/lib/libtcl84= =2Eso.1 #7 0x00000008006f462d in TclCompEvalObj () from /usr/local/lib/libtcl84.so= =2E1 #8 0x000000080071b7ad in TclObjInterpProc () from /usr/local/lib/libtcl84.= so.1 #9 0x00000008006c9c72 in TclEvalObjvInternal () from /usr/local/lib/libtcl= 84.so.1 #10 0x00000008006f0b03 in TclExprFloatError () from /usr/local/lib/libtcl84= =2Eso.1 #11 0x00000008006f462d in TclCompEvalObj () from /usr/local/lib/libtcl84.so= =2E1 #12 0x000000080071b7ad in TclObjInterpProc () from /usr/local/lib/libtcl84.= so.1 #13 0x00000008006c9c72 in TclEvalObjvInternal () from /usr/local/lib/libtcl= 84.so.1 #14 0x00000008006cb1e4 in Tcl_EvalEx () from /usr/local/lib/libtcl84.so.1 #15 0x0000000800721511 in Tcl_AfterObjCmd () from /usr/local/lib/libtcl84.s= o.1 #16 0x0000000800720ccc in Tcl_CreateTimerHandler () from /usr/local/lib/lib= tcl84.so.1 #17 0x0000000800711fea in Tcl_ServiceEvent () from /usr/local/lib/libtcl84.= so.1 #18 0x0000000800712357 in Tcl_DoOneEvent () from /usr/local/lib/libtcl84.so= =2E1 #19 0x000000000041b272 in gsio_read (interp=3D0x58d400, iov=3D0xe14408, iov= cnt=3D4, timeout=3D81000) at gsio.c:234 #20 0x000000000040a3fb in rawdata_wakeup (interp=3D0x58d400, inqueue=3D0x7f= ffffff6aa0, outqueue=3D0xdf, flags=3D0x7fffffffa584, token=3D0xffffff000713c340) at data.c:179 #21 0x00000000004327b3 in proc_run (interp=3D0x58d400) at procedure.c:528 #22 0x00000000004306b5 in ud_Stream (cd=3D0xd, interp=3D0x58d400, argc=3D2,= argv=3D0x0) at procedure.c:157 #23 0x00000008006c840b in TclInvokeStringCommand () from /usr/local/lib/lib= tcl84.so.1 #24 0x00000008006c9c72 in TclEvalObjvInternal () from /usr/local/lib/libtcl= 84.so.1 #25 0x00000008006f0b03 in TclExprFloatError () from /usr/local/lib/libtcl84= =2Eso.1 #26 0x00000008006f462d in TclCompEvalObj () from /usr/local/lib/libtcl84.so= =2E1 #27 0x000000080071b7ad in TclObjInterpProc () from /usr/local/lib/libtcl84.= so.1 #28 0x00000008006c9c72 in TclEvalObjvInternal () from /usr/local/lib/libtcl= 84.so.1 #29 0x00000008006f0b03 in TclExprFloatError () from /usr/local/lib/libtcl84= =2Eso.1 #30 0x00000008006f462d in TclCompEvalObj () from /usr/local/lib/libtcl84.so= =2E1 #31 0x000000080071b7ad in TclObjInterpProc () from /usr/local/lib/libtcl84.= so.1 #32 0x00000008006c9c72 in TclEvalObjvInternal () from /usr/local/lib/libtcl= 84.so.1 #33 0x00000008006f0b03 in TclExprFloatError () from /usr/local/lib/libtcl84= =2Eso.1 #34 0x00000008006f462d in TclCompEvalObj () from /usr/local/lib/libtcl84.so= =2E1 #35 0x00000008006cbc86 in Tcl_EvalObjEx () from /usr/local/lib/libtcl84.so.1 #36 0x00000008006ceca2 in Tcl_CatchObjCmd () from /usr/local/lib/libtcl84.s= o.1 #37 0x00000008006c9c72 in TclEvalObjvInternal () from /usr/local/lib/libtcl= 84.so.1 #38 0x00000008006f0b03 in TclExprFloatError () from /usr/local/lib/libtcl84= =2Eso.1 #39 0x00000008006f485a in Tcl_ExprObj () from /usr/local/lib/libtcl84.so.1 #40 0x00000008006ca3b3 in Tcl_ExprBooleanObj () from /usr/local/lib/libtcl8= 4.so.1 #41 0x00000008006d1979 in Tcl_IfObjCmd () from /usr/local/lib/libtcl84.so.1 #42 0x00000008006c9c72 in TclEvalObjvInternal () from /usr/local/lib/libtcl= 84.so.1 #43 0x00000008006cb1e4 in Tcl_EvalEx () from /usr/local/lib/libtcl84.so.1 #44 0x000000080070b092 in Tcl_FSEvalFile () from /usr/local/lib/libtcl84.so= =2E1 #45 0x000000080070ea30 in Tcl_Main () from /usr/local/lib/libtcl84.so.1 #46 0x00000000004281fe in main (argc=3D13, argv=3D0x7fffffff6aa0) at main.c= :49 (gdb) #19 0x000000000041b272 in gsio_read (interp=3D0x58d400, iov=3D0xe14408, iov= cnt=3D4, timeout=3D81000) at gsio.c:234 234 Tcl_DoOneEvent(TCL_ALL_EVENTS | TCL_DONT_WAIT); (gdb) #20 0x000000000040a3fb in rawdata_wakeup (interp=3D0x58d400, inqueue=3D0x7f= ffffff6aa0, outqueue=3D0xdf, flags=3D0x7fffffffa584, token=3D0xffffff000713c340) at data.c:179 179 if (dop->do_ifmod->m.ifmod.if_read(interp, dop->do_iov, DATA_NB= UFS, timeout)) eureka:~>kill 1453 eureka:~>kill 1453 1453: No such process eureka:~>ps -axlwww | grep Rec UID PID PPID CPU PRI NI VSZ RSS MWCHAN STAT TT TIME COMMAND 19999 12293 1 44 96 -5 17056 16 user m Dkill 12293 eureka:~>kill -9 12293 eureka:~>kill -9 12293 eureka:~>fstat /dev/gsio0 USER CMD PID FD MOUNT INUM MODE SZ|DV R/W NAME radar Recorder 12293 3 /dev 149 crw-rw-r-- gsio0 rw /dev/= gsio0 Interestingly close-on-exec is set on that file descriptor, so it shouldn't= =20 be keeping the gsio0 device busy like that. Does anyone have any suggestions for debugging it? I will try and get a cra= sh dump the next time it happens although it's going to be difficult if I can't use sysctl :( (I don't have console access and the system is in a very remote location ju= st to make it extra fun) The FreeBSD version is 6.1 with some minor mods to add the various custom drivers we're using. Anyone have any suggestions for tracking the problem down? Or better yet a solution :) I can upgrade it to RELENG_6 if it will fix the problem but I'd rather not do so without a reasonable idea it will improve things. Thanks. =2D-=20 Daniel O'Connor software and network engineer for Genesis Software - http://www.gsoft.com.au "The nice thing about standards is that there are so many of them to choose from." -- Andrew Tanenbaum GPG Fingerprint - 5596 B766 97C0 0E94 4347 295E E593 DC20 7B3F CE8C --nextPart6375052.DeoMG3dhqs Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.5 (FreeBSD) iD8DBQBF5ioG5ZPcIHs/zowRAk0UAJwN56HBPsbnfbnfz/ahtP1+ls8uawCbBWDr G3E2kFexpzj1Cx1ZDn4ftHs= =5sDk -----END PGP SIGNATURE----- --nextPart6375052.DeoMG3dhqs--