Date: Thu, 10 Jan 2008 14:26:40 +0100 (CET) From: Lars Erik Gullerud <lerik@nolink.net> To: freebsd-stable@freebsd.org, rancid-discuss@shrubbery.net Subject: Rancid/Expect failing on FreeBSD/SMP systems Message-ID: <20080110132211.D75682@electra.nolink.net>
next in thread | raw e-mail | index | archive | help
My apologies for posting this both to the Rancid list and FreeBSD-STABLE, however I am not sure where to start troubleshooting this issue - I am suspecting it is a FreeBSD issue, but I am thinking we are probably not the only shop running RANCID (ports/net-mgmt/rancid) on FreeBSD (since it is quite popular in ISP environments), so hopefully someone can look at it from the RANCID angle and give some helpful input on how to troubleshoot this further. The problem: After finally giving in and starting to phase out some of our oldest FreeBSD 4.11 servers and replace them with FreeBSD 6.x on some fresh hardware, I got around to start moving our RANCID server. This however, has been the start of a real nightmare. I don't think the problems I am seeing are in RANCID itself, however it can be reliable reproduced every time i run RANCID and I have not been able to reproduce it in any other way with pure expect test-cases directly. What happens: Expect processes "hang" during RANCID runs, and go into infinite loops eating 100% CPU (on one CPU core). The problem is reliably reproduced everytime we do a full rancid-run, but the actual device it chokes on varies between runs so it is not device-related. It does seem to happen most often when collecting Juniper M-series gear with large configurations though, using jrancid and ssh. We can NOT seem to reproduce it by running jrancid (or any other) on a single device at at time - which is somewhat confusing at is DOES happen when setting PAR_COUNT to 1 and doing a rancid-run (which should IMHO be pretty much the same as doing sequential single device runs...) Our environment: We run RANCID extensively to collect a few hundred devices, including Cisco, Cisco-CatOS, Juniper, Extreme, Extreme-XOS, Riverstone, FortiNet/FortiGate, etc. We want to start storing CPE configs in addition to our own core gear in RANCID now, which means we will be putting several thousand routers into RANCID, which also explains the need for fresher hardware... RANCID version does not seem to matter, I have tested with both some ancient 2.3.0 scripts and 2.3.2a7, same behaviour. Using the same RANCID instance (I have tarballed it up and installed it on a bunch of servers, i.e. using the same CVS and the same router.db files etc.), it fails on: FreeBSD 7.0-BETA4, amd64, SMP kernel, 8 x CPU cores (2 x quad Xeon 5335) FreeBSD 6.2-STABLE, i386, SMP kernel, 2 x CPU cores (2 x single-core Xeon) Both have perl-5.8.8_1, expect 5.43.0_3 and tcl-8.4.16,1 built from ports. It however seems to work fine on: Linux CentOS 4.5 x86-64, 4 x CPU cores (2 x dual Xeon 5130) FreeBSD 4.11 i386, UP kernel, 1 x CPU core (1 x single-core Xeon) FreeBSD 7.0-RC1, i386, UP kernel, 1 x CPU core (1 x P4) (Linux box has Expect 5.42 and Tcl 8.3...) So it only seems to be on newer FreeBSD with SMP. (If anyone have RANCID working okay on FreeBSD 6.x/7.x on SMP systems at all, please let me know...) Now, for some details, if anyone has any ideas. What is actually happening is this, when truss'ing the stuck Expect-process: fcntl(4,F_GETFL,) = 0 (0x0) fcntl(4,F_SETFL,0x0) ERR#25 'Inappropriate ioctl for device' fcntl(4,F_GETFL,) = 0 (0x0) fcntl(4,F_SETFL,0x0) ERR#25 'Inappropriate ioctl for device' <looping ad nauseum> So, which device is it trying to fcntl, and what is it trying to do? lsof shows the following: expect 1417 rancid cwd VDIR 0,86 2048 7607662 /local/rancid/var/core/configs expect 1417 rancid rtd VDIR 0,81 512 2 / expect 1417 rancid 0r VCHR 0,24 0t0 24 /dev/null expect 1417 rancid 2r VCHR 0,24 0t0 24 /dev/null expect 1417 rancid 3r VCHR 0,24 0t0 24 /dev/null expect 1417 rancid 4r VCHR 0,24 0t0 24 /dev/null file descriptor 4 is /dev/null. Why is it trying to F_SETFL /dev/null to BLOCKING mode (which is failing)? Why should it be playing with /dev/null at all? Well, digging a little, this is what the lsof output looked like 10 seconds earlier: expect 1417 rancid cwd VDIR 0,86 2048 7607662 /local/rancid/var/core/configs expect 1417 rancid rtd VDIR 0,81 512 2 / expect 1417 rancid 0r VCHR 0,24 0t0 24 /dev/null expect 1417 rancid 1u PIPE 0x38bfcf8 0 ->0xffffff00038bfba0 expect 1417 rancid 2w VREG 0,86 76 7583772 /local (/dev/mfid0s1f) expect 1417 rancid 3u VCHR 0,108 0t0 108 /dev/ttyp2 expect 1417 rancid 4u VCHR 0,117 0t45 117 /dev/ptyp7 ssh 1418 rancid cwd VDIR 0,86 2048 7607662 /local/rancid/var/core/configs ssh 1418 rancid rtd VDIR 0,81 512 2 / ssh 1418 rancid txt unknown file system type: 8\xb9^_^B\xff\xff\xff^Xb\xab)^B\xff\xff\xffE ssh 1418 rancid 0u VCHR 0,118 0t0 118 /dev/ttyp7 ssh 1418 rancid 1u VCHR 0,118 0t0 118 /dev/ttyp7 ssh 1418 rancid 2u VCHR 0,118 0t0 118 /dev/ttyp7 ssh 1418 rancid 3w VREG 0,86 76 7583772 /local (/dev/mfid0s1f) ssh 1418 rancid 4u IPv4 0xffffff008c030240 0t0 TCP *:27776->*:49323 ssh 1418 rancid 5u VCHR 0,118 0t45 118 /dev/ttyp7 Here, fd 4 is actually a pty (pty7), which seems to be a fork to PID 1418, the ssh session to the router, and everything seems to be normal. PID 1418 is no longer there on the most recent lsof, so 1418 seems to have died(?) and PID 1417 now has /dev/null on its file descriptor 4. I don't know why that is, but why is it trying to fcntl it to Blocking I/O mode? Here is a gdb attach to the PID and a backtrace: (gdb) bt #0 0x0000000800aefc9c in fcntl () from /lib/libc.so.7 #1 0x00000000004072c5 in ?? () #2 0x00000008006a8c18 in StackSetBlockMode () from /usr/local/lib/libtcl84.so.1 #3 0x00000008006a8c54 in SetBlockMode () from /usr/local/lib/libtcl84.so.1 #4 0x00000008006acf75 in Tcl_SetChannelOption () from /usr/local/lib/libtcl84.so.1 #5 0x00000008006aeda0 in TclFinalizeIOSubsystem () from /usr/local/lib/libtcl84.so.1 #6 0x0000000800697f74 in Tcl_FinalizeThread () from /usr/local/lib/libtcl84.so.1 #7 0x0000000800698081 in Tcl_Finalize () from /usr/local/lib/libtcl84.so.1 #8 0x000000080069833a in Tcl_Exit () from /usr/local/lib/libtcl84.so.1 #9 0x0000000000409610 in ?? () #10 0x00000008006742be in TclInvokeStringCommand () from /usr/local/lib/libtcl84.so.1 #11 0x0000000800675944 in TclEvalObjvInternal () from /usr/local/lib/libtcl84.so.1 #12 0x0000000800675dff in Tcl_EvalEx () from /usr/local/lib/libtcl84.so.1 #13 0x00000008006b55d9 in Tcl_FSEvalFile () from /usr/local/lib/libtcl84.so.1 #14 0x00000008006b5690 in Tcl_EvalFile () from /usr/local/lib/libtcl84.so.1 #15 0x0000000000404f58 in ?? () #16 0x0000000000404d47 in ?? () >From the functions it is running in Tcl, it seems it is Tcl's cleanup code that is failing, when it is trying to restore a Tcl "channel" to normal mode during an exit event. This is where my clue runs out, and I am at a loss as to how to proceed from here. I have tried digging in both Tcl and Expect source code to see if can catch anything obvious, but alas, this is somewhat outside my area of expertise (I am a networking guy, not a programmer)... Any suggestions on how to proceed to find and fix this issue would be welcome, as the only other option for us is to abandon FreeBSD and go with Linux on the server, and we have already replaced too many FreeBSD boxes with Linux for my liking, I don't want to see yet another one go... Regards, Lars Erik Gullerud
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20080110132211.D75682>