Date: Thu, 10 Jan 2008 16:04:02 +0100 (CET) From: Mohacsi Janos <mohacsi@niif.hu> To: Lars Erik Gullerud <lerik@nolink.net> Cc: freebsd-stable@freebsd.org, rancid-discuss@shrubbery.net Subject: Re: [rancid] Rancid/Expect failing on FreeBSD/SMP systems Message-ID: <20080110160036.A40486@mignon.ki.iif.hu> In-Reply-To: <20080110132211.D75682@electra.nolink.net> References: <20080110132211.D75682@electra.nolink.net>
next in thread | previous in thread | raw e-mail | index | archive | help
Hi Lars, You should use expect-devel port to avoid hunging on pty have a look at http://www.freebsd.org/cgi/query-pr.cgi?pr=118452 Janos Mohacsi Network Engineer, Research Associate, Head of Network Planning and Projects NIIF/HUNGARNET, HUNGARY Key 70EF9882: DEC2 C685 1ED4 C95A 145F 4300 6F64 7B00 70EF 9882 On Thu, 10 Jan 2008, Lars Erik Gullerud wrote: > My apologies for posting this both to the Rancid list and FreeBSD-STABLE, > however I am not sure where to start troubleshooting this issue - I am > suspecting it is a FreeBSD issue, but I am thinking we are probably not > the only shop running RANCID (ports/net-mgmt/rancid) on FreeBSD (since it > is quite popular in ISP environments), so hopefully someone can look at > it from the RANCID angle and give some helpful input on how to > troubleshoot this further. > > The problem: After finally giving in and starting to phase out some of our > oldest FreeBSD 4.11 servers and replace them with FreeBSD 6.x on some > fresh hardware, I got around to start moving our RANCID server. This > however, has been the start of a real nightmare. I don't think the > problems I am seeing are in RANCID itself, however it can be reliable > reproduced every time i run RANCID and I have not been able to reproduce > it in any other way with pure expect test-cases directly. > > What happens: > > Expect processes "hang" during RANCID runs, and go into infinite loops > eating 100% CPU (on one CPU core). The problem is reliably reproduced > everytime we do a full rancid-run, but the actual device it chokes on > varies between runs so it is not device-related. It does seem to happen > most often when collecting Juniper M-series gear with large configurations > though, using jrancid and ssh. > > We can NOT seem to reproduce it by running jrancid (or any other) on a > single device at at time - which is somewhat confusing at is DOES happen > when setting PAR_COUNT to 1 and doing a rancid-run (which > should IMHO be pretty much the same as doing sequential single device > runs...) > > Our environment: > > We run RANCID extensively to collect a few hundred devices, including > Cisco, Cisco-CatOS, Juniper, Extreme, Extreme-XOS, Riverstone, > FortiNet/FortiGate, etc. We want to start storing CPE configs in addition > to our own core gear in RANCID now, which means we will be putting several > thousand routers into RANCID, which also explains the need for fresher > hardware... > > RANCID version does not seem to matter, I have tested with both some > ancient 2.3.0 scripts and 2.3.2a7, same behaviour. > > Using the same RANCID instance (I have tarballed it up and installed it on > a bunch of servers, i.e. using the same CVS and the same router.db files > etc.), it fails on: > > FreeBSD 7.0-BETA4, amd64, SMP kernel, 8 x CPU cores (2 x quad Xeon 5335) > FreeBSD 6.2-STABLE, i386, SMP kernel, 2 x CPU cores (2 x single-core Xeon) > > Both have perl-5.8.8_1, expect 5.43.0_3 and tcl-8.4.16,1 built from ports. > > It however seems to work fine on: > > Linux CentOS 4.5 x86-64, 4 x CPU cores (2 x dual Xeon 5130) > FreeBSD 4.11 i386, UP kernel, 1 x CPU core (1 x single-core Xeon) > FreeBSD 7.0-RC1, i386, UP kernel, 1 x CPU core (1 x P4) > > (Linux box has Expect 5.42 and Tcl 8.3...) > > So it only seems to be on newer FreeBSD with SMP. (If anyone have RANCID > working okay on FreeBSD 6.x/7.x on SMP systems at all, please let me > know...) > > Now, for some details, if anyone has any ideas. What is actually > happening is this, when truss'ing the stuck Expect-process: > > fcntl(4,F_GETFL,) = 0 (0x0) > fcntl(4,F_SETFL,0x0) ERR#25 'Inappropriate ioctl for device' > fcntl(4,F_GETFL,) = 0 (0x0) > fcntl(4,F_SETFL,0x0) ERR#25 'Inappropriate ioctl for device' > <looping ad nauseum> > > So, which device is it trying to fcntl, and what is it trying to do? lsof > shows the following: > > expect 1417 rancid cwd VDIR 0,86 2048 7607662 /local/rancid/var/core/configs > expect 1417 rancid rtd VDIR 0,81 512 2 / > expect 1417 rancid 0r VCHR 0,24 0t0 24 /dev/null > expect 1417 rancid 2r VCHR 0,24 0t0 24 /dev/null > expect 1417 rancid 3r VCHR 0,24 0t0 24 /dev/null > expect 1417 rancid 4r VCHR 0,24 0t0 24 /dev/null > > file descriptor 4 is /dev/null. Why is it trying to F_SETFL /dev/null to > BLOCKING mode (which is failing)? Why should it be playing with /dev/null > at all? Well, digging a little, this is what the lsof output looked like > 10 seconds earlier: > > expect 1417 rancid cwd VDIR 0,86 2048 7607662 /local/rancid/var/core/configs > expect 1417 rancid rtd VDIR 0,81 512 2 / > expect 1417 rancid 0r VCHR 0,24 0t0 24 /dev/null > expect 1417 rancid 1u PIPE 0x38bfcf8 0 ->0xffffff00038bfba0 > expect 1417 rancid 2w VREG 0,86 76 7583772 /local (/dev/mfid0s1f) > expect 1417 rancid 3u VCHR 0,108 0t0 108 /dev/ttyp2 > expect 1417 rancid 4u VCHR 0,117 0t45 117 /dev/ptyp7 > ssh 1418 rancid cwd VDIR 0,86 2048 7607662 /local/rancid/var/core/configs > ssh 1418 rancid rtd VDIR 0,81 512 2 / > ssh 1418 rancid txt unknown file system type: 8\xb9^_^B\xff\xff\xff^Xb\xab)^B\xff\xff\xffE > ssh 1418 rancid 0u VCHR 0,118 0t0 118 /dev/ttyp7 > ssh 1418 rancid 1u VCHR 0,118 0t0 118 /dev/ttyp7 > ssh 1418 rancid 2u VCHR 0,118 0t0 118 /dev/ttyp7 > ssh 1418 rancid 3w VREG 0,86 76 7583772 /local (/dev/mfid0s1f) > ssh 1418 rancid 4u IPv4 0xffffff008c030240 0t0 TCP *:27776->*:49323 > ssh 1418 rancid 5u VCHR 0,118 0t45 118 /dev/ttyp7 > > Here, fd 4 is actually a pty (pty7), which seems to be a fork to PID 1418, > the ssh session to the router, and everything seems to be normal. > > PID 1418 is no longer there on the most recent lsof, so 1418 seems to > have died(?) and PID 1417 now has /dev/null on its file descriptor 4. I > don't know why that is, but why is it trying to fcntl it to Blocking I/O > mode? Here is a gdb attach to the PID and a backtrace: > > (gdb) bt > #0 0x0000000800aefc9c in fcntl () from /lib/libc.so.7 > #1 0x00000000004072c5 in ?? () > #2 0x00000008006a8c18 in StackSetBlockMode () > from /usr/local/lib/libtcl84.so.1 > #3 0x00000008006a8c54 in SetBlockMode () from > /usr/local/lib/libtcl84.so.1 > #4 0x00000008006acf75 in Tcl_SetChannelOption () > from /usr/local/lib/libtcl84.so.1 > #5 0x00000008006aeda0 in TclFinalizeIOSubsystem () > from /usr/local/lib/libtcl84.so.1 > #6 0x0000000800697f74 in Tcl_FinalizeThread () > from /usr/local/lib/libtcl84.so.1 > #7 0x0000000800698081 in Tcl_Finalize () from > /usr/local/lib/libtcl84.so.1 > #8 0x000000080069833a in Tcl_Exit () from /usr/local/lib/libtcl84.so.1 > #9 0x0000000000409610 in ?? () > #10 0x00000008006742be in TclInvokeStringCommand () > from /usr/local/lib/libtcl84.so.1 > #11 0x0000000800675944 in TclEvalObjvInternal () > from /usr/local/lib/libtcl84.so.1 > #12 0x0000000800675dff in Tcl_EvalEx () from /usr/local/lib/libtcl84.so.1 > #13 0x00000008006b55d9 in Tcl_FSEvalFile () from > /usr/local/lib/libtcl84.so.1 > #14 0x00000008006b5690 in Tcl_EvalFile () from > /usr/local/lib/libtcl84.so.1 > #15 0x0000000000404f58 in ?? () > #16 0x0000000000404d47 in ?? () > >> From the functions it is running in Tcl, it seems it is Tcl's cleanup > code that is failing, when it is trying to restore a Tcl "channel" to > normal mode during an exit event. > > This is where my clue runs out, and I am at a loss as to how to proceed > from here. I have tried digging in both Tcl and Expect source code to see > if can catch anything obvious, but alas, this is somewhat outside my area > of expertise (I am a networking guy, not a programmer)... > > Any suggestions on how to proceed to find and fix this issue would be > welcome, as the only other option for us is to abandon FreeBSD and go with > Linux on the server, and we have already replaced too many FreeBSD boxes > with Linux for my liking, I don't want to see yet another one go... > > Regards, > Lars Erik Gullerud > _______________________________________________ > Rancid-discuss mailing list > Rancid-discuss@shrubbery.net > http://www.shrubbery.net/mailman/listinfo.cgi/rancid-discuss >
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20080110160036.A40486>