From owner-freebsd-questions@FreeBSD.ORG Tue Dec 4 07:44:48 2007 Return-Path: Delivered-To: freebsd-questions@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 6660A16A418 for ; Tue, 4 Dec 2007 07:44:48 +0000 (UTC) (envelope-from dan@dan.emsphone.com) Received: from dan.emsphone.com (dan.emsphone.com [199.67.51.101]) by mx1.freebsd.org (Postfix) with ESMTP id 437A913C447 for ; Tue, 4 Dec 2007 07:44:48 +0000 (UTC) (envelope-from dan@dan.emsphone.com) Received: (from dan@localhost) by dan.emsphone.com (8.14.1/8.14.1) id lB47iiaX073414; Tue, 4 Dec 2007 01:44:44 -0600 (CST) (envelope-from dan) Date: Tue, 4 Dec 2007 01:44:44 -0600 From: Dan Nelson To: "Support (Rudy)" Message-ID: <20071204074444.GB12505@dan.emsphone.com> References: <4754C19E.5060708@monkeybrains.net> <4754CD5B.90605@daleco.biz> <4754DD17.6050701@monkeybrains.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4754DD17.6050701@monkeybrains.net> X-OS: FreeBSD 7.0-BETA3 User-Agent: Mutt/1.5.16 (2007-06-09) Cc: freebsd-questions@freebsd.org Subject: Re: cron pile up! Lot's of "cron: running job (cron)" X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 04 Dec 2007 07:44:48 -0000 In the last episode (Dec 03), Support (Rudy) said: > Below is part of the cron... Seems like any random cronjob can get > clogged up... load varies from 0.2 to 1.0 on this dual-core box. I > rebooted the box -- cron's continue to slowly pile up. > > One of the cronjobs that is 'stuck' is this one: /root/bin/raid-status.sh > which can be found here: > http://www.monkeybrains.net/~rudy/example/raid_status.html > > Forgot to mention, I am running: > 6.2-STABLE FreeBSD 6.2-STABLE #3: Thu May 31 01:18:15 PDT 2007 > > OH, ps shows this: > 58383 ?? D 0:00.00 cron: running job (cron) > 58384 ?? IVs 0:00.00 cron: running job (cron) In general, when troubleshhoting, "ps axlw" is a more useful command. It adds among other columns, the MWCHAN one, which details exactly why a process is stuck in the D state. Anyway, cron does a fork and then a vfork creating a child and a grandchild process. I'm sort of surprised at the amount of code between vfork and exec in the grandchild in /src/usr.sbin/cron/cron/do_command.c . Since process 3 is actually using process 2's address space one must be extremely careful not to modify static variables or change other global state that would affect the parent once it resumes execution, and all the logging, environment-setting, and user-context calls are certain to mess with the parent's state, especially with nss modules in the mix. I'd personally recompile cron with all vforks replaced with fork and see what happens. It couldn't hurt to update to a newer kernel version along the RELENG_6 branch as a test, I guess. Note that your uname will change to 6.3-PRERELEASE, but apart from causing lsof to complain, you should be okay. > /var/log/cron has this entry: > Dec 3 20:16:00 pita /usr/sbin/cron[58384]: (root) CMD (/root/bin/raid-status.sh CRON) > > BUT there is no 'raid-status.sh' stuck in the "ps axw". Seems like the > vfork set off the cronjob, it ran, but then cron didn't 'stop' executing. > Any debuggin tips? Can you tell if raid-status.sh ever ran? i.e. is process 2 stuck at the start of vfork or at the end. BTW, here's a minimal example of the danger of putting code between vfork and exec: #include #include #include int main(void) { int i = 1; switch (vfork()) { case -1: err(1, "vfork failed"); break; case 0: /* child */ i = 2; execl("/usr/bin/true", "true", NULL); _exit(0); break; default: break; } printf("in parent, i is %d\n", i); return 0; } -- Dan Nelson dnelson@allantgroup.com