From owner-freebsd-questions@FreeBSD.ORG Sun Sep 7 08:29:03 2003 Return-Path: Delivered-To: freebsd-questions@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id C011616A4BF for ; Sun, 7 Sep 2003 08:29:03 -0700 (PDT) Received: from franklin-belle.com (adsl-65-68-247-73.dsl.crchtx.swbell.net [65.68.247.73]) by mx1.FreeBSD.org (Postfix) with ESMTP id BF55D43FF3 for ; Sun, 7 Sep 2003 08:29:02 -0700 (PDT) (envelope-from jackstone@sage-one.net) Received: from sagea (sagea.sage-american [10.0.0.3]) by franklin-belle.com (8.12.8p1/8.12.8) with SMTP id h87FT1lP033100 for ; Sun, 7 Sep 2003 10:29:01 -0500 (CDT) (envelope-from jackstone@sage-one.net) Message-Id: <3.0.5.32.20030907102900.01393408@sage-one.net> X-Sender: jackstone@sage-one.net X-Mailer: QUALCOMM Windows Eudora Pro Version 3.0.5 (32) Date: Sun, 07 Sep 2003 10:29:00 -0500 To: freebsd-questions@freebsd.org From: "Jack L. Stone" Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" X-Spam-Status: No, hits=-0.7 required=4.5 tests=AWL version=2.55-fbelle.rules_v1 X-Spam-Checker-Version: SpamAssassin 2.55-fbelle.rules_v1 (1.174.2.19-2003-05-19-exp) Subject: Random crash and/or reboots X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 07 Sep 2003 15:29:04 -0000 Mail server: 4.8-RELEASE-p3 A while back, on a couple of occasions, I posted a query about some bad behavior on my mail server. For the past several months, it has been either crashing/reboot or just rebooting. It's ALWAYS triggered by a SSH login, but at random and ONLY at the "su" to root -- usually the most time before reboot is about 2+ weeks and then contrasted by 2 in a row right after the reboot -- actually no pattern. It has never happened directly at the console. I have replaced every single piece of hardware, e.g., PSU, cables, NICs, including finally a switching of the whole machine, except for the hard disk that contains the system. That had to remain in the new machine. Even then, I have moved the entire system & contents to another new HD. Thus, I concluded it to be a software problem. There are no indications of anything in the logs, and no core dumps. It just stops and reboots, and any random time it pick. Only a couple of times it has crashed without the remote login. One tip was that I might have stale NFS mountabs -- cleared them out, but problem persisted. The above tip was suggested when I mentioned that on a couple or more of the occurrences, I managed to get to the console quickly enough to see (in bright bold) "lockmgr locking against myself" -- or close to that. My google of that error does mention stale mounts, but mostly about esoteric code stuff. No fix found anywhere. Then, on this list, I saw the thread about other having mysterious reboots and one suggestion was to run lsof(8) on continuous loops so that a log file would be captured of open files when these reboots occurred. I have captured 6 of these logs. I don't see anything that jumps out as being a common file problem. I have placed 6 text files at the URLs below containing only 300 lines of each log, which should contain enough info for a comparison. (I let the logs grow to 200MB before restarting the lsof loop each time -- of course these samples are chopped off at the moment of crash/reboot along with the 300 other files before that moment) I am at a loss, other than rebuilding the system from scratch, but that is no assurance of a fix. The one thing unique here is that it is the mail server and runs spamd (spamassassin-2.55), spamass-milter-2.0 (which has a lock file) and procmail-3.22 (which does a lot of locking). I am suspicious of the locking going on with the above spam-fight programs, which may clash when a SSH login & su occurs. I believe a lock is required for it too...?? Would appreciate anyone's time and efforts to look at these files and see if anything is spotted that I don't see. the most recent is #6-lsof.txt and works backwards. The 6-lsof.txt was just this morning. http://sageweb/tmp/1-lsof.txt http://sageweb/tmp/2-lsof.txt http://sageweb/tmp/3-lsof.txt http://sageweb/tmp/4-lsof.txt http://sageweb/tmp/5-lsof.txt http://sageweb/tmp/6-lsof.txt Much obliged! Best regards, Jack L. Stone, Administrator SageOne Net http://www.sage-one.net jackstone@sage-one.net