Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 07 Sep 2003 10:29:00 -0500
From:      "Jack L. Stone" <jackstone@sage-one.net>
To:        freebsd-questions@freebsd.org
Subject:   Random crash and/or reboots
Message-ID:  <3.0.5.32.20030907102900.01393408@sage-one.net>

next in thread | raw e-mail | index | archive | help
Mail server: 4.8-RELEASE-p3

A while back, on a couple of occasions, I posted a query about some bad
behavior on my mail server. For the past several months, it has been either
crashing/reboot or just rebooting. It's ALWAYS triggered by a SSH login,
but at random and ONLY at the "su" to root -- usually the most time before
reboot is about 2+ weeks and then contrasted by 2 in a row right after the
reboot -- actually no pattern. It has never happened directly at the console.

I have replaced every single piece of hardware, e.g., PSU, cables, NICs,
including finally a switching of the whole machine, except for the hard
disk that contains the system. That had to remain in the new machine. Even
then, I have moved the entire system & contents to another new HD. Thus, I
concluded it to be a software problem.

There are no indications of anything in the logs, and no core dumps. It
just stops and reboots, and any random time it pick. Only a couple of times
it has crashed without the remote login.

One tip was that I might have stale NFS mountabs -- cleared them out, but
problem persisted.

The above tip was suggested when I mentioned that on a couple or more of
the occurrences, I managed to get to the console quickly enough to see (in
bright bold) "lockmgr locking against myself" -- or close to that. My
google of that error does mention stale mounts, but mostly about esoteric
code stuff. No fix found anywhere.

Then, on this list, I saw the thread about other having mysterious reboots
and one suggestion was to run lsof(8) on continuous loops so that a log
file would be captured of open files when these reboots occurred. I have
captured 6 of these logs. I don't see anything that jumps out as being a
common file problem. I have placed 6 text files at the URLs below
containing only 300 lines of each log, which should contain enough info for
a comparison. (I let the logs grow to 200MB before restarting the lsof loop
each time -- of course these samples are chopped off at the moment of
crash/reboot along with the 300 other files before that moment)

I am at a loss, other than rebuilding the system from scratch, but that is
no assurance of a fix. The one thing unique here is that it is the mail
server and runs spamd (spamassassin-2.55), spamass-milter-2.0 (which has a
lock file) and procmail-3.22 (which does a lot of locking).

I am suspicious of the locking going on with the above spam-fight programs,
which may clash when a SSH login & su occurs. I believe a lock is required
for it too...??

Would appreciate anyone's time and efforts to look at these files and see
if anything is spotted that I don't see. the most recent is #6-lsof.txt and
works backwards. The 6-lsof.txt was just this morning.

http://sageweb/tmp/1-lsof.txt
http://sageweb/tmp/2-lsof.txt
http://sageweb/tmp/3-lsof.txt
http://sageweb/tmp/4-lsof.txt
http://sageweb/tmp/5-lsof.txt
http://sageweb/tmp/6-lsof.txt

Much obliged!

Best regards,
Jack L. Stone,
Administrator

SageOne Net
http://www.sage-one.net
jackstone@sage-one.net



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3.0.5.32.20030907102900.01393408>