From owner-freebsd-stable@FreeBSD.ORG  Thu Nov 24 23:14:02 2005
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
X-Original-To: freebsd-stable@freebsd.org
Delivered-To: freebsd-stable@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 1E6D416A937
	for <freebsd-stable@freebsd.org>; Thu, 24 Nov 2005 23:14:00 +0000 (GMT)
	(envelope-from dan@syz.com)
Received: from mail.clearwave.ca (h139-142-194-114.gtcust.grouptelecom.net
	[139.142.194.114])
	by mx1.FreeBSD.org (Postfix) with ESMTP id E57D9459B9
	for <freebsd-stable@freebsd.org>; Thu, 24 Nov 2005 21:36:11 +0000 (GMT)
	(envelope-from dan@syz.com)
Received: from localhost (localhost.clearwave.ca [127.0.0.1])
	by mail.clearwave.ca (Postfix) with ESMTP id 9DF821037A33
	for <freebsd-stable@freebsd.org>; Thu, 24 Nov 2005 14:35:58 -0700 (MST)
Received: from mail.clearwave.ca ([127.0.0.1])
	by localhost (mail.clearwave.ca [127.0.0.1]) (amavisd-new, port 10024)
	with ESMTP id 73943-05 for <freebsd-stable@freebsd.org>;
	Thu, 24 Nov 2005 14:35:49 -0700 (MST)
Received: from [192.168.2.108] (h139-142-196-33.gtcust.grouptelecom.net
	[139.142.196.33])
	by mail.clearwave.ca (Postfix) with ESMTP id E3C661037A2F
	for <freebsd-stable@freebsd.org>; Thu, 24 Nov 2005 14:35:49 -0700 (MST)
Mime-Version: 1.0 (Apple Message framework v746.2)
Content-Transfer-Encoding: 7bit
Message-Id: <83BB6E6C-AF9B-4B1F-9D89-C170E98AECFF@syz.com>
Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed
To: freebsd-stable@freebsd.org
From: Dan Charrois <dan@syz.com>
Date: Thu, 24 Nov 2005 14:36:01 -0700
X-Mailer: Apple Mail (2.746.2)
X-Virus-Scanned: amavisd-new at clearwave.ca
Subject: Re: FreeBSD unstable on Dell 1750 using SMP?
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>, 
	<mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
	<mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 24 Nov 2005 23:14:02 -0000

Hi Kris, Rutger, and others that have commented on this thread.

I'm happy to hear that I'm not the only one experiencing problems  
like this.  I posted a similar question a month or so ago about a  
PowerEdge 2850 using SMP (dual Xeons) and never received any  
responses that helped solve the problem, or even any indication that  
others had the same problem.  As you know, troubleshooting this is  
quite difficult, since it can take weeks to go down, and then the  
"auto-reboot" doesn't result in any clues as to why in the log file -  
it's just suddenly started again as if someone had pulled the plug on  
it.  I've been pulling my hair out.

My machine crashed twice in the last month or so, within two weeks of  
each other.  Both times, it was just as a cron task was about to  
schedule the mysqlhotcopy script to back up some SQL databases that  
are being hosted on that machine, so I thought it may have something  
to do with that (I had it running as a root crontask so figured that  
maybe some bug in that caused things to go weird - it was running as  
root, after all).  I changed it to run under a less privileged user  
and the machine hasn't died for about 2 1/2 weeks.  But that's hardly  
a conclusive case of having solved the situation - it's probably  
planning on surviving just long enough to last until the point I need  
it the most to work.   It sounds as though memory buffer allocations  
are going wacky or something, in which anything could take it down  
given the wrong combination of events.

In any case, We're running the amd64 version of FreeBSD 5.4-RELEASE- 
p6 FreeBSD 5.4-RELEASE-p6 #3: Fri Aug  5 18:18:10 MDT 2005

A netstat -m (which I'd never tried before) yields:

18446744073709551402 mbufs in use
49/25600 mbuf clusters in use (current/max)
0/0/0 sfbufs in use (current/peak/max)
44 KBytes allocated to network
0 requests for sfbufs denied
0 requests for sfbufs delayed
0 requests for I/O initiated by sendfile
884 calls to protocol drain routines

Obviously, the mbufs in use currently on that machine is way out to  
lunch.  And interestingly, it looks as though my max mbuf clusters in  
use of 25600 is identical to the other netstat -m reports from people  
having this problem.

Another machine (an older single CPU Dell) on which I'm running the  
386 version of FreeBSD 5.4-RELEASE-p5 FreeBSD 5.4-RELEASE-p5 #1: Thu  
Jul 21 22:30:46 MDT 2005 has a more sane netstat -m:

130 mbufs in use
128/8896 mbuf clusters in use (current/max)
0/177/2480 sfbufs in use (current/peak/max)
288 KBytes allocated to network
0 requests for sfbufs denied
0 requests for sfbufs delayed
208493 requests for I/O initiated by sendfile
26697 calls to protocol drain routines

But here's about where any troubleshooting on my own reaches its  
limit.  I noticed that Kris mentioned it was a known problem in the  
stats counting for SMP machines and had been fixed, but haven't been  
able to find a reference to that, or any indication of how to do so.   
Is this fix supposed to have been an accounting bug in the report for  
netstat, or is it something which would have taken down the machine  
as has been happening?

If switching to single CPU mode works, it's good to hear that I have  
an option if things continue to act up.  But I'd really rather not  
have to "dumb down" the machine to one CPU when there is the  
potential of two.  Most of the time it's not under a huge load, but  
periodically there are massive spikes, and that's where having two  
CPUs really help.

If anyone can shed further light on a fix for this problem, it would  
be greatly appreciated!

Dan
--
Syzygy Research & Technology
Box 83, Legal, AB  T0G 1L0 Canada
Phone: 780-961-2213