From owner-freebsd-questions@FreeBSD.ORG Fri Nov 10 17:16:13 2006 Return-Path: X-Original-To: freebsd-questions@freebsd.org Delivered-To: freebsd-questions@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id BFE1816A40F for ; Fri, 10 Nov 2006 17:16:13 +0000 (UTC) (envelope-from eagletree@hughes.net) Received: from n054.sc0.cp.net (smtpout1090.sc0.he.tucows.com [64.97.144.90]) by mx1.FreeBSD.org (Postfix) with ESMTP id 449E343D49 for ; Fri, 10 Nov 2006 17:16:12 +0000 (GMT) (envelope-from eagletree@hughes.net) Received: from [192.168.1.100] (67.47.213.86) by n054.sc0.cp.net (7.2.069.1) (authenticated as eagletree@hughes.net) id 4554569E000134E4 for freebsd-questions@freebsd.org; Fri, 10 Nov 2006 17:16:10 +0000 Mime-Version: 1.0 (Apple Message framework v752.2) Content-Transfer-Encoding: 7bit Message-Id: <83083882-E193-445F-AF3D-E3ECD1E243B1@hughes.net> Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed To: FreeBSD Questions From: Chris Date: Fri, 10 Nov 2006 09:16:02 -0800 X-Mailer: Apple Mail (2.752.2) Subject: 6.x hangs on AMD64 again X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 10 Nov 2006 17:16:13 -0000 I've posted several questions (under two other ids though the name Chris) since March trying to put up a Tyan quad dual s4882. I've run it on 6.0 STABLE as of about March, 6.1 RELEASE in several flavors from May through September and finally 6.2 PRERELEASE as of mid-October. I found issues early on with transition states on the bge interface, found a memory chip that was marginal and have tested and tested throughout this period. Every time we place the system back in production, we see a hang without any indications of what the problem would be, after 4-7 days of running. I've tried to think of where the problems could be and it would seem that 6.x AMD64 exhibits this type of issue for many individuals who put a server under heavy load. I've seen many unresolved posts here and elsewhere that describe strikingly similar scenarios. When in full production, it's running 5 websites out of a prefork non-ssl Apache 2.2.3, light ports-installed mysql 4.19 access via perl cgi (not mod_perl) and heavy access to perl generated and flat html archives pages (for discussion just counted 300K page views for a day on one of the sites). This computer does not breath hard at all with peak hours showing top staying at 80+% idle. I've not opened up any service to where it can fill the 8Gb RAM in spawning too many processes. Process count peaks at about 180 because it services the request backlog so quickly. Active memory is usually about 250 MB and inactive varies. The configuration is very simple and it runs nothing else other than rsyncd and sshd. The hang seems to have nothing to do with peak access times, in fact, it will suddenly hang at our slowest time of the day. I ran for over a month without a hang when leaving the machine relegated to low traffic websites. We've spent a lot to get clean dedicated power and installed a monitoring hardware device to let us see what's going on, no help. Temperature of the computer room is nicely down given that it's winter here and the facility is kept fairly cold. No AC but the computer room remains about 70 degrees F. I'm aware of the warning about 6.2 PR in production but the symptoms have not deviated amongst any 6.x version and 6.2 PR was the only way to pick up the extensive changes to the bge driver without hacking. I need opinions on how to debug and possibly even who I should go to and pay to take a closer look at this scenario. Here are questions and ideas I've thought of, is there any validity in these or have you other ideas? 1. I've wondered if AMD64 SMP was a bad idea. Should I be using i386 for stability? It one thing I've not tried. 2. Should acpi be off as a precaution just to rule it out. It's not blacklisted. I'd turned it off for a long time when testing but the results were muddy. 3. Should I reduce the system to 4GB ram to attempt to skirt the issue. Is 6.x less reliable over 4GB? 4. Where can I find the meanings of all vmstat -z variables, I'm dumping them to another server every two minutes giving the percentage change on each sample, but am unsure if I can correlate this to much of anything meaningful without good definitions. Just started this but will need information. 5. Does mysql use linux threads and could that be the mistake that's taking us out? Even wild goose chases will be welcome at this point ;-). Thanks, Chris Pratt