From owner-freebsd-stable@FreeBSD.ORG Mon Jan 12 21:04:17 2009 Return-Path: Delivered-To: freebsd-stable@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 2B3531065670 for ; Mon, 12 Jan 2009 21:04:17 +0000 (UTC) (envelope-from freebsd@max.af.czu.cz) Received: from chinook.internetservice.cz (caesar.internetservice.cz [217.11.237.121]) by mx1.freebsd.org (Postfix) with ESMTP id 921598FC1B for ; Mon, 12 Jan 2009 21:04:16 +0000 (UTC) (envelope-from freebsd@max.af.czu.cz) Received: (qmail 45346 invoked by uid 89); 12 Jan 2009 20:37:34 -0000 Received: from unknown (HELO ares.internetservice.cz) (public@chinook.internetservice.cz@217.11.239.237) by chinook.internetservice.cz with ESMTPA; 12 Jan 2009 20:37:34 -0000 Message-ID: <496BA9AE.10801@max.af.czu.cz> Date: Mon, 12 Jan 2009 21:35:58 +0100 From: Tomas Randa User-Agent: Thunderbird 2.0.0.14 (X11/20080723) MIME-Version: 1.0 To: Garance A Drosihn References: <042FE04A-2F8D-47DD-8454-7BBA3791D7A8@inoc.net> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: freebsd-stable@FreeBSD.org, Robert Watson Subject: Re: Big problems with 7.1 locking up :-( X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 12 Jan 2009 21:04:17 -0000 Hello, I have similar problems. The last "good" kernel I have from stable brach, october the 8. Then in next upgrade, I saw big problems with performance. I tried ULE, 4BSD etc, but nothing helps, only downgrading system back. Now I am trying 7.1-p1 and problems are here again. Mysql is waiting a lot of time with status "waiting for opening table" or "waiting for close tables" I have 32bit FreeBSD with PAE, 1x xeon 5420, supermicro motherboard, areca SATA controller. Could not be problem in "da" device for example? Thanks Tomas Randa Garance A Drosihn wrote: > At 2:55 PM +0000 1/12/09, Robert Watson wrote: >> On Fri, 9 Jan 2009, Garance A Drosihn wrote: >> >>> At 2:39 PM -0500 1/9/09, Robert Blayzor wrote: >>>> On Jan 8, 2009, at 8:58 PM, Pete French wrote: >>>>> I have a number of HP 1U servers, all of which were running 7.0 >>>>> perfectly happily. I have been testing 7.1 in it's various >>>>> incarnations for the last couple of months on our test server and >>>>> it has performed perfectly. >>>> >>>> I noticed a problem with 7.0 on a couple of Dell servers. [...] >>>> We've since then compiled the kernel under the BSD scheduler to >>>> rule that out, and so far so good. >>>> >>>> Since ULE is now default in 7.1 and not in 7.0, perhaps you can try >>>> that? >>> >>> FWIW, the other guy I know who is having this problem had already >>> switched to using ULE under 7.0-release, and did not have any >>> problems with it. So *his* problem was probably not related to >>> SCHED_ULE, unless something has recently changed there. >>> >>> Turns out he hasn't reverted back to 7.0-release just yet, so he's >>> going to try SCHED_4BSD and see if that helps his situation. >> >> Scheduler changes always come with some risk of exposing bugs that >> have existed in the code for a long time but never really manifested >> themselves. ULE is well shaken-out, having been under development for >> at least five years, but it is possible that some problems will >> become visible as a result of the switch. I would encourage people >> to stick with ULE, but if you're having a stability problem then >> experimenting with scheduler as a variable that could be triggering >> the problem may well be useful to help track down the bug. > > Just to followup on this: My friend did switch back to a 7.1 kernel with > SCHED_4BSD, and he still ran into problems. The error messages weren't > the same, but errors did happen in the same high disk-I/O situations as > the lockup happened with SCHED_ULE. At this point he's fallen back to > the 7.0-kernel that he had been running (which also has SCHED_ULE), and > all the problems have gone away. So at the moment he's running with a > 7.0-ish kernel and the 7.1-release userland, without the hanging > problems. > So the problem is something in the kernel, but it is *NOT* the scheduler > (at least, not in his case). > > He is not eager to do a whole lot of experiments to track down the > problem, since this is happening on busy production machines and he > can't afford to have a lot of downtime on them (especially now that the > semester at RPI has started up). The systems have some large (2 TB) > filesystems on them, and the lockups occur in high disk-I/O situations. > He's seeing the problem on one system which is a dual CPU quad-core > xeon, and another which is a 64 bit P4 with hyperthreading. The one > thing in common between the two setups is that the boot drives + a > 3ware controller (with its array of RAID disks) is moved from one > machine to the other one: > > "its a 3ware 9500 12 port model, the boot drive is connected to > an ICH6 in IDE mode, and yes, I've run it in single, single with > hyper threading, and 8 way mode. All 64 bit." > > We still have no idea where the problem really is. For all we know, > someone spilled a Pepsi on it when he wasn't looking... >