From owner-freebsd-stable@FreeBSD.ORG Thu Oct 30 16:34:19 2008 Return-Path: Delivered-To: stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id DCE39106564A; Thu, 30 Oct 2008 16:34:18 +0000 (UTC) (envelope-from scottl@samsco.org) Received: from pooker.samsco.org (pooker.samsco.org [168.103.85.57]) by mx1.freebsd.org (Postfix) with ESMTP id 5B9AE8FC1A; Thu, 30 Oct 2008 16:34:18 +0000 (UTC) (envelope-from scottl@samsco.org) Received: from phobos.local ([192.168.254.200]) (authenticated bits=0) by pooker.samsco.org (8.14.2/8.14.2) with ESMTP id m9UG2H5E082696; Thu, 30 Oct 2008 10:02:17 -0600 (MDT) (envelope-from scottl@samsco.org) Message-ID: <4909DA89.9060804@samsco.org> Date: Thu, 30 Oct 2008 10:02:17 -0600 From: Scott Long User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-US; rv:1.8.1.13) Gecko/20080313 SeaMonkey/1.1.9 MIME-Version: 1.0 To: Oliver Lehmann References: <20081029170728.be7cc7ab.lehmann@ans-netz.de> In-Reply-To: <20081029170728.be7cc7ab.lehmann@ans-netz.de> X-Enigmail-Version: 0.95.6 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-4.4 required=3.8 tests=ALL_TRUSTED,BAYES_00 autolearn=ham version=3.1.8 X-Spam-Checker-Version: SpamAssassin 3.1.8 (2007-02-13) on pooker.samsco.org Cc: stable@freebsd.org, smp@freebsd.org Subject: Re: 3Ware 9000 series hangs under load X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 30 Oct 2008 16:34:19 -0000 Oliver Lehmann wrote: > Hi, > > I've problems with my 3ware controller. Havingg heavy I/O load (e.g. > running 40 port builds the day over with tinderbox which involves > un-taring a whole FreeBSD tree 40 times), my system hangs with the well > known > > swap_pager: indefinite wait buffer: bufobj: 0, blkno: 2, size: 4096 > swap_pager: indefinite wait buffer: bufobj: 0, blkno: 2, size: 4096 > > error. I'v opened a ticket at 3ware and after half a month of > dummy-testings (are your drives fine, can you run a stress test), it > looks like i was redirected to someone from the 2nd lvl support and he > told me: > > There are 2 things that you can try, > 1, disable apic in your bootloader.conf file, or RMA the controller. > > The error that you have is generally caused by an interrupt problem, > defective backplane, bad drive or bad controller. > > and after I told him that I intend to use the 2 CPUs I have and not > falling back to one CPU for ever he responded: > > Yes I do understand about disabling APIC, but the feature is sometimes > not stable in all dual proc systems. There are many variables, the > CPU's have to be matched down to the Lot #, the motherboard must have a > good design and the kernel supporting APIC must be stable. But, it is a > good test to see if it is software or hardware. > > So what I did now, was compiling a kernel w/o apic/smp and I'm running > this configuration now for 3 days stressing the system w/o running into > the swap_pager problem. Can it be still a controller problem or is it > more likley a problem of FreeBSDs smp/apic implementation or the board > I'm using (Intel L440GX). > > I'm asking because I'm not sure which problem it is now and before > telling it 3ware and having them responding "ok it is a FreeBSD problem" > or "ok it is a board problem" I'd like to know what can be the case here. > > (please keep me CCed, I'm not subscribed to smp@) > > Further information (and the history) on this topic can be found here > (and following): > > http://lists.freebsd.org/pipermail/freebsd-stable/2008-September/045500.html > > The probability that it's a problem in the generic interrupt/APIC code in FreeBSD is low. That code has matured quite well over the last 5 years, and it is very solid for just about every other hardware configuration out there. I'd suspect the following things in the following order: 1. Driver bug. Driver might be loosing an interrupt, or might be deadlocking due to coding/design problems. 2. Defective controller 3. Buggy firmware on the controller. FreeBSD does tend to push I/O controllers a lot harder than other OS's, resulting in strange bugs sometimes being found. 4. Defective motherboard. The fact that it's running fine with SMP/APIC disabled could easily mean that it's not taking as high of a load, and is thus avoiding problems. It could also mean that latent bugs in the driver are not being exposed. I don't have a lot of time to spend debugging this, but I'd suggest that you either take up AMCC's offer to RMA the board, or put a spare ATA drive in the chassis and set it up as a dump partition, then get a crashdump of the system when it gets into this state. Scott