From owner-freebsd-smp@FreeBSD.ORG Wed Oct 29 16:34:03 2008 Return-Path: Delivered-To: smp@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 6109E1065674 for ; Wed, 29 Oct 2008 16:34:03 +0000 (UTC) (envelope-from lehmann@ans-netz.de) Received: from avocado.salatschuessel.net (avocado.salatschuessel.net [78.111.72.186]) by mx1.freebsd.org (Postfix) with SMTP id C66CD8FC14 for ; Wed, 29 Oct 2008 16:34:02 +0000 (UTC) (envelope-from lehmann@ans-netz.de) Received: (qmail 76006 invoked by uid 89); 29 Oct 2008 16:07:20 -0000 Received: from unknown (HELO kartoffel.salatschuessel.net) (78.111.72.187) by avocado.salatschuessel.net with SMTP; 29 Oct 2008 16:07:20 -0000 Date: Wed, 29 Oct 2008 17:07:28 +0100 From: Oliver Lehmann To: stable@freebsd.org Message-Id: <20081029170728.be7cc7ab.lehmann@ans-netz.de> X-Mailer: Sylpheed 2.5.0 (GTK+ 2.12.11; amd64-portbld-freebsd7.0) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Cc: smp@freebsd.org Subject: 3Ware 9000 series hangs under load X-BeenThere: freebsd-smp@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: FreeBSD SMP implementation group List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 29 Oct 2008 16:34:03 -0000 Hi, I've problems with my 3ware controller. Havingg heavy I/O load (e.g. running 40 port builds the day over with tinderbox which involves un-taring a whole FreeBSD tree 40 times), my system hangs with the well known swap_pager: indefinite wait buffer: bufobj: 0, blkno: 2, size: 4096 swap_pager: indefinite wait buffer: bufobj: 0, blkno: 2, size: 4096 error. I'v opened a ticket at 3ware and after half a month of dummy-testings (are your drives fine, can you run a stress test), it looks like i was redirected to someone from the 2nd lvl support and he told me: There are 2 things that you can try, 1, disable apic in your bootloader.conf file, or RMA the controller. The error that you have is generally caused by an interrupt problem, defective backplane, bad drive or bad controller. and after I told him that I intend to use the 2 CPUs I have and not falling back to one CPU for ever he responded: Yes I do understand about disabling APIC, but the feature is sometimes not stable in all dual proc systems. There are many variables, the CPU's have to be matched down to the Lot #, the motherboard must have a good design and the kernel supporting APIC must be stable. But, it is a good test to see if it is software or hardware. So what I did now, was compiling a kernel w/o apic/smp and I'm running this configuration now for 3 days stressing the system w/o running into the swap_pager problem. Can it be still a controller problem or is it more likley a problem of FreeBSDs smp/apic implementation or the board I'm using (Intel L440GX). I'm asking because I'm not sure which problem it is now and before telling it 3ware and having them responding "ok it is a FreeBSD problem" or "ok it is a board problem" I'd like to know what can be the case here. (please keep me CCed, I'm not subscribed to smp@) Further information (and the history) on this topic can be found here (and following): http://lists.freebsd.org/pipermail/freebsd-stable/2008-September/045500.html -- Oliver Lehmann http://www.pofo.de/ http://wishlist.ans-netz.de/ From owner-freebsd-smp@FreeBSD.ORG Thu Oct 30 13:27:15 2008 Return-Path: Delivered-To: freebsd-smp@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 75DE2106564A; Thu, 30 Oct 2008 13:27:15 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from server.baldwin.cx (bigknife-pt.tunnel.tserv9.chi1.ipv6.he.net [IPv6:2001:470:1f10:75::2]) by mx1.freebsd.org (Postfix) with ESMTP id 0AB6B8FC17; Thu, 30 Oct 2008 13:27:14 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from localhost.corp.yahoo.com (john@localhost [IPv6:::1]) (authenticated bits=0) by server.baldwin.cx (8.14.3/8.14.3) with ESMTP id m9UDR7J0048145; Thu, 30 Oct 2008 09:27:07 -0400 (EDT) (envelope-from jhb@freebsd.org) From: John Baldwin To: freebsd-smp@freebsd.org Date: Thu, 30 Oct 2008 09:27:02 -0400 User-Agent: KMail/1.9.7 References: <20081029170728.be7cc7ab.lehmann@ans-netz.de> In-Reply-To: <20081029170728.be7cc7ab.lehmann@ans-netz.de> MIME-Version: 1.0 Content-Disposition: inline Message-Id: <200810300927.03620.jhb@freebsd.org> Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit X-Greylist: Sender succeeded SMTP AUTH authentication, not delayed by milter-greylist-2.0.2 (server.baldwin.cx [IPv6:::1]); Thu, 30 Oct 2008 09:27:07 -0400 (EDT) X-Virus-Scanned: ClamAV 0.93.1/8541/Wed Oct 29 22:54:28 2008 on server.baldwin.cx X-Virus-Status: Clean X-Spam-Status: No, score=-2.6 required=4.2 tests=BAYES_00,NO_RELAYS autolearn=ham version=3.1.3 X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on server.baldwin.cx Cc: freebsd-stable@freebsd.org, Oliver Lehmann Subject: Re: 3Ware 9000 series hangs under load X-BeenThere: freebsd-smp@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: FreeBSD SMP implementation group List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 30 Oct 2008 13:27:15 -0000 On Wednesday 29 October 2008 12:07:28 pm Oliver Lehmann wrote: > Hi, > > I've problems with my 3ware controller. Havingg heavy I/O load (e.g. > running 40 port builds the day over with tinderbox which involves > un-taring a whole FreeBSD tree 40 times), my system hangs with the well > known > > swap_pager: indefinite wait buffer: bufobj: 0, blkno: 2, size: 4096 > swap_pager: indefinite wait buffer: bufobj: 0, blkno: 2, size: 4096 > > error. I'v opened a ticket at 3ware and after half a month of > dummy-testings (are your drives fine, can you run a stress test), it > looks like i was redirected to someone from the 2nd lvl support and he > told me: > > There are 2 things that you can try, > 1, disable apic in your bootloader.conf file, or RMA the controller. > > The error that you have is generally caused by an interrupt problem, > defective backplane, bad drive or bad controller. > > and after I told him that I intend to use the 2 CPUs I have and not > falling back to one CPU for ever he responded: > > Yes I do understand about disabling APIC, but the feature is sometimes > not stable in all dual proc systems. There are many variables, the > CPU's have to be matched down to the Lot #, the motherboard must have a > good design and the kernel supporting APIC must be stable. But, it is a > good test to see if it is software or hardware. > > So what I did now, was compiling a kernel w/o apic/smp and I'm running > this configuration now for 3 days stressing the system w/o running into > the swap_pager problem. Can it be still a controller problem or is it > more likley a problem of FreeBSDs smp/apic implementation or the board > I'm using (Intel L440GX). > > I'm asking because I'm not sure which problem it is now and before > telling it 3ware and having them responding "ok it is a FreeBSD problem" > or "ok it is a board problem" I'd like to know what can be the case here. > > (please keep me CCed, I'm not subscribed to smp@) > > Further information (and the history) on this topic can be found here > (and following): > > http://lists.freebsd.org/pipermail/freebsd-stable/2008-September/045500.html FYI, you can disable APIC support w/o recompiling your kernel. Just set 'hint.apic.0.disabled=1' in the loader. If the problem is that the card stops triggering interrupts after being up for a while, then it is likely not a FreeBSD bug. If FreeBSD doesn't get the interrupt routing and setup correct then the card will not work at all starting at boot. You can also try just disabling SMP while leaving APIC enabled by setting 'kern.smp.disabled=1' from the loader. If that fixes the issue, then it may be that the 3ware driver simply has a race condition that is more easily triggered on SMP boxes. -- John Baldwin From owner-freebsd-smp@FreeBSD.ORG Thu Oct 30 16:21:14 2008 Return-Path: Delivered-To: smp@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 656411065677 for ; Thu, 30 Oct 2008 16:21:14 +0000 (UTC) (envelope-from lehmann@ans-netz.de) Received: from avocado.salatschuessel.net (avocado.salatschuessel.net [78.111.72.186]) by mx1.freebsd.org (Postfix) with SMTP id AF94C8FC17 for ; Thu, 30 Oct 2008 16:21:13 +0000 (UTC) (envelope-from lehmann@ans-netz.de) Received: (qmail 97195 invoked by uid 89); 30 Oct 2008 16:21:11 -0000 Received: from unknown (HELO kartoffel.salatschuessel.net) (78.111.72.187) by avocado.salatschuessel.net with SMTP; 30 Oct 2008 16:21:11 -0000 Date: Thu, 30 Oct 2008 17:21:30 +0100 From: Oliver Lehmann To: Scott Long Message-Id: <20081030172130.c163755e.lehmann@ans-netz.de> In-Reply-To: <4909DA89.9060804@samsco.org> References: <20081029170728.be7cc7ab.lehmann@ans-netz.de> <4909DA89.9060804@samsco.org> X-Mailer: Sylpheed 2.5.0 (GTK+ 2.12.11; amd64-portbld-freebsd7.0) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Cc: stable@freebsd.org, smp@freebsd.org Subject: Re: 3Ware 9000 series hangs under load X-BeenThere: freebsd-smp@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: FreeBSD SMP implementation group List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 30 Oct 2008 16:21:14 -0000 Scott Long wrote: > or put a spare ATA > drive in the chassis and set it up as a dump partition, then get a > crashdump of the system when it gets into this state. The system is not panicing itself so I've tried debugging some time ago with KDB by panicing it by hand after it got stuck again. Here is what I did back then (but I guess this isn't telling much) http://lists.freebsd.org/pipermail/freebsd-stable/2008-October/045578.html -- Oliver Lehmann http://www.pofo.de/ http://wishlist.ans-netz.de/ From owner-freebsd-smp@FreeBSD.ORG Thu Oct 30 16:34:19 2008 Return-Path: Delivered-To: smp@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id DCE39106564A; Thu, 30 Oct 2008 16:34:18 +0000 (UTC) (envelope-from scottl@samsco.org) Received: from pooker.samsco.org (pooker.samsco.org [168.103.85.57]) by mx1.freebsd.org (Postfix) with ESMTP id 5B9AE8FC1A; Thu, 30 Oct 2008 16:34:18 +0000 (UTC) (envelope-from scottl@samsco.org) Received: from phobos.local ([192.168.254.200]) (authenticated bits=0) by pooker.samsco.org (8.14.2/8.14.2) with ESMTP id m9UG2H5E082696; Thu, 30 Oct 2008 10:02:17 -0600 (MDT) (envelope-from scottl@samsco.org) Message-ID: <4909DA89.9060804@samsco.org> Date: Thu, 30 Oct 2008 10:02:17 -0600 From: Scott Long User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-US; rv:1.8.1.13) Gecko/20080313 SeaMonkey/1.1.9 MIME-Version: 1.0 To: Oliver Lehmann References: <20081029170728.be7cc7ab.lehmann@ans-netz.de> In-Reply-To: <20081029170728.be7cc7ab.lehmann@ans-netz.de> X-Enigmail-Version: 0.95.6 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-4.4 required=3.8 tests=ALL_TRUSTED,BAYES_00 autolearn=ham version=3.1.8 X-Spam-Checker-Version: SpamAssassin 3.1.8 (2007-02-13) on pooker.samsco.org Cc: stable@freebsd.org, smp@freebsd.org Subject: Re: 3Ware 9000 series hangs under load X-BeenThere: freebsd-smp@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: FreeBSD SMP implementation group List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 30 Oct 2008 16:34:19 -0000 Oliver Lehmann wrote: > Hi, > > I've problems with my 3ware controller. Havingg heavy I/O load (e.g. > running 40 port builds the day over with tinderbox which involves > un-taring a whole FreeBSD tree 40 times), my system hangs with the well > known > > swap_pager: indefinite wait buffer: bufobj: 0, blkno: 2, size: 4096 > swap_pager: indefinite wait buffer: bufobj: 0, blkno: 2, size: 4096 > > error. I'v opened a ticket at 3ware and after half a month of > dummy-testings (are your drives fine, can you run a stress test), it > looks like i was redirected to someone from the 2nd lvl support and he > told me: > > There are 2 things that you can try, > 1, disable apic in your bootloader.conf file, or RMA the controller. > > The error that you have is generally caused by an interrupt problem, > defective backplane, bad drive or bad controller. > > and after I told him that I intend to use the 2 CPUs I have and not > falling back to one CPU for ever he responded: > > Yes I do understand about disabling APIC, but the feature is sometimes > not stable in all dual proc systems. There are many variables, the > CPU's have to be matched down to the Lot #, the motherboard must have a > good design and the kernel supporting APIC must be stable. But, it is a > good test to see if it is software or hardware. > > So what I did now, was compiling a kernel w/o apic/smp and I'm running > this configuration now for 3 days stressing the system w/o running into > the swap_pager problem. Can it be still a controller problem or is it > more likley a problem of FreeBSDs smp/apic implementation or the board > I'm using (Intel L440GX). > > I'm asking because I'm not sure which problem it is now and before > telling it 3ware and having them responding "ok it is a FreeBSD problem" > or "ok it is a board problem" I'd like to know what can be the case here. > > (please keep me CCed, I'm not subscribed to smp@) > > Further information (and the history) on this topic can be found here > (and following): > > http://lists.freebsd.org/pipermail/freebsd-stable/2008-September/045500.html > > The probability that it's a problem in the generic interrupt/APIC code in FreeBSD is low. That code has matured quite well over the last 5 years, and it is very solid for just about every other hardware configuration out there. I'd suspect the following things in the following order: 1. Driver bug. Driver might be loosing an interrupt, or might be deadlocking due to coding/design problems. 2. Defective controller 3. Buggy firmware on the controller. FreeBSD does tend to push I/O controllers a lot harder than other OS's, resulting in strange bugs sometimes being found. 4. Defective motherboard. The fact that it's running fine with SMP/APIC disabled could easily mean that it's not taking as high of a load, and is thus avoiding problems. It could also mean that latent bugs in the driver are not being exposed. I don't have a lot of time to spend debugging this, but I'd suggest that you either take up AMCC's offer to RMA the board, or put a spare ATA drive in the chassis and set it up as a dump partition, then get a crashdump of the system when it gets into this state. Scott