Date: Thu, 30 Oct 2008 10:02:17 -0600 From: Scott Long <scottl@samsco.org> To: Oliver Lehmann <lehmann@ans-netz.de> Cc: stable@freebsd.org, smp@freebsd.org Subject: Re: 3Ware 9000 series hangs under load Message-ID: <4909DA89.9060804@samsco.org> In-Reply-To: <20081029170728.be7cc7ab.lehmann@ans-netz.de> References: <20081029170728.be7cc7ab.lehmann@ans-netz.de>
next in thread | previous in thread | raw e-mail | index | archive | help
Oliver Lehmann wrote: > Hi, > > I've problems with my 3ware controller. Havingg heavy I/O load (e.g. > running 40 port builds the day over with tinderbox which involves > un-taring a whole FreeBSD tree 40 times), my system hangs with the well > known > > swap_pager: indefinite wait buffer: bufobj: 0, blkno: 2, size: 4096 > swap_pager: indefinite wait buffer: bufobj: 0, blkno: 2, size: 4096 > > error. I'v opened a ticket at 3ware and after half a month of > dummy-testings (are your drives fine, can you run a stress test), it > looks like i was redirected to someone from the 2nd lvl support and he > told me: > > There are 2 things that you can try, > 1, disable apic in your bootloader.conf file, or RMA the controller. > > The error that you have is generally caused by an interrupt problem, > defective backplane, bad drive or bad controller. > > and after I told him that I intend to use the 2 CPUs I have and not > falling back to one CPU for ever he responded: > > Yes I do understand about disabling APIC, but the feature is sometimes > not stable in all dual proc systems. There are many variables, the > CPU's have to be matched down to the Lot #, the motherboard must have a > good design and the kernel supporting APIC must be stable. But, it is a > good test to see if it is software or hardware. > > So what I did now, was compiling a kernel w/o apic/smp and I'm running > this configuration now for 3 days stressing the system w/o running into > the swap_pager problem. Can it be still a controller problem or is it > more likley a problem of FreeBSDs smp/apic implementation or the board > I'm using (Intel L440GX). > > I'm asking because I'm not sure which problem it is now and before > telling it 3ware and having them responding "ok it is a FreeBSD problem" > or "ok it is a board problem" I'd like to know what can be the case here. > > (please keep me CCed, I'm not subscribed to smp@) > > Further information (and the history) on this topic can be found here > (and following): > > http://lists.freebsd.org/pipermail/freebsd-stable/2008-September/045500.html > > The probability that it's a problem in the generic interrupt/APIC code in FreeBSD is low. That code has matured quite well over the last 5 years, and it is very solid for just about every other hardware configuration out there. I'd suspect the following things in the following order: 1. Driver bug. Driver might be loosing an interrupt, or might be deadlocking due to coding/design problems. 2. Defective controller 3. Buggy firmware on the controller. FreeBSD does tend to push I/O controllers a lot harder than other OS's, resulting in strange bugs sometimes being found. 4. Defective motherboard. The fact that it's running fine with SMP/APIC disabled could easily mean that it's not taking as high of a load, and is thus avoiding problems. It could also mean that latent bugs in the driver are not being exposed. I don't have a lot of time to spend debugging this, but I'd suggest that you either take up AMCC's offer to RMA the board, or put a spare ATA drive in the chassis and set it up as a dump partition, then get a crashdump of the system when it gets into this state. Scott
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4909DA89.9060804>