Date: Mon, 24 Oct 2005 13:22:39 -0500 From: Dan Rue <drue@therub.org> To: Vinod Kashyap <vkashyap@amcc.com> Cc: freebsd-stable@FreeBSD.org Subject: Re: twa kernel panic under heavy IO Message-ID: <20051024182239.GJ38097@therub.org> In-Reply-To: <2B3B2AA816369A4E87D7BE63EC9D2F26D89125@SDCEXCHANGE01.ad.amcc.com> References: <2B3B2AA816369A4E87D7BE63EC9D2F26D89125@SDCEXCHANGE01.ad.amcc.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, Oct 24, 2005 at 11:07:28AM -0700, Vinod Kashyap wrote: > > After going around with 3ware web support, this issue has > > been concluded, but not resolved. I tried my 3ware 9500 on > > FreeBSD 5.3, 5.4, and 5-STABLE. With all of these versions > > of OS and driver (i never changed the driver version > > manually), I received hard lock ups and reboots (though, > > interestingly, no kernel panics). > > > > 3ware had me check and troubleshoot a number of > > possibilities, until they finally decided it was a hardware > > problem and issued me a replacement card. However, in the > > meantime, I upgraded to FreeBSD > > 6.0RC1 and the machine is now working flawlessly. I returned > > the replacement card unused. > > > > I can only conclude that this means that there is a large > > (timing?) bug in the twa driver in freebsd 5.3/5.4/5-stable > > (as opposed to an isolated hardware problem with my setup). > > > > I have pasted the full conversation with 3ware on my website > > for those interested here: > > http://therub.org/9500.txt (sorry for the poor formatting) > > > > At one point, I received the following error message just > > before the machine locked up: > > > > >Oct 12 11:36:13 leopard kernel: initiate_write_filepage: already > > >started > > > > I grepped for that error message in the freebsd kernel > > source, and found it in sys/ufs/ffs/ffs_softdep.c on line > > 3580. What makes it really interesting is the comment above > > where the error is thrown: > > > > if (pagedep->pd_state & IOSTARTED) { > > /* > > * This can only happen if there is a driver that does not > > * understand chaining. Here biodone will reissue the call > > * to strategy for the incomplete buffers. > > */ > > printf("initiate_write_filepage: already started\n"); > > return; > > } > > > > I know this is a 3ware issue. I am posting this resolution > > response here in hopes that it may help someone else that > > hits this bug - and with the hope that publically it will get > > the attention of the 3ware freebsd driver team/individual. > > > > The error messages you are seeing are consistent with bad hardware. > The hardware is becoming unavailable for the driver to talk to it. > This other message "initiate_write_filepage..." is different but did > you see the machine hang after this message got printed? I don't > think it's related to the hang. > The initiate_write_filepage occured right before the hang. Here's the full log from that time: Oct 6 17:00:32 leopard kernel: twa0: ERROR: (0x16: 0x1301): Missing expected status bit(s): status reg = 0x15025bb0; Missing bits: [MC_RDY,] Oct 6 17:00:33 leopard last message repeated 399 times Oct 6 17:00:36 leopard kernel: ected status bit(s): status reg = 0x15025bb2; Missing bits: [MC_RDY,] Oct 6 17:00:36 leopard kernel: twa0: ERROR: (0x16: 0x1301): Missing expected status bit(s): status reg = 0x15025bb2; Missing bits: [MC_RDY,] Oct 6 17:00:36 leopard last message repeated 296 times Oct 6 17:01:37 leopard kernel: initiate_write_filepage: already started Oct 6 17:01:37 leopard last message repeated 83 times Oct 6 17:01:37 leopard kernel: twa0: ERROR: (0x05: 0x210b): Request timed out!: request = 0xc23fb0a0 Oct 6 17:01:37 leopard kernel: twa0: INFO: (0x16: 0x1108): Resetting controller...: Oct 6 17:01:37 leopard kernel: twa0: INFO: (0x04: 0x005e): Cache synchronized after power fail: unit=0 Oct 6 17:01:37 leopard kernel: twa0: INFO: (0x04: 0x0001): Controller reset occurred: resets=1 Oct 6 17:01:37 leopard kernel: twa0: INFO: (0x16: 0x1107): Controller reset done!: If it's a hardware problem, why would it run fine on 6.0? The hang was very easy to trigger, and i've put the 6.0 machine through the gauntlet trying to recreate the problem. Thanks for looking into this (again) for me, Dan
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20051024182239.GJ38097>