Date: Mon, 24 Oct 2005 11:32:48 -0700 From: "Vinod Kashyap" <vkashyap@amcc.com> To: "Dan Rue" <drue@therub.org> Cc: freebsd-stable@FreeBSD.org Subject: RE: twa kernel panic under heavy IO Message-ID: <2B3B2AA816369A4E87D7BE63EC9D2F26D89149@SDCEXCHANGE01.ad.amcc.com>
next in thread | raw e-mail | index | archive | help
> -----Original Message----- > From: Dan Rue [mailto:drue@therub.org]=20 > Sent: Monday, October 24, 2005 11:23 AM > To: Vinod Kashyap > Cc: freebsd-stable@FreeBSD.org > Subject: Re: twa kernel panic under heavy IO >=20 > On Mon, Oct 24, 2005 at 11:07:28AM -0700, Vinod Kashyap wrote: > > > After going around with 3ware web support, this issue has been=20 > > > concluded, but not resolved. I tried my 3ware 9500 on=20 > FreeBSD 5.3,=20 > > > 5.4, and 5-STABLE. With all of these versions of OS and=20 > driver (i=20 > > > never changed the driver version manually), I received=20 > hard lock ups=20 > > > and reboots (though, interestingly, no kernel panics). > > >=20 > > > 3ware had me check and troubleshoot a number of=20 > possibilities, until=20 > > > they finally decided it was a hardware problem and issued me a=20 > > > replacement card. However, in the meantime, I upgraded to FreeBSD > > > 6.0RC1 and the machine is now working flawlessly. I returned the=20 > > > replacement card unused. > > >=20 > > > I can only conclude that this means that there is a large > > > (timing?) bug in the twa driver in freebsd 5.3/5.4/5-stable (as=20 > > > opposed to an isolated hardware problem with my setup). > > >=20 > > > I have pasted the full conversation with 3ware on my website for=20 > > > those interested here: > > > http://therub.org/9500.txt (sorry for the poor formatting) > > >=20 > > > At one point, I received the following error message just=20 > before the=20 > > > machine locked up: > > >=20 > > > >Oct 12 11:36:13 leopard kernel: initiate_write_filepage: already=20 > > > >started > > >=20 > > > I grepped for that error message in the freebsd kernel=20 > source, and=20 > > > found it in sys/ufs/ffs/ffs_softdep.c on line 3580. What=20 > makes it=20 > > > really interesting is the comment above where the error is thrown: > > >=20 > > > if (pagedep->pd_state & IOSTARTED) { > > > /* > > > * This can only happen if there is a driver that does not > > > * understand chaining. Here biodone will reissue the call > > > * to strategy for the incomplete buffers. > > > */ > > > printf("initiate_write_filepage: already started\n"); > > > return; > > > } > > >=20 > > > I know this is a 3ware issue. I am posting this=20 > resolution response=20 > > > here in hopes that it may help someone else that hits=20 > this bug - and=20 > > > with the hope that publically it will get the attention=20 > of the 3ware=20 > > > freebsd driver team/individual. > > >=20 > >=20 > > The error messages you are seeing are consistent with bad hardware. > > The hardware is becoming unavailable for the driver to talk to it. > > This other message "initiate_write_filepage..." is=20 > different but did=20 > > you see the machine hang after this message got printed? I don't=20 > > think it's related to the hang. > >=20 >=20 > The initiate_write_filepage occured right before the hang. =20 > Here's the full log from that time:=20 >=20 > Oct 6 17:00:32 leopard kernel: twa0: ERROR: (0x16: 0x1301):=20 > Missing expected status bit(s): status reg =3D 0x15025bb0;=20 > Missing bits: [MC_RDY,] Oct 6 17:00:33 leopard last message=20 > repeated 399 times Oct 6 17:00:36 leopard kernel: ected=20 > status bit(s): status reg =3D 0x15025bb2; Missing bits:=20 > [MC_RDY,] Oct 6 17:00:36 leopard kernel: twa0: ERROR: (0x16:=20 > 0x1301): Missing expected status bit(s): status reg =3D=20 > 0x15025bb2; Missing bits: [MC_RDY,] Oct 6 17:00:36 leopard=20 > last message repeated 296 times Oct 6 17:01:37 leopard=20 > kernel: initiate_write_filepage: already started Oct 6=20 > 17:01:37 leopard last message repeated 83 times Oct 6=20 > 17:01:37 leopard kernel: twa0: ERROR: (0x05: 0x210b): Request=20 > timed out!: request =3D 0xc23fb0a0 Oct 6 17:01:37 leopard=20 > kernel: twa0: INFO: (0x16: 0x1108): Resetting controller...: =20 > Oct 6 17:01:37 leopard kernel: twa0: INFO: (0x04: 0x005e):=20 > Cache synchronized after power fail: unit=3D0 Oct 6 17:01:37=20 > leopard kernel: twa0: INFO: (0x04: 0x0001): Controller reset=20 > occurred: resets=3D1 Oct 6 17:01:37 leopard kernel: twa0:=20 > INFO: (0x16: 0x1107): Controller reset done!: =20 >=20 Ok, that message is preceded by those same messages that indicate that the hardware became unavailable. So, that message seems to have been the result of the same hardware issue I mentioned. =20 >=20 > If it's a hardware problem, why would it run fine on 6.0? =20 > The hang was very easy to trigger, and i've put the 6.0=20 > machine through the gauntlet trying to recreate the problem. >=20 That's a valid question. It could be only a matter of time... > Thanks for looking into this (again) for me, Dan > -------------------------------------------------------- CONFIDENTIALITY NOTICE: This e-mail message, including any attachments, = is for the sole use of the intended recipient(s) and contains = information that is confidential and proprietary to Applied Micro = Circuits Corporation or its subsidiaries. It is to be used solely for = the purpose of furthering the parties' business relationship. All = unauthorized review, use, disclosure or distribution is prohibited. If = you are not the intended recipient, please contact the sender by reply = e-mail and destroy all copies of the original message.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?2B3B2AA816369A4E87D7BE63EC9D2F26D89149>