Date: Mon, 29 Mar 2010 12:07:51 +0200 From: Harald Schmalzbauer <h.schmalzbauer@omnilan.de> To: Alexander Motin <mav@FreeBSD.org> Cc: freebsd-stable@FreeBSD.org Subject: Re: ahcich timeouts, only with ahci, not with ataahci Message-ID: <4BB07BF7.6070602@omnilan.de> In-Reply-To: <4B8E1B3D.306@FreeBSD.org> References: <1266934981.00222684.1266922202@10.7.7.3> <4B83EFD4.8050403@FreeBSD.org> <4B8E1489.2070306@omnilan.de> <4B8E1B3D.306@FreeBSD.org>
index | next in thread | previous in thread | raw e-mail
[-- Attachment #1 --] Alexander Motin schrieb am 03.03.2010 09:18 (localtime): > Harald Schmalzbauer wrote: >> Alexander Motin schrieb am 23.02.2010 16:10 (localtime): >>> Harald Schmalzbauer wrote: >>>> I'm frequently getting my machine locked with ahcichX timeouts: >>>> ahcich2: Timeout on slot 0 >>>> ahcich2: is 00000000 cs 00000001 ss 00000000 rs 00000001 tfd c0 serr >>>> 00000000 >>>> ahcich2: Timeout on slot 8 >>>> ahcich2: is 00000000 cs 00000100 ss 00000000 rs 00000100 tfd c0 serr >>>> 00000000 >>>> ahcich2: Timeout on slot 8 >>>> ahcich2: is 00000000 cs fffff07f ss ffffff7f rs ffffff7f tfd c0 serr >>>> 00000000 >>>> ... >>> Looking that is (Interrupt status) is zero and `rs == cs | ss` (running >>> command bitmasks in driver and hardware), controller doesn't report >>> command completion. Looking on TFD status 0xc0 with BUSY bit set, I >>> would suppose that either disk stuck in command processing for some >>> reason, or controller missed command completion status. >>> >>> Have you noticed 30 second (default ATA timeout) pause before timeout >>> message printed? Just want to be sure that driver waited enough before >>> give up. >>> >>>> This happens when backup over GbE overloads ZFS/HDD capabilities. >>>> I reduced vfs.zfs.txg.timeout to 1 to prevent the machine from locking >>>> up almost immediately, but from it still happens. >>>> When I don't use ahci but ataahci (the old driver if I understand things >>>> correct) I also see the ZFS burst write congestion, but this doesn't >>>> lead to controller timeouts, thus blocking the machine. >>>> >>>> Sometimes the machine recovers from the disk lock, but most often I have >>>> to reboot. >>> How it looks when it doesn't? Can you send me full log messages? >> Hello, this morning I had a stall, but the machine recovered after about >> one Minute. Here's what I got from the kernel: >> ahcich2: Timeout on slot 29 >> ahcich2: is 00000000 cs 00000003 ss e0000003 rs e0000003 tfd c0 serr >> 00000000 >> em1: watchdog timeout -- resetting >> em1: watchdog timeout -- resetting >> ahcich2: Timeout on slot 10 >> ahcich2: is 00000000 cs 00006000 ss 00007c00 rs 00007c00 tfd c0 serr >> 00000000 >> ahcich2: Timeout on slot 18 >> ahcich2: is 00000000 cs 00040000 ss 00000000 rs 00040000 tfd c0 serr >> 00000000 >> ahcich2: Timeout on slot 2 >> ahcich2: is 00000000 cs 00000004 ss 00000000 rs 00000004 tfd c0 serr >> 00000000 >> ahcich2: Timeout on slot 2 >> ahcich2: is 00000000 cs 00000000 ss 0000000c rs 0000000c tfd 40 serr >> 00000000 >> >> Does this tell you something useful? > > It doesn't. Looking on logged register content - commands are indeed > still running and no interrupts requested. Interesting to see em1 > watchdog timeout there. Aren't they related somehow? I have the drives now running in another server, ich7 chipset. Using UFS, the complete machine locks up for ~30 secs with disk load of 3.5MB/s. But I don't get any timeout messages and the machine always recovered. Changing to the old ata driver solves the problem. Any chance to get this problem fixed? I couldn't see lockups on another OS with NCQ in AHCI mode enabled. I'd ship such a disk to anyone who is willing to debug. Thanks, -Harry [-- Attachment #2 --] -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.13 (FreeBSD) iEYEARECAAYFAkuwe/gACgkQLDqVQ9VXb8jVJgCgslySg8t/r8/CTXmC+a8ETW+8 7m4AoNBouWumrT4qwXBsEBPbvFvNNW31 =mo2w -----END PGP SIGNATURE-----help
Want to link to this message? Use this
URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4BB07BF7.6070602>
