Date: Sun, 14 Mar 2010 12:12:19 +0100 From: Harald Schmalzbauer <h.schmalzbauer@omnilan.de> To: Jeremy Chadwick <freebsd@jdc.parodius.com> Cc: Alexander Motin <mav@freebsd.org>, freebsd-stable@freebsd.org Subject: Re: ahcich timeouts, only with ahci, not with ataahci Message-ID: <4B9CC493.30009@omnilan.de> In-Reply-To: <4B9C034B.90900@omnilan.de> References: <1266934981.00222684.1266922202@10.7.7.3> <4B83EFD4.8050403@FreeBSD.org> <4B8E1489.2070306@omnilan.de> <4B8E1B3D.306@FreeBSD.org> <4B8E1DA9.2090406@omnilan.de> <20100303110647.GA51588@icarus.home.lan> <4B9C034B.90900@omnilan.de>
next in thread | previous in thread | raw e-mail | index | archive | help
This is an OpenPGP/MIME signed message (RFC 2440 and 3156) --------------enig2C6DCC30587B025E71A51C11 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: quoted-printable Harald Schmalzbauer schrieb am 13.03.2010 22:27 (localtime): > Am 03.03.2010 12:06, schrieb Jeremy Chadwick: >> On Wed, Mar 03, 2010 at 09:28:25AM +0100, Harald Schmalzbauer wrote: >>> Alexander Motin schrieb am 03.03.2010 09:18 (localtime): >>>> Harald Schmalzbauer wrote: >>>>> Alexander Motin schrieb am 23.02.2010 16:10 (localtime): >>>>>> Harald Schmalzbauer wrote: >>>>>>> I'm frequently getting my machine locked with ahcichX timeouts: >>>>>>> ahcich2: Timeout on slot 0 >>>>>>> ahcich2: is 00000000 cs 00000001 ss 00000000 rs 00000001 tfd c0 s= err >>>>>>> 00000000 >>>>>>> ahcich2: Timeout on slot 8 >>>>>>> ahcich2: is 00000000 cs 00000100 ss 00000000 rs 00000100 tfd c0 s= err >>>>>>> 00000000 >>>>>>> ahcich2: Timeout on slot 8 >>>>>>> ahcich2: is 00000000 cs fffff07f ss ffffff7f rs ffffff7f tfd c0 s= err >>>>>>> 00000000 >>>>>>> ... >>>>>> Looking that is (Interrupt status) is zero and `rs =3D=3D cs | ss`= (running >>>>>> command bitmasks in driver and hardware), controller doesn't repor= t >>>>>> command completion. Looking on TFD status 0xc0 with BUSY bit set, = I >>>>>> would suppose that either disk stuck in command processing for som= e >>>>>> reason, or controller missed command completion status. >>>>>> >>>>>> Have you noticed 30 second (default ATA timeout) pause before time= out >>>>>> message printed? Just want to be sure that driver waited enough be= fore >>>>>> give up. >>>>>> >>>>>>> This happens when backup over GbE overloads ZFS/HDD capabilities.= >>>>>>> I reduced vfs.zfs.txg.timeout to 1 to prevent the machine from lo= cking >>>>>>> up almost immediately, but from it still happens. >>>>>>> When I don't use ahci but ataahci (the old driver if I understand= things >>>>>>> correct) I also see the ZFS burst write congestion, but this does= n't >>>>>>> lead to controller timeouts, thus blocking the machine. >>>>>>> >>>>>>> Sometimes the machine recovers from the disk lock, but most often= I have >>>>>>> to reboot. >>>>>> How it looks when it doesn't? Can you send me full log messages? >>>>> Hello, this morning I had a stall, but the machine recovered after = about >>>>> one Minute. Here's what I got from the kernel: >>>>> ahcich2: Timeout on slot 29 >>>>> ahcich2: is 00000000 cs 00000003 ss e0000003 rs e0000003 tfd c0 ser= r >>>>> 00000000 >>>>> em1: watchdog timeout -- resetting >>>>> em1: watchdog timeout -- resetting >>>>> ahcich2: Timeout on slot 10 >>>>> ahcich2: is 00000000 cs 00006000 ss 00007c00 rs 00007c00 tfd c0 ser= r >>>>> 00000000 >>>>> ahcich2: Timeout on slot 18 >>>>> ahcich2: is 00000000 cs 00040000 ss 00000000 rs 00040000 tfd c0 ser= r >>>>> 00000000 >>>>> ahcich2: Timeout on slot 2 >>>>> ahcich2: is 00000000 cs 00000004 ss 00000000 rs 00000004 tfd c0 ser= r >>>>> 00000000 >>>>> ahcich2: Timeout on slot 2 >>>>> ahcich2: is 00000000 cs 00000000 ss 0000000c rs 0000000c tfd 40 ser= r >>>>> 00000000 >>>>> >>>>> Does this tell you something useful? >>>> It doesn't. Looking on logged register content - commands are indeed= >>>> still running and no interrupts requested. Interesting to see em1 >>>> watchdog timeout there. Aren't they related somehow? >>> dmesg | grep "irq 18": >>> uhci0: <Intel 82801I (ICH9) USB controller> port 0x20c0-0x20df irq >>> 18 at device 26.0 on pci0 >>> uhci4: <Intel 82801I (ICH9) USB controller> port 0x2040-0x205f irq >>> 18 at device 29.2 on pci0 >>> em1: <Intel(R) PRO/1000 Network Connection 6.9.14> port >>> 0x1000-0x103f mem 0xe1920000-0xe193ffff,0xe1900000-0xe191ffff irq 18 >>> at device 2.0 on pci3 >>> ichsmb0: <Intel 82801I (ICH9) SMBus controller> port 0x2000-0x201f >>> mem 0xe1a22000-0xe1a220ff irq 18 at device 31.3 on pci0 >>> >>> The don't share the same IRQ at least. >>> dmesg | grep "irq 21" >>> uhci1: <Intel 82801I (ICH9) USB controller> port 0x20a0-0x20bf irq >>> 21 at device 26.1 on pci0 >>> ahci0: <Intel ICH9 AHCI SATA controller> port >>> 0x2408-0x240f,0x2414-0x2417,0x2400-0x2407,0x2410-0x2413,0x2020-0x203f= >>> mem 0xe1a21000-0xe1a217ff irq 21 at device 31.2 on pci0 >>> >>> The em1 has no cable attached. I get many of these em watchdog >>> timeouts. Never thought they could be related to ahci. I'll see if >>> the em watchdog timeouts happens in any relation to disk usage. >> Please provide output from the commands I provided. dmesg|grep is not= >> sufficient for helping track this down, specifically with regards to t= he >> em1 watchdog timeouts. >=20 > Sorry for the delay, here's the details: > hostb0@pci0:0:0:0: class=3D0x060000 card=3D0x34d08086 chip=3D0x29f= 08086 > rev=3D0x00 hdr=3D0x00 > vendor =3D 'Intel Corporation' > device =3D '3200 Chipset (Bearlake) Processor to I/O Controller= ' > class =3D bridge > subclass =3D HOST-PCI > em0@pci0:0:25:0: class=3D0x020000 card=3D0x34d08086 chip=3D0x10b= d8086 > rev=3D0x02 hdr=3D0x00 > vendor =3D 'Intel Corporation' > device =3D 'Intel 82566DM Gigabit Ethernet Adapter (82566DM)' > class =3D network > subclass =3D ethernet > uhci0@pci0:0:26:0: class=3D0x0c0300 card=3D0x34d08086 chip=3D0x293= 78086 > rev=3D0x02 hdr=3D0x00 > vendor =3D 'Intel Corporation' > device =3D '82801IB/IR/IH (ICH9 Family) USB Universal Host Cont= roller' > class =3D serial bus > subclass =3D USB > uhci1@pci0:0:26:1: class=3D0x0c0300 card=3D0x34d08086 chip=3D0x293= 88086 > rev=3D0x02 hdr=3D0x00 > vendor =3D 'Intel Corporation' > device =3D '82801IB/IR/IH (ICH9 Family) USB Universal Host Cont= roller' > class =3D serial bus > subclass =3D USB > ehci0@pci0:0:26:7: class=3D0x0c0320 card=3D0x34d08086 chip=3D0x293= c8086 > rev=3D0x02 hdr=3D0x00 > vendor =3D 'Intel Corporation' > device =3D '82801IB/IR/IH (ICH9 Family) USB2 Enhanced Host Cont= roller' > class =3D serial bus > subclass =3D USB > pcib1@pci0:0:28:0: class=3D0x060400 card=3D0x29408086 chip=3D0x294= 08086 > rev=3D0x02 hdr=3D0x01 > vendor =3D 'Intel Corporation' > device =3D '82801IB/IR/IH (ICH9 Family) PCIe Root Port 1' > class =3D bridge > subclass =3D PCI-PCI > pcib2@pci0:0:28:4: class=3D0x060400 card=3D0x29488086 chip=3D0x294= 88086 > rev=3D0x02 hdr=3D0x01 > vendor =3D 'Intel Corporation' > device =3D '82801IB/IR/IH (ICH9 Family) PCIe Root Port 5' > class =3D bridge > subclass =3D PCI-PCI > uhci2@pci0:0:29:0: class=3D0x0c0300 card=3D0x34d08086 chip=3D0x293= 48086 > rev=3D0x02 hdr=3D0x00 > vendor =3D 'Intel Corporation' > device =3D '82801IB/IR/IH (ICH9 Family) USB Universal Host Cont= roller' > class =3D serial bus > subclass =3D USB > uhci3@pci0:0:29:1: class=3D0x0c0300 card=3D0x34d08086 chip=3D0x293= 58086 > rev=3D0x02 hdr=3D0x00 > vendor =3D 'Intel Corporation' > device =3D '82801IB/IR/IH (ICH9 Family) USB Universal Host Cont= roller' > class =3D serial bus > subclass =3D USB > uhci4@pci0:0:29:2: class=3D0x0c0300 card=3D0x34d08086 chip=3D0x293= 68086 > rev=3D0x02 hdr=3D0x00 > vendor =3D 'Intel Corporation' > device =3D '82801IB/IR/IH (ICH9 Family) USB Universal Host Cont= roller' > class =3D serial bus > subclass =3D USB > ehci1@pci0:0:29:7: class=3D0x0c0320 card=3D0x34d08086 chip=3D0x293= a8086 > rev=3D0x02 hdr=3D0x00 > vendor =3D 'Intel Corporation' > device =3D '82801IB/IR/IH (ICH9 Family) USB2 Enhanced Host Cont= roller' > class =3D serial bus > subclass =3D USB > pcib3@pci0:0:30:0: class=3D0x060401 card=3D0x34d08086 chip=3D0x244= e8086 > rev=3D0x92 hdr=3D0x01 > vendor =3D 'Intel Corporation' > device =3D '82801 Family (ICH2/3/4/5/6/7/8/9,63xxESB) Hub > Interface to PCI Bridge' > class =3D bridge > subclass =3D PCI-PCI > isab0@pci0:0:31:0: class=3D0x060100 card=3D0x34d08086 chip=3D0x291= 68086 > rev=3D0x02 hdr=3D0x00 > vendor =3D 'Intel Corporation' > device =3D '82801IR (ICH9R) LPC Interface Controller' > class =3D bridge > subclass =3D PCI-ISA > ahci0@pci0:0:31:2: class=3D0x010601 card=3D0x34d08086 chip=3D0x292= 28086 > rev=3D0x02 hdr=3D0x00 > vendor =3D 'Intel Corporation' > device =3D '82801IB/IR/IH (ICH9 Family) 6 port SATA AHCI Contro= ller' > class =3D mass storage > subclass =3D SATA > ichsmb0@pci0:0:31:3: class=3D0x0c0500 card=3D0x34d08086 chip=3D0x293= 08086 > rev=3D0x02 hdr=3D0x00 > vendor =3D 'Intel Corporation' > device =3D '82801IB/IR/IH (ICH9 Family) SMBus Controller' > class =3D serial bus > subclass =3D SMBus > vgapci0@pci0:2:0:0: class=3D0x030000 card=3D0x01018086 chip=3D0x052= 2102b > rev=3D0x02 hdr=3D0x00 > vendor =3D 'Matrox Electronic Systems Ltd.' > device =3D 'Matrox G200e (ServerEngines) - English (G200e)' > class =3D display > subclass =3D VGA > em1@pci0:3:2:0: class=3D0x020000 card=3D0x34d08086 chip=3D0x10768086 re= v=3D0x05 > hdr=3D0x00 > vendor =3D 'Intel Corporation' > device =3D 'Gigabit Ethernet Controller (82541EI)' > class =3D network > subclass =3D ethernet > banana:~>20: vmstat -i > interrupt total rate > irq4: uart0 27392 0 > irq18: em1 uhci0++ 178 0 > irq19: uhci3 22 0 > irq23: uhci2 ehci1 2 0 > cpu0: timer 231629257 1917 > irq256: em0 2559286 21 > irq257: ahci0 1602196 13 > cpu1: timer 231628629 1917 > Total 467446962 3869 >=20 > Looks very similar to the problem here: > http://unix.derkeiler.com/Mailing-Lists/FreeBSD/stable/2010-03/msg00143= =2Ehtml For the records, again I have those timeouts, but with different status bits: em1: watchdog timeout -- resetting em1: watchdog timeout -- resetting em1: watchdog timeout -- resetting em1: watchdog timeout -- resetting ahcich2: Timeout on slot 14 ahcich2: is 00000000 cs 00004000 ss 00000000 rs 00004000 tfd c0 serr=20 00000000 ahcich2: Timeout on slot 14 ahcich2: is 00000000 cs fff81fff ss ffffdfff rs ffffdfff tfd c0 serr=20 00000000 ahcich2: Timeout on slot 10 ahcich2: is 00000000 cs 00000c00 ss 00000000 rs 00000c00 tfd c0 serr=20 00000000 ahcich2: Timeout on slot 6 ahcich2: is 00000000 cs 00000040 ss 00000000 rs 00000040 tfd c0 serr=20 00000000 ahcich2: Timeout on slot 2 ahcich2: is 00000000 cs 00000004 ss 00000000 rs 00000004 tfd c0 serr=20 00000000 ahcich2: Timeout on slot 30 ahcich2: is 00000000 cs 40000000 ss 00000000 rs 40000000 tfd c0 serr=20 00000000 ahcich2: Timeout on slot 30 ahcich2: is 00000000 cs 1ffffffc ss dfffffff rs dfffffff tfd c0 serr=20 00000000 em1: watchdog timeout -- resetting em1: watchdog timeout -- resetting The machine is now hangig for several hours, no login at resial console=20 possible. To emhazise, em1 is not connected/used, em0 is the productive=20 interface. I'll have to drive on site... What do these bits tell compared to the hangs above, especcially the=20 middle one, which the machine recovered from? No interrupts requested=20 and `rs =3D=3D cs | ss` meaning controller doesn't report command=20 completion. That's identical. Any other useful info? If not I won't post = future timeout messages. Thanks, -Harry --------------enig2C6DCC30587B025E71A51C11 Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.13 (FreeBSD) iEYEARECAAYFAkucxJkACgkQLDqVQ9VXb8i1hgCbBebgmb/9f/kJzHygp4o1JgZk DP8AnR9YDn3ixtGINVp8QVb0QaieYQj1 =NzeA -----END PGP SIGNATURE----- --------------enig2C6DCC30587B025E71A51C11--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4B9CC493.30009>