From owner-freebsd-stable@FreeBSD.ORG Sat Mar 13 21:28:13 2010 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id D834C106566B; Sat, 13 Mar 2010 21:28:13 +0000 (UTC) (envelope-from h.schmalzbauer@omnilan.de) Received: from host.omnilan.net (host.omnilan.net [62.245.232.135]) by mx1.freebsd.org (Postfix) with ESMTP id C10588FC19; Sat, 13 Mar 2010 21:28:12 +0000 (UTC) Received: from [172.21.1.33] (akima-win.flintsbach.schmalzbauer.de [172.21.1.33]) (authenticated bits=0) by host.omnilan.net (8.13.8/8.13.8) with ESMTP id o2DLS7Ki025275 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Sat, 13 Mar 2010 22:28:11 +0100 (CET) (envelope-from h.schmalzbauer@omnilan.de) X-Authentication-Warning: smtp.dmz.omnilan.net: Host akima-win.flintsbach.schmalzbauer.de [172.21.1.33] claimed to be [172.21.1.33] Message-ID: <4B9C034B.90900@omnilan.de> Date: Sat, 13 Mar 2010 22:27:39 +0100 From: Harald Schmalzbauer Organization: OmniLAN User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.1; de; rv:1.9.1.5) Gecko/20091204 Lightning/1.0b1 Thunderbird/3.0 MIME-Version: 1.0 To: Jeremy Chadwick References: <1266934981.00222684.1266922202@10.7.7.3> <4B83EFD4.8050403@FreeBSD.org> <4B8E1489.2070306@omnilan.de> <4B8E1B3D.306@FreeBSD.org> <4B8E1DA9.2090406@omnilan.de> <20100303110647.GA51588@icarus.home.lan> In-Reply-To: <20100303110647.GA51588@icarus.home.lan> X-Enigmail-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="------------enig99F70BB8ABD28A47139A7119" Cc: Alexander Motin , freebsd-stable@freebsd.org Subject: Re: ahcich timeouts, only with ahci, not with ataahci X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 13 Mar 2010 21:28:14 -0000 This is an OpenPGP/MIME signed message (RFC 2440 and 3156) --------------enig99F70BB8ABD28A47139A7119 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Am 03.03.2010 12:06, schrieb Jeremy Chadwick: > On Wed, Mar 03, 2010 at 09:28:25AM +0100, Harald Schmalzbauer wrote: >> Alexander Motin schrieb am 03.03.2010 09:18 (localtime): >>> Harald Schmalzbauer wrote: >>>> Alexander Motin schrieb am 23.02.2010 16:10 (localtime): >>>>> Harald Schmalzbauer wrote: >>>>>> I'm frequently getting my machine locked with ahcichX timeouts: >>>>>> ahcich2: Timeout on slot 0 >>>>>> ahcich2: is 00000000 cs 00000001 ss 00000000 rs 00000001 tfd c0 se= rr >>>>>> 00000000 >>>>>> ahcich2: Timeout on slot 8 >>>>>> ahcich2: is 00000000 cs 00000100 ss 00000000 rs 00000100 tfd c0 se= rr >>>>>> 00000000 >>>>>> ahcich2: Timeout on slot 8 >>>>>> ahcich2: is 00000000 cs fffff07f ss ffffff7f rs ffffff7f tfd c0 se= rr >>>>>> 00000000 >>>>>> ... >>>>> Looking that is (Interrupt status) is zero and `rs =3D=3D cs | ss` = (running >>>>> command bitmasks in driver and hardware), controller doesn't report= >>>>> command completion. Looking on TFD status 0xc0 with BUSY bit set, I= >>>>> would suppose that either disk stuck in command processing for some= >>>>> reason, or controller missed command completion status. >>>>> >>>>> Have you noticed 30 second (default ATA timeout) pause before timeo= ut >>>>> message printed? Just want to be sure that driver waited enough bef= ore >>>>> give up. >>>>> >>>>>> This happens when backup over GbE overloads ZFS/HDD capabilities. >>>>>> I reduced vfs.zfs.txg.timeout to 1 to prevent the machine from loc= king >>>>>> up almost immediately, but from it still happens. >>>>>> When I don't use ahci but ataahci (the old driver if I understand = things >>>>>> correct) I also see the ZFS burst write congestion, but this doesn= 't >>>>>> lead to controller timeouts, thus blocking the machine. >>>>>> >>>>>> Sometimes the machine recovers from the disk lock, but most often = I have >>>>>> to reboot. >>>>> How it looks when it doesn't? Can you send me full log messages? >>>> Hello, this morning I had a stall, but the machine recovered after a= bout >>>> one Minute. Here's what I got from the kernel: >>>> ahcich2: Timeout on slot 29 >>>> ahcich2: is 00000000 cs 00000003 ss e0000003 rs e0000003 tfd c0 serr= >>>> 00000000 >>>> em1: watchdog timeout -- resetting >>>> em1: watchdog timeout -- resetting >>>> ahcich2: Timeout on slot 10 >>>> ahcich2: is 00000000 cs 00006000 ss 00007c00 rs 00007c00 tfd c0 serr= >>>> 00000000 >>>> ahcich2: Timeout on slot 18 >>>> ahcich2: is 00000000 cs 00040000 ss 00000000 rs 00040000 tfd c0 serr= >>>> 00000000 >>>> ahcich2: Timeout on slot 2 >>>> ahcich2: is 00000000 cs 00000004 ss 00000000 rs 00000004 tfd c0 serr= >>>> 00000000 >>>> ahcich2: Timeout on slot 2 >>>> ahcich2: is 00000000 cs 00000000 ss 0000000c rs 0000000c tfd 40 serr= >>>> 00000000 >>>> >>>> Does this tell you something useful? >>> >>> It doesn't. Looking on logged register content - commands are indeed >>> still running and no interrupts requested. Interesting to see em1 >>> watchdog timeout there. Aren't they related somehow? >> >> dmesg | grep "irq 18": >> uhci0: port 0x20c0-0x20df irq >> 18 at device 26.0 on pci0 >> uhci4: port 0x2040-0x205f irq >> 18 at device 29.2 on pci0 >> em1: port >> 0x1000-0x103f mem 0xe1920000-0xe193ffff,0xe1900000-0xe191ffff irq 18 >> at device 2.0 on pci3 >> ichsmb0: port 0x2000-0x201f >> mem 0xe1a22000-0xe1a220ff irq 18 at device 31.3 on pci0 >> >> The don't share the same IRQ at least. >> dmesg | grep "irq 21" >> uhci1: port 0x20a0-0x20bf irq >> 21 at device 26.1 on pci0 >> ahci0: port >> 0x2408-0x240f,0x2414-0x2417,0x2400-0x2407,0x2410-0x2413,0x2020-0x203f >> mem 0xe1a21000-0xe1a217ff irq 21 at device 31.2 on pci0 >> >> The em1 has no cable attached. I get many of these em watchdog >> timeouts. Never thought they could be related to ahci. I'll see if >> the em watchdog timeouts happens in any relation to disk usage. >=20 > Please provide output from the commands I provided. dmesg|grep is not > sufficient for helping track this down, specifically with regards to th= e > em1 watchdog timeouts. Sorry for the delay, here's the details: hostb0@pci0:0:0:0: class=3D0x060000 card=3D0x34d08086 chip=3D0x29f08= 086 rev=3D0x00 hdr=3D0x00 vendor =3D 'Intel Corporation' device =3D '3200 Chipset (Bearlake) Processor to I/O Controller' class =3D bridge subclass =3D HOST-PCI em0@pci0:0:25:0: class=3D0x020000 card=3D0x34d08086 chip=3D0x10bd8= 086 rev=3D0x02 hdr=3D0x00 vendor =3D 'Intel Corporation' device =3D 'Intel 82566DM Gigabit Ethernet Adapter (82566DM)' class =3D network subclass =3D ethernet uhci0@pci0:0:26:0: class=3D0x0c0300 card=3D0x34d08086 chip=3D0x29378= 086 rev=3D0x02 hdr=3D0x00 vendor =3D 'Intel Corporation' device =3D '82801IB/IR/IH (ICH9 Family) USB Universal Host Contro= ller' class =3D serial bus subclass =3D USB uhci1@pci0:0:26:1: class=3D0x0c0300 card=3D0x34d08086 chip=3D0x29388= 086 rev=3D0x02 hdr=3D0x00 vendor =3D 'Intel Corporation' device =3D '82801IB/IR/IH (ICH9 Family) USB Universal Host Contro= ller' class =3D serial bus subclass =3D USB ehci0@pci0:0:26:7: class=3D0x0c0320 card=3D0x34d08086 chip=3D0x293c8= 086 rev=3D0x02 hdr=3D0x00 vendor =3D 'Intel Corporation' device =3D '82801IB/IR/IH (ICH9 Family) USB2 Enhanced Host Contro= ller' class =3D serial bus subclass =3D USB pcib1@pci0:0:28:0: class=3D0x060400 card=3D0x29408086 chip=3D0x29408= 086 rev=3D0x02 hdr=3D0x01 vendor =3D 'Intel Corporation' device =3D '82801IB/IR/IH (ICH9 Family) PCIe Root Port 1' class =3D bridge subclass =3D PCI-PCI pcib2@pci0:0:28:4: class=3D0x060400 card=3D0x29488086 chip=3D0x29488= 086 rev=3D0x02 hdr=3D0x01 vendor =3D 'Intel Corporation' device =3D '82801IB/IR/IH (ICH9 Family) PCIe Root Port 5' class =3D bridge subclass =3D PCI-PCI uhci2@pci0:0:29:0: class=3D0x0c0300 card=3D0x34d08086 chip=3D0x29348= 086 rev=3D0x02 hdr=3D0x00 vendor =3D 'Intel Corporation' device =3D '82801IB/IR/IH (ICH9 Family) USB Universal Host Contro= ller' class =3D serial bus subclass =3D USB uhci3@pci0:0:29:1: class=3D0x0c0300 card=3D0x34d08086 chip=3D0x29358= 086 rev=3D0x02 hdr=3D0x00 vendor =3D 'Intel Corporation' device =3D '82801IB/IR/IH (ICH9 Family) USB Universal Host Contro= ller' class =3D serial bus subclass =3D USB uhci4@pci0:0:29:2: class=3D0x0c0300 card=3D0x34d08086 chip=3D0x29368= 086 rev=3D0x02 hdr=3D0x00 vendor =3D 'Intel Corporation' device =3D '82801IB/IR/IH (ICH9 Family) USB Universal Host Contro= ller' class =3D serial bus subclass =3D USB ehci1@pci0:0:29:7: class=3D0x0c0320 card=3D0x34d08086 chip=3D0x293a8= 086 rev=3D0x02 hdr=3D0x00 vendor =3D 'Intel Corporation' device =3D '82801IB/IR/IH (ICH9 Family) USB2 Enhanced Host Contro= ller' class =3D serial bus subclass =3D USB pcib3@pci0:0:30:0: class=3D0x060401 card=3D0x34d08086 chip=3D0x244e8= 086 rev=3D0x92 hdr=3D0x01 vendor =3D 'Intel Corporation' device =3D '82801 Family (ICH2/3/4/5/6/7/8/9,63xxESB) Hub Interface to PCI Bridge' class =3D bridge subclass =3D PCI-PCI isab0@pci0:0:31:0: class=3D0x060100 card=3D0x34d08086 chip=3D0x29168= 086 rev=3D0x02 hdr=3D0x00 vendor =3D 'Intel Corporation' device =3D '82801IR (ICH9R) LPC Interface Controller' class =3D bridge subclass =3D PCI-ISA ahci0@pci0:0:31:2: class=3D0x010601 card=3D0x34d08086 chip=3D0x29228= 086 rev=3D0x02 hdr=3D0x00 vendor =3D 'Intel Corporation' device =3D '82801IB/IR/IH (ICH9 Family) 6 port SATA AHCI Controll= er' class =3D mass storage subclass =3D SATA ichsmb0@pci0:0:31:3: class=3D0x0c0500 card=3D0x34d08086 chip=3D0x29308= 086 rev=3D0x02 hdr=3D0x00 vendor =3D 'Intel Corporation' device =3D '82801IB/IR/IH (ICH9 Family) SMBus Controller' class =3D serial bus subclass =3D SMBus vgapci0@pci0:2:0:0: class=3D0x030000 card=3D0x01018086 chip=3D0x05221= 02b rev=3D0x02 hdr=3D0x00 vendor =3D 'Matrox Electronic Systems Ltd.' device =3D 'Matrox G200e (ServerEngines) - English (G200e)' class =3D display subclass =3D VGA em1@pci0:3:2:0: class=3D0x020000 card=3D0x34d08086 chip=3D0x10768086 rev=3D= 0x05 hdr=3D0x00 vendor =3D 'Intel Corporation' device =3D 'Gigabit Ethernet Controller (82541EI)' class =3D network subclass =3D ethernet banana:~>20: vmstat -i interrupt total rate irq4: uart0 27392 0 irq18: em1 uhci0++ 178 0 irq19: uhci3 22 0 irq23: uhci2 ehci1 2 0 cpu0: timer 231629257 1917 irq256: em0 2559286 21 irq257: ahci0 1602196 13 cpu1: timer 231628629 1917 Total 467446962 3869 Looks very similar to the problem here: http://unix.derkeiler.com/Mailing-Lists/FreeBSD/stable/2010-03/msg00143.h= tml Thanks, -Harry --------------enig99F70BB8ABD28A47139A7119 Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.12 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAkucA2wACgkQLDqVQ9VXb8gK/gCgor2kS8lJcVdSi2zqI+aMA68S DisAn3To1/x7eJpkU20Dxqp7wCQjfYA/ =W7lO -----END PGP SIGNATURE----- --------------enig99F70BB8ABD28A47139A7119--