Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 14 Mar 2010 12:12:19 +0100
From:      Harald Schmalzbauer <h.schmalzbauer@omnilan.de>
To:        Jeremy Chadwick <freebsd@jdc.parodius.com>
Cc:        Alexander Motin <mav@freebsd.org>, freebsd-stable@freebsd.org
Subject:   Re: ahcich timeouts, only with ahci, not with ataahci
Message-ID:  <4B9CC493.30009@omnilan.de>
In-Reply-To: <4B9C034B.90900@omnilan.de>
References:  <1266934981.00222684.1266922202@10.7.7.3>	<4B83EFD4.8050403@FreeBSD.org>	<4B8E1489.2070306@omnilan.de>	<4B8E1B3D.306@FreeBSD.org>	<4B8E1DA9.2090406@omnilan.de>	<20100303110647.GA51588@icarus.home.lan> <4B9C034B.90900@omnilan.de>

next in thread | previous in thread | raw e-mail | index | archive | help
This is an OpenPGP/MIME signed message (RFC 2440 and 3156)
--------------enig2C6DCC30587B025E71A51C11
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: quoted-printable

Harald Schmalzbauer schrieb am 13.03.2010 22:27 (localtime):
> Am 03.03.2010 12:06, schrieb Jeremy Chadwick:
>> On Wed, Mar 03, 2010 at 09:28:25AM +0100, Harald Schmalzbauer wrote:
>>> Alexander Motin schrieb am 03.03.2010 09:18 (localtime):
>>>> Harald Schmalzbauer wrote:
>>>>> Alexander Motin schrieb am 23.02.2010 16:10 (localtime):
>>>>>> Harald Schmalzbauer wrote:
>>>>>>> I'm frequently getting my machine locked with ahcichX timeouts:
>>>>>>> ahcich2: Timeout on slot 0
>>>>>>> ahcich2: is 00000000 cs 00000001 ss 00000000 rs 00000001 tfd c0 s=
err
>>>>>>> 00000000
>>>>>>> ahcich2: Timeout on slot 8
>>>>>>> ahcich2: is 00000000 cs 00000100 ss 00000000 rs 00000100 tfd c0 s=
err
>>>>>>> 00000000
>>>>>>> ahcich2: Timeout on slot 8
>>>>>>> ahcich2: is 00000000 cs fffff07f ss ffffff7f rs ffffff7f tfd c0 s=
err
>>>>>>> 00000000
>>>>>>> ...
>>>>>> Looking that is (Interrupt status) is zero and `rs =3D=3D cs | ss`=
 (running
>>>>>> command bitmasks in driver and hardware), controller doesn't repor=
t
>>>>>> command completion. Looking on TFD status 0xc0 with BUSY bit set, =
I
>>>>>> would suppose that either disk stuck in command processing for som=
e
>>>>>> reason, or controller missed command completion status.
>>>>>>
>>>>>> Have you noticed 30 second (default ATA timeout) pause before time=
out
>>>>>> message printed? Just want to be sure that driver waited enough be=
fore
>>>>>> give up.
>>>>>>
>>>>>>> This happens when backup over GbE overloads ZFS/HDD capabilities.=

>>>>>>> I reduced vfs.zfs.txg.timeout to 1 to prevent the machine from lo=
cking
>>>>>>> up almost immediately, but from it still happens.
>>>>>>> When I don't use ahci but ataahci (the old driver if I understand=
 things
>>>>>>> correct) I also see the ZFS burst write congestion, but this does=
n't
>>>>>>> lead to controller timeouts, thus blocking the machine.
>>>>>>>
>>>>>>> Sometimes the machine recovers from the disk lock, but most often=
 I have
>>>>>>> to reboot.
>>>>>> How it looks when it doesn't? Can you send me full log messages?
>>>>> Hello, this morning I had a stall, but the machine recovered after =
about
>>>>> one Minute. Here's what I got from the kernel:
>>>>> ahcich2: Timeout on slot 29
>>>>> ahcich2: is 00000000 cs 00000003 ss e0000003 rs e0000003 tfd c0 ser=
r
>>>>> 00000000
>>>>> em1: watchdog timeout -- resetting
>>>>> em1: watchdog timeout -- resetting
>>>>> ahcich2: Timeout on slot 10
>>>>> ahcich2: is 00000000 cs 00006000 ss 00007c00 rs 00007c00 tfd c0 ser=
r
>>>>> 00000000
>>>>> ahcich2: Timeout on slot 18
>>>>> ahcich2: is 00000000 cs 00040000 ss 00000000 rs 00040000 tfd c0 ser=
r
>>>>> 00000000
>>>>> ahcich2: Timeout on slot 2
>>>>> ahcich2: is 00000000 cs 00000004 ss 00000000 rs 00000004 tfd c0 ser=
r
>>>>> 00000000
>>>>> ahcich2: Timeout on slot 2
>>>>> ahcich2: is 00000000 cs 00000000 ss 0000000c rs 0000000c tfd 40 ser=
r
>>>>> 00000000
>>>>>
>>>>> Does this tell you something useful?
>>>> It doesn't. Looking on logged register content - commands are indeed=

>>>> still running and no interrupts requested. Interesting to see em1
>>>> watchdog timeout there. Aren't they related somehow?
>>> 	dmesg | grep "irq 18":
>>> uhci0: <Intel 82801I (ICH9) USB controller> port 0x20c0-0x20df irq
>>> 18 at device 26.0 on pci0
>>> uhci4: <Intel 82801I (ICH9) USB controller> port 0x2040-0x205f irq
>>> 18 at device 29.2 on pci0
>>> em1: <Intel(R) PRO/1000 Network Connection 6.9.14> port
>>> 0x1000-0x103f mem 0xe1920000-0xe193ffff,0xe1900000-0xe191ffff irq 18
>>> at device 2.0 on pci3
>>> ichsmb0: <Intel 82801I (ICH9) SMBus controller> port 0x2000-0x201f
>>> mem 0xe1a22000-0xe1a220ff irq 18 at device 31.3 on pci0
>>>
>>> The don't share the same IRQ at least.
>>> dmesg | grep "irq 21"
>>> uhci1: <Intel 82801I (ICH9) USB controller> port 0x20a0-0x20bf irq
>>> 21 at device 26.1 on pci0
>>> ahci0: <Intel ICH9 AHCI SATA controller> port
>>> 0x2408-0x240f,0x2414-0x2417,0x2400-0x2407,0x2410-0x2413,0x2020-0x203f=

>>> mem 0xe1a21000-0xe1a217ff irq 21 at device 31.2 on pci0
>>>
>>> The em1 has no cable attached. I get many of these em watchdog
>>> timeouts. Never thought they could be related to ahci. I'll see if
>>> the em watchdog timeouts happens in any relation to disk usage.
>> Please provide output from the commands I provided.  dmesg|grep is not=

>> sufficient for helping track this down, specifically with regards to t=
he
>> em1 watchdog timeouts.
>=20
> Sorry for the delay, here's the details:
> hostb0@pci0:0:0:0:      class=3D0x060000 card=3D0x34d08086 chip=3D0x29f=
08086
> rev=3D0x00 hdr=3D0x00
>     vendor     =3D 'Intel Corporation'
>     device     =3D '3200 Chipset (Bearlake) Processor to I/O Controller=
'
>     class      =3D bridge
>     subclass   =3D HOST-PCI
> em0@pci0:0:25:0:        class=3D0x020000 card=3D0x34d08086 chip=3D0x10b=
d8086
> rev=3D0x02 hdr=3D0x00
>     vendor     =3D 'Intel Corporation'
>     device     =3D 'Intel 82566DM Gigabit Ethernet Adapter (82566DM)'
>     class      =3D network
>     subclass   =3D ethernet
> uhci0@pci0:0:26:0:      class=3D0x0c0300 card=3D0x34d08086 chip=3D0x293=
78086
> rev=3D0x02 hdr=3D0x00
>     vendor     =3D 'Intel Corporation'
>     device     =3D '82801IB/IR/IH (ICH9 Family) USB Universal Host Cont=
roller'
>     class      =3D serial bus
>     subclass   =3D USB
> uhci1@pci0:0:26:1:      class=3D0x0c0300 card=3D0x34d08086 chip=3D0x293=
88086
> rev=3D0x02 hdr=3D0x00
>     vendor     =3D 'Intel Corporation'
>     device     =3D '82801IB/IR/IH (ICH9 Family) USB Universal Host Cont=
roller'
>     class      =3D serial bus
>     subclass   =3D USB
> ehci0@pci0:0:26:7:      class=3D0x0c0320 card=3D0x34d08086 chip=3D0x293=
c8086
> rev=3D0x02 hdr=3D0x00
>     vendor     =3D 'Intel Corporation'
>     device     =3D '82801IB/IR/IH (ICH9 Family) USB2 Enhanced Host Cont=
roller'
>     class      =3D serial bus
>     subclass   =3D USB
> pcib1@pci0:0:28:0:      class=3D0x060400 card=3D0x29408086 chip=3D0x294=
08086
> rev=3D0x02 hdr=3D0x01
>     vendor     =3D 'Intel Corporation'
>     device     =3D '82801IB/IR/IH (ICH9 Family) PCIe Root Port 1'
>     class      =3D bridge
>     subclass   =3D PCI-PCI
> pcib2@pci0:0:28:4:      class=3D0x060400 card=3D0x29488086 chip=3D0x294=
88086
> rev=3D0x02 hdr=3D0x01
>     vendor     =3D 'Intel Corporation'
>     device     =3D '82801IB/IR/IH (ICH9 Family) PCIe Root Port 5'
>     class      =3D bridge
>     subclass   =3D PCI-PCI
> uhci2@pci0:0:29:0:      class=3D0x0c0300 card=3D0x34d08086 chip=3D0x293=
48086
> rev=3D0x02 hdr=3D0x00
>     vendor     =3D 'Intel Corporation'
>     device     =3D '82801IB/IR/IH (ICH9 Family) USB Universal Host Cont=
roller'
>     class      =3D serial bus
>     subclass   =3D USB
> uhci3@pci0:0:29:1:      class=3D0x0c0300 card=3D0x34d08086 chip=3D0x293=
58086
> rev=3D0x02 hdr=3D0x00
>     vendor     =3D 'Intel Corporation'
>     device     =3D '82801IB/IR/IH (ICH9 Family) USB Universal Host Cont=
roller'
>     class      =3D serial bus
>     subclass   =3D USB
> uhci4@pci0:0:29:2:      class=3D0x0c0300 card=3D0x34d08086 chip=3D0x293=
68086
> rev=3D0x02 hdr=3D0x00
>     vendor     =3D 'Intel Corporation'
>     device     =3D '82801IB/IR/IH (ICH9 Family) USB Universal Host Cont=
roller'
>     class      =3D serial bus
>     subclass   =3D USB
> ehci1@pci0:0:29:7:      class=3D0x0c0320 card=3D0x34d08086 chip=3D0x293=
a8086
> rev=3D0x02 hdr=3D0x00
>     vendor     =3D 'Intel Corporation'
>     device     =3D '82801IB/IR/IH (ICH9 Family) USB2 Enhanced Host Cont=
roller'
>     class      =3D serial bus
>     subclass   =3D USB
> pcib3@pci0:0:30:0:      class=3D0x060401 card=3D0x34d08086 chip=3D0x244=
e8086
> rev=3D0x92 hdr=3D0x01
>     vendor     =3D 'Intel Corporation'
>     device     =3D '82801 Family (ICH2/3/4/5/6/7/8/9,63xxESB) Hub
> Interface to PCI Bridge'
>     class      =3D bridge
>     subclass   =3D PCI-PCI
> isab0@pci0:0:31:0:      class=3D0x060100 card=3D0x34d08086 chip=3D0x291=
68086
> rev=3D0x02 hdr=3D0x00
>     vendor     =3D 'Intel Corporation'
>     device     =3D '82801IR (ICH9R) LPC Interface Controller'
>     class      =3D bridge
>     subclass   =3D PCI-ISA
> ahci0@pci0:0:31:2:      class=3D0x010601 card=3D0x34d08086 chip=3D0x292=
28086
> rev=3D0x02 hdr=3D0x00
>     vendor     =3D 'Intel Corporation'
>     device     =3D '82801IB/IR/IH (ICH9 Family) 6 port SATA AHCI Contro=
ller'
>     class      =3D mass storage
>     subclass   =3D SATA
> ichsmb0@pci0:0:31:3:    class=3D0x0c0500 card=3D0x34d08086 chip=3D0x293=
08086
> rev=3D0x02 hdr=3D0x00
>     vendor     =3D 'Intel Corporation'
>     device     =3D '82801IB/IR/IH (ICH9 Family) SMBus Controller'
>     class      =3D serial bus
>     subclass   =3D SMBus
> vgapci0@pci0:2:0:0:     class=3D0x030000 card=3D0x01018086 chip=3D0x052=
2102b
> rev=3D0x02 hdr=3D0x00
>     vendor     =3D 'Matrox Electronic Systems Ltd.'
>     device     =3D 'Matrox G200e (ServerEngines) - English (G200e)'
>     class      =3D display
>     subclass   =3D VGA
> em1@pci0:3:2:0: class=3D0x020000 card=3D0x34d08086 chip=3D0x10768086 re=
v=3D0x05
> hdr=3D0x00
>     vendor     =3D 'Intel Corporation'
>     device     =3D 'Gigabit Ethernet Controller (82541EI)'
>     class      =3D network
>     subclass   =3D ethernet
> banana:~>20: vmstat -i
> interrupt                          total       rate
> irq4: uart0                        27392          0
> irq18: em1 uhci0++                   178          0
> irq19: uhci3                          22          0
> irq23: uhci2 ehci1                     2          0
> cpu0: timer                    231629257       1917
> irq256: em0                      2559286         21
> irq257: ahci0                    1602196         13
> cpu1: timer                    231628629       1917
> Total                          467446962       3869
>=20
> Looks very similar to the problem here:
> http://unix.derkeiler.com/Mailing-Lists/FreeBSD/stable/2010-03/msg00143=
=2Ehtml

For the records,

again I have those timeouts, but with different status bits:

em1: watchdog timeout -- resetting
em1: watchdog timeout -- resetting
em1: watchdog timeout -- resetting
em1: watchdog timeout -- resetting
ahcich2: Timeout on slot 14
ahcich2: is 00000000 cs 00004000 ss 00000000 rs 00004000 tfd c0 serr=20
00000000
ahcich2: Timeout on slot 14
ahcich2: is 00000000 cs fff81fff ss ffffdfff rs ffffdfff tfd c0 serr=20
00000000
ahcich2: Timeout on slot 10
ahcich2: is 00000000 cs 00000c00 ss 00000000 rs 00000c00 tfd c0 serr=20
00000000
ahcich2: Timeout on slot 6
ahcich2: is 00000000 cs 00000040 ss 00000000 rs 00000040 tfd c0 serr=20
00000000
ahcich2: Timeout on slot 2
ahcich2: is 00000000 cs 00000004 ss 00000000 rs 00000004 tfd c0 serr=20
00000000
ahcich2: Timeout on slot 30
ahcich2: is 00000000 cs 40000000 ss 00000000 rs 40000000 tfd c0 serr=20
00000000
ahcich2: Timeout on slot 30
ahcich2: is 00000000 cs 1ffffffc ss dfffffff rs dfffffff tfd c0 serr=20
00000000
em1: watchdog timeout -- resetting
em1: watchdog timeout -- resetting

The machine is now hangig for several hours, no login at resial console=20
possible. To emhazise, em1 is not connected/used, em0 is the productive=20
interface. I'll have to drive on site...
What do these bits tell compared to the hangs above, especcially the=20
middle one, which the machine recovered from? No interrupts requested=20
and `rs =3D=3D cs | ss` meaning controller doesn't report command=20
completion. That's identical. Any other useful info? If not I won't post =

future timeout messages.

Thanks,

-Harry


--------------enig2C6DCC30587B025E71A51C11
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: OpenPGP digital signature
Content-Disposition: attachment; filename="signature.asc"

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.13 (FreeBSD)

iEYEARECAAYFAkucxJkACgkQLDqVQ9VXb8i1hgCbBebgmb/9f/kJzHygp4o1JgZk
DP8AnR9YDn3ixtGINVp8QVb0QaieYQj1
=NzeA
-----END PGP SIGNATURE-----

--------------enig2C6DCC30587B025E71A51C11--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4B9CC493.30009>