FreeBSD Mail Archives

Date:      Wed, 3 Mar 2010 03:06:47 -0800
From:      Jeremy Chadwick <freebsd@jdc.parodius.com>
To:        freebsd-stable@freebsd.org
Subject:   Re: ahcich timeouts, only with ahci, not with ataahci
Message-ID:  <20100303110647.GA51588@icarus.home.lan>
In-Reply-To: <4B8E1DA9.2090406@omnilan.de>
References:  <1266934981.00222684.1266922202@10.7.7.3> <4B83EFD4.8050403@FreeBSD.org> <4B8E1489.2070306@omnilan.de> <4B8E1B3D.306@FreeBSD.org> <4B8E1DA9.2090406@omnilan.de>

On Wed, Mar 03, 2010 at 09:28:25AM +0100, Harald Schmalzbauer wrote:
> Alexander Motin schrieb am 03.03.2010 09:18 (localtime):
> >Harald Schmalzbauer wrote:
> >>Alexander Motin schrieb am 23.02.2010 16:10 (localtime):
> >>>Harald Schmalzbauer wrote:
> >>>>I'm frequently getting my machine locked with ahcichX timeouts:
> >>>>ahcich2: Timeout on slot 0
> >>>>ahcich2: is 00000000 cs 00000001 ss 00000000 rs 00000001 tfd c0 serr
> >>>>00000000
> >>>>ahcich2: Timeout on slot 8
> >>>>ahcich2: is 00000000 cs 00000100 ss 00000000 rs 00000100 tfd c0 serr
> >>>>00000000
> >>>>ahcich2: Timeout on slot 8
> >>>>ahcich2: is 00000000 cs fffff07f ss ffffff7f rs ffffff7f tfd c0 serr
> >>>>00000000
> >>>>...
> >>>Looking that is (Interrupt status) is zero and `rs == cs | ss` (running
> >>>command bitmasks in driver and hardware), controller doesn't report
> >>>command completion. Looking on TFD status 0xc0 with BUSY bit set, I
> >>>would suppose that either disk stuck in command processing for some
> >>>reason, or controller missed command completion status.
> >>>
> >>>Have you noticed 30 second (default ATA timeout) pause before timeout
> >>>message printed? Just want to be sure that driver waited enough before
> >>>give up.
> >>>
> >>>>This happens when backup over GbE overloads ZFS/HDD capabilities.
> >>>>I reduced vfs.zfs.txg.timeout to 1 to prevent the machine from locking
> >>>>up almost immediately, but from it still happens.
> >>>>When I don't use ahci but ataahci (the old driver if I understand things
> >>>>correct) I also see the ZFS burst write congestion, but this doesn't
> >>>>lead to controller timeouts, thus blocking the machine.
> >>>>
> >>>>Sometimes the machine recovers from the disk lock, but most often I have
> >>>>to reboot.
> >>>How it looks when it doesn't? Can you send me full log messages?
> >>Hello, this morning I had a stall, but the machine recovered after about
> >> one Minute. Here's what I got from the kernel:
> >>ahcich2: Timeout on slot 29
> >>ahcich2: is 00000000 cs 00000003 ss e0000003 rs e0000003 tfd c0 serr
> >>00000000
> >>em1: watchdog timeout -- resetting
> >>em1: watchdog timeout -- resetting
> >>ahcich2: Timeout on slot 10
> >>ahcich2: is 00000000 cs 00006000 ss 00007c00 rs 00007c00 tfd c0 serr
> >>00000000
> >>ahcich2: Timeout on slot 18
> >>ahcich2: is 00000000 cs 00040000 ss 00000000 rs 00040000 tfd c0 serr
> >>00000000
> >>ahcich2: Timeout on slot 2
> >>ahcich2: is 00000000 cs 00000004 ss 00000000 rs 00000004 tfd c0 serr
> >>00000000
> >>ahcich2: Timeout on slot 2
> >>ahcich2: is 00000000 cs 00000000 ss 0000000c rs 0000000c tfd 40 serr
> >>00000000
> >>
> >>Does this tell you something useful?
> >
> >It doesn't. Looking on logged register content - commands are indeed
> >still running and no interrupts requested. Interesting to see em1
> >watchdog timeout there. Aren't they related somehow?
> 
> 	dmesg | grep "irq 18":
> uhci0: <Intel 82801I (ICH9) USB controller> port 0x20c0-0x20df irq
> 18 at device 26.0 on pci0
> uhci4: <Intel 82801I (ICH9) USB controller> port 0x2040-0x205f irq
> 18 at device 29.2 on pci0
> em1: <Intel(R) PRO/1000 Network Connection 6.9.14> port
> 0x1000-0x103f mem 0xe1920000-0xe193ffff,0xe1900000-0xe191ffff irq 18
> at device 2.0 on pci3
> ichsmb0: <Intel 82801I (ICH9) SMBus controller> port 0x2000-0x201f
> mem 0xe1a22000-0xe1a220ff irq 18 at device 31.3 on pci0
> 
> The don't share the same IRQ at least.
> dmesg | grep "irq 21"
> uhci1: <Intel 82801I (ICH9) USB controller> port 0x20a0-0x20bf irq
> 21 at device 26.1 on pci0
> ahci0: <Intel ICH9 AHCI SATA controller> port
> 0x2408-0x240f,0x2414-0x2417,0x2400-0x2407,0x2410-0x2413,0x2020-0x203f
> mem 0xe1a21000-0xe1a217ff irq 21 at device 31.2 on pci0
> 
> The em1 has no cable attached. I get many of these em watchdog
> timeouts. Never thought they could be related to ahci. I'll see if
> the em watchdog timeouts happens in any relation to disk usage.

Please provide output from the commands I provided.  dmesg|grep is not
sufficient for helping track this down, specifically with regards to the
em1 watchdog timeouts.

-- 
| Jeremy Chadwick                                   jdc@parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                  Mountain View, CA, USA |
| Making life hard for others since 1977.              PGP: 4BD6C0CB |

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20100303110647.GA51588>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation