From owner-freebsd-questions@FreeBSD.ORG Fri Oct 12 16:47:30 2012 Return-Path: Delivered-To: freebsd-questions@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id B1F939EB for ; Fri, 12 Oct 2012 16:47:30 +0000 (UTC) (envelope-from nate.keegan@prescott-az.gov) Received: from tungsten.cityofprescott.net (tungsten.cityofprescott.net [63.229.103.74]) by mx1.freebsd.org (Postfix) with ESMTP id 8B7DC8FC16 for ; Fri, 12 Oct 2012 16:47:30 +0000 (UTC) Received: from tungsten.cityofprescott.net (localhost [127.0.0.1]) by tungsten.cityofprescott.net (Postfix) with ESMTP id 44FCB8C8433 for ; Fri, 12 Oct 2012 09:39:13 -0700 (MST) Received: from obsidian.ad.cityofprescott.org (unknown [172.30.0.160]) by tungsten.cityofprescott.net (Postfix) with ESMTP id 1A14C8C8427 for ; Fri, 12 Oct 2012 09:39:13 -0700 (MST) Received: from Obsidian.ad.cityofprescott.org ([fe80::7106:4d83:5091:556a]) by obsidian.ad.cityofprescott.org ([fe80::7106:4d83:5091:556a%10]) with mapi; Fri, 12 Oct 2012 09:41:55 -0700 From: "Keegan,Nate" To: "freebsd-questions@freebsd.org" Date: Fri, 12 Oct 2012 09:41:56 -0700 Subject: ahcich Timeouts SATA SSD Thread-Topic: ahcich Timeouts SATA SSD Thread-Index: Ac2omH3EcOLB9LszTTuYbb2QW7sZHQ== Message-ID: <0488BA670C8E594D93BE0556FEB89063054C373D29@obsidian.ad.cityofprescott.org> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: acceptlanguage: en-US Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 12 Oct 2012 16:47:30 -0000 My configuration is as follows: FreeBSD 8.2-RELEASE Supermicro X8DTi-LN4F (Intel Tylersburg 5520 chipset) motherboard 24 GB system memory 32 x Hitachi Deskstar 5K3000 disks connected to 4 x Intel SASUC8I (LSI 3081= E-R) in IT mode 2 x Crucial M4 64 Gb SATA SSD for FreeBSD OS (zroot) 2 x Intel 320 MLC 80 Gb SATA SSD for L2ARC and swap SSD are connected to on-board SATA port on motherboard This system was commissioned in February of 2012 and ran without issue as a= ZFS backup system on our network until about 3 weeks ago. At that time I started getting kernel panics due to timeouts to the on-boar= d SATA devices. The only change to the system since it was built was to add= an SSD for swap (32 Gb swap device) and this issue did not happen until se= veral months after this was added. My initial thought was that I might have a bad SSD drive so I swapped out o= ne of the Crucial SSD drives and the problem happened again a few days late= r. I then moved to systematically replacing items such as SATA cables, memory,= motherboard, etc and the problem continued. For example, I swapped out the= 4 SATA cables with brand new SATA cables and waited to see if the problem = happened again. Once it did I moved on to replacing the motherboard with an= identical motherboard, waited, etc. I could not find an obvious hardware related explanation for this behavior = so about a week and a half ago I did a fresh install of FreeBSD 9.0-RELEASE= to move from the ATA driver to the AHCI driver as I found some evidence th= at this was helpful. The problem continued with something like this: ahcich0: Timeout on slot 29 port 0 ahcich0: is 000000000 cs 00000000 ss e0000000 rs e0000000 tfd 40 serr 00000= 000 cmd 0004df17 ahcich0: AHCI reset: device not ready after 31000ms (tfd =3D 00000080) ahcich0: Timeout on slot 31 port 0 ahcich0: is 00000000 cs 80000000 ss 00000000 rs 80000000 tfd 80 serr 000000= 00 cmd 0004df17 (ada0:ahcich0:0:0:0): lost device ahcich0: AHCI reset: device not ready after 3100ms (tfd =3D 00000080) ahcich0: Timeout on slot 31 port 0 ahcich0: is 00000000 cs 80000003 ss 800000003 rs 80000003 tfd 80 serr 00000= 00 cmd 0004df17 (ada0:ahcich0:0:0:0): removing device entry ahcich0: AHCI reset: device not ready after 31000ms (tfd =3D 00000080) ahcich0: Poll timeout on slot 1 port 0 ahcich0: is 00000000 cs 00000002 ss 000000000 rs 0000002 tfd 80 serr 000000= 00 cmd 004c117 When this happens the only way to recover the system is to hard boot via IP= MI (yanking the power vs hitting reset). I cannot say that every time this = happens a hard reset is necessary but more often than not a hard reset is n= ecessary as the on-board AHCI portion of the BIOS does not always see the d= isks after the event without a hard system power reset. I have done a bunch of Google work on this and have seen the issue appear i= n FreeNAS and FreeBSD but no clear cut resolution in terms of how to addres= s it or what causes it. Some people had a bad SSD, others had to disable NC= Q or power management on their SSD, particular brands of SSD (Samsung), etc= . Nothing conclusive so far. At the present time the issue happens every 1-2 hours unless I have the fol= lowing in my /boot/loader.conf after the ahci_load statement: ahci_load=3D"YES" # See ahci(4) hint.ahcich.0.sata_rev=3D1 hint.ahcich.1.sata_rev=3D1 hint.ahcich.2.sata_rev=3D1 hint.ahcich.3.sata_rev=3D1 hint.ahcich.0.pm_level=3D1 hint.ahcich.1.pm_level=3D1 hint.ahcich.2.pm_level=3D1 hint.ahcich.3.pm_level=3D1 I have a script in /usr/local/etc/rc.d which disables NCQ on these drives: #!/bin/sh CAMCONTROL=3D/sbin/camcontrol $CAMCONTROL tags ada0 -N 1 > /dev/null $CAMCONTROL tags ada1 -N 1 > /dev/null $CAMCONTROL tags ada2 -N 1 > /dev/null $CAMCONTROL tags ada3 -N 1 > /dev/null exit 0 I went ahead and pulled the Intel SSDs as they were showing ASR and hardwar= e resets which incremented. Removing both of these disks from the system di= d not change the situation. The combination of /boot/loader.conf and this script gets me 6 days or so o= f operation before the issue pops up again. If I remove these two items I g= et maybe 2 hours before the issue happens again. Right now I'm down to one OS disk and one swap disk and that is it for SSD = disks on the system. At the last reboot (yesterday) I disabled APM on the disks (ada0 and ada1 a= t this point) to see if that makes a difference as I found a reference to t= his being a potential problem. I'm looking for insight/help on this as I'm about out of options. If there = is a way to gather more information when this happens, post up information,= etc I'm open to trying it. What is driving me crazy is that I can't seem to come up with a concrete ex= planation as to why now and not back when the system was built. The issue o= nly seems to happen when the system is idle and the SSD drives do not see m= uch action other than to host OS, scripts, etc while the Intel/LSI based dr= ives is where the actual I/O is at. The system logs do not show anything prior to event happening and the OS wi= ll respond to ping requests after the issue and if you have an active SSH s= ession you will remain connected to the system until you attempt to do some= thing like 'ls', 'ps', etc. New SSH requests to the system get 'connection refused'. As far as I can see I have three real options left: * Hope that someone here knows something I don't * Ditch SSD for straight SATA disks (plan on doing this next week before ne= xt likely happening sometime Wed am) as perhaps there is some odd SATA/SSD = interaction with FreeBSD or with controller I'm not aware of (haven't had t= his happen with plain SATA and FreeBSD before) * Ditch FreeBSD for Solaris so I can keep ZFS lovin for the intended purpos= e of this system I'm open to suggestions, direction, etc to see if I can nail down what is g= oing on and put this issue to bed for not only myself but for anyone else w= ho might run into it before I lose what little hair and sanity I have left.= ..heh - Nate