From owner-freebsd-questions@FreeBSD.ORG  Fri Oct 12 16:47:30 2012
Return-Path: <owner-freebsd-questions@FreeBSD.ORG>
Delivered-To: freebsd-questions@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
 by hub.freebsd.org (Postfix) with ESMTP id B1F939EB
 for <freebsd-questions@freebsd.org>; Fri, 12 Oct 2012 16:47:30 +0000 (UTC)
 (envelope-from nate.keegan@prescott-az.gov)
Received: from tungsten.cityofprescott.net (tungsten.cityofprescott.net
 [63.229.103.74])
 by mx1.freebsd.org (Postfix) with ESMTP id 8B7DC8FC16
 for <freebsd-questions@freebsd.org>; Fri, 12 Oct 2012 16:47:30 +0000 (UTC)
Received: from tungsten.cityofprescott.net (localhost [127.0.0.1])
 by tungsten.cityofprescott.net (Postfix) with ESMTP id 44FCB8C8433
 for <freebsd-questions@freebsd.org>; Fri, 12 Oct 2012 09:39:13 -0700 (MST)
Received: from obsidian.ad.cityofprescott.org (unknown [172.30.0.160])
 by tungsten.cityofprescott.net (Postfix) with ESMTP id 1A14C8C8427
 for <freebsd-questions@freebsd.org>; Fri, 12 Oct 2012 09:39:13 -0700 (MST)
Received: from Obsidian.ad.cityofprescott.org ([fe80::7106:4d83:5091:556a]) by
 obsidian.ad.cityofprescott.org ([fe80::7106:4d83:5091:556a%10]) with
 mapi; Fri, 12 Oct 2012 09:41:55 -0700
From: "Keegan,Nate" <nate.keegan@prescott-az.gov>
To: "freebsd-questions@freebsd.org" <freebsd-questions@freebsd.org>
Date: Fri, 12 Oct 2012 09:41:56 -0700
Subject: ahcich Timeouts SATA SSD
Thread-Topic: ahcich Timeouts SATA SSD
Thread-Index: Ac2omH3EcOLB9LszTTuYbb2QW7sZHQ==
Message-ID: <0488BA670C8E594D93BE0556FEB89063054C373D29@obsidian.ad.cityofprescott.org>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
acceptlanguage: en-US
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
X-BeenThere: freebsd-questions@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: User questions <freebsd-questions.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-questions>, 
 <mailto:freebsd-questions-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-questions>
List-Post: <mailto:freebsd-questions@freebsd.org>
List-Help: <mailto:freebsd-questions-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-questions>, 
 <mailto:freebsd-questions-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 12 Oct 2012 16:47:30 -0000

My configuration is as follows:

FreeBSD 8.2-RELEASE
Supermicro X8DTi-LN4F (Intel Tylersburg 5520 chipset) motherboard
24 GB system memory
32 x Hitachi Deskstar 5K3000 disks connected to 4 x Intel SASUC8I (LSI 3081=
E-R) in IT mode
2 x Crucial M4 64 Gb SATA SSD for FreeBSD OS (zroot)
2 x Intel 320 MLC 80 Gb SATA SSD for L2ARC and swap
SSD are connected to on-board SATA port on motherboard

This system was commissioned in February of 2012 and ran without issue as a=
 ZFS backup system on our network until about 3 weeks ago.

At that time I started getting kernel panics due to timeouts to the on-boar=
d SATA devices. The only change to the system since it was built was to add=
 an SSD for swap (32 Gb swap device) and this issue did not happen until se=
veral months after this was added.

My initial thought was that I might have a bad SSD drive so I swapped out o=
ne of the Crucial SSD drives and the problem happened again a few days late=
r.

I then moved to systematically replacing items such as SATA cables, memory,=
 motherboard, etc and the problem continued. For example, I swapped out the=
 4 SATA cables with brand new SATA cables and waited to see if the problem =
happened again. Once it did I moved on to replacing the motherboard with an=
 identical motherboard, waited, etc.

I could not find an obvious hardware related explanation for this behavior =
so about a week and a half ago I did a fresh install of FreeBSD 9.0-RELEASE=
 to move from the ATA driver to the AHCI driver as I found some evidence th=
at this was helpful.

The problem continued with something like this:

ahcich0: Timeout on slot 29 port 0
ahcich0: is 000000000 cs 00000000 ss e0000000 rs e0000000 tfd 40 serr 00000=
000 cmd 0004df17

ahcich0: AHCI reset: device not ready after 31000ms (tfd =3D 00000080)
ahcich0: Timeout on slot 31 port 0
ahcich0: is 00000000 cs 80000000 ss 00000000 rs 80000000 tfd 80 serr 000000=
00 cmd 0004df17
(ada0:ahcich0:0:0:0): lost device

ahcich0: AHCI reset: device not ready after 3100ms (tfd =3D 00000080)
ahcich0: Timeout on slot 31 port 0
ahcich0: is 00000000 cs 80000003 ss 800000003 rs 80000003 tfd 80 serr 00000=
00 cmd 0004df17
(ada0:ahcich0:0:0:0): removing device entry

ahcich0: AHCI reset: device not ready after 31000ms (tfd =3D 00000080)
ahcich0: Poll timeout on slot 1 port 0
ahcich0: is 00000000 cs 00000002 ss 000000000 rs 0000002 tfd 80 serr 000000=
00 cmd 004c117

When this happens the only way to recover the system is to hard boot via IP=
MI (yanking the power vs hitting reset). I cannot say that every time this =
happens a hard reset is necessary but more often than not a hard reset is n=
ecessary as the on-board AHCI portion of the BIOS does not always see the d=
isks after the event without a hard system power reset.

I have done a bunch of Google work on this and have seen the issue appear i=
n FreeNAS and FreeBSD but no clear cut resolution in terms of how to addres=
s it or what causes it. Some people had a bad SSD, others had to disable NC=
Q or power management on their SSD, particular brands of SSD (Samsung), etc=
.

Nothing conclusive so far.

At the present time the issue happens every 1-2 hours unless I have the fol=
lowing in my /boot/loader.conf after the ahci_load statement:

ahci_load=3D"YES"

# See ahci(4)
hint.ahcich.0.sata_rev=3D1
hint.ahcich.1.sata_rev=3D1
hint.ahcich.2.sata_rev=3D1
hint.ahcich.3.sata_rev=3D1

hint.ahcich.0.pm_level=3D1
hint.ahcich.1.pm_level=3D1
hint.ahcich.2.pm_level=3D1
hint.ahcich.3.pm_level=3D1

I have a script in /usr/local/etc/rc.d which disables NCQ on these drives:

#!/bin/sh

CAMCONTROL=3D/sbin/camcontrol

$CAMCONTROL tags ada0 -N 1 > /dev/null
$CAMCONTROL tags ada1 -N 1 > /dev/null
$CAMCONTROL tags ada2 -N 1 > /dev/null
$CAMCONTROL tags ada3 -N 1 > /dev/null

exit 0

I went ahead and pulled the Intel SSDs as they were showing ASR and hardwar=
e resets which incremented. Removing both of these disks from the system di=
d not change the situation.

The combination of /boot/loader.conf and this script gets me 6 days or so o=
f operation before the issue pops up again. If I remove these two items I g=
et maybe 2 hours before the issue happens again.

Right now I'm down to one OS disk and one swap disk and that is it for SSD =
disks on the system.

At the last reboot (yesterday) I disabled APM on the disks (ada0 and ada1 a=
t this point) to see if that makes a difference as I found a reference to t=
his being a potential problem.

I'm looking for insight/help on this as I'm about out of options. If there =
is a way to gather more information when this happens, post up information,=
 etc I'm open to trying it.

What is driving me crazy is that I can't seem to come up with a concrete ex=
planation as to why now and not back when the system was built. The issue o=
nly seems to happen when the system is idle and the SSD drives do not see m=
uch action other than to host OS, scripts, etc while the Intel/LSI based dr=
ives is where the actual I/O is at.

The system logs do not show anything prior to event happening and the OS wi=
ll respond to ping requests after the issue and if you have an active SSH s=
ession you will remain connected to the system until you attempt to do some=
thing like 'ls', 'ps', etc.

New SSH requests to the system get 'connection refused'.

As far as I can see I have three real options left:

* Hope that someone here knows something I don't
* Ditch SSD for straight SATA disks (plan on doing this next week before ne=
xt likely happening sometime Wed am) as perhaps there is some odd SATA/SSD =
interaction with FreeBSD or with controller I'm not aware of (haven't had t=
his happen with plain SATA and FreeBSD before)
* Ditch FreeBSD for Solaris so I can keep ZFS lovin for the intended purpos=
e of this system

I'm open to suggestions, direction, etc to see if I can nail down what is g=
oing on and put this issue to bed for not only myself but for anyone else w=
ho might run into it before I lose what little hair and sanity I have left.=
..heh

- Nate