Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 14 Oct 2012 18:28:56 +0800 (CST)
From:      LW <litelwang@126.com>
To:        freebsd-hardware@freebsd.org
Subject:   Install Program can't boot after AMD-A75 onboard-raid is activated
Message-ID:  <284a4ec5.a626.13a5ed1c3c5.Coremail.litelwang@126.com>

next in thread | raw e-mail | index | archive | help
SGFyZHdhcmUgbGlzdDoKMeOAgU1vdGhlcmJvYXJkKDEpOk1TSSBBNzVNQS1QMzUsd2l0aCBsYXRl
c3QgQklPUy1VRUZJKEFNRCBBNzUgY2hpcHNldCxzdXBwb3J0IHNhdGEz44CBdXNiM+OAgXNvY2tl
dCBGTTHjgIFkZHIz44CBVUVGSSkKMuOAgUNQVSgxKTpBTUQgQTQtMzQwMO+8iHNvY2tldCBGTTHv
vIkKM+OAgU1lbW9yeSgxKTpERFIzIDE2MDAgNEdCCjTjgIFIYXJkZGlzaygyKTpTZWFnYXRlIHNh
dGEzIDUwMEdCCjXjgIFVU0IgRFZEIFJPTSgxKQpQcm9ibGVtIGRlc2NyaXB0aW9uOgox44CBSWYg
SURFIG9yIEFIQ0kgbW9kZSBpcyBhY3RpdmF0ZWQsZXJ2ZXJ5dGhpbmcgaXMgT0sgYW5kIGl0IHJ1
bnMgZmFzdCAuCjLjgIFJZiBSQUlEIG1vZGUgaXMgYWN0aXZhdGVkKFdJVEhPVVQgYW55IHJhaWQg
ZGVmaW5lZCBvciB3aXRoIHJhaWQxIGRlZmluZWQpLEJvb3RMb2FkZXIoSW5zdGFsbCBDRCBvciBE
VkQsMzIgb3IgNjQgdmVyc2lvbixGQjkgUmVsZWFzZSkgY2FuIHJ1biBidXQgd2lsbCByZWJvb3Qg
dmVyeSBzb29uIC5Pbmx5IGFib3V0IDMtbGluZXMgbWVzc2FnZSB3ZXJlIHNob3dlZCAuCgpJIGhh
dmUgcmVhZCBzb21lIG1lc3NhZ2VzIGluIHRoaXMgbWFpbCBsaXN0IGJ1dCBmb3VuZCBub3RoaW5n
IGltcG9ydCB0byBzb2x2ZSBteSBwcm9ibGVtIOOAglRoYW5rcyEK
From owner-freebsd-hardware@FreeBSD.ORG  Sun Oct 14 23:03:47 2012
Return-Path: <owner-freebsd-hardware@FreeBSD.ORG>
Delivered-To: freebsd-hardware@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
 by hub.freebsd.org (Postfix) with ESMTP id 4FA21779
 for <freebsd-hardware@freebsd.org>; Sun, 14 Oct 2012 23:03:47 +0000 (UTC)
 (envelope-from nate.keegan@gmail.com)
Received: from mail-vb0-f54.google.com (mail-vb0-f54.google.com
 [209.85.212.54])
 by mx1.freebsd.org (Postfix) with ESMTP id 03CE88FC18
 for <freebsd-hardware@freebsd.org>; Sun, 14 Oct 2012 23:03:46 +0000 (UTC)
Received: by mail-vb0-f54.google.com with SMTP id v11so5894760vbm.13
 for <freebsd-hardware@freebsd.org>; Sun, 14 Oct 2012 16:03:40 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=mime-version:date:message-id:subject:from:to:content-type;
 bh=70JEa6++7x9vgyBb1iUjHVNarKo1siY1B85qA0XWc/4=;
 b=fEoX+HnLwHBtR5AURDbhcFGo7FWmsPM+716992KrJqaTpbQYeOIMU31XxpSMUs0EON
 1d3PsywHGW+5QCBkHhlSvdPeTSC9hCEb/mC9rNptMi19W+NPQ6ejHIf0WX+dFYJjeLmj
 s3cqkg122Wyeg/Su+6wHeh4CiwttG2580CZy5eiHp7eu8wz1JJOE5Eb8p+EIH+gU5EG4
 yoPYAIJZP+c/U935Iz6Lzqu+xCM6D1JXtSlj47ItrFBtTizcw1advjQ5qmzkHeTbIR1o
 zC3BDjzR7aaGSZAMHlceJzCjp42asjRmezcwg452L2WKuJvfqf6mQ+7oTj0fb57xzRT3
 EPaA==
MIME-Version: 1.0
Received: by 10.221.2.76 with SMTP id nt12mr5760534vcb.12.1350255819966; Sun,
 14 Oct 2012 16:03:39 -0700 (PDT)
Received: by 10.58.240.42 with HTTP; Sun, 14 Oct 2012 16:03:39 -0700 (PDT)
Date: Sun, 14 Oct 2012 16:03:39 -0700
Message-ID: <CABVjXfeV9VvF6sJC3Tb78z=jP+2sF+OJ2q0euCZkNqN_Yjs9ag@mail.gmail.com>
Subject: ahcich Timeouts SATA SSD
From: nate keegan <nate.keegan@gmail.com>
To: freebsd-hardware@freebsd.org
Content-Type: text/plain; charset=ISO-8859-1
X-BeenThere: freebsd-hardware@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: General discussion of FreeBSD hardware <freebsd-hardware.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-hardware>, 
 <mailto:freebsd-hardware-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hardware>;
List-Post: <mailto:freebsd-hardware@freebsd.org>
List-Help: <mailto:freebsd-hardware-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hardware>, 
 <mailto:freebsd-hardware-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 14 Oct 2012 23:03:47 -0000

I originally posted this to the FreeBSD hardware forum and then on
freebsd-questions at the direction of a moderator in the forum.

Based on what I'm seeing for post types on freebsd-questions this
might be the best forum for this issue as it looks like some sort of a
strange issue or bug between FreeBSD 8.2/9.0 and SATA SSD drives.

My configuration is as follows:

FreeBSD 8.2-RELEASE
Supermicro X8DTi-LN4F (Intel Tylersburg 5520 chipset) motherboard
24 GB system memory
32 x Hitachi Deskstar 5K3000 disks connected to 4 x Intel SASUC8I (LSI
3081E-R) in IT mode
2 x Crucial M4 64 Gb SATA SSD for FreeBSD OS (zroot)
2 x Intel 320 MLC 80 Gb SATA SSD for L2ARC and swap
SSD are connected to on-board SATA port on motherboard

This system was commissioned in February of 2012 and ran without issue
as a ZFS backup system on our network until about 3 weeks ago.

At that time I started getting kernel panics due to timeouts to the
on-board SATA devices. The only change to the system since it was
built was to add an SSD for swap (32 Gb swap device) and this issue
did not happen until several months after this was added.

My initial thought was that I might have a bad SSD drive so I swapped
out one of the Crucial SSD drives and the problem happened again a few
days later.

I then moved to systematically replacing items such as SATA cables,
memory, motherboard, etc and the problem continued. For example, I
swapped out the 4 SATA cables with brand new SATA cables and waited to
see if the problem happened again. Once it did I moved on to replacing
the motherboard with an identical motherboard, waited, etc.

I could not find an obvious hardware related explanation for this
behavior so about a week and a half ago I did a fresh install of
FreeBSD 9.0-RELEASE to move from the ATA driver to the AHCI driver as
I found some evidence that this was helpful.

The problem continued with something like this:

ahcich0: Timeout on slot 29 port 0
ahcich0: is 000000000 cs 00000000 ss e0000000 rs e0000000 tfd 40 serr
00000000 cmd 0004df17

ahcich0: AHCI reset: device not ready after 31000ms (tfd = 00000080)
ahcich0: Timeout on slot 31 port 0
ahcich0: is 00000000 cs 80000000 ss 00000000 rs 80000000 tfd 80 serr
00000000 cmd 0004df17
(ada0:ahcich0:0:0:0): lost device

ahcich0: AHCI reset: device not ready after 3100ms (tfd = 00000080)
ahcich0: Timeout on slot 31 port 0
ahcich0: is 00000000 cs 80000003 ss 800000003 rs 80000003 tfd 80 serr
0000000 cmd 0004df17
(ada0:ahcich0:0:0:0): removing device entry

ahcich0: AHCI reset: device not ready after 31000ms (tfd = 00000080)
ahcich0: Poll timeout on slot 1 port 0
ahcich0: is 00000000 cs 00000002 ss 000000000 rs 0000002 tfd 80 serr
00000000 cmd 004c117

When this happens the only way to recover the system is to hard boot
via IPMI (yanking the power vs hitting reset). I cannot say that every
time this happens a hard reset is necessary but more often than not a
hard reset is necessary as the on-board AHCI portion of the BIOS does
not always see the disks after the event without a hard system power
reset.

I have done a bunch of Google work on this and have seen the issue
appear in FreeNAS and FreeBSD but no clear cut resolution in terms of
how to address it or what causes it. Some people had a bad SSD, others
had to disable NCQ or power management on their SSD, particular brands
of SSD (Samsung), etc.

Nothing conclusive so far.

At the present time the issue happens every 1-2 hours unless I have
the following in my /boot/loader.conf after the ahci_load statement:

ahci_load="YES"

# See ahci(4)
hint.ahcich.0.sata_rev=1
hint.ahcich.1.sata_rev=1
hint.ahcich.2.sata_rev=1
hint.ahcich.3.sata_rev=1

hint.ahcich.0.pm_level=1
hint.ahcich.1.pm_level=1
hint.ahcich.2.pm_level=1
hint.ahcich.3.pm_level=1

I have a script in /usr/local/etc/rc.d which disables NCQ on these drives:

#!/bin/sh

CAMCONTROL=/sbin/camcontrol

$CAMCONTROL tags ada0 -N 1 > /dev/null
$CAMCONTROL tags ada1 -N 1 > /dev/null
$CAMCONTROL tags ada2 -N 1 > /dev/null
$CAMCONTROL tags ada3 -N 1 > /dev/null

exit 0

I went ahead and pulled the Intel SSDs as they were showing ASR and
hardware resets which incremented. Removing both of these disks from
the system did not change the situation.

The combination of /boot/loader.conf and this script gets me 6 days or
so of operation before the issue pops up again. If I remove these two
items I get maybe 2 hours before the issue happens again.

Right now I'm down to one OS disk and one swap disk and that is it for
SSD disks on the system.

At the last reboot (yesterday) I disabled APM on the disks (ada0 and
ada1 at this point) to see if that makes a difference as I found a
reference to this being a potential problem.

I'm looking for insight/help on this as I'm about out of options. If
there is a way to gather more information when this happens, post up
information, etc I'm open to trying it.

What is driving me crazy is that I can't seem to come up with a
concrete explanation as to why now and not back when the system was
built. The issue only seems to happen when the system is idle and the
SSD drives do not see much action other than to host OS, scripts, etc
while the Intel/LSI based drives is where the actual I/O is at.

The system logs do not show anything prior to event happening and the
OS will respond to ping requests after the issue and if you have an
active SSH session you will remain connected to the system until you
attempt to do something like 'ls', 'ps', etc.

New SSH requests to the system get 'connection refused'.

As far as I can see I have three real options left:

* Hope that someone here knows something I don't
* Ditch SSD for straight SATA disks (plan on doing this next week
before next likely happening sometime Wed am) as perhaps there is some
odd SATA/SSD interaction with FreeBSD or with controller I'm not aware
of (haven't had this happen with plain SATA and FreeBSD before)
* Ditch FreeBSD for Solaris so I can keep ZFS lovin for the intended
purpose of this system

I'm open to suggestions, direction, etc to see if I can nail down what
is going on and put this issue to bed for not only myself but for
anyone else who might run into it in the future.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?284a4ec5.a626.13a5ed1c3c5.Coremail.litelwang>