From owner-freebsd-stable@FreeBSD.ORG  Thu Feb  9 16:34:17 2012
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
Delivered-To: freebsd-stable@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 408A3106564A
	for <freebsd-stable@freebsd.org>; Thu,  9 Feb 2012 16:34:17 +0000 (UTC)
	(envelope-from jdc@koitsu.dyndns.org)
Received: from qmta05.emeryville.ca.mail.comcast.net
	(qmta05.emeryville.ca.mail.comcast.net [76.96.30.48])
	by mx1.freebsd.org (Postfix) with ESMTP id 1EF438FC14
	for <freebsd-stable@freebsd.org>; Thu,  9 Feb 2012 16:34:16 +0000 (UTC)
Received: from omta21.emeryville.ca.mail.comcast.net ([76.96.30.88])
	by qmta05.emeryville.ca.mail.comcast.net with comcast
	id XrqN1i0011u4NiLA5saGme; Thu, 09 Feb 2012 16:34:16 +0000
Received: from koitsu.dyndns.org ([67.180.84.87])
	by omta21.emeryville.ca.mail.comcast.net with comcast
	id XsaF1i0081t3BNj8hsaFhc; Thu, 09 Feb 2012 16:34:16 +0000
Received: by icarus.home.lan (Postfix, from userid 1000)
	id 73789102C1E; Thu,  9 Feb 2012 08:34:15 -0800 (PST)
Date: Thu, 9 Feb 2012 08:34:15 -0800
From: Jeremy Chadwick <freebsd@jdc.parodius.com>
To: Mike Tancsa <mike@sentex.net>
Message-ID: <20120209163415.GA96451@icarus.home.lan>
References: <4F32E289.4080806@sentex.net>
	<mailpost.1328736521.3202974.81071.mailing.freebsd.stable@FreeBSD.cs.nctu.edu.tw>
	<4F32F5B0.2060203@FreeBSD.org>
	<20120208223819.GA27488@icarus.home.lan>
	<4F32FB5E.7050102@FreeBSD.org> <4F33DB75.1080202@sentex.net>
	<20120209152240.GA95470@icarus.home.lan>
	<4F33F056.6070300@sentex.net>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <4F33F056.6070300@sentex.net>
User-Agent: Mutt/1.5.21 (2010-09-15)
Cc: Alexander Motin <mav@FreeBSD.org>, freebsd-stable@FreeBSD.org
Subject: Re: siisch1: Error while READ LOG EXT
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>, 
	<mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
	<mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 09 Feb 2012 16:34:17 -0000

On Thu, Feb 09, 2012 at 11:12:06AM -0500, Mike Tancsa wrote:
> {snipping}
>
> > So which Port Multiplier did you add?  The one at scbus0 or scbus1?
> 
> 1
> <WDC WD2002FAEX-007BA0 05.01D05>   at scbus1 target 0 lun 0 (pass5,ada4)
> <WDC WD2002FAEX-007BA0 05.01D05>   at scbus1 target 1 lun 0 (pass6,ada5)
> <WDC WD2002FAEX-007BA0 05.01D05>   at scbus1 target 2 lun 0 (pass7,ada6)
> <WDC WD2002FAEX-007BA0 05.01D05>   at scbus1 target 3 lun 0 (pass8,ada7)
> <WDC WD2002FAEX-007BA0 05.01D05>   at scbus1 target 4 lun 0 (pass9,ada8)
> <Port Multiplier 37261095 1706>    at scbus1 target 15 lun 0 (pass10,pmp0)

I'll provide analysis for all 5 of these disks below.

> > A full dmesg (not just a snippet) would probably be helpful here.  What
> > you provided in your first post was too terse, especially given how many
> > disks you have in this system.  :-)
> > 
> > I really see no problem with looking at all disks -- specifically disks
> > ada0 through ada3, and ada4 through ada8 -- to determine which one may
> > be having problems.  You're welcome to run "smartctl -a" on each one and
> > put them up on the web, preferably segregated by disk name (e.g.
> > ada0.txt, ada1.txt, etc.) and I can review them all.
> 
> Actually, I just had a look at another server at our DR site. Its
> hardware has not changed in a bit, but I did bring the kernel uptodate.
> Its now logging the odd 'READ LOG EXT' error as well.  Its kernel is
> from Jan 22.  Prior to that kernel update, I had not seen these errors.
>  Something in the driver (ahci or cam layer?) that has changed perhaps ?
> 
> Feb  4 11:12:36 offsite kernel: siisch1: Error while READ LOG EXT

Perhaps, but mav@ would be the authority on that.

> http://www.tancsa.com/ahci.txt

So here are the results of analysis for disks ada4 through ada8:

ada4
  --> When the below errors happened are 100% unknown.  Just noting
      that here.
  --> SMART attribute 199 shows 13 CRC errors.  These would be
      caused by issues between the disk and the device its
      attached to (port multiplier I guess).  Causes could be
      bad SATA cables, bad ports, dirty/dusty ports, or flaky
      PCB (on the disk itself).
  --> SATA PHY log/counters confirms above problem:
      ID      Size     Value  Description
      0x0001  2           13  Command failed due to ICRC error
      0x0002  2           13  R_ERR response for data FIS
      0x0003  2           13  R_ERR response for device-to-host data FIS
  --> Given this behaviour, possibly the ATA commands submit which
      experienced errors were NCQ-related.
  --> The NCQ command error log does have non-zero values in it.
      The format of the output is proprietary, sadly, and smartmontools
      does not know how to decode it.  But, compare it to your other
      drives and you'll see there is non-zero data there.
  --> This is a likely candidate for the behaviour seen on this PM.

ada5
  --> When the below errors happened are 100% unknown.  Just noting
      that here.
  --> SMART attribute 199 shows 11 CRC errors.  These would be
      caused by issues between the disk and the device its
      attached to (port multiplier I guess).  Causes could be
      bad SATA cables, bad ports, dirty/dusty ports, or flaky
      PCB (on the disk itself).
  --> SATA PHY log/counters confirms above problem:
      ID      Size     Value  Description
      0x0001  2           11  Command failed due to ICRC error
      0x0002  2           11  R_ERR response for data FIS
      0x0003  2           11  R_ERR response for device-to-host data FIS
  --> Given this behaviour, possibly the ATA commands submit which
      experienced errors were NCQ-related.
  --> The NCQ command error log does have non-zero values in it.
      The format of the output is proprietary, sadly, and smartmontools
      does not know how to decode it.  But, compare it to your other
      drives and you'll see there is non-zero data there.
  --> This is a likely candidate for the behaviour seen on this PM.

ada6
  --> When the below errors happened are 100% unknown.  Just noting
      that here.
  --> SMART attribute 199 shows 8 CRC errors.  These would be
      caused by issues between the disk and the device its
      attached to (port multiplier I guess).  Causes could be
      bad SATA cables, bad ports, dirty/dusty ports, or flaky
      PCB (on the disk itself).
  --> SATA PHY log/counters confirms above problem:
      ID      Size     Value  Description
      0x0001  2            8  Command failed due to ICRC error
      0x0002  2            8  R_ERR response for data FIS
      0x0003  2            8  R_ERR response for device-to-host data FIS
  --> Given this behaviour, possibly the ATA commands submit which
      experienced errors were NCQ-related.
  --> The NCQ command error log does have non-zero values in it.
      The format of the output is proprietary, sadly, and smartmontools
      does not know how to decode it.  But, compare it to your other
      drives and you'll see there is non-zero data there.
  --> This is a likely candidate for the behaviour seen on this PM.

ada7
  --> When the below errors happened are 100% unknown.  Just noting
      that here.
  --> SMART attribute 199 shows 13 CRC errors.  These would be
      caused by issues between the disk and the device its
      attached to (port multiplier I guess).  Causes could be
      bad SATA cables, bad ports, dirty/dusty ports, or flaky
      PCB (on the disk itself).
  --> SATA PHY log/counters confirms above problem:
      ID      Size     Value  Description
      0x0001  2           13  Command failed due to ICRC error
      0x0002  2           13  R_ERR response for data FIS
      0x0003  2           13  R_ERR response for device-to-host data FIS
  --> There are an additional two anomalies shown for this drive in
      the SATA PHY log as well:
      0x0005  2            1  R_ERR response for non-data FIS
      0x0006  2            1  R_ERR response for device-to-host non-data FIS
  --> Given this behaviour, possibly the ATA commands submit which
      experienced errors were NCQ-related.
  --> The NCQ command error log does have non-zero values in it.
      The format of the output is proprietary, sadly, and smartmontools
      does not know how to decode it.  But, compare it to your other
      drives and you'll see there is non-zero data there.
  --> This is a likely candidate for the behaviour seen on this PM.

ada8
  --> Looks great, no issues.


Now, here is an important fact:

I can see all of these drives are brand new or at least unused until
recently.  Attribute 9 confirms that.  So you have 5 drives, 4 of which
are seeing CRC errors, but all were roughly put into use at the same
time (188 to 234 power-on hours).

So, for drives ada4 through ada7:

You will probably need to "track these drives" on a regular basis.  That
is to say, set up some cronjob or similar that logs the above output to
a file (appends data to it), specifically output from smartctl -A (not
-a and not -x) and smartctl -l sataphy on a per-disk basis.  smartd can
track SMART attribute changes, but does not track GPLog changes.  Make
sure to put timestamps in your logs.

The next time you see READ LOG EXT errors, you should review your logs
and see if any of the counters I focused on above have increased.  If
so, you've found your likely candidate.

CRC errors of this nature are considered "communication errors", which
is what mav@ eluded to in his previous post.

As for fixing the problem: I have no idea how you would go about this.
Use of port multipliers involves additional cables, possibly of shoddy
quality, or other components which may not be decent/reliable.  CRC
errors can also be caused by electrical interference, and I HAVE seen
this be the root cause on a few systems, but those systems had SATA
cables that had been accidentally cut on internals of the case, thus had
bare copper exposed.

I do not know if cable length could explain the CRC errors when a PM is
in place.  E.g. controller <--> cable <--> PM <--> cable <--> disk.  I
do not think that the limits of SATA cable lengths applies to the TOTAL
length between controller and disk in the above topology, but I do not
know.  Someone more familiar with PMs would need to comment here.

You could simply just have a bad break-out cable, or a bad PM.

Overall, this is just one of the many reasons why I avoid PMs, as well
as avoid eSATA (especially eSATA).  CRC errors are one of the most
difficult and annoying things to track down because there are so many
possibilities for root causes, and many are "hardware-level" to the
point where practically EE people need to be involved.

Good luck.

-- 
| Jeremy Chadwick                                 jdc@parodius.com |
| Parodius Networking                     http://www.parodius.com/ |
| UNIX Systems Administrator                 Mountain View, CA, US |
| Making life hard for others since 1977.             PGP 4BD6C0CB |