From owner-freebsd-stable@FreeBSD.ORG Thu Feb 9 16:34:17 2012 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 408A3106564A for ; Thu, 9 Feb 2012 16:34:17 +0000 (UTC) (envelope-from jdc@koitsu.dyndns.org) Received: from qmta05.emeryville.ca.mail.comcast.net (qmta05.emeryville.ca.mail.comcast.net [76.96.30.48]) by mx1.freebsd.org (Postfix) with ESMTP id 1EF438FC14 for ; Thu, 9 Feb 2012 16:34:16 +0000 (UTC) Received: from omta21.emeryville.ca.mail.comcast.net ([76.96.30.88]) by qmta05.emeryville.ca.mail.comcast.net with comcast id XrqN1i0011u4NiLA5saGme; Thu, 09 Feb 2012 16:34:16 +0000 Received: from koitsu.dyndns.org ([67.180.84.87]) by omta21.emeryville.ca.mail.comcast.net with comcast id XsaF1i0081t3BNj8hsaFhc; Thu, 09 Feb 2012 16:34:16 +0000 Received: by icarus.home.lan (Postfix, from userid 1000) id 73789102C1E; Thu, 9 Feb 2012 08:34:15 -0800 (PST) Date: Thu, 9 Feb 2012 08:34:15 -0800 From: Jeremy Chadwick To: Mike Tancsa Message-ID: <20120209163415.GA96451@icarus.home.lan> References: <4F32E289.4080806@sentex.net> <4F32F5B0.2060203@FreeBSD.org> <20120208223819.GA27488@icarus.home.lan> <4F32FB5E.7050102@FreeBSD.org> <4F33DB75.1080202@sentex.net> <20120209152240.GA95470@icarus.home.lan> <4F33F056.6070300@sentex.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4F33F056.6070300@sentex.net> User-Agent: Mutt/1.5.21 (2010-09-15) Cc: Alexander Motin , freebsd-stable@FreeBSD.org Subject: Re: siisch1: Error while READ LOG EXT X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 09 Feb 2012 16:34:17 -0000 On Thu, Feb 09, 2012 at 11:12:06AM -0500, Mike Tancsa wrote: > {snipping} > > > So which Port Multiplier did you add? The one at scbus0 or scbus1? > > 1 > at scbus1 target 0 lun 0 (pass5,ada4) > at scbus1 target 1 lun 0 (pass6,ada5) > at scbus1 target 2 lun 0 (pass7,ada6) > at scbus1 target 3 lun 0 (pass8,ada7) > at scbus1 target 4 lun 0 (pass9,ada8) > at scbus1 target 15 lun 0 (pass10,pmp0) I'll provide analysis for all 5 of these disks below. > > A full dmesg (not just a snippet) would probably be helpful here. What > > you provided in your first post was too terse, especially given how many > > disks you have in this system. :-) > > > > I really see no problem with looking at all disks -- specifically disks > > ada0 through ada3, and ada4 through ada8 -- to determine which one may > > be having problems. You're welcome to run "smartctl -a" on each one and > > put them up on the web, preferably segregated by disk name (e.g. > > ada0.txt, ada1.txt, etc.) and I can review them all. > > Actually, I just had a look at another server at our DR site. Its > hardware has not changed in a bit, but I did bring the kernel uptodate. > Its now logging the odd 'READ LOG EXT' error as well. Its kernel is > from Jan 22. Prior to that kernel update, I had not seen these errors. > Something in the driver (ahci or cam layer?) that has changed perhaps ? > > Feb 4 11:12:36 offsite kernel: siisch1: Error while READ LOG EXT Perhaps, but mav@ would be the authority on that. > http://www.tancsa.com/ahci.txt So here are the results of analysis for disks ada4 through ada8: ada4 --> When the below errors happened are 100% unknown. Just noting that here. --> SMART attribute 199 shows 13 CRC errors. These would be caused by issues between the disk and the device its attached to (port multiplier I guess). Causes could be bad SATA cables, bad ports, dirty/dusty ports, or flaky PCB (on the disk itself). --> SATA PHY log/counters confirms above problem: ID Size Value Description 0x0001 2 13 Command failed due to ICRC error 0x0002 2 13 R_ERR response for data FIS 0x0003 2 13 R_ERR response for device-to-host data FIS --> Given this behaviour, possibly the ATA commands submit which experienced errors were NCQ-related. --> The NCQ command error log does have non-zero values in it. The format of the output is proprietary, sadly, and smartmontools does not know how to decode it. But, compare it to your other drives and you'll see there is non-zero data there. --> This is a likely candidate for the behaviour seen on this PM. ada5 --> When the below errors happened are 100% unknown. Just noting that here. --> SMART attribute 199 shows 11 CRC errors. These would be caused by issues between the disk and the device its attached to (port multiplier I guess). Causes could be bad SATA cables, bad ports, dirty/dusty ports, or flaky PCB (on the disk itself). --> SATA PHY log/counters confirms above problem: ID Size Value Description 0x0001 2 11 Command failed due to ICRC error 0x0002 2 11 R_ERR response for data FIS 0x0003 2 11 R_ERR response for device-to-host data FIS --> Given this behaviour, possibly the ATA commands submit which experienced errors were NCQ-related. --> The NCQ command error log does have non-zero values in it. The format of the output is proprietary, sadly, and smartmontools does not know how to decode it. But, compare it to your other drives and you'll see there is non-zero data there. --> This is a likely candidate for the behaviour seen on this PM. ada6 --> When the below errors happened are 100% unknown. Just noting that here. --> SMART attribute 199 shows 8 CRC errors. These would be caused by issues between the disk and the device its attached to (port multiplier I guess). Causes could be bad SATA cables, bad ports, dirty/dusty ports, or flaky PCB (on the disk itself). --> SATA PHY log/counters confirms above problem: ID Size Value Description 0x0001 2 8 Command failed due to ICRC error 0x0002 2 8 R_ERR response for data FIS 0x0003 2 8 R_ERR response for device-to-host data FIS --> Given this behaviour, possibly the ATA commands submit which experienced errors were NCQ-related. --> The NCQ command error log does have non-zero values in it. The format of the output is proprietary, sadly, and smartmontools does not know how to decode it. But, compare it to your other drives and you'll see there is non-zero data there. --> This is a likely candidate for the behaviour seen on this PM. ada7 --> When the below errors happened are 100% unknown. Just noting that here. --> SMART attribute 199 shows 13 CRC errors. These would be caused by issues between the disk and the device its attached to (port multiplier I guess). Causes could be bad SATA cables, bad ports, dirty/dusty ports, or flaky PCB (on the disk itself). --> SATA PHY log/counters confirms above problem: ID Size Value Description 0x0001 2 13 Command failed due to ICRC error 0x0002 2 13 R_ERR response for data FIS 0x0003 2 13 R_ERR response for device-to-host data FIS --> There are an additional two anomalies shown for this drive in the SATA PHY log as well: 0x0005 2 1 R_ERR response for non-data FIS 0x0006 2 1 R_ERR response for device-to-host non-data FIS --> Given this behaviour, possibly the ATA commands submit which experienced errors were NCQ-related. --> The NCQ command error log does have non-zero values in it. The format of the output is proprietary, sadly, and smartmontools does not know how to decode it. But, compare it to your other drives and you'll see there is non-zero data there. --> This is a likely candidate for the behaviour seen on this PM. ada8 --> Looks great, no issues. Now, here is an important fact: I can see all of these drives are brand new or at least unused until recently. Attribute 9 confirms that. So you have 5 drives, 4 of which are seeing CRC errors, but all were roughly put into use at the same time (188 to 234 power-on hours). So, for drives ada4 through ada7: You will probably need to "track these drives" on a regular basis. That is to say, set up some cronjob or similar that logs the above output to a file (appends data to it), specifically output from smartctl -A (not -a and not -x) and smartctl -l sataphy on a per-disk basis. smartd can track SMART attribute changes, but does not track GPLog changes. Make sure to put timestamps in your logs. The next time you see READ LOG EXT errors, you should review your logs and see if any of the counters I focused on above have increased. If so, you've found your likely candidate. CRC errors of this nature are considered "communication errors", which is what mav@ eluded to in his previous post. As for fixing the problem: I have no idea how you would go about this. Use of port multipliers involves additional cables, possibly of shoddy quality, or other components which may not be decent/reliable. CRC errors can also be caused by electrical interference, and I HAVE seen this be the root cause on a few systems, but those systems had SATA cables that had been accidentally cut on internals of the case, thus had bare copper exposed. I do not know if cable length could explain the CRC errors when a PM is in place. E.g. controller <--> cable <--> PM <--> cable <--> disk. I do not think that the limits of SATA cable lengths applies to the TOTAL length between controller and disk in the above topology, but I do not know. Someone more familiar with PMs would need to comment here. You could simply just have a bad break-out cable, or a bad PM. Overall, this is just one of the many reasons why I avoid PMs, as well as avoid eSATA (especially eSATA). CRC errors are one of the most difficult and annoying things to track down because there are so many possibilities for root causes, and many are "hardware-level" to the point where practically EE people need to be involved. Good luck. -- | Jeremy Chadwick jdc@parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, US | | Making life hard for others since 1977. PGP 4BD6C0CB |