From owner-freebsd-questions@FreeBSD.ORG  Thu Feb  8 13:38:10 2007
Return-Path: <owner-freebsd-questions@FreeBSD.ORG>
X-Original-To: freebsd-questions@freebsd.org
Delivered-To: freebsd-questions@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id EBFE016A563
	for <freebsd-questions@freebsd.org>;
	Thu,  8 Feb 2007 13:37:59 +0000 (UTC)
	(envelope-from smithi@nimnet.asn.au)
Received: from gaia.nimnet.asn.au (nimbin.lnk.telstra.net [139.130.45.143])
	by mx1.freebsd.org (Postfix) with ESMTP id 5C1FC13C4A3
	for <freebsd-questions@freebsd.org>;
	Thu,  8 Feb 2007 13:37:54 +0000 (UTC)
	(envelope-from smithi@nimnet.asn.au)
Received: from localhost (smithi@localhost)
	by gaia.nimnet.asn.au (8.8.8/8.8.8R1.4) with SMTP id AAA29592;
	Fri, 9 Feb 2007 00:37:45 +1100 (EST)
	(envelope-from smithi@nimnet.asn.au)
Date: Fri, 9 Feb 2007 00:37:44 +1100 (EST)
From: Ian Smith <smithi@nimnet.asn.au>
To: Richard Lynch <ceo@l-i-e.com>
In-Reply-To: <33987.216.230.84.67.1170885541.squirrel@www.l-i-e.com>
Message-ID: <Pine.BSF.3.96.1070208215816.20114A-100000@gaia.nimnet.asn.au>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Cc: perryh@pluto.rain.com, freebsd-questions@freebsd.org
Subject: Re: READ_DMA48 error interpretation
X-BeenThere: freebsd-questions@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: User questions <freebsd-questions.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-questions>, 
	<mailto:freebsd-questions-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-questions>
List-Post: <mailto:freebsd-questions@freebsd.org>
List-Help: <mailto:freebsd-questions-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-questions>, 
	<mailto:freebsd-questions-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 08 Feb 2007 13:38:11 -0000

On Wed, 7 Feb 2007, Richard Lynch wrote:
 > [I've tried to snip away a lot of stuff, without losing any context...]

I'll prune a bit too, but will backtrack to earlier context, so thanks.

 > On Tue, February 6, 2007 2:50 am, Ian Smith wrote:
 > > On Mon, 5 Feb 2007 01:13:31 -0600 (CST) Richard Lynch <ceo@l-i-e.com>
 > > wrote:
 > >  > On Tue, January 16, 2007 3:21 pm, Chuck Swiger wrote:
 > >  > > On Jan 16, 2007, at 1:13 PM, Richard Lynch wrote:
 > >  > ...
 > >  > >> +ad1: TIMEOUT - READ_DMA48 retrying (1 retry left) LBA=404955007
 > >  > >> +ad1: FAILURE - READ_DMA48 status=51<READY,DSC,ERROR>
 > >  > >> error=10<NID_NOT_FOUND>
 > >  > >> LBA=404955007
 > >  > >> +g_vfs_done():ad1s1[READ(offset=207336931328, length=16384)]error = 5

Looks like a not ready error maybe.  The only value in your ad1.txt that
looks like it's ever been anywhere near any error threshold is ID# 11,
Calibration_Retry_Count, and its current value is fine.  Power glitch?

Are you getting any other hard looking errors in /var/log/messages?  Is
fsck happy?  It never hurts to run 'fsck -n' whenever you feel the urge.

 > >  > > Try installing the sysutils/smartmontools port and run a drive
 > > self-
 > 
 > >  > I ran the short test on the problem drives, and it said everything
 > > was
 > >  > fine.
 > >  >
 > >  > I'll try the long test at a later date.

Only your ad3.txt referred to below shows a (short) test having been
completed and logged.  You might check the smartctl -a results after
running at least short tests initially (looks like the long ones will
take 4-5 hours for your 4 drives) as Chuck has since suggested.

 > >  > #2. Sequences like this show up a fair amount:
 > >  > Device: /dev/ad2, SMART Prefailure Attribute: 3 Spin_Up_Time
 > > changed
 > >  > from 152 to 153
 > >  > Device: /dev/ad2, SMART Prefailure Attribute: 3 Spin_Up_Time
 > > changed
 > >  > from 153 to 152
 > >  > Device: /dev/ad0, SMART Prefailure Attribute: 8
 > > Seek_Time_Performance
 > >  > changed from 251 to 250

I'm not sure of the degree of logging you're having smartd use here, but
these small changes of value, especially up and down by 1 but a long way
from any error threshold, seem to be excessive and relatively trivial
perhaps debug-level detail?, ie most likely nothing of any concern.

I suggest reading man smartctl under '-A, --attributes' and then you'll
know as much as I do about what these may mean, and maybe worry less ..

 > Here are all the smartctl -a outputs:
 > 
 > http://l-i-e.com/ad0.txt
 > http://l-i-e.com/ad1.txt
 > http://l-i-e.com/ad2.txt
 > http://l-i-e.com/ad3.txt
 > 
 > ad3 is giving the most errors...
 > ad1 gives a fair amount though

Do you mean according to that fine-detail attribute changes logging?  Or
real read/write/seek etc errors being logged to messages?

 > And the ad0 and ad2 seem to be giving the spinup errors.

None of those reports seem to indicate any problems really, though if
anyone else cares to peek and notices any anomalies, I'm all eyes.

As for temperatures, the readings for all 4 drives seem very cool, but
then it is winter over there .. Temperature Celcius for ad0 to ad3 being
36, 27, 22 and 18 degrees C, each present and worst value well clear of
error thresholds .. did you interprete those values as temperatures?

 > ad0 is pretty much full
 > ad1 is the one I'm filling up currently
 > ad2 and ad3 have no actual content on them yet, but will "soon"
 > 
 > All the drives are kind of in an old PC tower (XT? AT???), except the
 > outer casing is, errr, not there...  Just the framework.

Might be worth checking that your power supply is up to handling 4 big
drives, but they weren't running more than mildly warm when reported.

 > ad2 and ad3 are in one of these Thermaltake iCage things:
 > http://www.performance-pcs.com/catalog/index.php?main_page=product_info&cPath=257&products_id=3533
 > which converts the old-school floppy drive[s] bay into an IDE bay, and
 > puts a big honking fan blowing on them.

These too were running nice and cool, 22 and 18C, when reported.  Cf my
40GB laptop drive (at smartctl version 5.36 [i386-portbld-freebsd5.5],
rather more recent than your 5.33 freebsd6.0) this afternoon:

 194 Temperature_Celsius  0x0022  100  100  000   Old_age  Always  -  40 (Lifetime Min/Max 13/49)

 > I'm not claiming it's "good enough" but I tried.
 > 
 > I left the iCage "bay" between them empty for airflow/cooling.
 > 
 > ad0 and ad1 are in the usual IDE bay of a tower.
 > I have a fan in there, but without the cover to shape the airflow,
 > perhaps that is not doing much useful...

Perhaps it wasn't properly warmed up when you ran those reports, but on
the data you've provided you don't have any sort of temperature problem. 

 > I can touch the exposed front and back top (above IDE cable) and lay
 > my finger along it.  It's "hot" but not like, "ouch hot" :-)

Over 70C or so is too hot to touch except momentarily.  You're cool :)

 > I don't think it's 100C+ hot, as that's boiling -- but perhaps the
 > thermometer is somewhere inside or...
 > 
 > Seems more likely, though, that that number is Fahrenheit (sp?) and
 > not Celcius..

The VALUE and WORST numbers don't measure temperature, but the drive's
idea of its own scale of 'toohottedness'; none of them show a problem.

 > [..]

Relax :) but portupgrade your smartmontools, and recheck it in summer.

The only figures that look a tad high to me are the 2 Samsungs (ad1 and
ad3)'s "195 Hardware_ECC_Recovered", but neither drive thinks they're a
problem (value/worst/thresh 100/100/0), and could well be byteswapped -
search for 'Samsung' in smartctl(8) about that possibility.

Cheers, Ian