From owner-freebsd-questions@FreeBSD.ORG Thu Feb 8 13:38:10 2007 Return-Path: X-Original-To: freebsd-questions@freebsd.org Delivered-To: freebsd-questions@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id EBFE016A563 for ; Thu, 8 Feb 2007 13:37:59 +0000 (UTC) (envelope-from smithi@nimnet.asn.au) Received: from gaia.nimnet.asn.au (nimbin.lnk.telstra.net [139.130.45.143]) by mx1.freebsd.org (Postfix) with ESMTP id 5C1FC13C4A3 for ; Thu, 8 Feb 2007 13:37:54 +0000 (UTC) (envelope-from smithi@nimnet.asn.au) Received: from localhost (smithi@localhost) by gaia.nimnet.asn.au (8.8.8/8.8.8R1.4) with SMTP id AAA29592; Fri, 9 Feb 2007 00:37:45 +1100 (EST) (envelope-from smithi@nimnet.asn.au) Date: Fri, 9 Feb 2007 00:37:44 +1100 (EST) From: Ian Smith To: Richard Lynch In-Reply-To: <33987.216.230.84.67.1170885541.squirrel@www.l-i-e.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Cc: perryh@pluto.rain.com, freebsd-questions@freebsd.org Subject: Re: READ_DMA48 error interpretation X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 08 Feb 2007 13:38:11 -0000 On Wed, 7 Feb 2007, Richard Lynch wrote: > [I've tried to snip away a lot of stuff, without losing any context...] I'll prune a bit too, but will backtrack to earlier context, so thanks. > On Tue, February 6, 2007 2:50 am, Ian Smith wrote: > > On Mon, 5 Feb 2007 01:13:31 -0600 (CST) Richard Lynch > > wrote: > > > On Tue, January 16, 2007 3:21 pm, Chuck Swiger wrote: > > > > On Jan 16, 2007, at 1:13 PM, Richard Lynch wrote: > > > ... > > > >> +ad1: TIMEOUT - READ_DMA48 retrying (1 retry left) LBA=404955007 > > > >> +ad1: FAILURE - READ_DMA48 status=51 > > > >> error=10 > > > >> LBA=404955007 > > > >> +g_vfs_done():ad1s1[READ(offset=207336931328, length=16384)]error = 5 Looks like a not ready error maybe. The only value in your ad1.txt that looks like it's ever been anywhere near any error threshold is ID# 11, Calibration_Retry_Count, and its current value is fine. Power glitch? Are you getting any other hard looking errors in /var/log/messages? Is fsck happy? It never hurts to run 'fsck -n' whenever you feel the urge. > > > > Try installing the sysutils/smartmontools port and run a drive > > self- > > > > I ran the short test on the problem drives, and it said everything > > was > > > fine. > > > > > > I'll try the long test at a later date. Only your ad3.txt referred to below shows a (short) test having been completed and logged. You might check the smartctl -a results after running at least short tests initially (looks like the long ones will take 4-5 hours for your 4 drives) as Chuck has since suggested. > > > #2. Sequences like this show up a fair amount: > > > Device: /dev/ad2, SMART Prefailure Attribute: 3 Spin_Up_Time > > changed > > > from 152 to 153 > > > Device: /dev/ad2, SMART Prefailure Attribute: 3 Spin_Up_Time > > changed > > > from 153 to 152 > > > Device: /dev/ad0, SMART Prefailure Attribute: 8 > > Seek_Time_Performance > > > changed from 251 to 250 I'm not sure of the degree of logging you're having smartd use here, but these small changes of value, especially up and down by 1 but a long way from any error threshold, seem to be excessive and relatively trivial perhaps debug-level detail?, ie most likely nothing of any concern. I suggest reading man smartctl under '-A, --attributes' and then you'll know as much as I do about what these may mean, and maybe worry less .. > Here are all the smartctl -a outputs: > > http://l-i-e.com/ad0.txt > http://l-i-e.com/ad1.txt > http://l-i-e.com/ad2.txt > http://l-i-e.com/ad3.txt > > ad3 is giving the most errors... > ad1 gives a fair amount though Do you mean according to that fine-detail attribute changes logging? Or real read/write/seek etc errors being logged to messages? > And the ad0 and ad2 seem to be giving the spinup errors. None of those reports seem to indicate any problems really, though if anyone else cares to peek and notices any anomalies, I'm all eyes. As for temperatures, the readings for all 4 drives seem very cool, but then it is winter over there .. Temperature Celcius for ad0 to ad3 being 36, 27, 22 and 18 degrees C, each present and worst value well clear of error thresholds .. did you interprete those values as temperatures? > ad0 is pretty much full > ad1 is the one I'm filling up currently > ad2 and ad3 have no actual content on them yet, but will "soon" > > All the drives are kind of in an old PC tower (XT? AT???), except the > outer casing is, errr, not there... Just the framework. Might be worth checking that your power supply is up to handling 4 big drives, but they weren't running more than mildly warm when reported. > ad2 and ad3 are in one of these Thermaltake iCage things: > http://www.performance-pcs.com/catalog/index.php?main_page=product_info&cPath=257&products_id=3533 > which converts the old-school floppy drive[s] bay into an IDE bay, and > puts a big honking fan blowing on them. These too were running nice and cool, 22 and 18C, when reported. Cf my 40GB laptop drive (at smartctl version 5.36 [i386-portbld-freebsd5.5], rather more recent than your 5.33 freebsd6.0) this afternoon: 194 Temperature_Celsius 0x0022 100 100 000 Old_age Always - 40 (Lifetime Min/Max 13/49) > I'm not claiming it's "good enough" but I tried. > > I left the iCage "bay" between them empty for airflow/cooling. > > ad0 and ad1 are in the usual IDE bay of a tower. > I have a fan in there, but without the cover to shape the airflow, > perhaps that is not doing much useful... Perhaps it wasn't properly warmed up when you ran those reports, but on the data you've provided you don't have any sort of temperature problem. > I can touch the exposed front and back top (above IDE cable) and lay > my finger along it. It's "hot" but not like, "ouch hot" :-) Over 70C or so is too hot to touch except momentarily. You're cool :) > I don't think it's 100C+ hot, as that's boiling -- but perhaps the > thermometer is somewhere inside or... > > Seems more likely, though, that that number is Fahrenheit (sp?) and > not Celcius.. The VALUE and WORST numbers don't measure temperature, but the drive's idea of its own scale of 'toohottedness'; none of them show a problem. > [..] Relax :) but portupgrade your smartmontools, and recheck it in summer. The only figures that look a tad high to me are the 2 Samsungs (ad1 and ad3)'s "195 Hardware_ECC_Recovered", but neither drive thinks they're a problem (value/worst/thresh 100/100/0), and could well be byteswapped - search for 'Samsung' in smartctl(8) about that possibility. Cheers, Ian