From owner-freebsd-stable@FreeBSD.ORG  Tue Sep 16 23:17:05 2008
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
Delivered-To: stable@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id C7B611065670
	for <stable@freebsd.org>; Tue, 16 Sep 2008 23:17:05 +0000 (UTC)
	(envelope-from clint@0lsen.net)
Received: from belle.0lsen.net (belle.0lsen.net [75.150.32.89])
	by mx1.freebsd.org (Postfix) with ESMTP id 9632A8FC13
	for <stable@freebsd.org>; Tue, 16 Sep 2008 23:17:05 +0000 (UTC)
	(envelope-from clint@0lsen.net)
Received: by belle.0lsen.net (Postfix, from userid 1001)
	id 7C2CD7962D; Tue, 16 Sep 2008 16:16:55 -0700 (PDT)
Date: Tue, 16 Sep 2008 16:16:55 -0700
From: Clint Olsen <clint.olsen@gmail.com>
To: Jeremy Chadwick <koitsu@FreeBSD.org>
Message-ID: <20080916231655.GC19665@0lsen.net>
References: <20080916170452.GB4861@0lsen.net>
	<20080916175858.GA70396@icarus.home.lan>
	<20080916181903.GC7540@0lsen.net>
	<20080916185401.GA71275@icarus.home.lan>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20080916185401.GA71275@icarus.home.lan>
User-Agent: Mutt/1.4.2.3i
Organization: NULlsen Network
X-Disclaimer: Mutt Bites!
X-0lsen-net-MailScanner-Information: Please contact the ISP for more
	information
X-MailScanner-ID: 7C2CD7962D.57A00
X-0lsen-net-MailScanner: Found to be clean
X-0lsen-net-MailScanner-From: clint@0lsen.net
X-Spam-Status: No
Cc: stable@freebsd.org
Subject: Re: Help debugging DMA_READ errors
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>, 
	<mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
	<mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 16 Sep 2008 23:17:05 -0000

On Sep 16, Jeremy Chadwick wrote:
> That's very strange then.  Something definitely tried to utilise acd0 at
> that hour of the night.  What is acd0 connected to, ATA-wise?  Again, I
> assume it's PATA, but I'd like to know the primary/secondary and
> master/slave organisation, since you are using a PATA disk too.

What's the best way to give you this?  Generally with disks I try to
separate them from DVD/CD drives, so I don't think they are on the same
chain.  Is the question whether or not the DVD/CD is a slave to the PATA
disk?

acd0: CDRW <Hewlett-Packard DVD Writer 100/1.37> at ata1-master UDMA33
 
> Looks fine, although I swore ATA controllers listed their IRQs.  atapci0
> doesn't appear to have an IRQ associated with it (should be 14 or 15),
> so that's a little odd to me.  vmstat -i would help here.
 
interrupt                          total       rate
irq1: atkbd0                          14          0
irq6: fdc0                             1          0
irq12: psm0                         1624          0
irq14: ata0                       410187         14
irq15: ata1                       225418          7
irq18: uhci2+                     111881          3
irq22: skc0                       260062          9
cpu0: timer                     56551841       1999
Total                           57561028       2035

> Okay, there are some problems with your disks, but it's going to be
> impossible for me to determine if the below problems caused what you saw.
> First, ad0:

I just freed up a 300G SATA disk, so I can swap out the PATA drive if you
think it's worth the effort.

> 1) Run "smartctl -t short" on /dev/ad0 and /dev/ad4.  You can safely use
> the disks during this time.  After a few minutes (depends on how much
> disk I/O is happening; the more I/O, the longer the test takes to
> complete), you should see an entry in the SMART self-test log saying
> Completed.  Once you see that, you should run smartctl -a on the disk
> again, and see if the attributes labelled "Offline" are different than
> they were before.
> 
> 2) Consider running smartd.  I do not normally advocate this, but in
> your case, it may be the only way to see which attribute values are
> actually changing on you if/when the issue happens again.  Any time a
> value changes, it'll be logged via syslog.  You can set up smartd.conf
> to ignore certain attributes (e.g. temperature, since that has a
> tendency to fluctuate up and down a degree).
 
I'm looking at that.  The sample conf file that comes with it isn't the
easiest on the eyes, so I haven't figure out what configuration I want or
how to set it up yet.

My external hard drive is running around 50 in that small external
enclosure.  That sounds bad.

190 Airflow_Temperature_Cel 0x0022   050   043   045    Old_age   Always In_the_past 50 (Lifetime Min/Max 32/53)
194 Temperature_Celsius     0x0022   050   057   000    Old_age   Always -       50 (0 21 0 0)

> If/when this happens again, you should be able to look at your logs and
> see what counters have changed.  For example if you see something like
> Power_Cycle_Count or Stop_Start_Count increase, you have disks which are
> losing power.
> 
> Welcome to the pain of debugging disk problems.  :-)

Thanks :)

-Clint

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.