From owner-freebsd-stable@FreeBSD.ORG Tue Sep 16 23:17:05 2008 Return-Path: Delivered-To: stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id C7B611065670 for ; Tue, 16 Sep 2008 23:17:05 +0000 (UTC) (envelope-from clint@0lsen.net) Received: from belle.0lsen.net (belle.0lsen.net [75.150.32.89]) by mx1.freebsd.org (Postfix) with ESMTP id 9632A8FC13 for ; Tue, 16 Sep 2008 23:17:05 +0000 (UTC) (envelope-from clint@0lsen.net) Received: by belle.0lsen.net (Postfix, from userid 1001) id 7C2CD7962D; Tue, 16 Sep 2008 16:16:55 -0700 (PDT) Date: Tue, 16 Sep 2008 16:16:55 -0700 From: Clint Olsen To: Jeremy Chadwick Message-ID: <20080916231655.GC19665@0lsen.net> References: <20080916170452.GB4861@0lsen.net> <20080916175858.GA70396@icarus.home.lan> <20080916181903.GC7540@0lsen.net> <20080916185401.GA71275@icarus.home.lan> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20080916185401.GA71275@icarus.home.lan> User-Agent: Mutt/1.4.2.3i Organization: NULlsen Network X-Disclaimer: Mutt Bites! X-0lsen-net-MailScanner-Information: Please contact the ISP for more information X-MailScanner-ID: 7C2CD7962D.57A00 X-0lsen-net-MailScanner: Found to be clean X-0lsen-net-MailScanner-From: clint@0lsen.net X-Spam-Status: No Cc: stable@freebsd.org Subject: Re: Help debugging DMA_READ errors X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 16 Sep 2008 23:17:05 -0000 On Sep 16, Jeremy Chadwick wrote: > That's very strange then. Something definitely tried to utilise acd0 at > that hour of the night. What is acd0 connected to, ATA-wise? Again, I > assume it's PATA, but I'd like to know the primary/secondary and > master/slave organisation, since you are using a PATA disk too. What's the best way to give you this? Generally with disks I try to separate them from DVD/CD drives, so I don't think they are on the same chain. Is the question whether or not the DVD/CD is a slave to the PATA disk? acd0: CDRW at ata1-master UDMA33 > Looks fine, although I swore ATA controllers listed their IRQs. atapci0 > doesn't appear to have an IRQ associated with it (should be 14 or 15), > so that's a little odd to me. vmstat -i would help here. interrupt total rate irq1: atkbd0 14 0 irq6: fdc0 1 0 irq12: psm0 1624 0 irq14: ata0 410187 14 irq15: ata1 225418 7 irq18: uhci2+ 111881 3 irq22: skc0 260062 9 cpu0: timer 56551841 1999 Total 57561028 2035 > Okay, there are some problems with your disks, but it's going to be > impossible for me to determine if the below problems caused what you saw. > First, ad0: I just freed up a 300G SATA disk, so I can swap out the PATA drive if you think it's worth the effort. > 1) Run "smartctl -t short" on /dev/ad0 and /dev/ad4. You can safely use > the disks during this time. After a few minutes (depends on how much > disk I/O is happening; the more I/O, the longer the test takes to > complete), you should see an entry in the SMART self-test log saying > Completed. Once you see that, you should run smartctl -a on the disk > again, and see if the attributes labelled "Offline" are different than > they were before. > > 2) Consider running smartd. I do not normally advocate this, but in > your case, it may be the only way to see which attribute values are > actually changing on you if/when the issue happens again. Any time a > value changes, it'll be logged via syslog. You can set up smartd.conf > to ignore certain attributes (e.g. temperature, since that has a > tendency to fluctuate up and down a degree). I'm looking at that. The sample conf file that comes with it isn't the easiest on the eyes, so I haven't figure out what configuration I want or how to set it up yet. My external hard drive is running around 50 in that small external enclosure. That sounds bad. 190 Airflow_Temperature_Cel 0x0022 050 043 045 Old_age Always In_the_past 50 (Lifetime Min/Max 32/53) 194 Temperature_Celsius 0x0022 050 057 000 Old_age Always - 50 (0 21 0 0) > If/when this happens again, you should be able to look at your logs and > see what counters have changed. For example if you see something like > Power_Cycle_Count or Stop_Start_Count increase, you have disks which are > losing power. > > Welcome to the pain of debugging disk problems. :-) Thanks :) -Clint -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean.