From owner-freebsd-stable@FreeBSD.ORG Wed Sep 17 05:43:33 2008 Return-Path: Delivered-To: stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 472D91065684 for ; Wed, 17 Sep 2008 05:43:33 +0000 (UTC) (envelope-from jdc@koitsu.dyndns.org) Received: from QMTA02.emeryville.ca.mail.comcast.net (qmta02.emeryville.ca.mail.comcast.net [76.96.30.24]) by mx1.freebsd.org (Postfix) with ESMTP id 222728FC18 for ; Wed, 17 Sep 2008 05:43:27 +0000 (UTC) (envelope-from jdc@koitsu.dyndns.org) Received: from OMTA01.emeryville.ca.mail.comcast.net ([76.96.30.11]) by QMTA02.emeryville.ca.mail.comcast.net with comcast id FYF71a00B0EPchoA2hjT0o; Wed, 17 Sep 2008 05:43:27 +0000 Received: from koitsu.dyndns.org ([67.180.253.227]) by OMTA01.emeryville.ca.mail.comcast.net with comcast id FhjS1a0024v8bD78MhjSZb; Wed, 17 Sep 2008 05:43:27 +0000 X-Authority-Analysis: v=1.0 c=1 a=QycZ5dHgAAAA:8 a=CQTDIXD9_gMFmE6EXqMA:9 a=xvYsILBxeiOTfIue5x4A:7 a=mIoJDWruu8pLFB0wRBoRzAuxnRYA:4 a=EoioJ0NPDVgA:10 a=LY0hPdMaydYA:10 Received: by icarus.home.lan (Postfix, from userid 1000) id 3A00C17B81A; Tue, 16 Sep 2008 22:43:26 -0700 (PDT) Date: Tue, 16 Sep 2008 22:43:26 -0700 From: Jeremy Chadwick To: Clint Olsen Message-ID: <20080917054326.GC81776@icarus.home.lan> References: <20080916170452.GB4861@0lsen.net> <20080916175858.GA70396@icarus.home.lan> <20080916181903.GC7540@0lsen.net> <20080916185401.GA71275@icarus.home.lan> <20080916231655.GC19665@0lsen.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20080916231655.GC19665@0lsen.net> User-Agent: Mutt/1.5.18 (2008-05-17) Cc: stable@freebsd.org Subject: Re: Help debugging DMA_READ errors X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 17 Sep 2008 05:43:33 -0000 On Tue, Sep 16, 2008 at 04:16:55PM -0700, Clint Olsen wrote: > On Sep 16, Jeremy Chadwick wrote: > > That's very strange then. Something definitely tried to utilise acd0 at > > that hour of the night. What is acd0 connected to, ATA-wise? Again, I > > assume it's PATA, but I'd like to know the primary/secondary and > > master/slave organisation, since you are using a PATA disk too. > > What's the best way to give you this? Generally with disks I try to > separate them from DVD/CD drives, so I don't think they are on the same > chain. Is the question whether or not the DVD/CD is a slave to the PATA > disk? Correct. I wanted to see if it was on the same primary or secondary controller as the ad0 disk which emitted errors. > acd0: CDRW at ata1-master UDMA33 ...and it doesn't appear to be. Taken from your previous mails: ad0: 114473MB at ata0-master UDMA100 acd0: CDRW at ata1-master UDMA33 What this confirms is that there are two separate PATA cables (one for the ad0 disk, sitting on primary-master on IRQ 14, and one for the acd0 DVD drive, sitting on secondary-master on IRQ 15). So that would mean, in the case of "bad cables", you would have *three* separate cables (2xPATA, 1xSATA) which would all have gone bad at the same time. This is highly, highly unlikely. > > Looks fine, although I swore ATA controllers listed their IRQs. atapci0 > > doesn't appear to have an IRQ associated with it (should be 14 or 15), > > so that's a little odd to me. vmstat -i would help here. > > interrupt total rate > irq1: atkbd0 14 0 > irq6: fdc0 1 0 > irq12: psm0 1624 0 > irq14: ata0 410187 14 > irq15: ata1 225418 7 > irq18: uhci2+ 111881 3 > irq22: skc0 260062 9 > cpu0: timer 56551841 1999 > Total 57561028 2035 IRQs sharing is in effect, despite an APIC being used. But I doubt this is an interrupt problem. IRQ18 is also shared with at least one other device; it's definitely shared with the USB controller, but the "+" indicates there's even more devices associated with the IRQ. Piecing together things from previous mails: ad0 is on ata0 (which is atapci0, Intel ICH5 UDMA100 controller; IRQ 14) acd0 is on ata1 (which is atapci0, Intel ICH5 UDMA100 controller; IRQ 15) ad4 is on ata2 (which is atapci1, Intel ICH5 SATA150 controller; IRQ 18) ad6 is on ata3 (which is atapci1, Intel ICH5 SATA150 controller; IRQ 18) > > Okay, there are some problems with your disks, but it's going to be > > impossible for me to determine if the below problems caused what you saw. > > First, ad0: > > I just freed up a 300G SATA disk, so I can swap out the PATA drive if you > think it's worth the effort. With regards to ad0, it's entirely your call. I'm pedantic about bad blocks, even if they've been remapped successfully, but that's just me. Others are more relaxed about it all. > > 1) Run "smartctl -t short" on /dev/ad0 and /dev/ad4. You can safely use > > the disks during this time. After a few minutes (depends on how much > > disk I/O is happening; the more I/O, the longer the test takes to > > complete), you should see an entry in the SMART self-test log saying > > Completed. Once you see that, you should run smartctl -a on the disk > > again, and see if the attributes labelled "Offline" are different than > > they were before. > > > > 2) Consider running smartd. I do not normally advocate this, but in > > your case, it may be the only way to see which attribute values are > > actually changing on you if/when the issue happens again. Any time a > > value changes, it'll be logged via syslog. You can set up smartd.conf > > to ignore certain attributes (e.g. temperature, since that has a > > tendency to fluctuate up and down a degree). > > I'm looking at that. The sample conf file that comes with it isn't the > easiest on the eyes, so I haven't figure out what configuration I want or > how to set it up yet. The example configuration is overzealous with comments and is badly formatted making it difficult to read. The simple version: If smartd sees the string DEVICESCAN (before any disk definitions), it'll simply probe SMART stats periodically for all disks attached at the time smartd was started. (If disk definitions are seen first, then it ignores DEVICESCAN from that point forward). The problem with DEVICESCAN is that you can't give each device its own flags (see below). Each disk is configured on its own line in the config. The flags you can pass it do many different things (ignore certain changing attributes (-I), send mail to an address on attribute change (-m), and many other things -- see smartd.conf(5)). > My external hard drive is running around 50 in that small external > enclosure. That sounds bad. > > 190 Airflow_Temperature_Cel 0x0022 050 043 045 Old_age Always In_the_past 50 (Lifetime Min/Max 32/53) > 194 Temperature_Celsius 0x0022 050 057 000 Old_age Always - 50 (0 21 0 0) I covered this in another mail; yes, the temperature is of concern, but it's not causing the DMA errors you're seeing on other disks. :-) -- | Jeremy Chadwick jdc at parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB |