Date: Tue, 16 Sep 2008 11:54:01 -0700 From: Jeremy Chadwick <koitsu@FreeBSD.org> To: Clint Olsen <clint.olsen@gmail.com> Cc: stable@freebsd.org Subject: Re: Help debugging DMA_READ errors Message-ID: <20080916185401.GA71275@icarus.home.lan> In-Reply-To: <20080916181903.GC7540@0lsen.net> References: <20080916170452.GB4861@0lsen.net> <20080916175858.GA70396@icarus.home.lan> <20080916181903.GC7540@0lsen.net>
next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, Sep 16, 2008 at 11:19:03AM -0700, Clint Olsen wrote: > On Sep 16, Jeremy Chadwick wrote: > > acd0 is a CD/DVD drive. ad4 is a hard disk. What exactly were you > > doing with the system at the time these errors appeared? Were you using > > the CD/DVD drive? Was there a disc in the drive that was mounted? > > If none of these things, I'm baffled as to what would read acd0 and > > cause what you see here. > > I was not at the system at the time. I never have had a disk in the drive > nor is /cdrom mounted currently. I have dump backups that run in the > middle of the night on the various filesystems. That's very strange then. Something definitely tried to utilise acd0 at that hour of the night. What is acd0 connected to, ATA-wise? Again, I assume it's PATA, but I'd like to know the primary/secondary and master/slave organisation, since you are using a PATA disk too. > > Can you please provide full details of what these disks are connected > > to? I'd like to see dmesg output for ata0, ata2, and ata3, as well as > > the atapci devices those ataX devices are attached to, ditto with > > vmstat -i output. Are there any other errors in your logs around > > that time (e.g. watchdog timeouts of any kind on network devices, etc.?) > > # dmesg | grep -i ata > atapci0: <Intel ICH5 UDMA100 controller> port 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xfc00-0xfc0f at device 31.1 on pci0 > ata0: <ATA channel 0> on atapci0 > ata1: <ATA channel 1> on atapci0 > atapci1: <Intel ICH5 SATA150 controller> port 0xeff0-0xeff7,0xefe4-0xefe7,0xefa8-0xefaf,0xefe0-0xefe3,0xef60-0xef6f irq 18 at device 31.2 on pci0 > ata2: <ATA channel 0> on atapci1 > ata3: <ATA channel 1> on atapci1 Looks fine, although I swore ATA controllers listed their IRQs. atapci0 doesn't appear to have an IRQ associated with it (should be 14 or 15), so that's a little odd to me. vmstat -i would help here. > See attached. Okay, there are some problems with your disks, but it's going to be impossible for me to determine if the below problems caused what you saw. First, ad0: > ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE > 5 Reallocated_Sector_Ct 0x0033 199 199 140 Pre-fail Always - 7 > 194 Temperature_Celsius 0x0022 111 253 000 Old_age Always - 39 > 196 Reallocated_Event_Count 0x0032 193 193 000 Old_age Always - 7 You have 7 bad sectors on ad0, and all 7 of those sectors have been successfully remapped (meaning spare sectors on the disk (there is a limited/reserved number of them) have been used). Sector reallocation on ATA and SATA disks happens transparently, meaning the OS is never told by the disk/controller what's going on behind the scenes. However, reallocation can take time -- consider the situation where a bad sector/block cannot be read or written to. The disk may attempt to re-read/write that sector/block a few times before giving up, marking it bad, and using a reallocated spare. FreeBSD will begin spitting out DMA errors after 5 full seconds of not receiving responses to ATA commands. So, if the above reallocations caused what you saw, then the reallocations took >5 seconds. Since you don't have SMART stats for your disks **before** this issue began, we're not going to be able to determine if the reallocations are what caused it. This is why some people run smartd. That said: I do not see any sign of DMA read/write errors in the SMART log. That's good. The temperature of this drive is also acceptable (39C), which is also okay. On to ad4: > 194 Temperature_Celsius 0x0022 104 253 000 Old_age Always - 46 > 199 UDMA_CRC_Error_Count 0x000a 200 253 000 Old_age Always - 2 There are two CRC errors reported here, detected during DMA. These could have resulted in data corruption on your drive; CRC errors cannot correct data, it's just a simple way of determining if what data was sent or received was intact. However, there's no sign of DMA errors in the SMART log. I'm not sure what to make of that; I really would expect there to be some. Additionally, this disk is running at 46C, which borders on the high end of warm. You may want to check cooling or general airflow around the drive. I strongly doubt temperature is the problem here though. Regarding both disks: Some statistics are "offline", which means those attributes/stats will not be updated until you run a short/long/offline SMART test. Do not let the word "offline" make you think the drive is actually taken offline -- it isn't. I would recommend two things: 1) Run "smartctl -t short" on /dev/ad0 and /dev/ad4. You can safely use the disks during this time. After a few minutes (depends on how much disk I/O is happening; the more I/O, the longer the test takes to complete), you should see an entry in the SMART self-test log saying Completed. Once you see that, you should run smartctl -a on the disk again, and see if the attributes labelled "Offline" are different than they were before. 2) Consider running smartd. I do not normally advocate this, but in your case, it may be the only way to see which attribute values are actually changing on you if/when the issue happens again. Any time a value changes, it'll be logged via syslog. You can set up smartd.conf to ignore certain attributes (e.g. temperature, since that has a tendency to fluctuate up and down a degree). If/when this happens again, you should be able to look at your logs and see what counters have changed. For example if you see something like Power_Cycle_Count or Stop_Start_Count increase, you have disks which are losing power. Welcome to the pain of debugging disk problems. :-) -- | Jeremy Chadwick jdc at parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB |
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20080916185401.GA71275>