Date: Mon, 19 Jul 2010 13:33:20 -0700 From: Jeremy Chadwick <freebsd@jdc.parodius.com> To: Mike Tancsa <mike@sentex.net> Cc: freebsd-stable@freebsd.org Subject: Re: deadlock or bad disk ? RELENG_8 Message-ID: <20100719203320.GB21088@icarus.home.lan> In-Reply-To: <201007191237.o6JCbmj7049339@lava.sentex.ca> References: <201007182108.o6IL88eG043887@lava.sentex.ca> <20100718211415.GA84127@icarus.home.lan> <201007182142.o6ILgDQW044046@lava.sentex.ca> <20100719023419.GA91006@icarus.home.lan> <201007190301.o6J31Hs1045607@lava.sentex.ca> <20100719033424.GA92607@icarus.home.lan> <201007191237.o6JCbmj7049339@lava.sentex.ca>
next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, Jul 19, 2010 at 08:37:50AM -0400, Mike Tancsa wrote: > At 11:34 PM 7/18/2010, Jeremy Chadwick wrote: > >> > >> yes, da0 is a RAID volume with 4 disks behind the scenes. > > > >Okay, so can you get full SMART statistics for all 4 of those disks? > >The adjusted/calculated values for SMART thresholds won't be helpful > >here, one will need the actual raw SMART data. I hope the Areca CLI can > >provide that. > > I thought there was, but I cant seem to get the current smartctl to > work with the card. > > -d TYPE, --device=TYPE > Specifies the type of the device. The valid arguments to this > option are ata, scsi, sat, marvell, 3ware,N, areca,N, usbcy- > press, usbjmicron, usbsunplus, cciss,N, hpt,L/M (or hpt,L/M/N), > and test. > > # smartctl -a -d areca,0 /dev/arcmsr0 > smartctl 5.39.1 2010-01-28 r3054 [FreeBSD 8.1-PRERELEASE amd64] (local build) > Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net > > /dev/arcmsr0: Unknown device type 'areca,0' > =======> VALID ARGUMENTS ARE: ata, scsi, sat[,N][+TYPE], > usbcypress[,X], usbjmicron[,x][,N], usbsunplus, 3ware,N, hpt,L/M/N, > cciss,N, atacam, test <======= > > Use smartctl -h to get a usage summary According to the official smartctl documentation and man page, the "areca,N" argument is only supported on Linux. Bummer. Areca SATA RAID controllers are currently supported under Linux only. To look at SATA disks behind Areca RAID controllers, use syntax such as: smartctl -a -d areca,2 /dev/sg2 smartctl -a -d areca,3 /dev/sg3 > The latest CLI tool only gives this info > > CLI> disk info drv=1 > Drive Information > =============================================================== > IDE Channel : 1 > Model Name : ST31000340AS > Serial Number : 3QJ07F1N > Firmware Rev. : SD15 > Disk Capacity : 1000.2GB > Device State : NORMAL > Timeout Count : 0 > Media Error Count : 0 > Device Temperature : 29 C > SMART Read Error Rate : 108(6) > SMART Spinup Time : 91(0) > SMART Reallocation Count : 100(36) > SMART Seek Error Rate : 81(30) > SMART Spinup Retries : 100(97) > SMART Calibration Retries : N.A.(N.A.) > =============================================================== > GuiErrMsg<0x00>: Success. > > CLI> disk smart drv=1 > S.M.A.R.T Information For Drive[#01] > # Attribute Items Flag Value Thres State > =============================================================================== > 1 Raw Read Error Rate 0x0f 108 6 OK > 3 Spin Up Time 0x03 91 0 OK > 4 Start/Stop Count 0x32 100 20 OK > 5 Reallocated Sector Count 0x33 100 36 OK > 7 Seek Error Rate 0x0f 81 30 OK > 9 Power-on Hours Count 0x32 79 0 OK > 10 Spin Retry Count 0x13 100 97 OK > 12 Device Power Cycle Count 0x32 100 20 OK > 194 Temperature 0x22 29 0 OK > 197 Current Pending Sector Count 0x12 100 0 OK > 198 Off-line Scan Uncorrectable Sector Count 0x10 100 0 OK > 199 Ultra DMA CRC Error Count 0x3e 200 0 OK > =============================================================================== > GuiErrMsg<0x00>: Success. Yeah, this isn't going to help much. The raw SMART data isn't being shown. I downloaded the Areca CLI manual dated 2010/07 which doesn't state anything other than what you've already shown. Bummer. > >If so, think about what would happen if heavy I/O happened on > >both da0 and da1 at the same time. I talk about this a bit more below. > > No different than any other single disk being heavily worked. > Again, this particular hardware configuration has been beaten about > for a couple of years. So I am not sure why all of a sudden it would > be not possible to do That's a very good question, and I don't have an answer to it. I also would have a hard time believing that suddenly out of no where heavy I/O would exhibit this problem. I'm just going over possibilities. For example, I see that the da1 RAID volume is labelled "backup1", so if you were storing backups there possibly the I/O degrades over time as a result of there being more data/files, etc... Wouldn't have seen it a year ago, but might see it now. Just thinking out loud. > >situation (since you'd then be dedicating an entire disk to just swap). > >Others may have other advice. You mention in a later mail that the > >ada[0-3] disks make up a ZFS pool of some sort. You might try splitting > >ada0 into two slices, one for swap and the other used as a pool member. > > That seems like it would just move the problem you are trying to get > me to avoid to a different set of disks. If putting swap on a raid > array is a bad thing, I am not sure how moving it to a ZFS raid > array will help. The idea wasn't to move swap to ZFS (that's a bad idea from what I remember, something about crash dumps not working in that situation). My idea was to move swap to a dedicated partition on a disk that happens to also be used for ZFS. E.g.: ada0 ada0s1a = 20GB = swap ada0s1b = 980GB = ZFS pool ada1 = 1000GB = ZFS pool ada2 = 1000GB = ZFS pool ada3 = 1000GB = ZFS pool Again, this isn't a solution for the problem. I'm in no way trying to dissuade anyone from figuring out the root cause. But quite often on the list if someone can't get an answer to "why" they want to know what they can do as a workaround. There just happens to be reports of this problem going all the way back to RELENG_6, and all the posts I've read so far have been when people have had swap backed by some sort of RAID. > >Again: I don't think this is necessarily a bad disk problem. The only > >way you'd be able to determine that would be to monitor on a per-disk > >basis the I/O response time of each disk member on the Areca. If the > >CLI tools provide this, awesome. Otherwise you'll probably need to > >involve Areca Support. > > In the past when I have had bad disks on the areca, it did catch and > flag device timeouts. There were no such alerts leading up to this > situation. Yeah, which makes it sound more like a driver issue or something. I really don't know what to say. Areca does officially support FreeBSD so they might have some ideas. -- | Jeremy Chadwick jdc@parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB |
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20100719203320.GB21088>