Date: Mon, 19 Jul 2010 08:37:50 -0400 From: Mike Tancsa <mike@sentex.net> To: Jeremy Chadwick <freebsd@jdc.parodius.com> Cc: freebsd-stable@freebsd.org Subject: Re: deadlock or bad disk ? RELENG_8 Message-ID: <201007191237.o6JCbmj7049339@lava.sentex.ca> In-Reply-To: <20100719033424.GA92607@icarus.home.lan> References: <201007182108.o6IL88eG043887@lava.sentex.ca> <20100718211415.GA84127@icarus.home.lan> <201007182142.o6ILgDQW044046@lava.sentex.ca> <20100719023419.GA91006@icarus.home.lan> <201007190301.o6J31Hs1045607@lava.sentex.ca> <20100719033424.GA92607@icarus.home.lan>
next in thread | previous in thread | raw e-mail | index | archive | help
At 11:34 PM 7/18/2010, Jeremy Chadwick wrote: > > > > yes, da0 is a RAID volume with 4 disks behind the scenes. > >Okay, so can you get full SMART statistics for all 4 of those disks? >The adjusted/calculated values for SMART thresholds won't be helpful >here, one will need the actual raw SMART data. I hope the Areca CLI can >provide that. I thought there was, but I cant seem to get the current smartctl to work with the card. -d TYPE, --device=TYPE Specifies the type of the device. The valid arguments to this option are ata, scsi, sat, marvell, 3ware,N, areca,N, usbcy- press, usbjmicron, usbsunplus, cciss,N, hpt,L/M (or hpt,L/M/N), and test. # smartctl -a -d areca,0 /dev/arcmsr0 smartctl 5.39.1 2010-01-28 r3054 [FreeBSD 8.1-PRERELEASE amd64] (local build) Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net /dev/arcmsr0: Unknown device type 'areca,0' =======> VALID ARGUMENTS ARE: ata, scsi, sat[,N][+TYPE], usbcypress[,X], usbjmicron[,x][,N], usbsunplus, 3ware,N, hpt,L/M/N, cciss,N, atacam, test <======= Use smartctl -h to get a usage summary The latest CLI tool only gives this info CLI> disk info drv=1 Drive Information =============================================================== IDE Channel : 1 Model Name : ST31000340AS Serial Number : 3QJ07F1N Firmware Rev. : SD15 Disk Capacity : 1000.2GB Device State : NORMAL Timeout Count : 0 Media Error Count : 0 Device Temperature : 29 C SMART Read Error Rate : 108(6) SMART Spinup Time : 91(0) SMART Reallocation Count : 100(36) SMART Seek Error Rate : 81(30) SMART Spinup Retries : 100(97) SMART Calibration Retries : N.A.(N.A.) =============================================================== GuiErrMsg<0x00>: Success. CLI> disk smart drv=1 S.M.A.R.T Information For Drive[#01] # Attribute Items Flag Value Thres State =============================================================================== 1 Raw Read Error Rate 0x0f 108 6 OK 3 Spin Up Time 0x03 91 0 OK 4 Start/Stop Count 0x32 100 20 OK 5 Reallocated Sector Count 0x33 100 36 OK 7 Seek Error Rate 0x0f 81 30 OK 9 Power-on Hours Count 0x32 79 0 OK 10 Spin Retry Count 0x13 100 97 OK 12 Device Power Cycle Count 0x32 100 20 OK 194 Temperature 0x22 29 0 OK 197 Current Pending Sector Count 0x12 100 0 OK 198 Off-line Scan Uncorrectable Sector Count 0x10 100 0 OK 199 Ultra DMA CRC Error Count 0x3e 200 0 OK =============================================================================== GuiErrMsg<0x00>: Success. CLI> The obvious ones (timeout, media error etc) are all zero >Also, I'm willing to bet that the da0 "volume" and the da1 "volume" >actually share the same physical disks on the Areca controller. Is that >correct? Yes >If so, think about what would happen if heavy I/O happened on >both da0 and da1 at the same time. I talk about this a bit more below. No different than any other single disk being heavily worked. Again, this particular hardware configuration has been beaten about for a couple of years. So I am not sure why all of a sudden it would be not possible to do > > > > Prior to someone rebooting it, it had been stuck in this state for a > > good 90min. Apart from upgrading to a later RELENG_8 to get the > > security patches, the machine had been running a few versions of > > RELENG_8 doing the same workloads every week without issue. > >Then I would say you'd need to roll back kernel+world to a previous date >and try to figure out when the issue began, if that is indeed the case. Possibly. The box only gets a heavy workout periodically when it does an rsync to our DR site. >It would also help if you could provide timestamps of those messages; >are they all happening at once, or gradual over time? If over time, do >they all happen around the same time every day, etc.? You see where I'm >going with this. Every couple of seconds I think. If it happens again, I will time it. >situation (since you'd then be dedicating an entire disk to just swap). >Others may have other advice. You mention in a later mail that the >ada[0-3] disks make up a ZFS pool of some sort. You might try splitting >ada0 into two slices, one for swap and the other used as a pool member. That seems like it would just move the problem you are trying to get me to avoid to a different set of disks. If putting swap on a raid array is a bad thing, I am not sure how moving it to a ZFS raid array will help. >Again: I don't think this is necessarily a bad disk problem. The only >way you'd be able to determine that would be to monitor on a per-disk >basis the I/O response time of each disk member on the Areca. If the >CLI tools provide this, awesome. Otherwise you'll probably need to >involve Areca Support. In the past when I have had bad disks on the areca, it did catch and flag device timeouts. There were no such alerts leading up to this situation. ---Mike -------------------------------------------------------------------- Mike Tancsa, tel +1 519 651 3400 Sentex Communications, mike@sentex.net Providing Internet since 1994 www.sentex.net Cambridge, Ontario Canada www.sentex.net/mike
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?201007191237.o6JCbmj7049339>