Date: Sat, 28 Mar 2015 06:13:57 +0000 From: Ken Moffat <zarniwhoop@ntlworld.com> To: CK <nibbana@gmx.us> Cc: freebsd-questions@freebsd.org Subject: Re: smartctl Message-ID: <20150328061357.GA18597@milliways> In-Reply-To: <0LzskF-1ZWnak3ftL-0150PB@mail.gmx.com> References: <0LzskF-1ZWnak3ftL-0150PB@mail.gmx.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, Mar 27, 2015 at 09:05:29PM -0800, CK wrote: > Regarding the unexpected loss of files from the filesystem under various > loads, is the appended 'smartctl' data sufficient to make the determination > that the loss of files while the operating system is in use could be due to > the condition of the drive? > Drives fail. Sometimes smartctl reports problems _if_ you run the tests, other times they fail suddenly. The drive is old (only 40GB), so although the hours are only 12540 (500 days) I suspect it might have been "round the clock". Apparently it is a 5400rpm PATA drive - I used to use a pair of 5400rpm drives for RAID1 on a previous server, but I think I bought those 6 or more years ago, and even then they were 320GB. So old age seems a possible answer. > I didn't think so at first, because: > > 1) I would expect a FreeBSD error to the effect of "unable to read/write > /dev/ada0" or "block checksum does not match block data". > > 2) I would expect that all data read/written to from a drive is verfied to be > correct by FreeBSD with checksums, and that it is guaranteed to be correct > if there are no serious and fatal errors reported by the operating system. I cannot comment on that (except in VMs I'm a linux user), but if the drive's write cache is enabled then technically all bets are off - most modern drives will do that to improve throughput. You can also get filesystem errors, and unfortunate use of 'rm -rf'. > > But I may be wrong in these assumptions. Anybody know for sure? I have never > seen FreeBSD report any filesystem r/w errors. My past experience has only > taught me that when a drive begins to make very bad noises, this generally > accompanies obvious and serious problems; and that a drive fails when the > mechanical parts fail, but not due to wear on heads/platters or other things > that may cause failures that are not detected/reported by the operating > system. > My experience is limited (starting with two or three machines, mostly with one drive each, through to the current day where I have 4 desktop machines with one drive each, and machine used as a server with 3 drives). But recently I seem to have to replace at least one drive every year (although the last one was "just in case" because the SMART checks were often reporting unreadable sectors - not permanent errors, and it was in RAID-1 so ok while the other one still worked - and I've discarded others because they became too slow or too antiquated (IDE, SATAv1). But I would seriously suggest that if you have installed smartmontools then you ought to run some of the tests - on a server I tend to run long tests daily, at a time when I hope it is quiet, but on desktops less frequently. For a laptop I probably only run them when I think about it and know it will be on mains power. > I can't see how the loss of files could occur without FreeBSD noticing it and > reporting on it. Does FreeBSD just trust drives to do everything correctly > at all times? > > -- > > smartctl 6.2 2014-02-18 r3874 [FreeBSD 9.2-RELEASE i386] (local build) > Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org > > === START OF INFORMATION SECTION === > Model Family: Western Digital Caviar WDxxxAB > Device Model: WDC WD400AB-22CDB0 > Serial Number: WD-WMA9T1222658 > Firmware Version: 22.04A22 > User Capacity: 40,020,664,320 bytes [40.0 GB] > Sector Size: 512 bytes logical/physical > Device is: In smartctl database [for details use: -P show] > ATA Version is: ATA/ATAPI-5 (minor revision not indicated) > Local Time is: Fri Mar 27 20:35:32 2015 AKDT > SMART support is: Available - device has SMART capability. > SMART support is: Enabled > > === START OF READ SMART DATA SECTION === > SMART overall-health self-assessment test result: PASSED > > General SMART Values: > Offline data collection status: (0x84) Offline data collection activity > was suspended by an interrupting command from host. > Auto Offline Data Collection: Enabled. > Self-test execution status: ( 0) The previous self-test routine completed > without error or no self-test has ever > been run. > Total time to complete Offline > data collection: ( 2376) seconds. > Offline data collection > capabilities: (0x3b) SMART execute Offline immediate. > Auto Offline data collection on/off support. > Suspend Offline collection upon new > command. > Offline surface scan supported. > Self-test supported. > Conveyance Self-test supported. > No Selective Self-test supported. > SMART capabilities: (0x0003) Saves SMART data before entering > power-saving mode. > Supports SMART auto save timer. > Error logging capability: (0x01) Error logging supported. > No General Purpose Logging support. > Short self-test routine > recommended polling time: ( 2) minutes. > Extended self-test routine > recommended polling time: ( 42) minutes. > Conveyance self-test routine > recommended polling time: ( 5) minutes. > > SMART Attributes Data Structure revision number: 16 > Vendor Specific SMART Attributes with Thresholds: > ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE > 1 Raw_Read_Error_Rate 0x000b 200 200 051 Pre-fail Always - 0 > 3 Spin_Up_Time 0x0007 102 099 021 Pre-fail Always - 3975 > 4 Start_Stop_Count 0x0032 100 100 040 Old_age Always - 58 > 5 Reallocated_Sector_Ct 0x0033 199 199 140 Pre-fail Always - 1 I've had recent drives which started to give problems (particularly, unreadable sectors) around the time the Reallocated Sector Count became non-zero. > 7 Seek_Error_Rate 0x000b 200 200 051 Pre-fail Always - 0 > 9 Power_On_Hours 0x0032 083 083 000 Old_age Always - 12540 > 10 Spin_Retry_Count 0x0013 100 253 051 Pre-fail Always - 0 > 11 Calibration_Retry_Count 0x0013 100 253 051 Pre-fail Always - 0 > 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 57 > 196 Reallocated_Event_Count 0x0032 199 199 000 Old_age Always - 1 > 197 Current_Pending_Sector 0x0012 200 200 000 Old_age Always - 0 > 198 Offline_Uncorrectable 0x0012 200 200 000 Old_age Always - 0 > 199 UDMA_CRC_Error_Count 0x000a 200 253 000 Old_age Always - 0 > 200 Multi_Zone_Error_Rate 0x0009 200 200 051 Pre-fail Offline - 0 > > SMART Error Log Version: 1 > No Errors Logged > > SMART Self-test log structure revision number 1 > No self-tests have been logged. [To run self-tests, use: smartctl -t] > I would try running some self-tests. > > Selective Self-tests/Logging not supported > > _______________________________________________ > freebsd-questions@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-questions > To unsubscribe, send any mail to "freebsd-questions-unsubscribe@freebsd.org" ĸen -- Nanny Ogg usually went to bed early. After all, she was an old lady. Sometimes she went to bed as early as 6 a.m.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20150328061357.GA18597>