Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 11 Oct 2004 08:19:07 -0400
From:      Eduard Martinescu <martines@rochester.rr.com>
To:        Ion-Mihai Tetcu <itetcu@apropo.ro>
Cc:        current@freebsd.org
Subject:   Re: TIMEOUT - WRITE_DMA and smart questions
Message-ID:  <1097497147.29958.3.camel@sauron.crafts4life.com>
In-Reply-To: <20041011140931.7934d78b@it.buh.cameradicommercio.ro>
References:  <20041011140931.7934d78b@it.buh.cameradicommercio.ro>

next in thread | previous in thread | raw e-mail | index | archive | help
Ion-Mihai,

For more information on smartmontools (smartctl,smartd), check out the
Source Forge site, http://smartmontools.sourceforge.net

If you have specific questions, you can email the support list (link on
the page above).

Ed

On Mon, 2004-10-11 at 07:09, Ion-Mihai Tetcu wrote:
> [ please reply only on questions@ if this is not appropriate for current@ ]
> 
> Hi,
> 
> While doing nothing special the system start printing TIMEOUT -
> WRITE_DMA erros and eventually after an atacontrol mode 0 PIO4 PIO4
> hanged completely at 04:20.
> 
> After restart I've got a few TIMEOUT .. but no hung, however the machine
> is idle.
> 
> SMART was enabled as seen bellow, but smartd wasn't running (stupid, huh
> :-/ ).
> 
> Obvious question: is the hdd dying ?
> 
> Second question, as I'm not familiar with SMART: how much can one trust
> SMART reports ?
> 
> Third question: could you suggest some settings for smartd ? I'm, asking
> this because I don't fully understand the man pages for smartctl and
> smartd; a link explaining more about smart would also be appreciated.
> 
> 
> System details:
> 
> Local system status (last daily mail):
>  3:01AM  up 2 days, 11:56, 2 users, load averages: 1.04, 1.07, 0.95
> 
>  % uname -a
> FreeBSD it.buh.cameradicommercio.ro 5.3-BETA7 FreeBSD 5.3-BETA7 #3: Mon Oct  4 21:57:25 EEST 2004     root@it.buh.tecnik93.com:/usr/obj/usr/src/sys/IT53_d  i386
> 
> Oct 11 04:06:51 it kernel: ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=186210020
> Oct 11 04:07:02 it kernel: ata0: reiniting channel ..
> Oct 11 04:07:02 it kernel: ata0: reset tp1 mask=03 ostat0=d0 ostat1=d0
> Oct 11 04:07:02 it kernel: ad0: stat=0xd0 err=0xd0 lsb=0xd0 msb=0xd0
> Oct 11 04:07:02 it last message repeated 95 times
> Oct 11 04:07:02 it kernel: ad0: stat=0x50 err=0x01 lsb=0x00 msb=0x00
> Oct 11 04:07:02 it kernel: ata0-slave:  stat=0x00 err=0x01 lsb=0x00 msb=0x00
> Oct 11 04:07:02 it kernel: ata0: reset tp2 stat0=50 stat1=00 devices=0x1<ATA_MASTER>
> Oct 11 04:07:02 it kernel: ata0: resetting done ..
> Oct 11 04:07:02 it kernel: ad0: pio=0x0c wdma=0x22 udma=0x45 cable=80pin
> Oct 11 04:07:02 it kernel: ad0: setting PIO4 on VIA 8235 chip
> Oct 11 04:07:02 it kernel: ad0: setting UDMA100 on VIA 8235 chip
> Oct 11 04:07:02 it kernel: ata0: device config done ..
> Oct 11 04:07:16 it kernel: (probe0:ata0:0:0:0): error 22
> Oct 11 04:07:16 it kernel: (probe0:ata0:0:0:0): Unretryable Error
> Oct 11 04:07:16 it kernel: (probe1:ata0:0:1:0): error 22
> Oct 11 04:07:16 it kernel: (probe1:ata0:0:1:0): Unretryable Error
> .........
> 
>  # grep LBA /var/log/messages
> Oct 11 04:06:51 it kernel: ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=186210020
> Oct 11 04:07:52 it kernel: ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=165839908
> Oct 11 04:08:48 it kernel: ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=165849220
> Oct 11 04:09:12 it kernel: ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=165851556
> Oct 11 04:09:32 it kernel: ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=165859748
> Oct 11 04:10:44 it kernel: ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=6343103
> Oct 11 04:11:23 it kernel: ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=186210916
> Oct 11 04:11:36 it kernel: ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=186211044
> Oct 11 04:11:58 it kernel: acd0: FAILURE - ATA_IDENTIFY status=51<READY,DSC,ERROR> error=4<ABORTED> LBA=0
> Oct 11 04:13:21 it kernel: ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=309294340
> Oct 11 04:14:00 it kernel: ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=175421156
> Oct 11 04:14:24 it kernel: ad0: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=175421156
> Oct 11 04:15:04 it kernel: ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=175421796
> Oct 11 04:15:48 it kernel: ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=130261540
> Oct 11 04:16:10 it kernel: ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=175421892
> Oct 11 04:16:53 it kernel: ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=173918724
> Oct 11 04:18:50 it kernel: ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=309924420
> Oct 11 04:19:14 it kernel: ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=4920283
> Oct 11 04:40:00 it kernel: ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=4918975
> Oct 11 04:40:56 it kernel: ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=6067199
> Oct 11 10:46:52 it kernel: ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=6343103
> 
>  # grep sw /var/log/messages
> Oct 11 04:14:24 it kernel: swap_pager: indefinite wait buffer: device: ad0s1e, blkno: 14841, size: 4096
> Oct 11 04:14:24 it kernel: swap_pager: indefinite wait buffer: device: ad0s3d, blkno: 14381, size: 4096
> Oct 11 04:16:53 it kernel: swap_pager: indefinite wait buffer: device: ad0s3d, blkno: 60732, size: 4096
> Oct 11 04:16:53 it kernel: swap_pager: indefinite wait buffer: device: ad0s3d, blkno: 33481, size: 4096
> Oct 11 04:16:53 it kernel: swap_pager: indefinite wait buffer: device: ad0s3d, blkno: 33488, size: 4096
> 
> 
> 
> The disk is:
>  # atacontrol cap 0 0
> ATA channel 0, Master, device ad0:
> 
> Protocol              ATA/ATAPI revision 6
> device model          WDC WD1600JB-00EVA0
> serial number         WD-WCAEK1298992
> firmware revision     15.05R15
> cylinders             16383
> heads                 16
> sectors/track         63
> lba supported         268435455 sectors
> lba48 supported       312579695 sectors
> dma supported
> overlap not supported
> 
> Feature                      Support  Enable    Value   Vendor
> write cache                    yes      no
> read ahead                     yes      yes
> dma queued                     no       no      0/0x00
> SMART                          yes      yes
> microcode download             yes      yes
> security                       yes      no
> power management               yes      yes
> advanced power management      no       no      0/0x00
> automatic acoustic management  yes      yes     254/0xFE        128/0x80
> 
>  # smartctl -a /dev/ad0
> smartctl version 5.32 Copyright (C) 2002-4 Bruce Allen
> Home page is http://smartmontools.sourceforge.net/
> 
> === START OF INFORMATION SECTION ===
> Device Model:     WDC WD1600JB-00EVA0
> Serial Number:    WD-WCAEK1298992
> Firmware Version: 15.05R15
> Device is:        In smartctl database [for details use: -P show]
> ATA Version is:   6
> ATA Standard is:  Exact ATA specification draft version not indicated
> Local Time is:    Mon Oct 11 12:37:32 2004 EEST
> SMART support is: Available - device has SMART capability.
> SMART support is: Enabled
> 
> The SMART RETURN STATUS return value (smartmontools -H option/Directive)
>  can not be retrieved with this version of ATAng, please do not rely on this value
> === START OF READ SMART DATA SECTION ===
> SMART overall-health self-assessment test result: PASSED
> 
> General SMART Values:
> Offline data collection status:  (0x05) Offline data collection activity
>                                         was aborted by an interrupting command from host.
>                                         Auto Offline Data Collection: Disabled.
> Self-test execution status:      (  40) The self-test routine was interrupted
>                                         by the host with a hard or soft reset.
> Total time to complete Offline
> data collection:                 (5061) seconds.
> Offline data collection
> capabilities:                    (0x79) SMART execute Offline immediate.
>                                         No Auto Offline data collection support.
>                                         Suspend Offline collection upon new
>                                         command.
>                                         Offline surface scan supported.
>                                         Self-test supported.
>                                         Conveyance Self-test supported.
>                                         Selective Self-test supported.
> SMART capabilities:            (0x0003) Saves SMART data before entering
>                                         power-saving mode.
>                                         Supports SMART auto save timer.
> Error logging capability:        (0x01) Error logging supported.
>                                         No General Purpose Logging support.
> Short self-test routine
> recommended polling time:        (   2) minutes.
> Extended self-test routine
> recommended polling time:        (  67) minutes.
> Conveyance self-test routine
> recommended polling time:        (   5) minutes.
> 
> SMART Attributes Data Structure revision number: 16
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
>   1 Raw_Read_Error_Rate     0x000b   200   200   051    Pre-fail  Always       -       0
>   3 Spin_Up_Time            0x0007   155   147   021    Pre-fail  Always       -       2775
>   4 Start_Stop_Count        0x0032   100   100   040    Old_age   Always       -       464
>   5 Reallocated_Sector_Ct   0x0033   199   199   140    Pre-fail  Always       -       8
>   7 Seek_Error_Rate         0x000b   200   199   051    Pre-fail  Always       -       0
>   9 Power_On_Hours          0x0032   096   096   000    Old_age   Always       -       3360
>  10 Spin_Retry_Count        0x0013   100   100   051    Pre-fail  Always       -       0
>  11 Calibration_Retry_Count 0x0013   100   100   051    Pre-fail  Always       -       0
>  12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       462
> 194 Temperature_Celsius     0x0022   124   253   000    Old_age   Always       -       26
> 196 Reallocated_Event_Count 0x0032   194   194   000    Old_age   Always       -       6
> 197 Current_Pending_Sector  0x0012   200   200   000    Old_age   Always       -       0
> 198 Offline_Uncorrectable   0x0012   200   200   000    Old_age   Always       -       0
> 199 UDMA_CRC_Error_Count    0x000a   200   253   000    Old_age   Always       -       2
> 200 Multi_Zone_Error_Rate   0x0009   200   155   051    Pre-fail  Offline      -       0
> 
> SMART Error Log Version: 1
> No Errors Logged
> 
> SMART Self-test log structure revision number 1
> Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
> # 1  Extended captive    Interrupted (host reset)      80%        77         -
> # 2  Extended offline    Aborted by host               90%        77         -
> # 3  Conveyance offline  Completed without error       00%        76         -
> # 4  Short offline       Completed without error       00%        76         -
> # 5  Conveyance offline  Completed without error       00%       233         -
> # 6  Short captive       Interrupted (host reset)      90%       233         -
> 
> SMART Selective self-test log data structure revision number 1
>  SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
>     1        0        0  Not_testing
>     2        0        0  Not_testing
>     3        0        0  Not_testing
>     4        0        0  Not_testing
>     5        0        0  Not_testing
> 
> Selective self-test flags (0x0):
>   After scanning selected spans, do NOT read-scan remainder of disk.
> If Selective self-test is pending on power-up, resume after 0 minute delay.
> 
> 
> Thanks,
-- 
Eduard Martinescu <martines@rochester.rr.com>



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1097497147.29958.3.camel>