FreeBSD Mail Archives

Date:      Thu, 6 Nov 2014 00:32:40 +0100
From:      Kai Gallasch <k@free.de>
To:        freebsd-stable@freebsd.org
Subject:   10.1 RC4 r273903 - zpool scrub on ssd mirror - ahci command timeout
Message-ID:  <20141106003240.344dedf6@orwell>

index | next in thread | raw e-mail


[-- Attachment #1 --]

Hi.

Not sure if this is 10.1 related or more a problem of the ssd
model and/or ahci controller..

I am currently running 10.1 RC4 r273903 on a zfs on root server with two
mirror pools. One of the pools is a mirror consisting of two Samsung
SSD 850 PRO 512GB SSDs.

When I start a zfs scrub on this pool the result of the scrub is:

# zpool status -v ssdpool
  pool: ssdpool
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
	attempt was made to correct the error.  Applications are
unaffected. action: Determine if the device needs to be replaced, and
clear the errors using 'zpool clear' or replace the device with 'zpool
replace'. see: http://illumos.org/msg/ZFS-8000-9P
  scan: scrub repaired 73K in 0h8m with 0 errors on Thu Nov  6 00:00:16
2014 config:

	NAME              STATE     READ WRITE CKSUM
	ssdpool           ONLINE       0     0     0
	  mirror-0        ONLINE       0     0     0
	    gpt/ssdpool0  ONLINE       0     0    17
	    gpt/ssdpool1  ONLINE       0     0    29

When I do a 'zpool clear' the pool status looks ok again. But when I
again start a zpool scrub the same thing happens again and the
above status "One or more devices has experienced an unrecoverable
error" shows again.


I find the following kernel message in the output of 'dmesg': (after
running zpool scrub two times)


ahcich2: Timeout on slot 15 port 0
ahcich2: is 00000000 cs 000f0000 ss 000f8000 rs 000f8000 tfd 40 serr
00000000 cmd 0024cf17 (ada2:ahcich2:0:0:0): READ_FPDMA_QUEUED. ACB: 60
8b a6 1d 56 40 0d 00 00 00 00 00 (ada2:ahcich2:0:0:0): CAM status:
Command timeout (ada2:ahcich2:0:0:0): Retrying command
ahcich2: Timeout on slot 23 port 0
ahcich2: is 00000000 cs 0f000000 ss 0f800000 rs 0f800000 tfd 40 serr
00000000 cmd 0024d817 (ada2:ahcich2:0:0:0): READ_FPDMA_QUEUED. ACB: 60
1b 23 81 bc 40 06 00 00 00 00 00 (ada2:ahcich2:0:0:0): CAM status:
Command timeout (ada2:ahcich2:0:0:0): Retrying command
ahcich2: Timeout on slot 3 port 0
ahcich2: is 00000000 cs 00000030 ss 00000038 rs 00000038 tfd 40 serr
00000000 cmd 0024c317 (ada2:ahcich2:0:0:0): READ_FPDMA_QUEUED. ACB: 60
26 bd 18 8e 40 12 00 00 00 00 00 (ada2:ahcich2:0:0:0): CAM status:
Command timeout (ada2:ahcich2:0:0:0): Retrying command


Besides: smartctl shows no error on ada2.
Here comes the output..

# smartctl -a -q noserial /dev/ada2
smartctl 6.3 2014-07-26 r3976 [FreeBSD 10.1-RC4 amd64] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke,
www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     Samsung SSD 850 PRO 512GB
Firmware Version: EXM01B6Q
User Capacity:    512,110,190,592 bytes [512 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4c
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Thu Nov  6 00:02:04 2014 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)	Offline data collection
activity was never started.
					Auto Offline Data Collection:
Disabled. Self-test execution status:      (   0)	The previous
self-test routine completed without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(    0) seconds.
Offline data collection
capabilities: 			 (0x53) SMART execute Offline
immediate. Auto Offline data collection on/off support.
					Suspend Offline collection upon
new command.
					No Offline surface scan
supported. Self-test supported.
					No Conveyance Self-test
supported. Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before
entering power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging
supported. Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 (  33) minutes.
SCT capabilities: 	       (0x003d)	SCT Status supported.
					SCT Error Recovery Control
supported. SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE
UPDATED  WHEN_FAILED RAW_VALUE 5 Reallocated_Sector_Ct   0x0033   100
100   010    Pre-fail  Always       -       0 9 Power_On_Hours
0x0032   099   099   000    Old_age   Always       -       154 12
Power_Cycle_Count       0x0032   099   099   000    Old_age
Always       -       5 177 Wear_Leveling_Count     0x0013   100   100
000    Pre-fail  Always       -       0 179 Used_Rsvd_Blk_Cnt_Tot
0x0013   100   100   010    Pre-fail  Always       -       0 181
Program_Fail_Cnt_Total  0x0032   100   100   010    Old_age
Always       -       0 182 Erase_Fail_Count_Total  0x0032   100   100
010    Old_age   Always       -       0 183 Runtime_Bad_Block
0x0013   100   100   010    Pre-fail  Always       -       0 187
Reported_Uncorrect      0x0032   100   100   000    Old_age
Always       -       0 190 Airflow_Temperature_Cel 0x0032   070   068
000    Old_age   Always       -       30 195 Hardware_ECC_Recovered
0x001a   200   200   000    Old_age   Always       -       0 199
UDMA_CRC_Error_Count    0x003e   100   100   000    Old_age
Always       -       0 235 Unknown_Attribute       0x0012   100   100
000    Old_age   Always       -       0 241 Total_LBAs_Written
0x0032   099   099   000    Old_age   Always       -       400466433

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining
LifeTime(hours)  LBA_of_first_error # 1  Extended offline    Completed
without error       00%       147         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute
delay.

I wonder What is the possible reason for this. Both SSDs are new.
Is this a common problem with zfs and SSDs (for example ahci timeouts
because of high data rates for a bus ?)

K.

-- 
PGP-KeyID = 0xE401B671927D4A5C



[-- Attachment #2 --]
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQIcBAEBCgAGBQJUWrOYAAoJEHBlTXxPsfWI+UIQAJ3E8zSy/71RdJ3XrEtTIVy3
Akz/LFvpJ6NFLu4meRXtpyNrX1PinNPIhMXM7c/ugXk0Absb4WZABR5fBiecHrTh
xxfcKoVZ8l5AIm06DfUlXRkLdGSfpVEkPIr8t2qFc/ka04yrinZoQIbr3i7Rv3be
ri7lKQBEOEjNYT0IdW9GTPOdzKDHnxu9I4zHiZ7x6DlMMUiGXPSkahCuo26ppo8T
AMcZr5IUdG+5LwWWBwNy2we1BTrQt2C/L0AlRLa3I1iqFMprq33M8nesCNDbLV9P
jjkUrUd0bPouFMUt1bEGwGrgZM4JLcywTuSeH2GpC9k3Jghz7yd3hxo4FEf2+zzl
4L5UyPZWTP8OGg2SG33TO/E2hORDfxJCyUdJy5nAFtxug/rm36d3qZBTltpjeq1K
jI0c7bm6fXGa9WgdtmDbXOZV06h6ZxSiRRWG9cpZr82x0rfsW1N2ncQhSyjkU8gM
i3L06SbiKLmEx8U6Swy11gCXnTW9QDyn9cj8NgFNKf3fMHgNdCZts7GD/kVAPmui
k0ZYG13TP0eNgowyN5ByOQckvdNnQPPW4ghKWPtkC2UkCgM+QbNEThuil13x0OiV
h/IE5/E6OooVUyMR/pTsTKLYK0yU08UOtjZSpnxPoXK7j5r6GrY7LoKpU6nnbyIx
WfPMN+fDI1NLjWTx8z9X
=aXKW
-----END PGP SIGNATURE-----

home | help

Want to link to this message? Use this
URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20141106003240.344dedf6>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation