Date: Thu, 6 Nov 2014 00:32:40 +0100 From: Kai Gallasch <k@free.de> To: freebsd-stable@freebsd.org Subject: 10.1 RC4 r273903 - zpool scrub on ssd mirror - ahci command timeout Message-ID: <20141106003240.344dedf6@orwell>
next in thread | raw e-mail | index | archive | help
--Sig_/Cb+iQfmYOFK1M9cv+=7kz6I Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: quoted-printable Hi. Not sure if this is 10.1 related or more a problem of the ssd model and/or ahci controller.. I am currently running 10.1 RC4 r273903 on a zfs on root server with two mirror pools. One of the pools is a mirror consisting of two Samsung SSD 850 PRO 512GB SSDs. When I start a zfs scrub on this pool the result of the scrub is: # zpool status -v ssdpool pool: ssdpool state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://illumos.org/msg/ZFS-8000-9P scan: scrub repaired 73K in 0h8m with 0 errors on Thu Nov 6 00:00:16 2014 config: NAME STATE READ WRITE CKSUM ssdpool ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 gpt/ssdpool0 ONLINE 0 0 17 gpt/ssdpool1 ONLINE 0 0 29 When I do a 'zpool clear' the pool status looks ok again. But when I again start a zpool scrub the same thing happens again and the above status "One or more devices has experienced an unrecoverable error" shows again. I find the following kernel message in the output of 'dmesg': (after running zpool scrub two times) ahcich2: Timeout on slot 15 port 0 ahcich2: is 00000000 cs 000f0000 ss 000f8000 rs 000f8000 tfd 40 serr 00000000 cmd 0024cf17 (ada2:ahcich2:0:0:0): READ_FPDMA_QUEUED. ACB: 60 8b a6 1d 56 40 0d 00 00 00 00 00 (ada2:ahcich2:0:0:0): CAM status: Command timeout (ada2:ahcich2:0:0:0): Retrying command ahcich2: Timeout on slot 23 port 0 ahcich2: is 00000000 cs 0f000000 ss 0f800000 rs 0f800000 tfd 40 serr 00000000 cmd 0024d817 (ada2:ahcich2:0:0:0): READ_FPDMA_QUEUED. ACB: 60 1b 23 81 bc 40 06 00 00 00 00 00 (ada2:ahcich2:0:0:0): CAM status: Command timeout (ada2:ahcich2:0:0:0): Retrying command ahcich2: Timeout on slot 3 port 0 ahcich2: is 00000000 cs 00000030 ss 00000038 rs 00000038 tfd 40 serr 00000000 cmd 0024c317 (ada2:ahcich2:0:0:0): READ_FPDMA_QUEUED. ACB: 60 26 bd 18 8e 40 12 00 00 00 00 00 (ada2:ahcich2:0:0:0): CAM status: Command timeout (ada2:ahcich2:0:0:0): Retrying command Besides: smartctl shows no error on ada2. Here comes the output.. # smartctl -a -q noserial /dev/ada2 smartctl 6.3 2014-07-26 r3976 [FreeBSD 10.1-RC4 amd64] (local build) Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org =3D=3D=3D START OF INFORMATION SECTION =3D=3D=3D Device Model: Samsung SSD 850 PRO 512GB Firmware Version: EXM01B6Q User Capacity: 512,110,190,592 bytes [512 GB] Sector Size: 512 bytes logical/physical Rotation Rate: Solid State Device Device is: Not in smartctl database [for details use: -P showall] ATA Version is: ACS-2, ATA8-ACS T13/1699-D revision 4c SATA Version is: SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s) Local Time is: Thu Nov 6 00:02:04 2014 CET SMART support is: Available - device has SMART capability. SMART support is: Enabled =3D=3D=3D START OF READ SMART DATA SECTION =3D=3D=3D SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever=20 been run. Total time to complete Offline=20 data collection: ( 0) seconds. Offline data collection capabilities: (0x53) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. No Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine=20 recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 33) minutes. SCT capabilities: (0x003d) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 1 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0 9 Power_On_Hours 0x0032 099 099 000 Old_age Always - 154 12 Power_Cycle_Count 0x0032 099 099 000 Old_age Always - 5 177 Wear_Leveling_Count 0x0013 100 100 000 Pre-fail Always - 0 179 Used_Rsvd_Blk_Cnt_Tot 0x0013 100 100 010 Pre-fail Always - 0 181 Program_Fail_Cnt_Total 0x0032 100 100 010 Old_age Always - 0 182 Erase_Fail_Count_Total 0x0032 100 100 010 Old_age Always - 0 183 Runtime_Bad_Block 0x0013 100 100 010 Pre-fail Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0032 070 068 000 Old_age Always - 30 195 Hardware_ECC_Recovered 0x001a 200 200 000 Old_age Always - 0 199 UDMA_CRC_Error_Count 0x003e 100 100 000 Old_age Always - 0 235 Unknown_Attribute 0x0012 100 100 000 Old_age Always - 0 241 Total_LBAs_Written 0x0032 099 099 000 Old_age Always - 400466433 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed without error 00% 147 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. I wonder What is the possible reason for this. Both SSDs are new. Is this a common problem with zfs and SSDs (for example ahci timeouts because of high data rates for a bus ?) K. --=20 PGP-KeyID =3D 0xE401B671927D4A5C --Sig_/Cb+iQfmYOFK1M9cv+=7kz6I Content-Type: application/pgp-signature; name=signature.asc Content-Disposition: attachment; filename=signature.asc -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQIcBAEBCgAGBQJUWrOYAAoJEHBlTXxPsfWI+UIQAJ3E8zSy/71RdJ3XrEtTIVy3 Akz/LFvpJ6NFLu4meRXtpyNrX1PinNPIhMXM7c/ugXk0Absb4WZABR5fBiecHrTh xxfcKoVZ8l5AIm06DfUlXRkLdGSfpVEkPIr8t2qFc/ka04yrinZoQIbr3i7Rv3be ri7lKQBEOEjNYT0IdW9GTPOdzKDHnxu9I4zHiZ7x6DlMMUiGXPSkahCuo26ppo8T AMcZr5IUdG+5LwWWBwNy2we1BTrQt2C/L0AlRLa3I1iqFMprq33M8nesCNDbLV9P jjkUrUd0bPouFMUt1bEGwGrgZM4JLcywTuSeH2GpC9k3Jghz7yd3hxo4FEf2+zzl 4L5UyPZWTP8OGg2SG33TO/E2hORDfxJCyUdJy5nAFtxug/rm36d3qZBTltpjeq1K jI0c7bm6fXGa9WgdtmDbXOZV06h6ZxSiRRWG9cpZr82x0rfsW1N2ncQhSyjkU8gM i3L06SbiKLmEx8U6Swy11gCXnTW9QDyn9cj8NgFNKf3fMHgNdCZts7GD/kVAPmui k0ZYG13TP0eNgowyN5ByOQckvdNnQPPW4ghKWPtkC2UkCgM+QbNEThuil13x0OiV h/IE5/E6OooVUyMR/pTsTKLYK0yU08UOtjZSpnxPoXK7j5r6GrY7LoKpU6nnbyIx WfPMN+fDI1NLjWTx8z9X =aXKW -----END PGP SIGNATURE----- --Sig_/Cb+iQfmYOFK1M9cv+=7kz6I--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20141106003240.344dedf6>