Date: Wed, 06 May 2020 20:21:57 +0000 From: bugzilla-noreply@freebsd.org To: bugs@FreeBSD.org Subject: [Bug 224496] mpr and mps drivers seems to have issues with large seagate drives Message-ID: <bug-224496-227-gfJzjX1g5r@https.bugs.freebsd.org/bugzilla/> In-Reply-To: <bug-224496-227@https.bugs.freebsd.org/bugzilla/> References: <bug-224496-227@https.bugs.freebsd.org/bugzilla/>
index | next in thread | previous in thread | raw e-mail
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=224496 --- Comment #41 from Sharad Ahlawat <freebsd@diyit.org> --- (In reply to Sharad Ahlawat from comment #28) Symptom: CAM retry and timeout messages leading to controller aborts and resets Cause: slow disks or using SMR disks Workaround: Increase the CAM timeout defaults ❯ sysctl kern.cam.da.default_timeout=90 kern.cam.da.default_timeout: 60 -> 90 ❯ sysctl kern.cam.ada.default_timeout=60 kern.cam.ada.default_timeout: 30 -> 60 And disable NCQ on SMR Seagates: ❯ cat cam-tags.sh #!/usr/local/bin/bash #shrinking the command Native Command Queue down to 1 effectively disabling queuing for Disk in `camcontrol devlist | grep "ST8000DM" | cut -d"," -f2 | cut -d")" -f1`; do camcontrol tags $Disk -N 1 ; # camcontrol tags $Disk -v done If you only have SMRs in your setup and use an UPS you could also: ❯ sysctl vfs.zfs.cache_flush_disable=1 Solution: don't use slow disks and SMRs disks with ZFS The long version: I am obliged to post this update given the driver downgrade workaround I previously posted on this thread before getting to the root cause for these logs in my messages file after upgrading to 12.x Jan 18 17:29:28 nas kernel: ahcich6: Timeout on slot 8 port 0 Jan 18 17:29:28 nas kernel: ahcich6: is 00000000 cs 00000100 ss 00000000 rs 00000100 tfd c0 serr 00000000 cmd 0000c817 Jan 18 17:29:28 nas kernel: (ada6:ahcich6:0:0:0): FLUSHCACHE48. ACB: ea 00 00 00 00 40 00 00 00 00 00 00 Jan 18 17:29:28 nas kernel: (ada6:ahcich6:0:0:0): CAM status: Command timeout Jan 18 17:29:28 nas kernel: (ada6:ahcich6:0:0:0): Retrying command, 0 more tries remain Jan 18 17:30:00 nas kernel: ahcich6: AHCI reset: device not ready after 31000ms (tfd = 00000080) Jan 18 17:30:30 nas kernel: ahcich6: Timeout on slot 9 port 0 Jan 18 17:30:30 nas kernel: ahcich6: is 00000000 cs 00000200 ss 00000000 rs 00000200 tfd 80 serr 00000000 cmd 0000c917 Jan 18 17:30:30 nas kernel: (aprobe0:ahcich6:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00 Jan 18 17:30:30 nas kernel: (aprobe0:ahcich6:0:0:0): CAM status: Command timeout Jan 18 17:30:30 nas kernel: (aprobe0:ahcich6:0:0:0): Retrying command, 0 more tries remain Apr 25 22:28:12 nas kernel: mps0: Controller reported scsi ioc terminated tgt 11 SMID 1039 loginfo 31080000 Apr 25 22:28:12 nas kernel: mps0: Controller reported scsi ioc terminated tgt 11 SMID 1357 loginfo 31080000 Apr 25 22:28:12 nas kernel: mps0: Controller reported scsi ioc terminated tgt 11 SMID 1933 loginfo 31080000 Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): READ(16). CDB: 88 00 00 00 00 01 45 fb 37 c8 00 00 00 b0 00 00 Apr 25 22:28:12 nas kernel: mps0: (da4:mps0:0:11:0): CAM status: CCB request completed with an error Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command, 3 more tries remain Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): READ(16). CDB: 88 00 00 00 00 01 45 fb 38 78 00 00 00 58 00 00 Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): CAM status: CCB request completed with an error Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command, 3 more tries remain Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): WRITE(16). CDB: 8a 00 00 00 00 01 b4 c0 a1 d8 00 00 01 00 00 00 Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): CAM status: CCB request completed with an error Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command, 3 more tries remain Apr 25 22:28:12 nas kernel: Controller reported scsi ioc terminated tgt 11 SMID 621 loginfo 31080000 Apr 25 22:28:12 nas kernel: mps0: Controller reported scsi ioc terminated tgt 11 SMID 476 loginfo 31080000 Apr 25 22:28:12 nas kernel: mps0: Controller reported scsi ioc terminated tgt 11 SMID 321 loginfo 31080000 Apr 25 22:28:12 nas kernel: mps0: Controller reported scsi ioc terminated tgt 11 SMID 1873 loginfo 31080000 Apr 25 22:28:12 nas kernel: mps0: Controller reported scsi ioc terminated tgt 11 SMID 1852 loginfo 31080000 Apr 25 22:28:12 nas kernel: mps0: Controller reported scsi ioc terminated tgt 11 SMID 1742 loginfo 31080000 Apr 25 22:28:12 nas kernel: mps0: Controller reported scsi ioc terminated tgt 11 SMID 387 loginfo 31080000 Apr 25 22:28:12 nas kernel: mps0: Controller reported scsi ioc terminated tgt 11 SMID 2104 loginfo 31080000 Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): WRITE(16). CDB: 8a 00 00 00 00 01 b4 c0 a2 d8 00 00 01 00 00 00 Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): CAM status: CCB request completed with an error Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command, 3 more tries remain Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): WRITE(16). CDB: 8a 00 00 00 00 01 b4 c0 a3 d8 00 00 01 00 00 00 Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): CAM status: CCB request completed with an error Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command, 3 more tries remain Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): WRITE(16). CDB: 8a 00 00 00 00 01 b4 c0 a4 d8 00 00 01 00 00 00 Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): CAM status: CCB request completed with an error Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command, 3 more tries remain Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): WRITE(16). CDB: 8a 00 00 00 00 01 b4 c0 a5 d8 00 00 01 00 00 00 Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): CAM status: CCB request completed with an error Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command, 3 more tries remain Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): WRITE(16). CDB: 8a 00 00 00 00 01 b4 c0 a6 d8 00 00 01 00 00 00 Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): CAM status: CCB request completed with an error Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command, 3 more tries remain Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): WRITE(16). CDB: 8a 00 00 00 00 01 b4 c0 a7 d8 00 00 01 00 00 00 Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): CAM status: CCB request completed with an error Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command, 3 more tries remain Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): WRITE(16). CDB: 8a 00 00 00 00 01 b4 c0 a8 d8 00 00 00 c0 00 00 Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): CAM status: CCB request completed with an error Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command, 3 more tries remain Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): READ(16). CDB: 88 00 00 00 00 01 45 fb 37 c0 00 00 00 08 00 00 Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): CAM status: CCB request completed with an error Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command, 3 more tries remain Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): READ(16). CDB: 88 00 00 00 00 01 45 fb 36 c0 00 00 01 00 00 00 Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): CAM status: SCSI Status Error Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): SCSI status: Check Condition Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): SCSI sense: ABORTED COMMAND asc:47,3 (Information unit iuCRC error detected) Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command (per sense data) Apr 29 14:07:38 nas kernel: ahcich6: Timeout on slot 22 port 0 Apr 29 14:07:38 nas kernel: ahcich6: is 00000000 cs 00000000 ss 00400000 rs 00400000 tfd 40 serr 00000000 cmd 0000d617 Apr 29 14:07:38 nas kernel: (ada6:ahcich6:0:0:0): READ_FPDMA_QUEUED. ACB: 60 80 80 44 e8 40 c9 01 00 00 00 00 Apr 29 14:07:38 nas kernel: (ada6:ahcich6:0:0:0): CAM status: Command timeout Apr 29 14:07:38 nas kernel: (ada6:ahcich6:0:0:0): Retrying command, 3 more tries remain Apr 29 14:08:09 nas kernel: ahcich6: AHCI reset: device not ready after 31000ms (tfd = 00000080) ❯ smartctl -a /dev/da4 ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 178 000 Old_age Always - 419 Note: Every controller ABORTED COMMAND increases the ID# 199 count Messages are correlated with sustained disk operations events: scrubs large file transfers cron jobs kicking off on multiple jails accessing the same pool, etc. Ruled out any issues with: loose cable cable quality controller hardware, bios and firmware power supply ❯ mpsutil show all Adapter: mps0 Adapter: Board Name: SAS9207-4i4e Board Assembly: H3-25434-00H Chip Name: LSISAS2308 Chip Revision: ALL BIOS Revision: 7.39.02.00 Firmware Revision: 20.00.07.00 Integrated RAID: no PhyNum CtlrHandle DevHandle Disabled Speed Min Max Device 0 0001 0009 N 6.0 1.5 6.0 SAS Initiator 1 0002 000a N 6.0 1.5 6.0 SAS Initiator 2 0003 000b N 6.0 1.5 6.0 SAS Initiator 3 0004 000c N 6.0 1.5 6.0 SAS Initiator 4 0005 000d N 6.0 1.5 6.0 SAS Initiator 5 0006 000e N 6.0 1.5 6.0 SAS Initiator 6 0007 000f N 6.0 1.5 6.0 SAS Initiator 7 0008 0010 N 6.0 1.5 6.0 SAS Initiator Devices: B____T SAS Address Handle Parent Device Speed Enc Slot Wdt 00 10 4433221100000000 0009 0001 SATA Target 6.0 0001 00 1 00 09 4433221101000000 000a 0002 SATA Target 6.0 0001 01 1 00 11 4433221102000000 000b 0003 SATA Target 6.0 0001 02 1 00 07 4433221103000000 000c 0004 SATA Target 6.0 0001 03 1 00 13 4433221104000000 000d 0005 SATA Target 6.0 0001 07 1 00 04 4433221105000000 000e 0006 SATA Target 6.0 0001 06 1 00 14 4433221106000000 000f 0007 SATA Target 6.0 0001 05 1 00 12 4433221107000000 0010 0008 SATA Target 6.0 0001 04 1 Enclosures: Slots Logical ID SEPHandle EncHandle Type 08 500605b0073400a0 0001 Direct Attached SGPIO Expanders: NumPhys SAS Address DevHandle Parent EncHandle SAS Level ❯ camcontrol devlist <Samsung SSD 850 PRO 256GB EXM02B6Q> at scbus0 target 0 lun 0 (ada0,pass0) <Samsung SSD 850 PRO 256GB EXM02B6Q> at scbus1 target 0 lun 0 (ada1,pass1) <Samsung SSD 860 EVO 500GB RVT03B6Q> at scbus2 target 0 lun 0 (ada2,pass2) <Samsung SSD 860 EVO 500GB RVT03B6Q> at scbus3 target 0 lun 0 (ada3,pass3) <ST8000DM004-2CX188 0001> at scbus4 target 0 lun 0 (ada4,pass4) <ST8000DM004-2CX188 0001> at scbus5 target 0 lun 0 (ada5,pass5) <ST8000DM004-2CX188 0001> at scbus6 target 0 lun 0 (ada6,pass6) <ST8000DM004-2CX188 0001> at scbus7 target 0 lun 0 (ada7,pass7) <WDC WD20EFRX-68EUZN0 80.00A80> at scbus8 target 0 lun 0 (ada8,pass8) <WDC WD20EFRX-68EUZN0 80.00A80> at scbus9 target 0 lun 0 (ada9,pass9) <ATA ST4000DX001-1CE1 CC45> at scbus10 target 4 lun 0 (pass10,da0) <ATA ST4000NM0024-1HT SN05> at scbus10 target 7 lun 0 (pass11,da1) <ATA ST4000NM0024-1HT SN05> at scbus10 target 9 lun 0 (pass12,da2) <ATA ST4000NM0024-1HT SN02> at scbus10 target 10 lun 0 (pass13,da3) <ATA ST4000NM0024-1HT SN06> at scbus10 target 11 lun 0 (pass14,da4) <ATA ST4000DX001-1CE1 CC44> at scbus10 target 12 lun 0 (pass15,da5) <ATA ST4000DX001-1CE1 CC45> at scbus10 target 13 lun 0 (pass16,da6) <ATA ST4000DX001-1CE1 CC45> at scbus10 target 14 lun 0 (pass17,da7) Notice: da4 has a different firmware revision and most probably the reason that I see relatively fewer timeouts errors on da1, da2 and da3, the other disks in the same ZFS pool. To understand these errors messages: https://en.wikipedia.org/wiki/SCSI_command https://en.wikipedia.org/wiki/SCSI_CDB ❯ cat /var/log/messages | grep "SCSI status" Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): SCSI status: Check Condition https://en.wikipedia.org/wiki/SCSI_Status_Code#02h_Check_Condition https://en.wikipedia.org/wiki/SCSI_contingent_allegiance_condition ❯ cat /var/log/messages | grep "SCSI sense" Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): SCSI sense: ABORTED COMMAND asc:47,3 (Information unit iuCRC error detected) https://en.wikipedia.org/wiki/Key_Code_Qualifier B 47 00 Aborted Command - SCSI parity error https://www.t10.org/lists/asc-num.htm#ASC_47 47h/03h DZTPROMAEBKVF INFORMATION UNIT iuCRC ERROR DETECTED CPB: - Byte 0-1 : other - Byte 2-9 : 8 bytes is the logical block address - Byte 10-13 : 4 bytes is the transfer length - Byte 14-5 : other So in my logs we are writing 0x100 = 256 bytes of data and throwing errors, writing to consecutive blocks. The retries generally succeed and subsequent scrubs don't find any errors, though you may see CKSUM errors during a scrub which I associated with the abort and controller reset events. Summary: 12.x FreeBSD and mps driver code is more efficient than 11.x, pushing more data back and forth from the disks and that has led to these timeout events I also experimented and found that reducing the memory clock speed in the bios is another workaround to stop these timeout errors with 12.x. Basically that is what my earlier posted workaround of using 11.x mps drivers with 12.x was effectively doing, slowing down the data transfer rates so the disks could keep up. Disclaimer: These observations, analysis and workarounds are specific to my setup, please adapt and incorporate accordingly. -- You are receiving this mail because: You are the assignee for the bug.help
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?bug-224496-227-gfJzjX1g5r>
