Date: Wed, 06 May 2020 20:21:57 +0000 From: bugzilla-noreply@freebsd.org To: bugs@FreeBSD.org Subject: [Bug 224496] mpr and mps drivers seems to have issues with large seagate drives Message-ID: <bug-224496-227-gfJzjX1g5r@https.bugs.freebsd.org/bugzilla/> In-Reply-To: <bug-224496-227@https.bugs.freebsd.org/bugzilla/> References: <bug-224496-227@https.bugs.freebsd.org/bugzilla/>
next in thread | previous in thread | raw e-mail | index | archive | help
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D224496 --- Comment #41 from Sharad Ahlawat <freebsd@diyit.org> --- (In reply to Sharad Ahlawat from comment #28) Symptom: CAM retry and timeout messages leading to controller aborts and resets Cause: slow disks or using SMR disks Workaround: Increase the CAM timeout defaults =E2=9D=AF sysctl kern.cam.da.default_timeout=3D90 kern.cam.da.default_timeout: 60 -> 90 =E2=9D=AF sysctl kern.cam.ada.default_timeout=3D60 kern.cam.ada.default_timeout: 30 -> 60 And disable NCQ on SMR Seagates: =E2=9D=AF cat cam-tags.sh #!/usr/local/bin/bash #shrinking the command Native Command Queue down to 1 effectively disabling queuing for Disk in `camcontrol devlist | grep "ST8000DM" | cut -d"," -f2 | cut -d"= )" -f1`; do camcontrol tags $Disk -N 1 ; # camcontrol tags $Disk -v done If you only have SMRs in your setup and use an UPS you could also: =E2=9D=AF sysctl vfs.zfs.cache_flush_disable=3D1 Solution: don't use slow disks and SMRs disks with ZFS The long version: I am obliged to post this update given the driver downgrade workaround I previously posted on this thread before getting to the root cause for these logs in my messages file after upgrading to 12.x Jan 18 17:29:28 nas kernel: ahcich6: Timeout on slot 8 port 0 Jan 18 17:29:28 nas kernel: ahcich6: is 00000000 cs 00000100 ss 00000000 rs 00000100 tfd c0 serr 00000000 cmd 0000c817 Jan 18 17:29:28 nas kernel: (ada6:ahcich6:0:0:0): FLUSHCACHE48. ACB: ea 00 = 00 00 00 40 00 00 00 00 00 00 Jan 18 17:29:28 nas kernel: (ada6:ahcich6:0:0:0): CAM status: Command timeo= ut Jan 18 17:29:28 nas kernel: (ada6:ahcich6:0:0:0): Retrying command, 0 more tries remain Jan 18 17:30:00 nas kernel: ahcich6: AHCI reset: device not ready after 310= 00ms (tfd =3D 00000080) Jan 18 17:30:30 nas kernel: ahcich6: Timeout on slot 9 port 0 Jan 18 17:30:30 nas kernel: ahcich6: is 00000000 cs 00000200 ss 00000000 rs 00000200 tfd 80 serr 00000000 cmd 0000c917 Jan 18 17:30:30 nas kernel: (aprobe0:ahcich6:0:0:0): ATA_IDENTIFY. ACB: ec = 00 00 00 00 40 00 00 00 00 00 00 Jan 18 17:30:30 nas kernel: (aprobe0:ahcich6:0:0:0): CAM status: Command timeout Jan 18 17:30:30 nas kernel: (aprobe0:ahcich6:0:0:0): Retrying command, 0 mo= re tries remain Apr 25 22:28:12 nas kernel: mps0: Controller reported scsi ioc terminated t= gt 11 SMID 1039 loginfo 31080000 Apr 25 22:28:12 nas kernel: mps0: Controller reported scsi ioc terminated t= gt 11 SMID 1357 loginfo 31080000 Apr 25 22:28:12 nas kernel: mps0: Controller reported scsi ioc terminated t= gt 11 SMID 1933 loginfo 31080000 Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): READ(16). CDB: 88 00 00 00 0= 0 01 45 fb 37 c8 00 00 00 b0 00 00 Apr 25 22:28:12 nas kernel: mps0: (da4:mps0:0:11:0): CAM status: CCB request completed with an error Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command, 3 more tri= es remain Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): READ(16). CDB: 88 00 00 00 0= 0 01 45 fb 38 78 00 00 00 58 00 00 Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): CAM status: CCB request completed with an error Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command, 3 more tri= es remain Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): WRITE(16). CDB: 8a 00 00 00 = 00 01 b4 c0 a1 d8 00 00 01 00 00 00 Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): CAM status: CCB request completed with an error Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command, 3 more tri= es remain Apr 25 22:28:12 nas kernel: Controller reported scsi ioc terminated tgt 11 = SMID 621 loginfo 31080000 Apr 25 22:28:12 nas kernel: mps0: Controller reported scsi ioc terminated t= gt 11 SMID 476 loginfo 31080000 Apr 25 22:28:12 nas kernel: mps0: Controller reported scsi ioc terminated t= gt 11 SMID 321 loginfo 31080000 Apr 25 22:28:12 nas kernel: mps0: Controller reported scsi ioc terminated t= gt 11 SMID 1873 loginfo 31080000 Apr 25 22:28:12 nas kernel: mps0: Controller reported scsi ioc terminated t= gt 11 SMID 1852 loginfo 31080000 Apr 25 22:28:12 nas kernel: mps0: Controller reported scsi ioc terminated t= gt 11 SMID 1742 loginfo 31080000 Apr 25 22:28:12 nas kernel: mps0: Controller reported scsi ioc terminated t= gt 11 SMID 387 loginfo 31080000 Apr 25 22:28:12 nas kernel: mps0: Controller reported scsi ioc terminated t= gt 11 SMID 2104 loginfo 31080000 Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): WRITE(16). CDB: 8a 00 00 00 = 00 01 b4 c0 a2 d8 00 00 01 00 00 00 Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): CAM status: CCB request completed with an error Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command, 3 more tri= es remain Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): WRITE(16). CDB: 8a 00 00 00 = 00 01 b4 c0 a3 d8 00 00 01 00 00 00 Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): CAM status: CCB request completed with an error Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command, 3 more tri= es remain Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): WRITE(16). CDB: 8a 00 00 00 = 00 01 b4 c0 a4 d8 00 00 01 00 00 00 Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): CAM status: CCB request completed with an error Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command, 3 more tri= es remain Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): WRITE(16). CDB: 8a 00 00 00 = 00 01 b4 c0 a5 d8 00 00 01 00 00 00 Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): CAM status: CCB request completed with an error Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command, 3 more tri= es remain Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): WRITE(16). CDB: 8a 00 00 00 = 00 01 b4 c0 a6 d8 00 00 01 00 00 00 Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): CAM status: CCB request completed with an error Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command, 3 more tri= es remain Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): WRITE(16). CDB: 8a 00 00 00 = 00 01 b4 c0 a7 d8 00 00 01 00 00 00 Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): CAM status: CCB request completed with an error Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command, 3 more tri= es remain Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): WRITE(16). CDB: 8a 00 00 00 = 00 01 b4 c0 a8 d8 00 00 00 c0 00 00 Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): CAM status: CCB request completed with an error Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command, 3 more tri= es remain Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): READ(16). CDB: 88 00 00 00 0= 0 01 45 fb 37 c0 00 00 00 08 00 00 Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): CAM status: CCB request completed with an error Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command, 3 more tri= es remain Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): READ(16). CDB: 88 00 00 00 0= 0 01 45 fb 36 c0 00 00 01 00 00 00 Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): CAM status: SCSI Status Error Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): SCSI status: Check Condition Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): SCSI sense: ABORTED COMMAND asc:47,3 (Information unit iuCRC error detected) Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command (per sense data) Apr 29 14:07:38 nas kernel: ahcich6: Timeout on slot 22 port 0 Apr 29 14:07:38 nas kernel: ahcich6: is 00000000 cs 00000000 ss 00400000 rs 00400000 tfd 40 serr 00000000 cmd 0000d617 Apr 29 14:07:38 nas kernel: (ada6:ahcich6:0:0:0): READ_FPDMA_QUEUED. ACB: 6= 0 80 80 44 e8 40 c9 01 00 00 00 00 Apr 29 14:07:38 nas kernel: (ada6:ahcich6:0:0:0): CAM status: Command timeo= ut Apr 29 14:07:38 nas kernel: (ada6:ahcich6:0:0:0): Retrying command, 3 more tries remain Apr 29 14:08:09 nas kernel: ahcich6: AHCI reset: device not ready after 310= 00ms (tfd =3D 00000080) =E2=9D=AF smartctl -a /dev/da4 ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED=20 WHEN_FAILED RAW_VALUE 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always=20= =20=20=20=20=20 - 0 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always=20= =20=20=20=20=20 - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline= =20=20=20=20=20 - 0 199 UDMA_CRC_Error_Count=C2=A0 =C2=A0 0x003e =C2=A0 200 =C2=A0 178 =C2=A0 0= 00=C2=A0 =C2=A0 Old_age =C2=A0 Always =C2=A0 =C2=A0 =C2=A0 - =C2=A0 =C2=A0 =C2=A0 419 Note: Every controller ABORTED COMMAND increases the ID# 199 count Messages are correlated with sustained disk operations events: scrubs large file transfers cron jobs kicking off on multiple jails accessing the same pool, et= c. Ruled out any issues with: loose cable cable quality controller hardware, bios and firmware power supply =E2=9D=AF mpsutil show all Adapter: mps0 Adapter: Board Name: SAS9207-4i4e Board Assembly: H3-25434-00H Chip Name: LSISAS2308 Chip Revision: ALL BIOS Revision: 7.39.02.00 Firmware Revision: 20.00.07.00 Integrated RAID: no PhyNum CtlrHandle DevHandle Disabled Speed Min Max Device 0 0001 0009 N 6.0 1.5 6.0 SAS Initiator 1 0002 000a N 6.0 1.5 6.0 SAS Initiator 2 0003 000b N 6.0 1.5 6.0 SAS Initiator 3 0004 000c N 6.0 1.5 6.0 SAS Initiator 4 0005 000d N 6.0 1.5 6.0 SAS Initiator 5 0006 000e N 6.0 1.5 6.0 SAS Initiator 6 0007 000f N 6.0 1.5 6.0 SAS Initiator 7 0008 0010 N 6.0 1.5 6.0 SAS Initiator Devices: B____T SAS Address Handle Parent Device Speed Enc Slot = Wdt 00 10 4433221100000000 0009 0001 SATA Target 6.0 0001 00 = 1 00 09 4433221101000000 000a 0002 SATA Target 6.0 0001 01 = 1 00 11 4433221102000000 000b 0003 SATA Target 6.0 0001 02 = 1 00 07 4433221103000000 000c 0004 SATA Target 6.0 0001 03 = 1 00 13 4433221104000000 000d 0005 SATA Target 6.0 0001 07 = 1 00 04 4433221105000000 000e 0006 SATA Target 6.0 0001 06 = 1 00 14 4433221106000000 000f 0007 SATA Target 6.0 0001 05 = 1 00 12 4433221107000000 0010 0008 SATA Target 6.0 0001 04 = 1 Enclosures: Slots Logical ID SEPHandle EncHandle Type 08 500605b0073400a0 0001 Direct Attached SGPIO Expanders: NumPhys SAS Address DevHandle Parent EncHandle SAS Level =E2=9D=AF camcontrol devlist <Samsung SSD 850 PRO 256GB EXM02B6Q> at scbus0 target 0 lun 0 (ada0,pass0) <Samsung SSD 850 PRO 256GB EXM02B6Q> at scbus1 target 0 lun 0 (ada1,pass1) <Samsung SSD 860 EVO 500GB RVT03B6Q> at scbus2 target 0 lun 0 (ada2,pass2) <Samsung SSD 860 EVO 500GB RVT03B6Q> at scbus3 target 0 lun 0 (ada3,pass3) <ST8000DM004-2CX188 0001> at scbus4 target 0 lun 0 (ada4,pass4) <ST8000DM004-2CX188 0001> at scbus5 target 0 lun 0 (ada5,pass5) <ST8000DM004-2CX188 0001> at scbus6 target 0 lun 0 (ada6,pass6) <ST8000DM004-2CX188 0001> at scbus7 target 0 lun 0 (ada7,pass7) <WDC WD20EFRX-68EUZN0 80.00A80> at scbus8 target 0 lun 0 (ada8,pass8) <WDC WD20EFRX-68EUZN0 80.00A80> at scbus9 target 0 lun 0 (ada9,pass9) <ATA ST4000DX001-1CE1 CC45> at scbus10 target 4 lun 0 (pass10,da0) <ATA ST4000NM0024-1HT SN05> at scbus10 target 7 lun 0 (pass11,da1) <ATA ST4000NM0024-1HT SN05> at scbus10 target 9 lun 0 (pass12,da2) <ATA ST4000NM0024-1HT SN02> at scbus10 target 10 lun 0 (pass13,da3) <ATA ST4000NM0024-1HT SN06> at scbus10 target 11 lun 0 (pass14,da4) <ATA ST4000DX001-1CE1 CC44> at scbus10 target 12 lun 0 (pass15,da5) <ATA ST4000DX001-1CE1 CC45> at scbus10 target 13 lun 0 (pass16,da6) <ATA ST4000DX001-1CE1 CC45> at scbus10 target 14 lun 0 (pass17,da7) Notice: da4 has a different firmware revision and most probably the reason = that I see relatively fewer timeouts errors on da1, da2 and da3, the other disks= in the same ZFS pool. To understand these errors messages: https://en.wikipedia.org/wiki/SCSI_command https://en.wikipedia.org/wiki/SCSI_CDB =E2=9D=AF cat /var/log/messages | grep "SCSI status" Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): SCSI status: Check Condition https://en.wikipedia.org/wiki/SCSI_Status_Code#02h_Check_Condition https://en.wikipedia.org/wiki/SCSI_contingent_allegiance_condition =E2=9D=AF cat /var/log/messages | grep "SCSI sense" Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): SCSI sense: ABORTED COMMAND asc:47,3 (Information unit iuCRC error detected) https://en.wikipedia.org/wiki/Key_Code_Qualifier B 47 00 Aborted Command - SCSI parity error https://www.t10.org/lists/asc-num.htm#ASC_47 47h/03h DZTPROMAEBKVF INFORMATION UNIT iuCRC ERROR DETECTED CPB: - Byte 0-1 : other - Byte 2-9 : 8 bytes is the logical block address - Byte 10-13 : 4 bytes is the transfer length - Byte 14-5 : other So in my logs we are writing 0x100 =3D 256 bytes of data and throwing error= s, writing to consecutive blocks. The retries generally succeed and subsequent scrubs don't find any errors, though you may see CKSUM errors during a scrub which I associated with the abort and controller reset events. Summary: 12.x FreeBSD and mps driver code is more efficient than 11.x, pushing more = data back and forth from the disks and that has led to these timeout events I also experimented and found that reducing the memory clock speed in the b= ios is another workaround to stop these timeout errors with 12.x. Basically tha= t is what my earlier posted workaround of using 11.x mps drivers with 12.x was effectively doing, slowing down the data transfer rates so the disks could = keep up. Disclaimer: These observations, analysis and workarounds are specific to my setup, plea= se adapt and incorporate accordingly. --=20 You are receiving this mail because: You are the assignee for the bug.=
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?bug-224496-227-gfJzjX1g5r>