From owner-freebsd-bugs@freebsd.org Wed May 6 20:21:59 2020 Return-Path: Delivered-To: freebsd-bugs@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id 521DB2DEDE7 for ; Wed, 6 May 2020 20:21:59 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from mailman.nyi.freebsd.org (unknown [127.0.1.3]) by mx1.freebsd.org (Postfix) with ESMTP id 49HSgC1Q86z48cY for ; Wed, 6 May 2020 20:21:59 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: by mailman.nyi.freebsd.org (Postfix) id 305B02DEDE1; Wed, 6 May 2020 20:21:59 +0000 (UTC) Delivered-To: bugs@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id 2FEB42DEDE0 for ; Wed, 6 May 2020 20:21:59 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from mxrelay.nyi.freebsd.org (mxrelay.nyi.freebsd.org [IPv6:2610:1c1:1:606c::19:3]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) server-signature RSA-PSS (4096 bits) client-signature RSA-PSS (4096 bits) client-digest SHA256) (Client CN "mxrelay.nyi.freebsd.org", Issuer "Let's Encrypt Authority X3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 49HSgC0jyXz48cX for ; Wed, 6 May 2020 20:21:59 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from kenobi.freebsd.org (kenobi.freebsd.org [IPv6:2610:1c1:1:606c::50:1d]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) server-signature RSA-PSS (4096 bits)) (Client did not present a certificate) by mxrelay.nyi.freebsd.org (Postfix) with ESMTPS id 13CF05358 for ; Wed, 6 May 2020 20:21:59 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from kenobi.freebsd.org ([127.0.1.5]) by kenobi.freebsd.org (8.15.2/8.15.2) with ESMTP id 046KLw8Q044134 for ; Wed, 6 May 2020 20:21:58 GMT (envelope-from bugzilla-noreply@freebsd.org) Received: (from www@localhost) by kenobi.freebsd.org (8.15.2/8.15.2/Submit) id 046KLwQE044112 for bugs@FreeBSD.org; Wed, 6 May 2020 20:21:58 GMT (envelope-from bugzilla-noreply@freebsd.org) X-Authentication-Warning: kenobi.freebsd.org: www set sender to bugzilla-noreply@freebsd.org using -f From: bugzilla-noreply@freebsd.org To: bugs@FreeBSD.org Subject: [Bug 224496] mpr and mps drivers seems to have issues with large seagate drives Date: Wed, 06 May 2020 20:21:57 +0000 X-Bugzilla-Reason: AssignedTo X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: Base System X-Bugzilla-Component: kern X-Bugzilla-Version: 11.1-STABLE X-Bugzilla-Keywords: X-Bugzilla-Severity: Affects Some People X-Bugzilla-Who: freebsd@diyit.org X-Bugzilla-Status: New X-Bugzilla-Resolution: X-Bugzilla-Priority: --- X-Bugzilla-Assigned-To: bugs@FreeBSD.org X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: https://bugs.freebsd.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: freebsd-bugs@freebsd.org X-Mailman-Version: 2.1.30 Precedence: list List-Id: Bug reports List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 06 May 2020 20:21:59 -0000 https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D224496 --- Comment #41 from Sharad Ahlawat --- (In reply to Sharad Ahlawat from comment #28) Symptom: CAM retry and timeout messages leading to controller aborts and resets Cause: slow disks or using SMR disks Workaround: Increase the CAM timeout defaults =E2=9D=AF sysctl kern.cam.da.default_timeout=3D90 kern.cam.da.default_timeout: 60 -> 90 =E2=9D=AF sysctl kern.cam.ada.default_timeout=3D60 kern.cam.ada.default_timeout: 30 -> 60 And disable NCQ on SMR Seagates: =E2=9D=AF cat cam-tags.sh #!/usr/local/bin/bash #shrinking the command Native Command Queue down to 1 effectively disabling queuing for Disk in `camcontrol devlist | grep "ST8000DM" | cut -d"," -f2 | cut -d"= )" -f1`; do camcontrol tags $Disk -N 1 ; # camcontrol tags $Disk -v done If you only have SMRs in your setup and use an UPS you could also: =E2=9D=AF sysctl vfs.zfs.cache_flush_disable=3D1 Solution: don't use slow disks and SMRs disks with ZFS The long version: I am obliged to post this update given the driver downgrade workaround I previously posted on this thread before getting to the root cause for these logs in my messages file after upgrading to 12.x Jan 18 17:29:28 nas kernel: ahcich6: Timeout on slot 8 port 0 Jan 18 17:29:28 nas kernel: ahcich6: is 00000000 cs 00000100 ss 00000000 rs 00000100 tfd c0 serr 00000000 cmd 0000c817 Jan 18 17:29:28 nas kernel: (ada6:ahcich6:0:0:0): FLUSHCACHE48. ACB: ea 00 = 00 00 00 40 00 00 00 00 00 00 Jan 18 17:29:28 nas kernel: (ada6:ahcich6:0:0:0): CAM status: Command timeo= ut Jan 18 17:29:28 nas kernel: (ada6:ahcich6:0:0:0): Retrying command, 0 more tries remain Jan 18 17:30:00 nas kernel: ahcich6: AHCI reset: device not ready after 310= 00ms (tfd =3D 00000080) Jan 18 17:30:30 nas kernel: ahcich6: Timeout on slot 9 port 0 Jan 18 17:30:30 nas kernel: ahcich6: is 00000000 cs 00000200 ss 00000000 rs 00000200 tfd 80 serr 00000000 cmd 0000c917 Jan 18 17:30:30 nas kernel: (aprobe0:ahcich6:0:0:0): ATA_IDENTIFY. ACB: ec = 00 00 00 00 40 00 00 00 00 00 00 Jan 18 17:30:30 nas kernel: (aprobe0:ahcich6:0:0:0): CAM status: Command timeout Jan 18 17:30:30 nas kernel: (aprobe0:ahcich6:0:0:0): Retrying command, 0 mo= re tries remain Apr 25 22:28:12 nas kernel: mps0: Controller reported scsi ioc terminated t= gt 11 SMID 1039 loginfo 31080000 Apr 25 22:28:12 nas kernel: mps0: Controller reported scsi ioc terminated t= gt 11 SMID 1357 loginfo 31080000 Apr 25 22:28:12 nas kernel: mps0: Controller reported scsi ioc terminated t= gt 11 SMID 1933 loginfo 31080000 Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): READ(16). CDB: 88 00 00 00 0= 0 01 45 fb 37 c8 00 00 00 b0 00 00 Apr 25 22:28:12 nas kernel: mps0: (da4:mps0:0:11:0): CAM status: CCB request completed with an error Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command, 3 more tri= es remain Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): READ(16). CDB: 88 00 00 00 0= 0 01 45 fb 38 78 00 00 00 58 00 00 Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): CAM status: CCB request completed with an error Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command, 3 more tri= es remain Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): WRITE(16). CDB: 8a 00 00 00 = 00 01 b4 c0 a1 d8 00 00 01 00 00 00 Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): CAM status: CCB request completed with an error Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command, 3 more tri= es remain Apr 25 22:28:12 nas kernel: Controller reported scsi ioc terminated tgt 11 = SMID 621 loginfo 31080000 Apr 25 22:28:12 nas kernel: mps0: Controller reported scsi ioc terminated t= gt 11 SMID 476 loginfo 31080000 Apr 25 22:28:12 nas kernel: mps0: Controller reported scsi ioc terminated t= gt 11 SMID 321 loginfo 31080000 Apr 25 22:28:12 nas kernel: mps0: Controller reported scsi ioc terminated t= gt 11 SMID 1873 loginfo 31080000 Apr 25 22:28:12 nas kernel: mps0: Controller reported scsi ioc terminated t= gt 11 SMID 1852 loginfo 31080000 Apr 25 22:28:12 nas kernel: mps0: Controller reported scsi ioc terminated t= gt 11 SMID 1742 loginfo 31080000 Apr 25 22:28:12 nas kernel: mps0: Controller reported scsi ioc terminated t= gt 11 SMID 387 loginfo 31080000 Apr 25 22:28:12 nas kernel: mps0: Controller reported scsi ioc terminated t= gt 11 SMID 2104 loginfo 31080000 Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): WRITE(16). CDB: 8a 00 00 00 = 00 01 b4 c0 a2 d8 00 00 01 00 00 00 Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): CAM status: CCB request completed with an error Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command, 3 more tri= es remain Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): WRITE(16). CDB: 8a 00 00 00 = 00 01 b4 c0 a3 d8 00 00 01 00 00 00 Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): CAM status: CCB request completed with an error Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command, 3 more tri= es remain Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): WRITE(16). CDB: 8a 00 00 00 = 00 01 b4 c0 a4 d8 00 00 01 00 00 00 Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): CAM status: CCB request completed with an error Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command, 3 more tri= es remain Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): WRITE(16). CDB: 8a 00 00 00 = 00 01 b4 c0 a5 d8 00 00 01 00 00 00 Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): CAM status: CCB request completed with an error Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command, 3 more tri= es remain Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): WRITE(16). CDB: 8a 00 00 00 = 00 01 b4 c0 a6 d8 00 00 01 00 00 00 Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): CAM status: CCB request completed with an error Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command, 3 more tri= es remain Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): WRITE(16). CDB: 8a 00 00 00 = 00 01 b4 c0 a7 d8 00 00 01 00 00 00 Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): CAM status: CCB request completed with an error Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command, 3 more tri= es remain Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): WRITE(16). CDB: 8a 00 00 00 = 00 01 b4 c0 a8 d8 00 00 00 c0 00 00 Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): CAM status: CCB request completed with an error Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command, 3 more tri= es remain Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): READ(16). CDB: 88 00 00 00 0= 0 01 45 fb 37 c0 00 00 00 08 00 00 Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): CAM status: CCB request completed with an error Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command, 3 more tri= es remain Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): READ(16). CDB: 88 00 00 00 0= 0 01 45 fb 36 c0 00 00 01 00 00 00 Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): CAM status: SCSI Status Error Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): SCSI status: Check Condition Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): SCSI sense: ABORTED COMMAND asc:47,3 (Information unit iuCRC error detected) Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command (per sense data) Apr 29 14:07:38 nas kernel: ahcich6: Timeout on slot 22 port 0 Apr 29 14:07:38 nas kernel: ahcich6: is 00000000 cs 00000000 ss 00400000 rs 00400000 tfd 40 serr 00000000 cmd 0000d617 Apr 29 14:07:38 nas kernel: (ada6:ahcich6:0:0:0): READ_FPDMA_QUEUED. ACB: 6= 0 80 80 44 e8 40 c9 01 00 00 00 00 Apr 29 14:07:38 nas kernel: (ada6:ahcich6:0:0:0): CAM status: Command timeo= ut Apr 29 14:07:38 nas kernel: (ada6:ahcich6:0:0:0): Retrying command, 3 more tries remain Apr 29 14:08:09 nas kernel: ahcich6: AHCI reset: device not ready after 310= 00ms (tfd =3D 00000080) =E2=9D=AF smartctl -a /dev/da4 ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED=20 WHEN_FAILED RAW_VALUE 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always=20= =20=20=20=20=20 - 0 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always=20= =20=20=20=20=20 - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline= =20=20=20=20=20 - 0 199 UDMA_CRC_Error_Count=C2=A0 =C2=A0 0x003e =C2=A0 200 =C2=A0 178 =C2=A0 0= 00=C2=A0 =C2=A0 Old_age =C2=A0 Always =C2=A0 =C2=A0 =C2=A0 - =C2=A0 =C2=A0 =C2=A0 419 Note: Every controller ABORTED COMMAND increases the ID# 199 count Messages are correlated with sustained disk operations events: scrubs large file transfers cron jobs kicking off on multiple jails accessing the same pool, et= c. Ruled out any issues with: loose cable cable quality controller hardware, bios and firmware power supply =E2=9D=AF mpsutil show all Adapter: mps0 Adapter: Board Name: SAS9207-4i4e Board Assembly: H3-25434-00H Chip Name: LSISAS2308 Chip Revision: ALL BIOS Revision: 7.39.02.00 Firmware Revision: 20.00.07.00 Integrated RAID: no PhyNum CtlrHandle DevHandle Disabled Speed Min Max Device 0 0001 0009 N 6.0 1.5 6.0 SAS Initiator 1 0002 000a N 6.0 1.5 6.0 SAS Initiator 2 0003 000b N 6.0 1.5 6.0 SAS Initiator 3 0004 000c N 6.0 1.5 6.0 SAS Initiator 4 0005 000d N 6.0 1.5 6.0 SAS Initiator 5 0006 000e N 6.0 1.5 6.0 SAS Initiator 6 0007 000f N 6.0 1.5 6.0 SAS Initiator 7 0008 0010 N 6.0 1.5 6.0 SAS Initiator Devices: B____T SAS Address Handle Parent Device Speed Enc Slot = Wdt 00 10 4433221100000000 0009 0001 SATA Target 6.0 0001 00 = 1 00 09 4433221101000000 000a 0002 SATA Target 6.0 0001 01 = 1 00 11 4433221102000000 000b 0003 SATA Target 6.0 0001 02 = 1 00 07 4433221103000000 000c 0004 SATA Target 6.0 0001 03 = 1 00 13 4433221104000000 000d 0005 SATA Target 6.0 0001 07 = 1 00 04 4433221105000000 000e 0006 SATA Target 6.0 0001 06 = 1 00 14 4433221106000000 000f 0007 SATA Target 6.0 0001 05 = 1 00 12 4433221107000000 0010 0008 SATA Target 6.0 0001 04 = 1 Enclosures: Slots Logical ID SEPHandle EncHandle Type 08 500605b0073400a0 0001 Direct Attached SGPIO Expanders: NumPhys SAS Address DevHandle Parent EncHandle SAS Level =E2=9D=AF camcontrol devlist at scbus0 target 0 lun 0 (ada0,pass0) at scbus1 target 0 lun 0 (ada1,pass1) at scbus2 target 0 lun 0 (ada2,pass2) at scbus3 target 0 lun 0 (ada3,pass3) at scbus4 target 0 lun 0 (ada4,pass4) at scbus5 target 0 lun 0 (ada5,pass5) at scbus6 target 0 lun 0 (ada6,pass6) at scbus7 target 0 lun 0 (ada7,pass7) at scbus8 target 0 lun 0 (ada8,pass8) at scbus9 target 0 lun 0 (ada9,pass9) at scbus10 target 4 lun 0 (pass10,da0) at scbus10 target 7 lun 0 (pass11,da1) at scbus10 target 9 lun 0 (pass12,da2) at scbus10 target 10 lun 0 (pass13,da3) at scbus10 target 11 lun 0 (pass14,da4) at scbus10 target 12 lun 0 (pass15,da5) at scbus10 target 13 lun 0 (pass16,da6) at scbus10 target 14 lun 0 (pass17,da7) Notice: da4 has a different firmware revision and most probably the reason = that I see relatively fewer timeouts errors on da1, da2 and da3, the other disks= in the same ZFS pool. To understand these errors messages: https://en.wikipedia.org/wiki/SCSI_command https://en.wikipedia.org/wiki/SCSI_CDB =E2=9D=AF cat /var/log/messages | grep "SCSI status" Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): SCSI status: Check Condition https://en.wikipedia.org/wiki/SCSI_Status_Code#02h_Check_Condition https://en.wikipedia.org/wiki/SCSI_contingent_allegiance_condition =E2=9D=AF cat /var/log/messages | grep "SCSI sense" Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): SCSI sense: ABORTED COMMAND asc:47,3 (Information unit iuCRC error detected) https://en.wikipedia.org/wiki/Key_Code_Qualifier B 47 00 Aborted Command - SCSI parity error https://www.t10.org/lists/asc-num.htm#ASC_47 47h/03h DZTPROMAEBKVF INFORMATION UNIT iuCRC ERROR DETECTED CPB: - Byte 0-1 : other - Byte 2-9 : 8 bytes is the logical block address - Byte 10-13 : 4 bytes is the transfer length - Byte 14-5 : other So in my logs we are writing 0x100 =3D 256 bytes of data and throwing error= s, writing to consecutive blocks. The retries generally succeed and subsequent scrubs don't find any errors, though you may see CKSUM errors during a scrub which I associated with the abort and controller reset events. Summary: 12.x FreeBSD and mps driver code is more efficient than 11.x, pushing more = data back and forth from the disks and that has led to these timeout events I also experimented and found that reducing the memory clock speed in the b= ios is another workaround to stop these timeout errors with 12.x. Basically tha= t is what my earlier posted workaround of using 11.x mps drivers with 12.x was effectively doing, slowing down the data transfer rates so the disks could = keep up. Disclaimer: These observations, analysis and workarounds are specific to my setup, plea= se adapt and incorporate accordingly. --=20 You are receiving this mail because: You are the assignee for the bug.=