From owner-freebsd-scsi@freebsd.org Tue Jun 7 23:30:17 2016 Return-Path: Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id DEAFFB6DF5E for ; Tue, 7 Jun 2016 23:30:17 +0000 (UTC) (envelope-from killing@multiplay.co.uk) Received: from mail-wm0-x22c.google.com (mail-wm0-x22c.google.com [IPv6:2a00:1450:400c:c09::22c]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 60CC11CA2 for ; Tue, 7 Jun 2016 23:30:17 +0000 (UTC) (envelope-from killing@multiplay.co.uk) Received: by mail-wm0-x22c.google.com with SMTP id v199so40150582wmv.0 for ; Tue, 07 Jun 2016 16:30:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=multiplay-co-uk.20150623.gappssmtp.com; s=20150623; h=subject:to:references:from:message-id:date:user-agent:mime-version :in-reply-to; bh=1TfW7jFNyOrL2O7rvTiGr+aqoGaypK6Tz4q1/+c1r2w=; b=mwFWGq6LF1yQl3jaiYSEVO3msN5OeZAZTkhrRiDY9ufKoJY0Y2GgSUXg6xliw5xg0N 5413v2GmCjt8+wUMgBWyy3+WhMIpH/05zVo80z8OmpzDiclm16IpO9sYVspvj5JrActo K/HdTE6uwtfX+FMOekoUJXXe8QMfP1m8vLylbtClaaIyshtUKBmPv5kKLi7Z78Uzqmti orHdB030+tkQG0sruvd3gjcBFX5g62LOenMBtpyRDpqM8HilHHAHkBo9c8i3G6uv/Iqu jqCiX/cQfyWbQxI2ECj7bDaQ2xRQjdXv+1VFyE02OuNkCmd9C2hOhzVpGRkFQ1ABLDoo cFaw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:subject:to:references:from:message-id:date :user-agent:mime-version:in-reply-to; bh=1TfW7jFNyOrL2O7rvTiGr+aqoGaypK6Tz4q1/+c1r2w=; b=HZtbFtC0yYzgCvD/Heu8TylN0EV8fpRG+NPBDpPx7J1Gluq+C6XRpvswpCxIRYNIiD X9AwvhmjogVAWURUlj4xEWTRaYWBN4vhQcEmG0jekPRExTfHMJ5XFNHAcCYunNhqm6Fd 4dXlNwI8sQqWUQBPQKpzB5VBJB4/yvsav2FPpnbcm/fVULuerkWFbNghwChlLkLR03aE v+ax9ornPICEFNH0PQ9S0XtZtD+7DWo/tGmgUx6kifbYw5DPexe/eOYORygbjwWvF4qh H5Bj4Tk4g5P2XlTm1MHcuTR6Xm1CdsKq6tSaQqkqYHZ/YyMJg3YsGY4fuo+thHzo7YJ9 isCg== X-Gm-Message-State: ALyK8tLAsBGlvMGzqvBbrEXdUjCU5S+Pfe+V9HaWQlS6PzbXcQ1o3c4oWAMXOhjcLHCo2xig X-Received: by 10.194.123.9 with SMTP id lw9mr1713992wjb.53.1465342215759; Tue, 07 Jun 2016 16:30:15 -0700 (PDT) Received: from [10.10.1.58] (liv3d.labs.multiplay.co.uk. [82.69.141.171]) by smtp.gmail.com with ESMTPSA id o76sm21768597wme.0.2016.06.07.16.30.14 for (version=TLSv1/SSLv3 cipher=OTHER); Tue, 07 Jun 2016 16:30:14 -0700 (PDT) Subject: Re: Avago LSI SAS 3008 & Intel SSD Timeouts To: freebsd-scsi@freebsd.org References: <30c04d8b-80cb-c637-26dc-97caebad3acb@mindpackstudios.com> <08C01646-9AF3-4E89-A545-C051A284E039@sarenet.es> <986e03a7-5dc8-f5e0-5a17-4bf49459f905@mindpackstudios.com> <2823D96D-881D-4D40-B610-FC8292FA2FC5@sarenet.es> <4072b65d-25d4-2a79-5911-573517b0ee57@mindpackstudios.com> <583dddc6-4614-9900-88f7-27347866d7aa@mindpackstudios.com> <331da785-c88b-d74e-512a-37bdb618d512@multiplay.co.uk> <94380b81-fcd7-511c-bc35-b8c5459d2ea4@multiplay.co.uk> <99b3b075-3158-29aa-3a33-311594fb9270@mindpackstudios.com> From: Steven Hartland Message-ID: <7e6e7b15-7500-01a5-006e-65a3131b5c17@multiplay.co.uk> Date: Wed, 8 Jun 2016 00:30:19 +0100 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.1.0 MIME-Version: 1.0 In-Reply-To: <99b3b075-3158-29aa-3a33-311594fb9270@mindpackstudios.com> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit X-Content-Filtered-By: Mailman/MimeDel 2.1.22 X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 07 Jun 2016 23:30:18 -0000 Oh another thing to test is iirc 3008 is supported by mrsas so you might want to try adding the following into loader.conf to switch drivers: hw.mfi.mrsas_enable="1" On 07/06/2016 23:43, list-news wrote: > No, it threw errors on both da6 and da7 and then I stopped it. > > Your last e-mail gave me thoughts though. I have a server with 2008 > controllers (entirely different backplane design, cpu, memory, etc). > I've moved the 4 drives to that and I'm running the test now. > > # uname = FreeBSD 10.2-RELEASE-p12 #1 r296215 > # sysctl dev.mps.0 > dev.mps.0.spinup_wait_time: 3 > dev.mps.0.chain_alloc_fail: 0 > dev.mps.0.enable_ssu: 1 > dev.mps.0.max_chains: 2048 > dev.mps.0.chain_free_lowwater: 1176 > dev.mps.0.chain_free: 2048 > dev.mps.0.io_cmds_highwater: 510 > dev.mps.0.io_cmds_active: 0 > dev.mps.0.driver_version: 20.00.00.00-fbsd > dev.mps.0.firmware_version: 17.00.01.00 > dev.mps.0.disable_msi: 0 > dev.mps.0.disable_msix: 0 > dev.mps.0.debug_level: 3 > dev.mps.0.%parent: pci5 > dev.mps.0.%pnpinfo: vendor=0x1000 device=0x0072 subvendor=0x1000 > subdevice=0x3020 class=0x010700 > dev.mps.0.%location: slot=0 function=0 > dev.mps.0.%driver: mps > dev.mps.0.%desc: Avago Technologies (LSI) SAS2008 > > About 1.5 hours has passed at full load, no errors. > > gstat drive busy% seems to hang out around 30-40 instead of ~60-70. > Overall throughput seems to be 20-30% less with my rough benchmarks. > > I'm not sure if this gets us closer to the answer, if this doesn't > time-out on the 2008 controller, it looks like one of these: > 1) The Intel drive firmware is being overloaded somehow when connected > to the 3008. > or > 2) The 3008 firmware or driver has an issue reading drive responses, > sporadically thinking the command timed-out (when maybe it really > didn't). > > Puzzle pieces: > A) Why does setting tags of 1 on drives connected to the 3008 fix the > problem? > B) With tags of 255. Why does postgres (and assuming a large fsync > count), seem to cause the problem within minutes? While running other > heavy i/o commands (zpool scrub, bonnie++, fio), all of which show > similarly high or higher iops take hours to cause the problem (if ever). > > I'll let this continue to run to further test. > > Thanks again for all the help. > > -Kyle > > On 6/7/16 4:22 PM, Steven Hartland wrote: >> Always da6? >> >> On 07/06/2016 21:19, list-news wrote: >>> Sure Steve: >>> >>> # cat /boot/loader.conf | grep trim >>> vfs.zfs.trim.enabled=0 >>> >>> # sysctl vfs.zfs.trim.enabled >>> vfs.zfs.trim.enabled: 0 >>> >>> # uptime >>> 3:14PM up 11 mins, 3 users, load averages: 6.58, 11.31, 7.07 >>> >>> # tail -f /var/log/messages: >>> Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): WRITE(10). CDB: 2a 00 >>> 2b d8 86 50 00 00 b0 00 length 90112 SMID 1016 command timeout cm >>> 0xfffffe0001375580 ccb 0xfffff8039895f800 target 16, handle(0x0010) >>> Jun 7 15:13:50 s18 kernel: mpr0: At enclosure level 0, slot 8, >>> connector name ( ) >>> Jun 7 15:13:50 s18 kernel: mpr0: timedout cm 0xfffffe0001375580 >>> allocated tm 0xfffffe0001322150 >>> Jun 7 15:13:50 s18 kernel: (noperiph:mpr0:0:4294967295:0): SMID 1 >>> Aborting command 0xfffffe0001375580 >>> Jun 7 15:13:50 s18 kernel: mpr0: Sending reset from >>> mprsas_send_abort for target ID 16 >>> Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): SYNCHRONIZE >>> CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 786 >>> command timeout cm 0xfffffe00013627a0 ccb 0xfffff8039851e800 target >>> 16, handle(0x0010) >>> Jun 7 15:13:50 s18 kernel: mpr0: At enclosure level 0, slot 8, >>> connector name ( ) >>> Jun 7 15:13:50 s18 kernel: mpr0: queued timedout cm >>> 0xfffffe00013627a0 for processing by tm 0xfffffe0001322150 >>> Jun 7 15:13:50 s18 kernel: mpr0: EventReply : >>> Jun 7 15:13:50 s18 kernel: EventDataLength: 2 >>> Jun 7 15:13:50 s18 kernel: AckRequired: 0 >>> Jun 7 15:13:50 s18 kernel: Event: SasDiscovery (0x16) >>> Jun 7 15:13:50 s18 kernel: EventContext: 0x0 >>> Jun 7 15:13:50 s18 kernel: Flags: 1 >>> Jun 7 15:13:50 s18 kernel: ReasonCode: Discovery Started >>> Jun 7 15:13:50 s18 kernel: PhysicalPort: 0 >>> Jun 7 15:13:50 s18 kernel: DiscoveryStatus: 0 >>> Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 >>> 0b 43 a8 00 00 00 10 00 length 8192 SMID 624 completed cm >>> 0xfffffe0001355300 ccb 0xfffff803984d4800 during recovery ioc 804b >>> scsi 0 state c xfer 0 >>> Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 >>> 0b 43 a8 00 00 00 10 00 length 8192 SMID 624 terminated ioc 804b >>> scsi 0 state c xfer 0 >>> Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 >>> 0b 43 a7 f0 00 00 10 00 length 8192 SMID 633 completed cm >>> 0xfffffe0001355ed0 ccb 0xfffff803987f0000 during recovery ioc 804b >>> scsi 0 state c xfer 0 >>> Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 >>> 0b 43 a7 f0 00 00 10 00 length 8192 SMID 633 terminated ioc 804b >>> scsi 0 state c xfer 0 >>> Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 >>> 0a 25 3f f0 00 00 08 00 length 4096 SMID 133 completed cm >>> 0xfffffe000132ce90 ccb 0xfffff803985fc000 during recovery ioc 804b >>> scsi 0 state c xfer 0 >>> Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 >>> 0a 25 3f f0 00 00 08 00 length 4096 SMID 133 terminated ioc 804b >>> scsi 0 state c xfer 0 >>> Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): WRITE(10). CDB: 2a 00 >>> 2b d8 86 50 00 00 b0 00 length 90112 SMID 1016 completed timedout cm >>> 0xfffffe0001375580 ccb 0xfffff8039895f800 during recovery ioc 8048 >>> scsi 0 state c (da6:mpr0:0:16:0): SYNCHRONIZE CACHE(10). CDB: 35 >>> 00 00 00 00 00 00 00 00 00 length 0 SMID 786 completed timedout cm >>> 0xfffffe(da6:mpr0:0:16:0): WRITE(10). CDB: 2a 00 2b d8 86 50 00 00 >>> b0 00 >>> Jun 7 15:13:50 s18 kernel: 00013627a0 ccb 0xfffff8039851e800 during >>> recovery ioc 804b scsi 0 (da6:mpr0:0:16:0): CAM status: Command timeout >>> Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): SYNCHRONIZE >>> CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 786 >>> terminated ioc 804b scsi 0 sta(da6:te c xfer 0 >>> Jun 7 15:13:50 s18 kernel: mpr0:0: (xpt0:mpr0:0:16:0): SMID 1 abort >>> TaskMID 1016 status 0x0 code 0x0 count 5 >>> Jun 7 15:13:50 s18 kernel: 16: (xpt0:mpr0:0:16:0): SMID 1 >>> finished recovery after aborting TaskMID 1016 >>> Jun 7 15:13:50 s18 kernel: 0): mpr0: Retrying command >>> Jun 7 15:13:50 s18 kernel: Unfreezing devq for target ID 16 >>> Jun 7 15:13:50 s18 kernel: mpr0: EventReply : >>> Jun 7 15:13:50 s18 kernel: EventDataLength: 4 >>> Jun 7 15:13:50 s18 kernel: AckRequired: 0 >>> Jun 7 15:13:50 s18 kernel: Event: SasTopologyChangeList (0x1c) >>> Jun 7 15:13:50 s18 kernel: EventContext: 0x0 >>> Jun 7 15:13:50 s18 kernel: EnclosureHandle: 0x2 >>> Jun 7 15:13:50 s18 kernel: ExpanderDevHandle: 0x9 >>> Jun 7 15:13:50 s18 kernel: NumPhys: 31 >>> Jun 7 15:13:50 s18 kernel: NumEntries: 1 >>> Jun 7 15:13:50 s18 kernel: StartPhyNum: 8 >>> Jun 7 15:13:50 s18 kernel: ExpStatus: Responding (0x3) >>> Jun 7 15:13:50 s18 kernel: PhysicalPort: 0 >>> Jun 7 15:13:50 s18 kernel: PHY[8].AttachedDevHandle: 0x0010 >>> Jun 7 15:13:50 s18 kernel: PHY[8].LinkRate: 12.0Gbps (0xbb) >>> Jun 7 15:13:50 s18 kernel: PHY[8].PhyStatus: PHYLinkStatusChange >>> Jun 7 15:13:50 s18 kernel: mpr0: (0)->(mprsas_fw_work) Working on >>> Event: [16] >>> Jun 7 15:13:50 s18 kernel: mpr0: (1)->(mprsas_fw_work) Event Free: >>> [16] >>> Jun 7 15:13:50 s18 kernel: mpr0: (1)->(mprsas_fw_work) Working on >>> Event: [1c] >>> Jun 7 15:13:50 s18 kernel: mpr0: (2)->(mprsas_fw_work) Event Free: >>> [1c] >>> Jun 7 15:13:50 s18 kernel: mpr0: EventReply : >>> Jun 7 15:13:50 s18 kernel: EventDataLength: 2 >>> Jun 7 15:13:50 s18 kernel: AckRequired: 0 >>> Jun 7 15:13:50 s18 kernel: Event: SasDiscovery (0x16) >>> Jun 7 15:13:50 s18 kernel: EventContext: 0x0 >>> Jun 7 15:13:50 s18 kernel: Flags: 0 >>> Jun 7 15:13:50 s18 kernel: ReasonCode: Discovery Complete >>> Jun 7 15:13:50 s18 kernel: PhysicalPort: 0 >>> Jun 7 15:13:50 s18 kernel: DiscoveryStatus: 0 >>> Jun 7 15:13:50 s18 kernel: mpr0: (2)->(mprsas_fw_work) Working on >>> Event: [16] >>> Jun 7 15:13:50 s18 kernel: mpr0: (3)->(mprsas_fw_work) Event Free: >>> [16] >>> Jun 7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): SYNCHRONIZE >>> CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 >>> Jun 7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): CAM status: SCSI >>> Status Error >>> Jun 7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): SCSI status: Check >>> Condition >>> Jun 7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): SCSI sense: UNIT >>> ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred) >>> Jun 7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): Retrying command (per >>> sense data) >>> >>> -Kyle >>> >>> On 6/7/16 2:53 PM, Steven Hartland wrote: >>>> CDB: 85 is a TRIM command IIRC, I know you tried it before using >>>> BIO delete but assuming your running ZFS can you set the following >>>> in loader.conf and see how you get on. >>>> vfs.zfs.trim.enabled=0 >>>> >>>> Regards >>>> Steve >>> >>> >>> _______________________________________________ >>> freebsd-scsi@freebsd.org mailing list >>> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi >>> To unsubscribe, send any mail to "freebsd-scsi-unsubscribe@freebsd.org" >> >> _______________________________________________ >> freebsd-scsi@freebsd.org mailing list >> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi >> To unsubscribe, send any mail to "freebsd-scsi-unsubscribe@freebsd.org" > > > _______________________________________________ > freebsd-scsi@freebsd.org mailing list > https://lists.freebsd.org/mailman/listinfo/freebsd-scsi > To unsubscribe, send any mail to "freebsd-scsi-unsubscribe@freebsd.org"