From owner-freebsd-scsi@freebsd.org Tue Jun 7 19:53:27 2016 Return-Path: Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 4468FB6D89A for ; Tue, 7 Jun 2016 19:53:27 +0000 (UTC) (envelope-from list-news@mindpackstudios.com) Received: from mail.furymx.com (mindpack.mx1.furymx.net [64.141.130.10]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 0DC4B167E for ; Tue, 7 Jun 2016 19:53:26 +0000 (UTC) (envelope-from list-news@mindpackstudios.com) Received: from mindpack.furymx.net (mindpack.mx1.furymx.net [10.10.1.10]) by mail.furymx.com (Postfix) with ESMTP id 8E6561ED4C5 for ; Tue, 7 Jun 2016 14:53:25 -0500 (CDT) X-Virus-Scanned: amavisd-new at furymx.com Received: from mail.furymx.com ([10.10.1.10]) by mindpack.furymx.net (mail.furymx.com [10.10.1.10]) (amavisd-new, port 10024) with ESMTP id dQxViKVTZADo for ; Tue, 7 Jun 2016 14:53:24 -0500 (CDT) Received: from vortex.local (c-98-215-180-176.hsd1.in.comcast.net [98.215.180.176]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) (Authenticated sender: kyle@mindpackstudios.com) by mail.furymx.com (Postfix) with ESMTPSA id 56DDA1ED4BD for ; Tue, 7 Jun 2016 14:53:24 -0500 (CDT) Subject: Re: Avago LSI SAS 3008 & Intel SSD Timeouts To: freebsd-scsi@freebsd.org References: <30c04d8b-80cb-c637-26dc-97caebad3acb@mindpackstudios.com> <08C01646-9AF3-4E89-A545-C051A284E039@sarenet.es> <986e03a7-5dc8-f5e0-5a17-4bf49459f905@mindpackstudios.com> <2823D96D-881D-4D40-B610-FC8292FA2FC5@sarenet.es> <4072b65d-25d4-2a79-5911-573517b0ee57@mindpackstudios.com> <6f861c77-d9c9-9710-7be6-5b08f1047fe5@multiplay.co.uk> From: list-news Message-ID: Date: Tue, 7 Jun 2016 14:53:23 -0500 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:45.0) Gecko/20100101 Thunderbird/45.1.1 MIME-Version: 1.0 In-Reply-To: <6f861c77-d9c9-9710-7be6-5b08f1047fe5@multiplay.co.uk> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 07 Jun 2016 19:53:27 -0000 I don't believe the mainboard has any SATA ports. It does have a PCIe slot IIRC though, and I may be able to rig something up with another LSI adapter I have laying around. If I can get it to fit and find a way to power the drives. Although, this seems unlikely unless you are seeing something I'm not? With that last test: If it's the SAS controller, 3 different ones running two different firmware versions are all causing the issue. If it's the backplane, I have now tested 3 of them as well, two of which I can confirm have different revision numbers. Errors never appear with tags set to 1 for each drive (effectively eliminating NCQ as I understand it). My brief understanding is that a higher tag count allows the SAS adapter to send more commands to the drive in parallel, allowing the drive to make the decisions about command ordering. If that is accurate, and the controller firmware was bad, I assume this would be a far more common bug that would have been fixed already. On the other hand, if it only happens during heavy SYNCHRONIZE CACHE commands in parallel on certain Intel SSD's and only on controllers (maybe 12gbps?) that can outrun the drive firmware or cause a race condition (my suspicions here). It seems far more likely this would have gone unnoticed by Intel. -Kyle On 6/7/16 2:02 PM, Steven Hartland wrote: > Have you tried direct attaching the drives? > > On 07/06/2016 18:09, list-news wrote: >> The system is a Twin. In the first post I mentioned this but I >> probably wasn't clear. >> >> The twin unit is this one: >> https://www.supermicro.com/products/system/2u/2028/sys-2028tp-decr.cfm >> >> I've used all components from twin node A and B (cpu / memory / >> mainboard / controller). I still get the errors. The backplane was >> the original thought of concern, and that has been RMA'd and replaced >> - errors continue. I've even swapped out power supplies with another >> identical unit I have here. >> >> In every case the errors continue, until I do this: >> #camcontrol daX -N 1 >> (for each drive in the zpool) >> >> Then the errors stop. >> >> The system errors every few minutes while my application is running. >> Set tags to -N 1, and everything goes quiet. 16 cores at 100% cpu >> and drives 80% busy @ ~15k IO p/s, for about 5 hours solid before it >> finishes a batch, no errors are reported with -N set to 1. If I set >> tags with -N 255 for each device, errors start again within 5 >> minutes, and continue every 2-5 minutes, until the batch is finished. >> >> -Kyle >> >>> I would try, if possible, to swap the controller. >>> >>> >>> >>> >>> >>> >>> Borja. >>> >>> >> >> _______________________________________________ >> freebsd-scsi@freebsd.org mailing list >> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi >> To unsubscribe, send any mail to "freebsd-scsi-unsubscribe@freebsd.org" > > _______________________________________________ > freebsd-scsi@freebsd.org mailing list > https://lists.freebsd.org/mailman/listinfo/freebsd-scsi > To unsubscribe, send any mail to "freebsd-scsi-unsubscribe@freebsd.org"