From owner-freebsd-scsi@FreeBSD.ORG Tue Nov 1 19:18:07 2011 Return-Path: Delivered-To: freebsd-scsi@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id C5846106566C for ; Tue, 1 Nov 2011 19:18:07 +0000 (UTC) (envelope-from peter.maloney@brockmann-consult.de) Received: from mo-p05-ob6.rzone.de (mo-p05-ob6.rzone.de [IPv6:2a01:238:20a:202:53f5::1]) by mx1.freebsd.org (Postfix) with ESMTP id C91908FC18 for ; Tue, 1 Nov 2011 19:18:06 +0000 (UTC) X-RZG-AUTH: :LWIKdA2leu0bPbLmhzXgqn0MTG6qiKEwQRWfNxSw4HzYIwjsnvdDt2oX8drk23mpKMZH7NA= X-RZG-CLASS-ID: mo05 Received: from [192.168.179.42] (hmbg-5f7606d1.pool.mediaWays.net [95.118.6.209]) by post.strato.de (mrclete mo57) (RZmta 26.10 AUTH) with (DHE-RSA-AES128-SHA encrypted) ESMTPA id w01ed2nA1IP5pQ for ; Tue, 1 Nov 2011 20:17:45 +0100 (MET) Message-ID: <4EAEF431.7090108@brockmann-consult.de> Date: Mon, 31 Oct 2011 20:17:05 +0100 From: Peter Maloney User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:7.0.1) Gecko/20110929 Thunderbird/7.0.1 MIME-Version: 1.0 To: freebsd-scsi@freebsd.org References: In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Content-Filtered-By: Mailman/MimeDel 2.1.5 Subject: Re: mps/LSI SAS2008 controller crashes when smartctl is run with upped disk tags X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 01 Nov 2011 19:18:07 -0000 Dear Jason, Am 01.11.2011 19:13, schrieb Jason Wolfe: > Hello, > > I have an issue with the mps driver on 8.2 where running 'smartctl -a' > rarely causes the controller to freak out when disk tags are > 2. I've > confirmed settings the tags to 1 resolves this crash, so that surely is a > clue in the right direction.. I'm using Seagate 1TB SAS drives - > ST91000640SS, and these are SuperMicro X8DTT-H chasis. This happens across > over a thousand servers, so it surely not flaky hardware. It could > obviously be some interoperability with these model drives and the mps > controller, but unfortunately I don't have any other drives deployed on > these cards to test that theory out :/ I get a simlar problem on a system with an LSI 9211-8i with 20 SATA disks attached (2 SSDs and 18 spnning disks). My system doesn't hang, panic, or reset though. I just lose access to one disk, which is then considered FAULTED in my zpool status (with the ZFS file system). If I physically remove the FAULTED disk and run "gpart recover da0", I get a panic. Otherwise, the system keeps running in a degraded state. When I reboot and resilver, some data is found damaged and repaired, not just refreshed with the latest state. The server has 1 HBA and 2 backplanes, and I have the 2 mirrored root disks on different backplanes. Maybe that is why mine runs degraded and yours hang. This happened twice so far (in around a month or two), and both times it was one of the mirrored root disks (SSDs) that faulted. My tags are set to 255. I will try reproducing it as you said, and then if it fails, rebooting and trying again setting tags to 2 as you suggested. And *thank you very much for this information*. This is the last outstanding issue with this server. I hope this workaround helps. # camcontrol tags /dev/da0 (pass0:mps0:0:7:0): device openings: 255 > > Luckily remote syslogging is enabled, so while nothing is kept locally, we > see these messages similar to these transmitted before the server hangs, > requiring a power cycle: > > (da0:mps0:0:1:0): SCSI command timeout on device handle 0x000a SMID > 510 > (da0:mps0:0:1:0): SCSI command timeout on device handle 0x000a SMID > 713 > (da0:mps0:0:1:0): SCSI command timeout on device handle 0x000a SMID > 942 > (da0:mps0:0:1:0): SCSI command timeout on device handle 0x000a SMID > 356 > (da0:mps0:0:1:0): SCSI command timeout on device handle 0x000a SMID > 492 > (da0:mps0:0:1:0): SCSI command timeout on device handle 0x000a SMID > 976 > (da11:mps0:0:12:0): SCSI command timeout on device handle 0x0015 SMID > 339 > (da11:mps0:0:12:0): SCSI command timeout on device handle 0x0015 SMID > 746 > (da5:mps0:0:6:0): SCSI command timeout on device handle 0x000f SMID 74 > (da6:mps0:0:7:0): SCSI command timeout on device handle 0x0010 SMID > 613 > (da2:mps0:0:3:0): SCSI command timeout on device handle 0x000c SMID 16 > (da10:mps0:0:11:0): SCSI command timeout on device handle 0x0014 SMID > 305 > (da1:mps0:0:2:0): SCSI command timeout on device handle 0x000b SMID 74 > (da6:mps0:0:7:0): SCSI command timeout on device handle 0x0010 SMID > 594 > > In some cases that would be followed by this, which would usually be the > last transmission, though we don't see this in all cases. It may just be > the system isn't always alive long enough to transmit: > > kernel: mps0: IOC Fault 0x40006003, Resetting > > > I'm able to reproduce fairly easily within a minute or two by heavily > loading the disks up by whatever means, and running smartctl -a in a loop: > > #!/bin/sh -x > > disks=`sysctl -n kern.disks|xargs -n1|grep ^da` > > for disk in $disks; do > camcontrol tags $disk -N 4 > done > > for z in `yes|head -100`; do > for disk in $disks; do > smartctl -s on -a /dev/$disk > done > done > > mps0: port 0xe000-0xe0ff mem > 0xfbd3c000-0xfbd3ffff,0xfbd40000-0xfbd7ffff irq 26 at device 0.0 on pci4 > mps0: Firmware: 07.00.00.00 > mps0: IOCCapabilities: > 1285c > mps0: [ITHREAD] > da0 at mps0 bus 0 scbus0 target 1 lun 0 > da1 at mps0 bus 0 scbus0 target 2 lun 0 > da2 at mps0 bus 0 scbus0 target 3 lun 0 > da3 at mps0 bus 0 scbus0 target 4 lun 0 > da4 at mps0 bus 0 scbus0 target 5 lun 0 > da5 at mps0 bus 0 scbus0 target 6 lun 0 > da6 at mps0 bus 0 scbus0 target 7 lun 0 > da7 at mps0 bus 0 scbus0 target 8 lun 0 > da8 at mps0 bus 0 scbus0 target 9 lun 0 > da9 at mps0 bus 0 scbus0 target 10 lun 0 > da10 at mps0 bus 0 scbus0 target 11 lun 0 > da11 at mps0 bus 0 scbus0 target 12 lun 0 > ses0 at mps0 bus 0 scbus0 target 13 lun 0 > > mps0@pci0:4:0:0: class=0x010700 card=0x040015d9 chip=0x00721000 rev=0x02 > hdr=0x00 > vendor = 'LSI Logic (Was: Symbios Logic, NCR)' > class = mass storage > subclass = SAS > > at scbus0 target 1 lun 0 (pass0,da0) > at scbus0 target 2 lun 0 (pass1,da1) > at scbus0 target 3 lun 0 (pass2,da2) > at scbus0 target 4 lun 0 (pass3,da3) > at scbus0 target 5 lun 0 (pass4,da4) > at scbus0 target 6 lun 0 (pass5,da5) > at scbus0 target 7 lun 0 (pass6,da6) > at scbus0 target 8 lun 0 (pass7,da7) > at scbus0 target 9 lun 0 (pass8,da8) > at scbus0 target 10 lun 0 (pass9,da9) > at scbus0 target 11 lun 0 (pass10,da10) > at scbus0 target 12 lun 0 (pass11,da11) > at scbus0 target 13 lun 0 (ses0,pass12) > > Thank you sirs, > > Jason Wolfe > _______________________________________________ > freebsd-scsi@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-scsi > To unsubscribe, send any mail to "freebsd-scsi-unsubscribe@freebsd.org" And my logs to compare: (note my root, swap, zfs cache, and zfs log is on the disk that fails). ) root@bcnas1:/var/log# swapinfo Device 1K-blocks Used Avail Capacity /dev/gpt/swap0 524288 5840 518448 1% /dev/gpt/swap1 524288 5640 518648 1% Total 1048576 11480 1037096 1% When it starts happening, it looks like this: Oct 29 00:02:12 bcnas1 kernel: (da10:mps0:0:9:0): SCSI command timeout on device handle 0x0016 SMID 220 Oct 29 00:02:12 bcnas1 kernel: (da10:mps0:0:9:0): SCSI command timeout on device handle 0x0016 SMID 87 Oct 29 00:02:12 bcnas1 kernel: (da10:mps0:0:9:0): SCSI command timeout on device handle 0x0016 SMID 795 Oct 29 00:02:12 bcnas1 kernel: (da10:mps0:0:9:0): SCSI command timeout on device handle 0x0016 SMID 423 Oct 29 00:02:12 bcnas1 kernel: (da10:mps0:0:9:0): SCSI command timeout on device handle 0x0016 SMID 338 Oct 29 02:19:12 bcnas1 kernel: :9:0): SCSI command timeout on device handle 0x0016 SMID 170 Oct 29 02:19:12 bcnas1 kernel: (da10:mps0:0:9:0): SCSI command timeout on device handle 0x0016 SMID 637 Oct 29 02:19:12 bcnas1 kernel: (da10:mps0:0:9:0): SCSI command timeout on device handle 0x0016 SMID 335 Oct 29 02:19:12 bcnas1 kernel: (da10:mps0:0:9:0): SCSI command timeout on device handle 0x0016 SMID 798 Oct 29 02:19:12 bcnas1 kernel: mps0: (0:9:0) terminated ioc 804b scsi 0 state c xfer 0 Oct 29 02:19:12 bcnas1 last message repeated 14 times Oct 29 02:19:12 bcnas1 kernel: mps0: mpssas_abort_complete: abort request on handle 0x16 SMID 991 complete Oct 29 02:19:12 bcnas1 kernel: mps0: mpssas_complete_tm_request: sending deferred task management request for handle 0x16 SMID 4 Oct 29 02:19:12 bcnas1 kernel: mps0: mpssas_abort_complete: abort request on handle 0x16 SMID 4 complete Oct 29 02:19:12 bcnas1 kernel: mps0: mpssas_complete_tm_request: sending deferred task management request for handle 0x16 SMID 227 Oct 29 02:19:12 bcnas1 kernel: mps0: mpssas_abort_complete: abort request on handle 0x16 SMID 227 complete Oct 29 02:19:12 bcnas1 kernel: mps0: mpssas_complete_tm_request: sending deferred task management request for handle 0x16 SMID 652 Oct 29 02:19:12 bcnas1 kernel: mps0: mpssas_abort_complete: abort request on handle 0x16 SMID 652 complete Oct 29 02:19:12 bcnas1 kernel: mps0: mpssas_complete_tm_request: sending deferred task management request for handle 0x16 SMID 125 Oct 29 02:19:12 bcnas1 kernel: mps0: mpssas_abort_complete: abort request on handle 0x16 SMID 125 complete Oct 29 02:19:12 bcnas1 kernel: mps0: mpssas_complete_tm_request: sending deferred task management request for handle 0x16 SMID 101 Oct 29 02:19:12 bcnas1 kernel: mps0: mpssas_abort_complete: abort request on handle 0x16 SMID 1017 complete Oct 29 02:19:12 bcnas1 kernel: mps0: mpssas_complete_tm_request: sending deferred task management request for handle 0x16 SMID 100 Oct 29 02:19:12 bcnas1 kernel: mps0: mpssas_abort_complete: abort request on handle 0x16 SMID 1004 complete Oct 29 02:19:12 bcnas1 kernel: mps0: mpssas_complete_tm_request: sending deferred task management request for handle 0x16 SMID 487 Oct 29 02:19:12 bcnas1 kernel: mps0: mpssas_abort_complete: abort request on handle 0x16 SMID 487 complete Oct 29 02:19:12 bcnas1 kernel: mps0: mpssas_complete_tm_request: sending deferred task management request for handle 0x16 SMID 279 Oct 29 02:19:12 bcnas1 kernel: mps0: mpssas_abort_complete: abort request on handle 0x16 SMID 279 complete Oct 29 02:19:12 bcnas1 kernel: mps0: mpssas_complete_tm_request: sending deferred task management request for handle 0x16 SMID 929 Oct 29 02:19:12 bcnas1 kernel: mps0: mpssas_abort_complete: abort request on handle 0x16 SMID 929 complete Oct 29 02:19:12 bcnas1 kernel: mps0: mpssas_complete_tm_request: sending deferred task management request for handle 0x16 SMID 346 Oct 29 02:19:12 bcnas1 kernel: mps0: mpssas_abort_complete: abort request on handle 0x16 SMID 346 complete Oct 29 02:19:12 bcnas1 kernel: mps0: mpssas_complete_tm_request: sending deferred task management request for handle 0x16 SMID 817 Oct 29 02:19:12 bcnas1 kernel: mps0: mpssas_abort_complete: abort request on handle 0x16 SMID 817 complete Oct 29 02:19:12 bcnas1 kernel: mps0: mpssas_complete_tm_request: sending deferred task management request for handle 0x16 SMID 170 Oct 29 02:19:12 bcnas1 kernel: mps0: mpssas_abort_complete: abort request on handle 0x16 SMID 170 complete Oct 29 02:19:12 bcnas1 kernel: mps0: mpssas_complete_tm_request: sending deferred task management request for handle 0x16 SMID 637 Oct 29 02:19:12 bcnas1 kernel: mps0: mpssas_abort_complete: abort request on handle 0x16 SMID 637 complete Oct 29 02:19:12 bcnas1 kernel: mps0: mpssas_complete_tm_request: sending deferred task management request for handle 0x16 SMID 335 Oct 29 02:19:12 bcnas1 kernel: mps0: mpssas_abort_complete: abort request on handle 0x16 SMID 335 complete Oct 29 02:19:12 bcnas1 kernel: mps0: mpssas_complete_tm_request: sending deferred task management request for handle 0x16 SMID 798 Oct 29 02:19:12 bcnas1 kernel: mps0: mpssas_abort_complete: abort request on handle 0x16 SMID 798 complete Oct 29 02:19:12 bcnas1 kernel: (da10:mps0:0:9:0): SCSI command timeout on device handle 0x0016 SMID 757 Oct 29 02:19:12 bcnas1 kernel: (da10:mps0:0:9:0): SCSI command timeout on device handle 0x0016 SMID 833 Oct 29 02:19:12 bcnas1 kernel: (da10:mps0:0:9:0): SCSI command timeout on device handle 0x0016 SMID 804 Oct 29 02:19:12 bcnas1 kernel: (da10:mps0:0:9:0): SCSI command timeout on device handle 0x0016 SMID 464 Oct 29 02:19:12 bcnas1 kernel: (da10:mps0:0:9:0): SCSI command timeout on device handle 0x0016 SMID 144 Oct 29 02:19:12 bcnas1 kernel: (da10:mps0:0:9:0): SCSI command timeout on device handle 0x0016 SMID 912 Oct 29 02:19:12 bcnas1 kernel: (da10:mps0:0:9:0): SCSI command timeout on device handle 0x0016 SMID 753 Oct 29 02:19:12 bcnas1 kernel: (da10:mps0:0:9:0): SCSI command timeout on device handle 0x0016 SMID 422 Oct 29 02:19:12 bcnas1 kernel: (da10:mps0:0:9:0): SCSI command timeout on device handle 0x0016 SMID 241 And then just before I rebooted it, basically looked the same, with the different messages mixed together: ct 31 07:52:12 bcnas1 kernel: mps0: mpssas_complete_tm_request: sending deferred task management request for handle 0x16 SMID 807 ct 31 07:52:12 bcnas1 kernel: mps0: mpssas_abort_complete: abort request on handle 0x16 SMID 807 complete ct 31 07:53:12 bcnas1 kernel: (da10:mps0:0:9:0): SCSI command timeout on device handle 0x0016 SMID 1006 ct 31 07:53:12 bcnas1 kernel: (da10:mps0:0:9:0): SCSI command timeout on device handle 0x0016 SMID 111 ct 31 07:53:20 bcnas1 kernel: mps0: (0:9:0) terminated ioc 804b scsi 0 state c xfer 0 ct 31 07:53:20 bcnas1 kernel: mps0: mpssas_abort_complete: abort request on handle 0x16 SMID 1006 complete ct 31 07:53:20 bcnas1 kernel: mps0: mpssas_complete_tm_request: sending deferred task management request for handle 0x16 SMID 111 ct 31 07:53:20 bcnas1 kernel: mps0: mpssas_abort_complete: abort request on handle 0x16 SMID 111 complete ct 31 07:54:20 bcnas1 kernel: (da10:mps0:0:9:0): SCSI command timeout on device handle 0x0016 SMID 669 ct 31 07:54:20 bcnas1 kernel: (da10:mps0:0:9:0): SCSI command timeout on device handle 0x0016 SMID 912 ct 31 07:54:28 bcnas1 kernel: mps0: (0:9:0) terminated ioc 804b scsi 0 state c xfer 0 ct 31 07:54:28 bcnas1 kernel: mps0: mpssas_abort_complete: abort request on handle 0x16 SMID 669 complete ct 31 07:54:28 bcnas1 kernel: mps0: mpssas_complete_tm_request: sending deferred task management request for handle 0x16 SMID 912 ct 31 07:54:28 bcnas1 kernel: mps0: mpssas_abort_complete: abort request on handle 0x16 SMID 912 complete ct 31 07:55:29 bcnas1 kernel: (da10:mps0:0:9:0): SCSI command timeout on device handle 0x0016 SMID 804 ct 31 07:55:29 bcnas1 kernel: (da10:mps0:0:9:0): SCSI command timeout on device handle 0x0016 SMID 1001 ct 31 07:55:36 bcnas1 kernel: mps0: (0:9:0) terminated ioc 804b scsi 0 state c xfer 0 ct 31 07:55:36 bcnas1 kernel: mps0: mpssas_abort_complete: abort request on handle 0x16 SMID 804 complete ct 31 07:55:36 bcnas1 kernel: mps0: mpssas_complete_tm_request: sending deferred task management request for handle 0x16 SMID 1001 ct 31 07:55:36 bcnas1 kernel: mps0: mpssas_abort_complete: abort request on handle 0x16 SMID 1001 complete ct 31 07:56:36 bcnas1 kernel: (da10:mps0:0:9:0): SCSI command timeout on device handle 0x0016 SMID 389 ct 31 07:56:36 bcnas1 kernel: (da10:mps0:0:9:0): SCSI command timeout on device handle 0x0016 SMID 885 ct 31 07:56:44 bcnas1 kernel: mps0: (0:9:0) terminated ioc 804b scsi 0 state c xfer 0 ct 31 07:56:44 bcnas1 kernel: mps0: mpssas_abort_complete: abort request on handle 0x16 SMID 389 complete ct 31 07:56:44 bcnas1 kernel: mps0: mpssas_complete_tm_request: sending deferred task management request for handle 0x16 SMID 885 ct 31 07:56:44 bcnas1 kernel: swap_pager: I/O error - pageout failed; blkno 131393,size 65536, error 5 ct 31 07:56:44 bcnas1 kernel: mps0: mpssas_abort_complete: abort request on handle 0x16 SMID 885 complete ct 31 07:57:45 bcnas1 kernel: (da10:mps0:0:9:0): SCSI command timeout on device handle 0x0016 SMID 442 ct 31 07:57:48 bcnas1 kernel: mps0: mpssas_abort_complete: abort request on handle 0x16 SMID 442 complete ct 31 07:58:49 bcnas1 kernel: (da10:mps0:0:9:0): SCSI command timeout on device handle 0x0016 SMID 413 ct 31 07:58:52 bcnas1 kernel: mps0: mpssas_abort_complete: abort request on handle 0x16 SMID 413 complete ct 31 07:59:53 bcnas1 kernel: (da10:mps0:0:9:0): SCSI command timeout on device handle 0x0016 SMID 90 ct 31 07:59:56 bcnas1 kernel: mps0: mpssas_abort_complete: abort request on handle 0x16 SMID 90 complete ct 31 08:00:56 bcnas1 kernel: (da10:mps0:0:9:0): SCSI command timeout on device handle 0x0016 SMID 504 ct 31 08:01:00 bcnas1 kernel: mps0: mpssas_abort_complete: abort request on handle 0x16 SMID 504 complete ct 31 08:02:01 bcnas1 kernel: (da10:mps0:0:9:0): SCSI command timeout on device handle 0x0016 SMID 861 ct 31 08:02:04 bcnas1 kernel: mps0: swap_pager: I/O error - pageout failed; blkno 131409,size 49152, error 5mpssas_abort_complete: abort request on handle 0x16 SMID 861 complete