From owner-freebsd-scsi@FreeBSD.ORG Tue Nov 1 18:42:03 2011 Return-Path: Delivered-To: freebsd-scsi@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 7879D106564A for ; Tue, 1 Nov 2011 18:42:03 +0000 (UTC) (envelope-from nitroboost@gmail.com) Received: from mail-dy0-f54.google.com (mail-dy0-f54.google.com [209.85.220.54]) by mx1.freebsd.org (Postfix) with ESMTP id EE6068FC16 for ; Tue, 1 Nov 2011 18:42:02 +0000 (UTC) Received: by dye36 with SMTP id 36so397915dye.13 for ; Tue, 01 Nov 2011 11:42:01 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:date:message-id:subject:from:to:content-type; bh=3trq6LFWsnHWqs3tuc6y7zF9rP+zyikr5hs04y/L/rs=; b=Fugmkxctt2nkyllbtSUgysM7z1Fu7JrL3e6rypghXRVVmM+Hs5UuUuFuYODecwqyzM GRetBFz+3GoAL3pRibYtB2RN1dbc+fsEZSOIJCxgJmL0HdN8j4OJgyJ+U8g9GvF+qvxW s2Y1hkd8bYvz2r17AHIWWO0Y6y7XNIWxp6r+M= MIME-Version: 1.0 Received: by 10.182.115.40 with SMTP id jl8mr157403obb.8.1320171197190; Tue, 01 Nov 2011 11:13:17 -0700 (PDT) Received: by 10.182.35.193 with HTTP; Tue, 1 Nov 2011 11:13:17 -0700 (PDT) Date: Tue, 1 Nov 2011 11:13:17 -0700 Message-ID: From: Jason Wolfe To: freebsd-scsi@freebsd.org Content-Type: text/plain; charset=ISO-8859-1 X-Content-Filtered-By: Mailman/MimeDel 2.1.5 Subject: mps/LSI SAS2008 controller crashes when smartctl is run with upped disk tags X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 01 Nov 2011 18:42:03 -0000 Hello, I have an issue with the mps driver on 8.2 where running 'smartctl -a' rarely causes the controller to freak out when disk tags are > 2. I've confirmed settings the tags to 1 resolves this crash, so that surely is a clue in the right direction.. I'm using Seagate 1TB SAS drives - ST91000640SS, and these are SuperMicro X8DTT-H chasis. This happens across over a thousand servers, so it surely not flaky hardware. It could obviously be some interoperability with these model drives and the mps controller, but unfortunately I don't have any other drives deployed on these cards to test that theory out :/ Luckily remote syslogging is enabled, so while nothing is kept locally, we see these messages similar to these transmitted before the server hangs, requiring a power cycle: (da0:mps0:0:1:0): SCSI command timeout on device handle 0x000a SMID 510 (da0:mps0:0:1:0): SCSI command timeout on device handle 0x000a SMID 713 (da0:mps0:0:1:0): SCSI command timeout on device handle 0x000a SMID 942 (da0:mps0:0:1:0): SCSI command timeout on device handle 0x000a SMID 356 (da0:mps0:0:1:0): SCSI command timeout on device handle 0x000a SMID 492 (da0:mps0:0:1:0): SCSI command timeout on device handle 0x000a SMID 976 (da11:mps0:0:12:0): SCSI command timeout on device handle 0x0015 SMID 339 (da11:mps0:0:12:0): SCSI command timeout on device handle 0x0015 SMID 746 (da5:mps0:0:6:0): SCSI command timeout on device handle 0x000f SMID 74 (da6:mps0:0:7:0): SCSI command timeout on device handle 0x0010 SMID 613 (da2:mps0:0:3:0): SCSI command timeout on device handle 0x000c SMID 16 (da10:mps0:0:11:0): SCSI command timeout on device handle 0x0014 SMID 305 (da1:mps0:0:2:0): SCSI command timeout on device handle 0x000b SMID 74 (da6:mps0:0:7:0): SCSI command timeout on device handle 0x0010 SMID 594 In some cases that would be followed by this, which would usually be the last transmission, though we don't see this in all cases. It may just be the system isn't always alive long enough to transmit: kernel: mps0: IOC Fault 0x40006003, Resetting I'm able to reproduce fairly easily within a minute or two by heavily loading the disks up by whatever means, and running smartctl -a in a loop: #!/bin/sh -x disks=`sysctl -n kern.disks|xargs -n1|grep ^da` for disk in $disks; do camcontrol tags $disk -N 4 done for z in `yes|head -100`; do for disk in $disks; do smartctl -s on -a /dev/$disk done done mps0: port 0xe000-0xe0ff mem 0xfbd3c000-0xfbd3ffff,0xfbd40000-0xfbd7ffff irq 26 at device 0.0 on pci4 mps0: Firmware: 07.00.00.00 mps0: IOCCapabilities: 1285c mps0: [ITHREAD] da0 at mps0 bus 0 scbus0 target 1 lun 0 da1 at mps0 bus 0 scbus0 target 2 lun 0 da2 at mps0 bus 0 scbus0 target 3 lun 0 da3 at mps0 bus 0 scbus0 target 4 lun 0 da4 at mps0 bus 0 scbus0 target 5 lun 0 da5 at mps0 bus 0 scbus0 target 6 lun 0 da6 at mps0 bus 0 scbus0 target 7 lun 0 da7 at mps0 bus 0 scbus0 target 8 lun 0 da8 at mps0 bus 0 scbus0 target 9 lun 0 da9 at mps0 bus 0 scbus0 target 10 lun 0 da10 at mps0 bus 0 scbus0 target 11 lun 0 da11 at mps0 bus 0 scbus0 target 12 lun 0 ses0 at mps0 bus 0 scbus0 target 13 lun 0 mps0@pci0:4:0:0: class=0x010700 card=0x040015d9 chip=0x00721000 rev=0x02 hdr=0x00 vendor = 'LSI Logic (Was: Symbios Logic, NCR)' class = mass storage subclass = SAS at scbus0 target 1 lun 0 (pass0,da0) at scbus0 target 2 lun 0 (pass1,da1) at scbus0 target 3 lun 0 (pass2,da2) at scbus0 target 4 lun 0 (pass3,da3) at scbus0 target 5 lun 0 (pass4,da4) at scbus0 target 6 lun 0 (pass5,da5) at scbus0 target 7 lun 0 (pass6,da6) at scbus0 target 8 lun 0 (pass7,da7) at scbus0 target 9 lun 0 (pass8,da8) at scbus0 target 10 lun 0 (pass9,da9) at scbus0 target 11 lun 0 (pass10,da10) at scbus0 target 12 lun 0 (pass11,da11) at scbus0 target 13 lun 0 (ses0,pass12) Thank you sirs, Jason Wolfe