From owner-freebsd-stable@FreeBSD.ORG Tue Jul 9 15:20:46 2013 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 6BCEA160 for ; Tue, 9 Jul 2013 15:20:46 +0000 (UTC) (envelope-from outbackdingo@gmail.com) Received: from mail-oa0-x22f.google.com (mail-oa0-x22f.google.com [IPv6:2607:f8b0:4003:c02::22f]) by mx1.freebsd.org (Postfix) with ESMTP id 39F12177E for ; Tue, 9 Jul 2013 15:20:46 +0000 (UTC) Received: by mail-oa0-f47.google.com with SMTP id m1so8214297oag.34 for ; Tue, 09 Jul 2013 08:20:45 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=L/2PxcTrKQuRUxZ2YJsciTdTP+zJAeVQfRIfDG5fk8E=; b=bdiy2KWA02enWT0HZ3Zeorh18BZy8r1TAI5s1cn/V85Ay9aGP55fLifhcsBoyRocrJ 5UAB5qzYXD7Vxmy4Z5I99RSrcCszT9gXmkqL4olnx87g36byh36/voT4GrqxbjbeCnnM jEpyxe65DL5ivGa5XOi6GPKlEV+UOf/73jLx97yqlm9RJFh1aQYCIw0q+jSx+G/jkD1t f33bk426o6/ftlcU1iNIVT3TS44s0zalW8qT9LiawwKs2ws4vY+ONkb5Vo6ayg+uqsGC XitE7jLJtCG/A45LUjiMuTxlAPiQG4TO8UslvwiY9NsQQ3Y98F3R7uAAxEIx97L9B7Tu M87w== MIME-Version: 1.0 X-Received: by 10.60.102.41 with SMTP id fl9mr24675124oeb.37.1373383245634; Tue, 09 Jul 2013 08:20:45 -0700 (PDT) Received: by 10.76.90.197 with HTTP; Tue, 9 Jul 2013 08:20:45 -0700 (PDT) In-Reply-To: <20130709144614.GA7538@icarus.home.lan> References: <20130709123900.GA5828@icarus.home.lan> <20130709144614.GA7538@icarus.home.lan> Date: Tue, 9 Jul 2013 11:20:45 -0400 Message-ID: Subject: Re: Stable/9 from today mpssas_scsiio timeouts From: Outback Dingo To: Jeremy Chadwick Content-Type: text/plain; charset=ISO-8859-1 X-Content-Filtered-By: Mailman/MimeDel 2.1.14 Cc: freebsd-stable@freebsd.org X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 09 Jul 2013 15:20:46 -0000 On Tue, Jul 9, 2013 at 10:46 AM, Jeremy Chadwick wrote: > On Tue, Jul 09, 2013 at 09:47:01AM -0400, Outback Dingo wrote: > > On Tue, Jul 9, 2013 at 9:44 AM, Outback Dingo >wrote: > > > On Tue, Jul 9, 2013 at 8:39 AM, Jeremy Chadwick > wrote: > > > > > >> On Tue, Jul 09, 2013 at 05:32:39AM -0400, Outback Dingo wrote: > > >> > as of stable today im seeing alot of new mps time outs > > >> > > > >> > 9.1-STABLE FreeBSD 9.1-STABLE #0 r253035M: Mon Jul 8 16:34:28 UTC > 2013 > > >> > root@:/usr/obj/nas/usr/src/sys/ > > >> > > > >> > mps1@pci0:130:0:0: class=0x010700 card=0x30201000 > chip=0x00721000 > > >> > rev=0x03 hdr=0x00 > > >> > vendor = 'LSI Logic / Symbios Logic' > > >> > device = 'SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon]' > > >> > class = mass storage > > >> > subclass = SAS > > >> > > > >> > > > >> > mps0: mpssas_scsiio_timeout checking sc 0xffffff8002145000 cm > > >> > 0xffffff80021a6b78 > > >> > (probe40:mps0:0:40:0): INQUIRY. CDB: 12 00 00 00 24 00 length 36 > SMID > > >> 983 > > >> > command timeout cm 0xffffff80021a6b78 ccb 0xfffffe002bb5f800 > > >> > mps0: mpssas_alloc_tm freezing simq > > >> > mps0: timedout cm 0xffffff80021a6b78 allocated tm 0xffffff80021587b0 > > >> > (probe40:mps0:0:40:0): INQUIRY. CDB: 12 00 00 00 24 00 length 36 > SMID > > >> 983 > > >> > completed timedout cm 0xffffff80021a6b78 ccb 0xfffffe002bb5f800 > during > > >> > recovery ioc 8048 scsi 0 state c xfer 0 > > >> > (noperiph:mps0:0:40:0): SMID 6 abort TaskMID 983 status 0x4a code > 0x0 > > >> count > > >> > 1 > > >> > (noperiph:mps0:0:40:0): SMID 6 finished recovery after aborting > TaskMID > > >> 983 > > >> > mps0: mpssas_free_tm releasing simq > > >> > (probe40:mps0:0:40:0): INQUIRY. CDB: 12 00 00 00 24 00 > > >> > (probe40:mps0:0:40:0): CAM status: Command timeout > > >> > (probe40:mps0:0:40:0): Retrying command > > >> > mps1: mpssas_scsiio_timeout checking sc 0xffffff8002384000 cm > > >> > 0xffffff80023e5b78 > > >> > (probe292:mps1:0:37:0): INQUIRY. CDB: 12 00 00 00 24 00 length 36 > SMID > > >> 983 > > >> > command timeout cm 0xffffff80023e5b78 ccb 0xfffffe002be14800 > > >> > mps1: mpssas_alloc_tm freezing simq > > >> > mps1: timedout cm 0xffffff80023e5b78 allocated tm 0xffffff80023977b0 > > >> > (probe292:mps1:0:37:0): INQUIRY. CDB: 12 00 00 00 24 00 length 36 > SMID > > >> 983 > > >> > completed timedout cm 0xffffff80023e5b78 ccb 0xfffffe002be14800 > during > > >> > recovery ioc 8048 scsi 0 state c xfer 0 > > >> > (noperiph:mps1:0:37:0): SMID 6 abort TaskMID 983 status 0x4a code > 0x0 > > >> count > > >> > 1 > > >> > (noperiph:mps1:0:37:0): SMID 6 finished recovery after aborting > TaskMID > > >> 983 > > >> > mps1: mpssas_free_tm releasing simq > > >> > (probe292:mps1:0:37:0): INQUIRY. CDB: 12 00 00 00 24 00 > > >> > (probe292:mps1:0:37:0): CAM status: Command timeout > > >> > (probe292:mps1:0:37:0): Retrying command > > >> > > >> 1. What revision were you running before (i.e. what were you on prior > to > > >> the upgrade)? > > >> > > > > > > > > > Sorry I was on 252595 from July 3 > > And does rolling back to r252595 resolve the problem for you? > > Because the only commit I see between r253035 and r252595 that might > account for some kind of behavioural change, unless I missed one while > skimming the commit history, is the following: > > r252730 -- http://www.freshbsd.org/commit/freebsd/r252730 > > If at all possible, please try updating to r253037 or newer to see > if that has some effect/improvement. Why I mention that commit: > > r253037 -- http://www.freshbsd.org/commit/freebsd/r253037 > > Because the only mps(4) changes done in recent days are: > > http://svnweb.freebsd.org/base/stable/9/sys/dev/mps/mps_sas.c?view=log > > r253037 > r251899 > r251874 > i can say this its between July 4, and 253048, im rolling back to 252723 to validate a good known working state > > Else I'd say what you're experiencing is legitimate/unrelated to kernel > changes. I can only speculate. > > The messages to me indicate that some part of the kernel is submitting a > SCSI INQUIRY request to the underlying device(s) which results in a CAM > timeout, i.e. the disk attached to the controller did not respond > promptly (while the controller seemed to be alive/well). > > If these disks (which we do not know the type of -- no dmesg provided, > etc.) are SSDs then TRIM behaviour is possibly causing the drive to take > too long to perform its TRIM cleanup, or, the drives themselves are > doing some kind of garbage collection which is taking quite a long time. > > Steven et all may have a different (and almost certainly more accurate) > analysis. > > It would really help if you could provide "dmesg" from the machine, as > well as any details about your setup (if ZFS, "zpool status", etc.), in > addition to (if SSDs) "sysctl -a | grep -i trim". All this matters. > > -- > | Jeremy Chadwick jdc@koitsu.org | > | UNIX Systems Administrator http://jdc.koitsu.org/ | > | Making life hard for others since 1977. PGP 4BD6C0CB | > >