From owner-freebsd-scsi@freebsd.org Thu Feb 18 17:33:31 2016 Return-Path: Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 7B5B5AAD5DA for ; Thu, 18 Feb 2016 17:33:31 +0000 (UTC) (envelope-from ambrisko@ambrisko.com) Received: from mail.ambrisko.com (mail.ambrisko.com [70.91.206.90]) by mx1.freebsd.org (Postfix) with ESMTP id 56EA2AEC for ; Thu, 18 Feb 2016 17:33:31 +0000 (UTC) (envelope-from ambrisko@ambrisko.com) X-Ambrisko-Me: Yes Received: from server2.ambrisko.com (HELO internal.ambrisko.com) ([192.168.1.2]) by ironport.ambrisko.com with ESMTP; 18 Feb 2016 09:48:10 -0800 Received: from ambrisko.com (localhost [127.0.0.1]) by internal.ambrisko.com (8.14.9/8.14.4) with ESMTP id u1IHXPsf029514; Thu, 18 Feb 2016 09:33:25 -0800 (PST) (envelope-from ambrisko@ambrisko.com) Received: (from ambrisko@localhost) by ambrisko.com (8.14.9/8.14.4/Submit) id u1IHXPlB029513; Thu, 18 Feb 2016 09:33:25 -0800 (PST) (envelope-from ambrisko) Date: Thu, 18 Feb 2016 09:33:25 -0800 From: Doug Ambrisko To: Tinker Cc: freebsd-scsi@freebsd.org Subject: Re: MRSAS driver/LSI MegaRaid 92XX-93XX admin question: When one of the Raid's physical drives break, how is it reported in the logs? Message-ID: <20160218173325.GA29200@ambrisko.com> References: <6a648d421b6d611b4f6f411b66303017@openmailbox.org> <55de137d1ed81930cfdbee579d881d62@openmailbox.org> <20160217000002.GA81916@ambrisko.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.23 (2014-03-12) X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 18 Feb 2016 17:33:31 -0000 On Wed, Feb 17, 2016 at 02:38:10PM +0700, Tinker wrote: | Hi Doug, | | Would you mind sharing your kernel patch for that functionality (if I | understand you right, you patched your kernel to channelize the events | to the dmesg)? I need to do some work on mrsas stuff at work, so I plan to sync our changes to -current etc. I'll send them to you. Doug A. | On 2016-02-17 07:00, Doug Ambrisko wrote: | > On Sun, Feb 14, 2016 at 10:13:31PM +0700, Tinker wrote: | > | (Will send any followup from now only to freebsd-scsi@ .) | > | | > | Did some additional research and found that the disk failure indeed | > is | > | reported in MRSAS' "event log". | > | | > | So my final question then is, how do you extract it into userland (in | > | the absence of an "mfiutil" as the MFI driver has)? | > | > I have local changes to print the event log in dmesg which gets | > sysloged. | > We then watch syslog for issues to report things to our customers | > automatically. This is similar to mfi(4). | > | > Thanks, | > | > Doug A. | > | Details below. Thanks. | > | | > | On 2016-02-14 19:59, Tinker wrote: | > | [...] | > | > | > http://www.cisco.com/c/dam/en/us/td/docs/unified_computing/ucs/3rd-party/lsi/mrsas/userguide/LSI_MR_SAS_SW_UG.pdf | > | > on page 305, that is section "A.2 Event Messages" - I don't know | > for | > | > what LGI chip this document is, but, it does not list particular | > event | > | > message very clearly for when an individual underlying disk would | > have | > | > broken, I don't even see any event for when a hot spare would be | > taken | > | > in use! | > | | > | | > | Wait - this page: | > | | > | | > https://www.schirmacher.de/display/Linux/Replace+failed+disk+in+MegaRAID+array | > | | > | (and also | > | | > http://serverfault.com/questions/485147/drive-is-failing-but-lsi-megaraid-controller-does-not-detect-it | > | ) | > | | > | gives an example of how the host system learns about broken disks: | > | | > | | > | Code: 0x00000051 .. Event Description: State change on VD 00/1 from | > | OPTIMAL(3) to DEGRADED(2) | > | | > | | > | Code: 0x00000072 .. Event Description: State change on PD | > 05(e0xfc/s0) | > | from ONLINE(18) to FAILED(11) | > | | > | (unclean disk broken seems to be shown as:) | > | | > | Code: 0x00000071 .. Event Description: Unexpected sense: PD | > 05(e0xfc/s0) | > | Path 4433221103000000, CDB: 2e 00 3a 38 1b c7 00 00 01 00, Sense: | > | b/00/00 | > | | > | | > | And this version of the LSI documentation | > | | > | | > http://hwraid.le-vert.net/raw-attachment/wiki/LSIMegaRAIDSAS/megacli_user_guide.pdf | > | | > | gives a clearer definition of the physical and virtual drive states | > in | > | "1.4.16 Physical Drive States" | > | and "1.4.17 Virtual Disk States" on pages 1-11 to 1-12. | > | | > | So as we see, a physical drive breaking would | > | | > | * "FAILED" the physical drive | > | | > | * "DEGRADED" the Virtual Drive (that is the logical exported drive) | > | (from "OPTIMAL") | > | | > | | > | So then, it was indeed the card's "event log" that contains this | > info. | > | | > | | > | | > | Last question then would only be then, *where* FreeBSD's MRSAS driver | > | sends its event log? | > | | > | | > | | > | _______________________________________________ | > | freebsd-stable@freebsd.org mailing list | > | https://lists.freebsd.org/mailman/listinfo/freebsd-stable | > | To unsubscribe, send any mail to | > "freebsd-stable-unsubscribe@freebsd.org"