Date: Wed, 07 Jun 2017 10:18:14 +0200 From: Harry Schmalzbauer <freebsd@omnilan.de> To: Stephen Mcconnell <stephen.mcconnell@broadcom.com> Cc: freebsd-scsi@freebsd.org, Scott Long <scottl@freebsd.org>, "Kenneth D. Merry" <ken@freebsd.org>, Stephen Mcconnell <stephen.mcconnell@broadcom.com> Subject: Re: sporadic CAM (all devices) outage on 11-stable, mps(4), ahci(4) and bhyve(8) involved. [Was: Re: mps(4) blocks panic-reboot] Message-ID: <5937B6C6.9020300@omnilan.de> In-Reply-To: <59306693.6080304@omnilan.de> References: <592FDE8C.1090609@omnilan.de> 12a36df9eff99c77ec621987efbe75fe@mail.gmail.com <ff9342e2e1eb541f347d9f683cfc8214@mail.gmail.com> <59303484.1040609@omnilan.de> <e6fe7cc17fb1302caf2122eaa11d10ba@mail.gmail.com> <593056E9.6000807@omnilan.de> <d48587b45e608cd519155d19567d03af@mail.gmail.com> <59305D4F.40707@omnilan.de> <f5bf40b2814ea894abab8bde8acb16bb@mail.gmail.com> <59306693.6080304@omnilan.de>
next in thread | previous in thread | raw e-mail | index | archive | help
Bezüglich Harry Schmalzbauer's Nachricht vom 01.06.2017 21:10 (localtime): > Bezüglich Stephen Mcconnell's Nachricht vom 01.06.2017 20:55 (localtime): >> Take a look at PR 212914. Could that be the issue? It was MFC'd to stable/11 >> with r309273 on Nov 28th, 2016. > Thanks a lot, but that's unrelated. Unfortunately, today a similar lockup occured :-( I was informed by mps(4): (da1:mps0:0:3:0): READ(10). CDB: 28 00 06 7e 4d 53 00 00 10 00 (da1:mps0:0:3:0): CAM status: Unrecoverable Host Bus Adapter Error (da1:mps0:0:3:0): Retrying command (da1:mps0:0:3:0): WRITE(10). CDB: 2a 00 06 f8 c5 1f 00 00 38 00 (da1:mps0:0:3:0): CAM status: Unrecoverable Host Bus Adapter Error (da1:mps0:0:3:0): Retrying command (da1:mps0:0:3:0): WRITE(10). CDB: 2a 00 06 f8 c5 1f 00 00 38 00 (da1:mps0:0:3:0): CAM status: SCSI Status Error (da1:mps0:0:3:0): SCSI status: Check Condition (da1:mps0:0:3:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred) (da1:mps0:0:3:0): Error 6, Retries exhausted (da1:mps0:0:3:0): Invalidating pack But it seemed all drives got lost again (although the kernel message couldn't be printed anymore), since on another still responsive (memorydisk rootfs) session I could get the zpool status and zfs reported all members having outstanding requests: pool: cetusPsys state: ONLINE status: One or more devices are faulted in response to IO failures. action: Make sure the affected devices are connected, then run 'zpool clear'. see: http://illumos.org/msg/ZFS-8000-JQ scan: none requested config: NAME STATE READ WRITE CKSUM cetusPsys ONLINE 370 13 0 mirror-0 ONLINE 40 12 0 gpt/cetusSYSzd1of4 ONLINE 3 26 0 da2 ONLINE 3 16 0 mirror-1 ONLINE 700 9 0 gpt/cetusSYSzd2of4 ONLINE 3 9 0 da3 ONLINE 3 54 0 I'll do anything I can do to help tracking this problem, since the one thing happened which I have taken massive precaution not to happen... a freezing hypervisor :-( Thanks, -harry (In case one is following any of my other recent PRs: This time, no passthru-enabled-VM was involved. The latter causes some very serious memory corruption IMHO... This machine is a XEON E3 with ECC, neither MBC nor MCE reports ECC errors...
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?5937B6C6.9020300>