FreeBSD Mail Archives

Date:      Wed, 07 Jun 2017 10:18:14 +0200
From:      Harry Schmalzbauer <freebsd@omnilan.de>
To:        Stephen Mcconnell <stephen.mcconnell@broadcom.com>
Cc:        freebsd-scsi@freebsd.org, Scott Long <scottl@freebsd.org>, "Kenneth D. Merry" <ken@freebsd.org>, Stephen Mcconnell <stephen.mcconnell@broadcom.com>
Subject:   Re: sporadic CAM (all devices) outage on 11-stable, mps(4), ahci(4) and bhyve(8) involved. [Was: Re: mps(4) blocks panic-reboot]
Message-ID:  <5937B6C6.9020300@omnilan.de>
In-Reply-To: <59306693.6080304@omnilan.de>
References:  <592FDE8C.1090609@omnilan.de> 12a36df9eff99c77ec621987efbe75fe@mail.gmail.com <ff9342e2e1eb541f347d9f683cfc8214@mail.gmail.com> <59303484.1040609@omnilan.de> <e6fe7cc17fb1302caf2122eaa11d10ba@mail.gmail.com> <593056E9.6000807@omnilan.de> <d48587b45e608cd519155d19567d03af@mail.gmail.com> <59305D4F.40707@omnilan.de> <f5bf40b2814ea894abab8bde8acb16bb@mail.gmail.com> <59306693.6080304@omnilan.de>


 Bezüglich Harry Schmalzbauer's Nachricht vom 01.06.2017 21:10 (localtime):
> Bezüglich Stephen Mcconnell's Nachricht vom 01.06.2017 20:55 (localtime):
>> Take a look at PR 212914. Could that be the issue? It was MFC'd to stable/11
>> with r309273 on Nov 28th, 2016.
> Thanks a lot, but that's unrelated.

Unfortunately, today a similar lockup occured :-(

I was informed by mps(4):

(da1:mps0:0:3:0): READ(10). CDB: 28 00 06 7e 4d 53 00 00 10 00
(da1:mps0:0:3:0): CAM status: Unrecoverable Host Bus Adapter Error
(da1:mps0:0:3:0): Retrying command
(da1:mps0:0:3:0): WRITE(10). CDB: 2a 00 06 f8 c5 1f 00 00 38 00
(da1:mps0:0:3:0): CAM status: Unrecoverable Host Bus Adapter Error
(da1:mps0:0:3:0): Retrying command
(da1:mps0:0:3:0): WRITE(10). CDB: 2a 00 06 f8 c5 1f 00 00 38 00
(da1:mps0:0:3:0): CAM status: SCSI Status Error
(da1:mps0:0:3:0): SCSI status: Check Condition
(da1:mps0:0:3:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset,
or bus device reset occurred)
(da1:mps0:0:3:0): Error 6, Retries exhausted
(da1:mps0:0:3:0): Invalidating pack

But it seemed all drives got lost again (although the kernel message
couldn't be printed anymore), since on another still responsive
(memorydisk rootfs) session I could get the zpool status and zfs
reported all members having outstanding requests:
  pool: cetusPsys
 state: ONLINE
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool
clear'.
   see: http://illumos.org/msg/ZFS-8000-JQ
  scan: none requested
config:

        NAME                    STATE     READ WRITE CKSUM
        cetusPsys               ONLINE     370    13     0
          mirror-0              ONLINE      40    12     0
            gpt/cetusSYSzd1of4  ONLINE       3    26     0
            da2                 ONLINE       3    16     0
          mirror-1              ONLINE     700     9     0
            gpt/cetusSYSzd2of4  ONLINE       3     9     0
            da3                 ONLINE       3    54     0

I'll do anything I can do to help tracking this problem, since the one
thing happened which I have taken massive precaution not to happen... a
freezing hypervisor :-(

Thanks,

-harry

(In case one is following any of my other recent PRs: This time, no
passthru-enabled-VM was involved. The latter causes some very serious
memory corruption IMHO... This machine is a XEON E3 with ECC, neither
MBC nor MCE reports ECC errors...

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?5937B6C6.9020300>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation