FreeBSD Mail Archives

Date:      Sat, 2 Feb 2019 12:02:56 -0600
From:      Karl Denninger <karl@denninger.net>
To:        FreeBSD-STABLE Mailing List <freebsd-stable@freebsd.org>
Subject:   9211 (LSI/SAS) issues on 11.2-STABLE
Message-ID:  <7bb25f55-fa77-f67e-11f3-b2240b01e25a@denninger.net>

index | next in thread | raw e-mail


[-- Attachment #1 --]
I recently started having some really oddball things  happening under
stress.  This coincided with the machine being updated to 11.2-STABLE
(FreeBSD 11.2-STABLE #1 r342918:) from 11.1.

Specifically, I get "errors" like this:

        (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 bb 08 00 01 00 00
length 131072 SMID 269 Aborting command 0xfffffe0001179110
mps0: Sending reset from mpssas_send_abort for target ID 37
        (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 bc 08 00 01 00 00
length 131072 SMID 924 terminated ioc 804b loginfo 31140000 scsi 0 state
c xfer 0
        (da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 ba 08 00 01 00 00
length 131072 SMID 161 terminated ioc 804b loginfo 31140000 scsi 0 state
c xfer 0
mps0: Unfreezing devq for target ID 37
(da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 bc 08 00 01 00 00
(da12:mps0:0:37:0): CAM status: CCB request completed with an error
(da12:mps0:0:37:0): Retrying command
(da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 bb 08 00 01 00 00
(da12:mps0:0:37:0): CAM status: Command timeout
(da12:mps0:0:37:0): Retrying command
(da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 ba 08 00 01 00 00
(da12:mps0:0:37:0): CAM status: CCB request completed with an error
(da12:mps0:0:37:0): Retrying command
(da12:mps0:0:37:0): READ(10). CDB: 28 00 af 82 ba 08 00 01 00 00
(da12:mps0:0:37:0): CAM status: SCSI Status Error
(da12:mps0:0:37:0): SCSI status: Check Condition
(da12:mps0:0:37:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on,
reset, or bus device reset occurred)
(da12:mps0:0:37:0): Retrying command (per sense data)

The "Unit Attention" implies the drive reset.  It only occurs on certain
drives under very heavy load (e.g. a scrub.)  I've managed to provoke it
on two different brands of disk across multiple firmware and capacities,
however, which tends to point away from a drive firmware problem.

A look at the pool data shows /no /errors (e.g. no checksum problems,
etc) and a look at the disk itself (using smartctl) shows no problems
either -- whatever is going on here the adapter is recovering from it
without any data corruption or loss registered on *either end*!

The configuration is an older SuperMicro Xeon board (X8DTL-IF) and shows:

mps0: <Avago Technologies (LSI) SAS2008> port 0xc000-0xc0ff mem
0xfbb3c000-0xfbb3ffff,0xfbb40000-0xfbb7ffff irq 30 at device 0.0 on pci3
mps0: Firmware: 19.00.00.00, Driver: 21.02.00.00-fbsd
mps0: IOCCapabilities:
1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc>

There is also a SAS expander connected to that with all but the boot
drives on it (the LSI card will not boot from the expander so the boot
mirror is directly connected to the adapter.)

Thinking this might be a firmware/driver compatibility related problem I
flashed the card to 20.00.07.00, which is the latest available.  That
made the situation **MUCH** worse; now instead of getting unit attention
issues I got *controller* resets (!!) which invariably some random
device (and sometimes more than one) in one of the pools to get
detached, as the controller didn't come back up fast enough for ZFS and
it declares the device(s) in question "removed".

Needless to say I immediately flashed the card back to 19.00.00.00!

This configuration has been completely stable on 11.1 for upwards of a
year, and only started misbehaving when I updated the OS to 11.2.  I've
pounded the living daylights out of this box for a very long time on a
succession of FreeBSD OS builds and up to 11.1 have never seen anything
like this; if I had a bad drive, it was clearly the drive.

Looking at the commit logs for the mps driver it appears there isn't
much here that *could* be involved, unless there's an interrupt issue
with some of the MSI changes that is interacting with my specific
motherboard line.

Any ideas on running this down would be appreciated; it's not easy to
trigger it on the 19.0 firmware but on 20. I can force a controller
reset and detach within a few minutes by running scrubs so if there are
things I can try (I have a sandbox machine with the same hardware in it
that won't make me cry much if I blow it up) that would great.

Thanks!

-- 
Karl Denninger
karl@denninger.net <mailto:karl@denninger.net>
/The Market Ticker/
/[S/MIME encrypted email preferred]/

[-- Attachment #2 --]
0�	*�H��
��0�10
	`�He0�	*�H��
��
�0��0����H���^��Ōc!5�
�H0
	*�H��
0��10	UUS10UFlorida10U	Niceville10U
Cuda Systems LLC10UCuda Systems CA1!0UCuda Systems LLC 2017 CA0
170817164217Z
270815164217Z0{10	UUS10UFlorida10U
Cuda Systems LLC10UCuda Systems CA1%0#UCuda Systems LLC 2017 Int CA0�"0
	*�H��
�0�
��h�-5B>[���;��o���l�Ӵ��0~͎O9}�9�Y��e������*�������$��g��!uk�vʶ�LzN�`jL�>��MD'7U4����5C�B�+�kY`bd����~b*�c3�N��y-�78j�u�]9H�e��uέ�sӬD��ؽ�m��gw�ER�?�&U�UR�j����'�}�9n�WD i�`XcbG��z�\g������G=��u�%���\�O�i1���3���ߝ4�
�K4�4p�YQr]�Ie�/r�0+��eEޝݖ0��C15�M��ݚ@J�SZ(zȏ�N�Ta�(2��5�D�D5���.l�<g[[Za��r�Q�Q%�Bu�ȴ����~~`���I�oh�R�b����ʳ��ڟ���u�2���M�S��8E�dF��UC���l�CM�aѳ����!����}ș�+�2��k��/�bų�E,��n�当ꖛ\�(8�WV�8	d]�b�	�������y�X��w	܊�:I�39��
0�0U]�^§������Q�\ӎ�0��U#��0�����T0�3���9�N0b������0��10	UUS10UFlorida10U	Niceville10U
Cuda Systems LLC10UCuda Systems CA1!0UCuda Systems LLC 2017 CA�	�@�U��i0U�0�0U��0
	*�H��
���:P U!>v�����J�ni��o�-����#�ן�]Wyu�j���ǑR̀��Q�
�nƇ�!GѦF��g\�yLx�g�w=�O�P��yceh�f[���}�ܷ�['4�ڝ�\[p6\o.��B&�JF���"�ZC{;�*o�*�mc��Cc�LY߾�`
�t�*�S!����񫶭�(���`�]D�HP�5���A~/�N���Pp�����6�=�m��h�k�밣'd���oA$�86hm����5���Ӛ��S@�j���ެE���gl��
�)�0JG���`%�k�3�5��P��a��C?���σ
׳HE�t}!�P���㏏%*���B�xb��Q�waKG����$6h�¦��M�v��e;��[o��-�Iی��&
���I,��T��c�ߎ#t �wPA�@��l0�P�+�KXB��պT	z���G�v;N��c��I3��&��JĬ���UP�N��a��?�/�%�W��6G۟N�0�00����k���#X��d��\�=0
	*�H��
0{10	UUS10UFlorida10U
Cuda Systems LLC10UCuda Systems CA1%0#UCuda Systems LLC 2017 Int CA0
170817212120Z
220816212120Z0W10	UUS10UFlorida10U
Cuda Systems LLC10Ukarl@denninger.net0�"0
	*�H��
�0�
��T��[I�-ΆϏdn;�Å�@שy���.u�s�~_�Z�G%<��M��Y��d�\g��v�f��n�s�a��1'6����E�gyjs�"C� [�{��~��_���K���Pn+<�*�pv���#Q�����+��H���/���7[-v��qD��V^U>�f��%�GX�)��H.��|l`�M(C�r�>е͇6����#�o��dc"Y�ljҦ�ln8�@�5S�A�0���&ۖ"���OGj?��U��DWZ5	��dDB7k-)�9�����I�zs��-�JA���v
��J��6L���$�Ն����1Sm�Y.��Lqw*��SH;E��F'�D�Ħ��H��]��M��O��������g���Q���Q�|M�ٙ��ג2Z��9y��@���y�]}6ٽe��Y9��Y2�xˆ�$T�=�e�CǺ��ǵb�n֛�{��j��|��@�LL�t�1�[D�k5:$=�	`�	�M���0��0<+00.0,+0� http://ocsp.cudasystems.net:88880	U00	`�H��B�0U��0U%0++03	`�H��B
&$OpenSSL Generated Client Certificate0U�%�՞V=���؁�;�bzQ0��U#��0���]�^§������Q�\ӎϡ�����0��10	UUS10UFlorida10U	Niceville10U
Cuda Systems LLC10UCuda Systems CA1!0UCuda Systems LLC 2017 CA��H���^��Ōc!5�
�H0U0�karl@denninger.net0
	*�H��
��۠�A0�-j%-�-$%���g2#ޡ��1�^��>���{K+�u��GE���v1���ş7Af&b�&O�;.��;A5���*U��)N��D2bF��|\=�]<�sˋL!��wrw���٧>��Y���M���Ä���3\mW�R�� h�Sv���!�_�zv�����l�?� ��3_�� �xU%�\�^����#���O*���Gk̍�YI_�&�Fꊛ�����@&�1�n�������}� ͬ:��{�hT�P3��B.�;���bU�8:Z��=^���Gw�8���!k-��@���x�E��@�i�,+'�Iᐚ:f��hz�tX7/�(h�Y`��� O�.������1}a`�%�RW��^�a�k������ǂp�C�Au�fgDix�UT��Щ/�7��}�%=j��nVZvcF����<�M=
�2^G�KH5魉
�_���O�4ެ�Byʈ���y��S��k�w=5�@h�.0�z�>�
W�1�0�0��0{10	UUS10UFlorida10U
Cuda Systems LLC10UCuda Systems CA1%0#UCuda Systems LLC 2017 Int CA��k���#X��d��\�=0
	`�He��E0	*�H��
	1	*�H��
0	*�H��
	1
190202180256Z0O	*�H��
	1B@��7�v
خ��C��k}��@D��o�'���[�`���\����Y�4�������
t
-H�E�0l	*�H��
	1_0]0	`�He*0	`�He0
*�H��
0*�H��
�0
*�H��
@0+0
*�H��
(0��	+�71��0��0{10	UUS10UFlorida10U
Cuda Systems LLC10UCuda Systems CA1%0#UCuda Systems LLC 2017 Int CA��k���#X��d��\�=0��*�H��
	1�����0{10	UUS10UFlorida10U
Cuda Systems LLC10UCuda Systems CA1%0#UCuda Systems LLC 2017 Int CA��k���#X��d��\�=0
	*�H��
�1�"��V�k�$=�`��(#1��&��ka��E	O�����Wx׈2�o�5����1���=�@�Ps�<�6Н.,w��FO1�ޗn�Q'�@U5[�XNys�Xé9��3=�=�Q=O.cې֩~������`=Er�� %\���s�	�n�+�pD/�J� Ȑ%���2Շ��vH�>@ݕ˧�}��E�+C��HP�`u��+[��W�.LD���w04dRm���L��?�y��F:(�&��1]�	���T�8��$�
��<8�i�@�<�����m5,�#�.7w�Fx�=�Fj'fHx�[���c�*	̬�H��G(�Kg�|��`�ϒ/X	��j"�M\2D��T]�31~�Q({���EM��=E� ����rN�����N��P`:�:���\���f�߆R!���3�`��tt��0��nu��� "q�L-�B�
E��bt? �w2�!+3��
`��h�Bۏw��¤҈�1���p@�ׂ

home | help

Want to link to this message? Use this
URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?7bb25f55-fa77-f67e-11f3-b2240b01e25a>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation