FreeBSD Mail Archives

Date:      Sat, 20 Apr 2019 16:26:01 -0500
From:      Karl Denninger <karl@denninger.net>
To:        freebsd-stable@freebsd.org
Subject:   Re: Concern: ZFS Mirror issues (12.STABLE and firmware 19 .v. 20)
Message-ID:  <e90494e9-9d6d-ce19-05db-3ebb06d00766@denninger.net>
In-Reply-To: <8108da18-2cdd-fa29-983c-3ae7be6be412@multiplay.co.uk>
References:  <f87f32f2-b8c5-75d3-4105-856d9f4752ef@denninger.net> <c96e31ad-6731-332e-5d2d-7be4889716e1@FreeBSD.org> <9a96b1b5-9337-fcae-1a2a-69d7bb24a5b3@denninger.net> <CACpH0MdLNQ_dqH%2Bto=amJbUuWprx3LYrOLO0rQi7eKw-ZcqWJw@mail.gmail.com> <1866e238-e2a1-ef4e-bee5-5a2f14e35b22@denninger.net> <3d2ad225-b223-e9db-cce8-8250571b92c9@FreeBSD.org> <2bc8a172-6168-5ba9-056c-80455eabc82b@denninger.net> <CACpH0MfmPzEO5BO2kFk8-F1hP9TsXEiXbfa1qxcvB8YkvAjWWw@mail.gmail.com> <2c23c0de-1802-37be-323e-d390037c6a84@denninger.net> <864062ab-f68b-7e63-c3da-539d1e9714f9@denninger.net> <6dc1bad1-05b8-2c65-99d3-61c547007dfe@denninger.net> <758d5611-c3cf-82dd-220f-a775a57bdd0b@multiplay.co.uk> <3f53389a-0cb5-d106-1f64-bbc2123e975c@denninger.net> <8108da18-2cdd-fa29-983c-3ae7be6be412@multiplay.co.uk>

index | next in thread | previous in thread | raw e-mail


[-- Attachment #1 --]
No; I can, but of course that's another ~8 hour (overnight) delay
between swaps.

That's not a bad idea however....

On 4/20/2019 15:56, Steven Hartland wrote:
> Thanks for extra info, the next question would be have you eliminated
> that corruption exists before the disk is removed?
>
> Would be interesting to add a zpool scrub to confirm this isn't the
> case before the disk removal is attempted.
>
>     Regards
>     Steve
>
> On 20/04/2019 18:35, Karl Denninger wrote:
>>
>> On 4/20/2019 10:50, Steven Hartland wrote:
>>> Have you eliminated geli as possible source?
>> No; I could conceivably do so by re-creating another backup volume
>> set without geli-encrypting the drives, but I do not have an extra
>> set of drives of the capacity required laying around to do that. I
>> would have to do it with lower-capacity disks, which I can attempt if
>> you think it would help.  I *do* have open slots in the drive
>> backplane to set up a second "test" unit of this sort.  For reasons
>> below it will take at least a couple of weeks to get good data on
>> whether the problem exists without geli, however.
>>>
>>> I've just setup an old server which has a LSI 2008 running and old
>>> FW (11.0) so was going to have a go at reproducing this.
>>>
>>> Apart from the disconnect steps below is there anything else needed
>>> e.g. read / write workload during disconnect?
>>
>> Yes.  An attempt to recreate this on my sandbox machine using smaller
>> disks (WD RE-320s) and a decent amount of read/write activity (tens
>> to ~100 gigabytes) on a root mirror of three disks with one taken
>> offline did not succeed.  It *reliably* appears, however, on my
>> backup volumes with every drive swap. The sandbox machine is
>> physically identical other than the physical disks; both are Xeons
>> with ECC RAM in them.
>>
>> The only operational difference is that the backup volume sets have a
>> *lot* of data written to them via zfs send|zfs recv over the
>> intervening period where with "ordinary" activity from I/O (which was
>> the case on my sandbox) the I/O pattern is materially different.  The
>> root pool on the sandbox where I tried to reproduce it synthetically
>> *is* using geli (in fact it boots native-encrypted.)
>>
>> The "ordinary" resilver on a disk swap typically covers ~2-3Tb and is
>> a ~6-8 hour process.
>>
>> The usual process for the backup pool looks like this:
>>
>> Have 2 of the 3 physical disks mounted; the third is in the bank vault.
>>
>> Over the space of a week, the backup script is run daily.  It first
>> imports the pool and then for each zfs filesystem it is backing up
>> (which is not all of them; I have a few volatile ones that I don't
>> care if I lose, such as object directories for builds and such, plus
>> some that are R/O data sets that are backed up separately) it does:
>>
>> If there is no "...@zfs-base": zfs snapshot -r ...@zfs-base; zfs send
>> -R ...@zfs-base | zfs receive -Fuvd $BACKUP
>>
>> else
>>
>> zfs rename -r ...@zfs-base ...@zfs-old
>> zfs snapshot -r ...@zfs-base
>>
>> zfs send -RI ...@zfs-old ...@zfs-base |zfs recv -Fudv $BACKUP
>>
>> .... if ok then zfs destroy -vr ...@zfs-old otherwise print a
>> complaint and stop.
>>
>> When all are complete it then does a "zpool export backup" to detach
>> the pool in order to reduce the risk of "stupid root user" (me)
>> accidents.
>>
>> In short I send an incremental of the changes since the last backup,
>> which in many cases includes a bunch of automatic snapshots that are
>> taken on frequent basis out of the cron. Typically there are a week's
>> worth of these that accumulate between swaps of the disk to the
>> vault, and the offline'd disk remains that way for a week.  I also
>> wait for the zpool destroy on each of the targets to drain before
>> continuing, as not doing so back in the 9 and 10.x days was a good
>> way to stimulate an instant panic on re-import the next day due to
>> kernel stack page exhaustion if the previous operation destroyed
>> hundreds of gigabytes of snapshots (which does routinely happen as
>> part of the backed up data is Macrium images from PCs, so when a new
>> month comes around the PC's backup routine removes a huge amount of
>> old data from the filesystem.)
>>
>> Trying to simulate the checksum errors in a few hours' time thus far
>> has failed.  But every time I swap the disks on a weekly basis I get
>> a handful of checksum errors on the scrub. If I export and re-import
>> the backup mirror after that the counters are zeroed -- the checksum
>> error count does *not* remain across an export/import cycle although
>> the "scrub repaired" line remains.
>>
>> For example after the scrub completed this morning I exported the
>> pool (the script expects the pool exported before it begins) and ran
>> the backup.  When it was complete:
>>
>> root@NewFS:~/backup-zfs # zpool status backup
>>   pool: backup
>>  state: DEGRADED
>> status: One or more devices has been taken offline by the administrator.
>>         Sufficient replicas exist for the pool to continue
>> functioning in a
>>         degraded state.
>> action: Online the device using 'zpool online' or replace the device
>> with
>>         'zpool replace'.
>>   scan: scrub repaired 188K in 0 days 09:40:18 with 0 errors on Sat
>> Apr 20 08:45:09 2019
>> config:
>>
>>         NAME                      STATE     READ WRITE CKSUM
>>         backup                    DEGRADED     0 0     0
>>           mirror-0                DEGRADED     0 0     0
>>             gpt/backup61.eli      ONLINE       0 0     0
>>             gpt/backup62-1.eli    ONLINE       0 0     0
>>             13282812295755460479  OFFLINE      0 0     0  was
>> /dev/gpt/backup62-2.eli
>>
>> errors: No known data errors
>>
>> It knows it fixed the checksums but the error count is zero -- I did
>> NOT "zpool clear".
>>
>> This may have been present in 11.2; I didn't run that long enough in
>> this environment to know.  It definitely was *not* present in 11.1
>> and before; the same data structure and script for backups has been
>> in use for a very long time without any changes and this first
>> appeared when I upgraded from 11.1 to 12.0 on this specific machine,
>> with the exact same physical disks being used for over a year
>> (they're currently 6Tb units; the last change out for those was ~1.5
>> years ago when I went from 4Tb to 6Tb volumes.)  I have both HGST-NAS
>> and He-Enterprise disks in the rotation and both show identical
>> behavior so it doesn't appear to be related to a firmware problem in
>> one disk .vs. the other (e.g. firmware that fails to flush the
>> on-drive cache before going to standby even though it was told to.)
>>
>>>
>>> mps0: <Avago Technologies (LSI) SAS2008> port 0xe000-0xe0ff mem
>>> 0xfaf3c000-0xfaf3ffff,0xfaf40000-0xfaf7ffff irq 26 at device 0.0 on
>>> pci3
>>> mps0: Firmware: 11.00.00.00, Driver: 21.02.00.00-fbsd
>>> mps0: IOCCapabilities:
>>> 185c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,IR>
>>>
>>>     Regards
>>>     Steve
>>>
>>> On 20/04/2019 15:39, Karl Denninger wrote:
>>>> I can confirm that 20.00.07.00 does *not* stop this.
>>>> The previous write/scrub on this device was on 20.00.07.00. It was
>>>> swapped back in from the vault yesterday, resilvered without incident,
>>>> but a scrub says....
>>>>
>>>> root@NewFS:/home/karl # zpool status backup
>>>>    pool: backup
>>>>   state: DEGRADED
>>>> status: One or more devices has experienced an unrecoverable
>>>> error.  An
>>>>          attempt was made to correct the error.  Applications are
>>>> unaffected.
>>>> action: Determine if the device needs to be replaced, and clear the
>>>> errors
>>>>          using 'zpool clear' or replace the device with 'zpool
>>>> replace'.
>>>>     see: http://illumos.org/msg/ZFS-8000-9P
>>>>    scan: scrub repaired 188K in 0 days 09:40:18 with 0 errors on
>>>> Sat Apr
>>>> 20 08:45:09 2019
>>>> config:
>>>>
>>>>          NAME                      STATE     READ WRITE CKSUM
>>>>          backup                    DEGRADED     0     0     0
>>>>            mirror-0                DEGRADED     0     0     0
>>>>              gpt/backup61.eli      ONLINE       0     0     0
>>>>              gpt/backup62-1.eli    ONLINE       0     0    47
>>>>              13282812295755460479  OFFLINE      0     0     0 was
>>>> /dev/gpt/backup62-2.eli
>>>>
>>>> errors: No known data errors
>>>>
>>>> So this is firmware-invariant (at least between 19.00.00.00 and
>>>> 20.00.07.00); the issue persists.
>>>>
>>>> Again, in my instance these devices are never removed "unsolicited" so
>>>> there can't be (or at least shouldn't be able to) unflushed data in
>>>> the
>>>> device or kernel cache.  The procedure is and remains:
>>>>
>>>> zpool offline .....
>>>> geli detach .....
>>>> camcontrol standby ...
>>>>
>>>> Wait a few seconds for the spindle to spin down.
>>>>
>>>> Remove disk.
>>>>
>>>> Then of course on the other side after insertion and the kernel has
>>>> reported "finding" the device:
>>>>
>>>> geli attach ...
>>>> zpool online ....
>>>>
>>>> Wait...
>>>>
>>>> If this is a boogered TXG that's held in the metadata for the
>>>> "offline"'d device (maybe "off by one"?) that's potentially bad in
>>>> that
>>>> if there is an unknown failure in the other mirror component the
>>>> resilver will complete but data has been irrevocably destroyed.
>>>>
>>>> Granted, this is a very low probability scenario (the area where
>>>> the bad
>>>> checksums are has to be where the corruption hits, and it has to
>>>> happen
>>>> between the resilver and access to that data.)  Those are long odds
>>>> but
>>>> nonetheless a window of "you're hosed" does appear to exist.
>>>>
>>>
>> -- 
>> Karl Denninger
>> karl@denninger.net <mailto:karl@denninger.net>
>> /The Market Ticker/
>> /[S/MIME encrypted email preferred]/
>
> _______________________________________________
> freebsd-stable@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-stable
> To unsubscribe, send any mail to "freebsd-stable-unsubscribe@freebsd.org"
-- 
Karl Denninger
karl@denninger.net <mailto:karl@denninger.net>
/The Market Ticker/
/[S/MIME encrypted email preferred]/

[-- Attachment #2 --]
0�	*�H��
��0�10
	`�He0�	*�H��
��
�0��0����H���^��Ōc!5�
�H0
	*�H��
0��10	UUS10UFlorida10U	Niceville10U
Cuda Systems LLC10UCuda Systems CA1!0UCuda Systems LLC 2017 CA0
170817164217Z
270815164217Z0{10	UUS10UFlorida10U
Cuda Systems LLC10UCuda Systems CA1%0#UCuda Systems LLC 2017 Int CA0�"0
	*�H��
�0�
��h�-5B>[���;��o���l�Ӵ��0~͎O9}�9�Y��e������*�������$��g��!uk�vʶ�LzN�`jL�>��MD'7U4����5C�B�+�kY`bd����~b*�c3�N��y-�78j�u�]9H�e��uέ�sӬD��ؽ�m��gw�ER�?�&U�UR�j����'�}�9n�WD i�`XcbG��z�\g������G=��u�%���\�O�i1���3���ߝ4�
�K4�4p�YQr]�Ie�/r�0+��eEޝݖ0��C15�M��ݚ@J�SZ(zȏ�N�Ta�(2��5�D�D5���.l�<g[[Za��r�Q�Q%�Bu�ȴ����~~`���I�oh�R�b����ʳ��ڟ���u�2���M�S��8E�dF��UC���l�CM�aѳ����!����}ș�+�2��k��/�bų�E,��n�当ꖛ\�(8�WV�8	d]�b�	�������y�X��w	܊�:I�39��
0�0U]�^§������Q�\ӎ�0��U#��0�����T0�3���9�N0b������0��10	UUS10UFlorida10U	Niceville10U
Cuda Systems LLC10UCuda Systems CA1!0UCuda Systems LLC 2017 CA�	�@�U��i0U�0�0U��0
	*�H��
���:P U!>v�����J�ni��o�-����#�ן�]Wyu�j���ǑR̀��Q�
�nƇ�!GѦF��g\�yLx�g�w=�O�P��yceh�f[���}�ܷ�['4�ڝ�\[p6\o.��B&�JF���"�ZC{;�*o�*�mc��Cc�LY߾�`
�t�*�S!����񫶭�(���`�]D�HP�5���A~/�N���Pp�����6�=�m��h�k�밣'd���oA$�86hm����5���Ӛ��S@�j���ެE���gl��
�)�0JG���`%�k�3�5��P��a��C?���σ
׳HE�t}!�P���㏏%*���B�xb��Q�waKG����$6h�¦��M�v��e;��[o��-�Iی��&
���I,��T��c�ߎ#t �wPA�@��l0�P�+�KXB��պT	z���G�v;N��c��I3��&��JĬ���UP�N��a��?�/�%�W��6G۟N�0�00����k���#X��d��\�=0
	*�H��
0{10	UUS10UFlorida10U
Cuda Systems LLC10UCuda Systems CA1%0#UCuda Systems LLC 2017 Int CA0
170817212120Z
220816212120Z0W10	UUS10UFlorida10U
Cuda Systems LLC10Ukarl@denninger.net0�"0
	*�H��
�0�
��T��[I�-ΆϏdn;�Å�@שy���.u�s�~_�Z�G%<��M��Y��d�\g��v�f��n�s�a��1'6����E�gyjs�"C� [�{��~��_���K���Pn+<�*�pv���#Q�����+��H���/���7[-v��qD��V^U>�f��%�GX�)��H.��|l`�M(C�r�>е͇6����#�o��dc"Y�ljҦ�ln8�@�5S�A�0���&ۖ"���OGj?��U��DWZ5	��dDB7k-)�9�����I�zs��-�JA���v
��J��6L���$�Ն����1Sm�Y.��Lqw*��SH;E��F'�D�Ħ��H��]��M��O��������g���Q���Q�|M�ٙ��ג2Z��9y��@���y�]}6ٽe��Y9��Y2�xˆ�$T�=�e�CǺ��ǵb�n֛�{��j��|��@�LL�t�1�[D�k5:$=�	`�	�M���0��0<+00.0,+0� http://ocsp.cudasystems.net:88880	U00	`�H��B�0U��0U%0++03	`�H��B
&$OpenSSL Generated Client Certificate0U�%�՞V=���؁�;�bzQ0��U#��0���]�^§������Q�\ӎϡ�����0��10	UUS10UFlorida10U	Niceville10U
Cuda Systems LLC10UCuda Systems CA1!0UCuda Systems LLC 2017 CA��H���^��Ōc!5�
�H0U0�karl@denninger.net0
	*�H��
��۠�A0�-j%-�-$%���g2#ޡ��1�^��>���{K+�u��GE���v1���ş7Af&b�&O�;.��;A5���*U��)N��D2bF��|\=�]<�sˋL!��wrw���٧>��Y���M���Ä���3\mW�R�� h�Sv���!�_�zv�����l�?� ��3_�� �xU%�\�^����#���O*���Gk̍�YI_�&�Fꊛ�����@&�1�n�������}� ͬ:��{�hT�P3��B.�;���bU�8:Z��=^���Gw�8���!k-��@���x�E��@�i�,+'�Iᐚ:f��hz�tX7/�(h�Y`��� O�.������1}a`�%�RW��^�a�k������ǂp�C�Au�fgDix�UT��Щ/�7��}�%=j��nVZvcF����<�M=
�2^G�KH5魉
�_���O�4ެ�Byʈ���y��S��k�w=5�@h�.0�z�>�
W�1�0�0��0{10	UUS10UFlorida10U
Cuda Systems LLC10UCuda Systems CA1%0#UCuda Systems LLC 2017 Int CA��k���#X��d��\�=0
	`�He��E0	*�H��
	1	*�H��
0	*�H��
	1
190420212601Z0O	*�H��
	1B@��YIv�[Kh�m��9�>ٝk��G��C��I�h���5�$���E�n��咋;|$9©�a�%��0l	*�H��
	1_0]0	`�He*0	`�He0
*�H��
0*�H��
�0
*�H��
@0+0
*�H��
(0��	+�71��0��0{10	UUS10UFlorida10U
Cuda Systems LLC10UCuda Systems CA1%0#UCuda Systems LLC 2017 Int CA��k���#X��d��\�=0��*�H��
	1�����0{10	UUS10UFlorida10U
Cuda Systems LLC10UCuda Systems CA1%0#UCuda Systems LLC 2017 Int CA��k���#X��d��\�=0
	*�H��
�S�!�\��!A��=�|��W�3dK.� ;9tbT����;nߝd������\�4�S�$��I����F�9Q�cy�>J�R��|ޱ�FEc�Hk��xy�l�
�8'���b�@�����%�|P%��*:�	j	��좓��=�,�"21������J^:�m]������[U����LZzOK��D���c]����䘄�)#�-�	od�F,d��D��%�l2��`�L�Ǭ�;��z�C�͸���-�/=8�o��I����0����9�����K�}���JO��d�;��&,�}�q�k��=��EJ����v�K�āP~.��@N�T����c� ?�	&��p���9L�߈�!��z$7�k�H��/�m��	}F�v�����e&`)
�H�v���A�Bċ�k�= �?Lác"�/����zͰ�͐��?�V�<���LE���7�SC�xcX�o�O��,�'uݸ���S���e���if@�8��z
Ʊ

home | help

Want to link to this message? Use this
URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?e90494e9-9d6d-ce19-05db-3ebb06d00766>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation