Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 8 Sep 2016 14:46:04 +0300
From:      Ruslan Makhmatkhanov <rm@FreeBSD.org>
To:        Maurizio Vairani <maurizio.vairani@cloverinformatica.it>, Ruslan Makhmatkhanov <rm@FreeBSD.org>
Cc:        freebsd-fs@freebsd.org
Subject:   Re: ZFS-8000-8A: assistance needed
Message-ID:  <11f2a660-304d-4a50-bab8-ec2a2737b3a4@FreeBSD.org>
In-Reply-To: <b9edb1ae-b59a-aefc-f547-1fb69e79f0f7@cloverinformatica.it>
References:  <c6e3d35a-d554-a809-4959-ee858c38aca7@FreeBSD.org> <b9edb1ae-b59a-aefc-f547-1fb69e79f0f7@cloverinformatica.it>

next in thread | previous in thread | raw e-mail | index | archive | help
Maurizio, much thanks for your answers. It made some things clearer, but 
some are still isn't. Please see my comments inline:

Maurizio Vairani wrote on 09/08/2016 10:39:
> Hi Ruslan,
>
> Il 06/09/2016 22:00, Ruslan Makhmatkhanov ha scritto:
>> Hello,
>>
>> I've got something new here and just not sure where to start on
>> solving that. It's on 10.2-RELEASE-p7 amd64.
>>
>> """
>> root:~ # zpool status -xv
>>   pool: storage_ssd
>>  state: ONLINE
>> status: One or more devices has experienced an error resulting in data
>>     corruption.  Applications may be affected.
>> action: Restore the file in question if possible.  Otherwise restore the
>>     entire pool from backup.
>>    see: http://illumos.org/msg/ZFS-8000-8A
>>   scan: scrub repaired 0 in 0h26m with 5 errors on Tue Aug 23 00:40:24
>> 2016
>> config:
>>
>>     NAME              STATE     READ WRITE CKSUM
>>     storage_ssd       ONLINE       0     0 59.3K
>>       mirror-0        ONLINE       0     0     0
>>         gpt/drive-06  ONLINE       0     0     0
>>         gpt/drive-07  ONLINE       0     0     9
>>       mirror-1        ONLINE       0     0  119K
>>         gpt/drive-08  ONLINE       0     0  119K
>>         gpt/drive-09  ONLINE       0     0  119K
>>     cache
>>       mfid5           ONLINE       0     0     0
>>       mfid6           ONLINE       0     0     0
>>
>> errors: Permanent errors have been detected in the following files:
>>
>>         <0x1bd0a>:<0x8>
>>         <0x31f23>:<0x8>
>>         /storage_ssd/f262f6ebaf5011e39ca7047d7bb28f4a/disk
>>         /storage_ssd/7ba3f661fa9811e3bd9d047d7bb28f4a/disk
>>         /storage_ssd/2751d305ecba11e3aef0047d7bb28f4a/disk
>>         /storage_ssd/6aa805bd22e911e4b470047d7bb28f4a/disk
>> """
>>
>> The pool looks ok, if I understand correctly, but we have a slowdown
>> in Xen VM's, that are using these disks via iSCSI. So can please
>> anybody explain what exactly that mean?
> The OS retries the read and/or write operation and you notice a slowdown.
>>
>> 1. Am I right that we have a hardware failure that lead to data
>> corruption?
> Yes.
>> If so, how to identify failed disk(s)
> The disks containing gpt/drive-07, the disk with gpt/drive-08 and the
> disk with gpt/drive-09.  With smartctl you can read the smart status of
> the disks for more info. I use smartd  with HDDs and SSDs and it,
> usually, warns me about a failing disk before zfs.

I checked the disks with that labels and all of them seems ok:

gpart show excerpt:
"""
Geom name: mfid2
1. Name: mfid2p1
    label: drive-07
Geom name: mfid3
1. Name: mfid3p1
    label: drive-08
Geom name: mfid4
1. Name: mfid4p1
    label: drive-09
"""

I checked them all with smartctl -a -d sat /dev/pass2[3,4] and got

"""
SMART Status not supported: ATA return descriptor not supported by 
controller firmware
SMART overall-health self-assessment test result: PASSED
"""

And they also seems ok in mfiutil:
root:~ # mfiutil show volumes
mfi0 Volumes:
   Id     Size    Level   Stripe  State   Cache   Name
  mfid0 (  838G) RAID-0     256K OPTIMAL Disabled
  mfid1 (  838G) RAID-0     256K OPTIMAL Disabled
  mfid2 (  476G) RAID-0     256K OPTIMAL Disabled
  mfid3 (  476G) RAID-0     256K OPTIMAL Disabled
  mfid4 (  476G) RAID-0     256K OPTIMAL Disabled
  mfid5 (  167G) RAID-0      64K OPTIMAL Writes
  mfid6 (  167G) RAID-0      64K OPTIMAL Writes
  mfid7 (  476G) RAID-0      64K OPTIMAL Writes

I only wasn't able to read the smart status of mfid0/mfid1 (they are 
SCSI-6 drives according to mfiutil):

smartctl -d scsi -a /dev/pass0[1]
Standard Inquiry (36 bytes) failed [Input/output error]
Retrying with a 64 byte Standard Inquiry
Standard Inquiry (64 bytes) failed [Input/output error]

But I'm not sure if I'm using correct type in -d switch (tried all of 
them). This drives doesn't seems participate in that zpool, but I'm 
interested in checking them too.

So, I'm interested how you detect that particular drives (with gpt 
labels drive-07, drive-08 and drive-09) are failing?


>> and how it is possible that data is corrupted on zfs mirror?
> If in both disks the sectors with the same data are damaged.
>> Is there anything I can do to recover except restoring from backup?
> Probably no, but you can check the iSCSI disk in the Xen VM if it is
> usable.

It is usable, but the system is very slow. I'll try to read it with dd 
to check if there are any troubles.

>>
>> 2. What first and second damaged "files" are and why they are shown
>> like that?
> ZFS metadata.
>>
>> I have this in /var/log/messages, but to me it looks like iSCSI
>> message, that's spring up when accessing damaged files:
>>
>> """
>> kernel: (1:32:0/28): WRITE command returned errno 122
>> """
> Probably in /var/log/messages you can read messages like this:
> Aug 27 03:02:19 clover-nas2 kernel: (ada3:ahcich15:0:0:0): CAM status:
> ATA Status Error
> Aug 27 03:02:19 clover-nas2 kernel: (ada3:ahcich15:0:0:0): ATA status:
> 51 (DRDY SERV ERR), error: 40 (UNC )
> Aug 27 03:02:19 clover-nas2 kernel: (ada3:ahcich15:0:0:0): RES: 51 40 e8
> 0f a6 40 44 00 00 08 00
> Aug 27 03:02:19 clover-nas2 kernel: (ada3:ahcich15:0:0:0): Error 5,
> Retries exhausted
>
> In this message the /dev/ada3 HDD is failing.

Actually, I have no messages like that. Only that "kernel: (1:32:0/28): 
WRITE command returned errno 122" with no drive indication. Anyway if 
disk is failing shouldn't I see indication of that in mfiutil drives 
status and zpool status itself (I confused with that these disks are 
still listed ONLINE while they are failing)? I also checked logs in VM's 
itself - there are no complains about disk I/O failures...


>> Manual zpool scrub was tried on this pool to not avail. The pool
>> capacity is only 66% full.
>>
>> Thanks for any hints in advance.
>>
> Maurizio
>
>
>


-- 
Regards,
Ruslan

T.O.S. Of Reality



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?11f2a660-304d-4a50-bab8-ec2a2737b3a4>