Date: Thu, 8 Sep 2016 14:46:04 +0300 From: Ruslan Makhmatkhanov <rm@FreeBSD.org> To: Maurizio Vairani <maurizio.vairani@cloverinformatica.it>, Ruslan Makhmatkhanov <rm@FreeBSD.org> Cc: freebsd-fs@freebsd.org Subject: Re: ZFS-8000-8A: assistance needed Message-ID: <11f2a660-304d-4a50-bab8-ec2a2737b3a4@FreeBSD.org> In-Reply-To: <b9edb1ae-b59a-aefc-f547-1fb69e79f0f7@cloverinformatica.it> References: <c6e3d35a-d554-a809-4959-ee858c38aca7@FreeBSD.org> <b9edb1ae-b59a-aefc-f547-1fb69e79f0f7@cloverinformatica.it>
next in thread | previous in thread | raw e-mail | index | archive | help
Maurizio, much thanks for your answers. It made some things clearer, but some are still isn't. Please see my comments inline: Maurizio Vairani wrote on 09/08/2016 10:39: > Hi Ruslan, > > Il 06/09/2016 22:00, Ruslan Makhmatkhanov ha scritto: >> Hello, >> >> I've got something new here and just not sure where to start on >> solving that. It's on 10.2-RELEASE-p7 amd64. >> >> """ >> root:~ # zpool status -xv >> pool: storage_ssd >> state: ONLINE >> status: One or more devices has experienced an error resulting in data >> corruption. Applications may be affected. >> action: Restore the file in question if possible. Otherwise restore the >> entire pool from backup. >> see: http://illumos.org/msg/ZFS-8000-8A >> scan: scrub repaired 0 in 0h26m with 5 errors on Tue Aug 23 00:40:24 >> 2016 >> config: >> >> NAME STATE READ WRITE CKSUM >> storage_ssd ONLINE 0 0 59.3K >> mirror-0 ONLINE 0 0 0 >> gpt/drive-06 ONLINE 0 0 0 >> gpt/drive-07 ONLINE 0 0 9 >> mirror-1 ONLINE 0 0 119K >> gpt/drive-08 ONLINE 0 0 119K >> gpt/drive-09 ONLINE 0 0 119K >> cache >> mfid5 ONLINE 0 0 0 >> mfid6 ONLINE 0 0 0 >> >> errors: Permanent errors have been detected in the following files: >> >> <0x1bd0a>:<0x8> >> <0x31f23>:<0x8> >> /storage_ssd/f262f6ebaf5011e39ca7047d7bb28f4a/disk >> /storage_ssd/7ba3f661fa9811e3bd9d047d7bb28f4a/disk >> /storage_ssd/2751d305ecba11e3aef0047d7bb28f4a/disk >> /storage_ssd/6aa805bd22e911e4b470047d7bb28f4a/disk >> """ >> >> The pool looks ok, if I understand correctly, but we have a slowdown >> in Xen VM's, that are using these disks via iSCSI. So can please >> anybody explain what exactly that mean? > The OS retries the read and/or write operation and you notice a slowdown. >> >> 1. Am I right that we have a hardware failure that lead to data >> corruption? > Yes. >> If so, how to identify failed disk(s) > The disks containing gpt/drive-07, the disk with gpt/drive-08 and the > disk with gpt/drive-09. With smartctl you can read the smart status of > the disks for more info. I use smartd with HDDs and SSDs and it, > usually, warns me about a failing disk before zfs. I checked the disks with that labels and all of them seems ok: gpart show excerpt: """ Geom name: mfid2 1. Name: mfid2p1 label: drive-07 Geom name: mfid3 1. Name: mfid3p1 label: drive-08 Geom name: mfid4 1. Name: mfid4p1 label: drive-09 """ I checked them all with smartctl -a -d sat /dev/pass2[3,4] and got """ SMART Status not supported: ATA return descriptor not supported by controller firmware SMART overall-health self-assessment test result: PASSED """ And they also seems ok in mfiutil: root:~ # mfiutil show volumes mfi0 Volumes: Id Size Level Stripe State Cache Name mfid0 ( 838G) RAID-0 256K OPTIMAL Disabled mfid1 ( 838G) RAID-0 256K OPTIMAL Disabled mfid2 ( 476G) RAID-0 256K OPTIMAL Disabled mfid3 ( 476G) RAID-0 256K OPTIMAL Disabled mfid4 ( 476G) RAID-0 256K OPTIMAL Disabled mfid5 ( 167G) RAID-0 64K OPTIMAL Writes mfid6 ( 167G) RAID-0 64K OPTIMAL Writes mfid7 ( 476G) RAID-0 64K OPTIMAL Writes I only wasn't able to read the smart status of mfid0/mfid1 (they are SCSI-6 drives according to mfiutil): smartctl -d scsi -a /dev/pass0[1] Standard Inquiry (36 bytes) failed [Input/output error] Retrying with a 64 byte Standard Inquiry Standard Inquiry (64 bytes) failed [Input/output error] But I'm not sure if I'm using correct type in -d switch (tried all of them). This drives doesn't seems participate in that zpool, but I'm interested in checking them too. So, I'm interested how you detect that particular drives (with gpt labels drive-07, drive-08 and drive-09) are failing? >> and how it is possible that data is corrupted on zfs mirror? > If in both disks the sectors with the same data are damaged. >> Is there anything I can do to recover except restoring from backup? > Probably no, but you can check the iSCSI disk in the Xen VM if it is > usable. It is usable, but the system is very slow. I'll try to read it with dd to check if there are any troubles. >> >> 2. What first and second damaged "files" are and why they are shown >> like that? > ZFS metadata. >> >> I have this in /var/log/messages, but to me it looks like iSCSI >> message, that's spring up when accessing damaged files: >> >> """ >> kernel: (1:32:0/28): WRITE command returned errno 122 >> """ > Probably in /var/log/messages you can read messages like this: > Aug 27 03:02:19 clover-nas2 kernel: (ada3:ahcich15:0:0:0): CAM status: > ATA Status Error > Aug 27 03:02:19 clover-nas2 kernel: (ada3:ahcich15:0:0:0): ATA status: > 51 (DRDY SERV ERR), error: 40 (UNC ) > Aug 27 03:02:19 clover-nas2 kernel: (ada3:ahcich15:0:0:0): RES: 51 40 e8 > 0f a6 40 44 00 00 08 00 > Aug 27 03:02:19 clover-nas2 kernel: (ada3:ahcich15:0:0:0): Error 5, > Retries exhausted > > In this message the /dev/ada3 HDD is failing. Actually, I have no messages like that. Only that "kernel: (1:32:0/28): WRITE command returned errno 122" with no drive indication. Anyway if disk is failing shouldn't I see indication of that in mfiutil drives status and zpool status itself (I confused with that these disks are still listed ONLINE while they are failing)? I also checked logs in VM's itself - there are no complains about disk I/O failures... >> Manual zpool scrub was tried on this pool to not avail. The pool >> capacity is only 66% full. >> >> Thanks for any hints in advance. >> > Maurizio > > > -- Regards, Ruslan T.O.S. Of Reality
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?11f2a660-304d-4a50-bab8-ec2a2737b3a4>