Date: Tue, 8 Aug 2017 12:23:28 +0500 From: "Eugene M. Zheganin" <emz@norma.perm.ru> To: freebsd-fs@freebsd.org Cc: freebsd-stable@FreeBSD.org Subject: Re: a strange and terrible saga of the cursed iSCSI ZFS SAN Message-ID: <ea43806f-95fc-177b-cd6f-183cb6d012d8@norma.perm.ru> In-Reply-To: <1bd10b1e-0583-6f44-297e-3147f6daddc5@norma.perm.ru> References: <1bd10b1e-0583-6f44-297e-3147f6daddc5@norma.perm.ru>
next in thread | previous in thread | raw e-mail | index | archive | help
On 05.08.2017 22:08, Eugene M. Zheganin wrote: > Hi, > > > I got a problem that I cannot solve just by myself. I have a iSCSI zfs > SAN system that crashes, corrupting it's data. I'll be short, and try > to describe it's genesis shortly: > > 1) autumn 2016, SAN is set up, supermicro server, external JBOD, > sandisk ssds, several redundant pools, FreeBSD 11.x (probably release, > don't really remember - see below). > > 2) this is working just fine until early spring 2017 > > 3) system starts to crash (various panics): > > panic: general protection fault > panic: page fault > panic: Solaris(panic): zfs: allocating allocated > segment(offset=6599069589504 size=81920) > panic: page fault > panic: page fault > panic: Solaris(panic): zfs: allocating allocated > segment(offset=8245779054592 size=8192) > panic: page fault > panic: page fault > panic: page fault > panic: Solaris(panic): zfs: allocating allocated > segment(offset=1792100934656 size=46080) > > 4) we memtested it immidiately, no problems found. > > 5) we switch sandisks to toshibas, we switch also the server to an > identical one, JBOD to an identical one, leaving same cables. > > 6) crashes don't stop. > > 7) we found that field engineers physically damaged (sic!) the SATA > cables (main one and spare ones), and that 90% of the disks show ICRC > SMART errors. > > 8) we replaced the cable (brand new HP one). > > 9) ATA SMART errors stopped increasing. > > 10) crashes continue. > > 11) we decided that probably when ZFS was moved over damaged cables > between JBODs it was somehow damaged too, so now it's panicking > because of that. so we wiped the data completely, reinitialized the > SAN system and put it back into the production. we even dd'ed each > disk with zeroes (!) - just in case. Important note: the data was > imported using zfs send from another, stable system that is runing in > production in another DC. > > 12) today we got another panic. > > btw the pools look now like this: > > > # zpool status -v > pool: data > state: ONLINE > status: One or more devices has experienced an error resulting in data > corruption. Applications may be affected. > action: Restore the file in question if possible. Otherwise restore the > entire pool from backup. > see: http://illumos.org/msg/ZFS-8000-8A > scan: none requested > config: > > NAME STATE READ WRITE CKSUM > data ONLINE 0 0 62 > raidz1-0 ONLINE 0 0 0 > da2 ONLINE 0 0 0 > da3 ONLINE 0 0 0 > da4 ONLINE 0 0 0 > da5 ONLINE 0 0 0 > da6 ONLINE 0 0 0 > raidz1-1 ONLINE 0 0 0 > da7 ONLINE 0 0 0 > da8 ONLINE 0 0 0 > da9 ONLINE 0 0 0 > da10 ONLINE 0 0 0 > da11 ONLINE 0 0 0 > raidz1-2 ONLINE 0 0 62 > da12 ONLINE 0 0 0 > da13 ONLINE 0 0 0 > da14 ONLINE 0 0 0 > da15 ONLINE 0 0 0 > da16 ONLINE 0 0 0 > > errors: Permanent errors have been detected in the following files: > > data/userdata/worker208:<0x1> > > pool: userdata > state: ONLINE > status: One or more devices has experienced an error resulting in data > corruption. Applications may be affected. > action: Restore the file in question if possible. Otherwise restore the > entire pool from backup. > see: http://illumos.org/msg/ZFS-8000-8A > scan: none requested > config: > > NAME STATE READ WRITE CKSUM > userdata ONLINE 0 0 216K > mirror-0 ONLINE 0 0 432K > gpt/userdata0 ONLINE 0 0 432K > gpt/userdata1 ONLINE 0 0 432K > > errors: Permanent errors have been detected in the following files: > > userdata/worker36:<0x1> > userdata/worker30:<0x1> > userdata/worker31:<0x1> > userdata/worker35:<0x1> > > 12) somewhere between p.5 and p.10 the pool became deduplicated (not > directly connected to the problem, just for production reasons). > > > So, concluding: we had bad hardware, we replaced EACH piece (server, > adapter, JBOD, cable, disks), and crashes just don't stop. We have 5 > another iSCSI SAN systems, almost fully identical that don't crash. > Crashes on this particular system began when it was running same set > of versions that stable systems. > > So far my priority version is that something was broken in the iSCSI+zfs stack somewhere between r310734 (most recent version on my SAN systems that works) and r320056 (probably earlier, but r320056 is the first revision with documented crash). So I downgraded back to r310734 (from a 11.1-RELEASE, which is affected, if I'm right). Some things speak pro this version: - the system was stable pre-spring 2017, before the upgrade happened - zfs corruption happens _only_ on the pools that the iSCSI is serving from, no corruption happens on the zfs pools that have nothing to do with providing zvils as iSCSI targets (and this seems to be the most convincing point). - the faulty hardware was changed. though it was changed to a identical hardware, BUT I have the very same set of identical hardware working in almost identical environment under r310734 in another DC. so far I'm not sure, because only 20 hours passed since the downgrade. However, if the system will be stable for more than a week (was never stable that long on recent revisions), it will prove I'm right and I'll file the PR. Thanks. Eugene.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?ea43806f-95fc-177b-cd6f-183cb6d012d8>