Date: Fri, 13 Oct 2017 18:58:30 +0200 From: Ben RUBSON <ben.rubson@gmail.com> To: Freebsd fs <freebsd-fs@freebsd.org> Subject: Re: ZFS stalled after some mirror disks were lost Message-ID: <13AF09F5-3ED9-4C81-A5E2-94BA770E991B@gmail.com> In-Reply-To: <6d1c80df-7e9f-c891-31ae-74dad3f67985@internetx.com> References: <4A0E9EB8-57EA-4E76-9D7E-3E344B2037D2@gmail.com> <DDCFAC80-2D72-4364-85B2-7F4D7D70BCEE@gmail.com> <82632887-E9D4-42D0-AC05-3764ABAC6B86@gmail.com> <20171007150848.7d50cad4@fabiankeil.de> <DFD0528D-549E-44C9-A093-D4A8837CB499@gmail.com> <6d1c80df-7e9f-c891-31ae-74dad3f67985@internetx.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On 12 Oct 2017 22:52, InterNetX - Juergen Gotteswinter wrote: > Ben started a discussion about his setup a few months ago, where he > described what he is going to do. And, at least my (and i am pretty sure > there where others, too) prophecy was that it will end up in a pretty > unreliable setup (gremlins and other things are included!) which is far > far away from being helpful in term of HA. A single node setup, with > reliable hardware configuration and as little as possible moving parts, > whould be way more reliable and flawless. First, thank you for your answer Juergen, I do appreciate it, of course, as well as the help you propose below. So thank you ! :) Yes, the discussion, more than one year ago now, on this list, was named "HAST + ZFS + NFS + CARP", but quickly moved around iSCSI when I initiated the idea :) I must say, after one year of production, that I'm rather happy with this setup. It works flawlessly (but the current discussed issue of course), and I switched from one node to another several times, successfully. Main purpose of the previous discussion was to have a second chassis to host the pool in case of a failure with the first one. > (...) if the underlying block device works like expected a > error should be returned to zfs, but noone knows what happens in this > setup during failure. maybe its some switch issue, or network driver bug > which prevents this and stalls the pool. The issue only happens when I disconnect iSCSI drives, it does not occurs suddenly by itself. So I would say the issue is on FreeBSD side, not network hardware :) 2 distinct behaviours/issues : - 1 : when I disconnect iSCSI drives from the server running the pool (iscsictl -Ra), some iSCSI drives remain on the system, leaving ZFS stalled ; - 2 : when I disconnect iSCSI drives from the target (shut NIC down / shutdown ctld), server running the pool sometimes panics (traces in my previous mail, 06/10). > @ben > > can you post your iscsi network configuration including ctld.conf and so > on? is your iscsi setup using multipath, lacp or is it just single pathed? Sure. So I use single pathed iSCSI. ### Target side : # /etc/ctl.conf (all targets are equally configured) : target iqn.............:hm1 { portal-group pg0 au0 alias G1207FHRDP2SThm lun 0 { path /dev/gpt/G1207FHRDP2SThm serial G1207FHRDP2SThm } } So, each target has its GPT label to clearly & quickly identify it (location, serial). Best practice, which I find to be very useful, taken from storage/ZFS books from Michael W Lucas & Allan Jude. ### Initiator side : # /etc/iscsi.conf (all targets are equally configured) : hm1 { InitiatorName = iqn............. TargetAddress = 192.168.2.2 TargetName = iqn.............:hm1 HeaderDigest = CRC32C DataDigest = CRC32C } Then, each disk is geom-labeled (glabel label) so that the previous example disk appears on initiator side in : /dev/label/G1207FHRDP2SThm Still the naming best practice which allows me to identify a disk wherever it is, without mistake. ZFS thus uses /dev/label/ paths. NAME STATE READ WRITE CKSUM home ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 label/G1203_serial_hm ONLINE 0 0 0 label/G1204_serial_hm ONLINE 0 0 0 label/G2203_serial_hm ONLINE 0 0 0 label/G2204_serial_hm ONLINE 0 0 0 mirror-1 ONLINE 0 0 0 label/G1205_serial_hm ONLINE 0 0 0 label/G1206_serial_hm ONLINE 0 0 0 label/G2205_serial_hm ONLINE 0 0 0 label/G2206_serial_hm ONLINE 0 0 0 cache label/G2200_serial_ch ONLINE 0 0 0 label/G2201_serial_ch ONLINE 0 0 0 (G2* are local disks, G1* are iCSCI disks) > overall... > > ben is stacking up way too much layers which prevent root cause diagnostic. > > lets see, i am trying to describe what i see here (please correct me if > this setup is different from my thinking). i think, to debug this mess, > its absolutely necessary to see all involved components > > - physical machine(s), exporting single raw disks via iscsi to > "frontends" (please provide exact configuration, software versions, and > built in hardware -> especially nic models, drivers, firmware) > > - frontend boxes, importing the raw iscsi disks for a zpool (again, > exact hardware configuration, network configuration, driver / firmware > versions and so on) Exact same hardware and software on both sides. FreeBSD 11.0-RELEASE-p12 SuperMicro motherboard ECC RAM NIC Mellanox ConnectX-3 40G fw 2.36.5000 HBA SAS 2008 LSI 9211-8i fw 20.00.07.00-IT SAS-only disks (no SATA) > - switch infrastructure (brand, model, firmware version, line speed, > link aggregation in use? if yes, lacp or whatever is in use here?) > > - single switch or stacked setup? > > did one already check the switch logs / error counts? No, as issue seems to come from FreeeBSD (easily reproductible with the 2 scenarios I gave above). > another thing which came to my mind is, if has zfs ever been designed to > be used on top of iscsi block devices? my thoughts so far where that zfs > loves native disks, without any layer between (no volume manager, no > partitions, no nothing). most ha setups i have seen so far where using > rock solid cross over cabled sas jbods with on demand activated paths in > case of failure. theres not that much that can cause voodoo in such > setups, compared to iscsi ha however failover scenarios with tons of > possible problematic components in between. We analyzed this in the previous topic : https://lists.freebsd.org/pipermail/freebsd-fs/2016-July/023503.html https://lists.freebsd.org/pipermail/freebsd-fs/2016-July/023527.html >>> Did you try to reproduce the problem without iSCSI? > > i bet the problem wont occur anymore on native disks. which should NOT > mean that zfs cant be used on iscsi devices, i am pretty sure it will > work fine... as long as: > > - iscsi target behavior is doing well, which includes that no strange > bugs start partying on your san network > (...) Andriy, who took many debug traces from my system, managed to reproduce the first issue locally, using a 3-way ZFS mirror with one local disk plus two iSCSI disks. Sounds like there is a deadlock issue on iSCSI initiator side (of course Andriy feel free to correct me if I'm wrong). Regarding the second issue, I'm not able to reproduce it if I don't use geom-labels. There may then be an issue on geom-label side (which could then also affect fully-local ZFS pools using geom-labels). Thank you again, Ben
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?13AF09F5-3ED9-4C81-A5E2-94BA770E991B>