Date: Thu, 12 Oct 2017 22:52:34 +0200 From: InterNetX - Juergen Gotteswinter <juergen.gotteswinter@internetx.com> To: freebsd-fs@freebsd.org Subject: Re: ZFS stalled after some mirror disks were lost Message-ID: <6d1c80df-7e9f-c891-31ae-74dad3f67985@internetx.com> In-Reply-To: <DFD0528D-549E-44C9-A093-D4A8837CB499@gmail.com> References: <4A0E9EB8-57EA-4E76-9D7E-3E344B2037D2@gmail.com> <DDCFAC80-2D72-4364-85B2-7F4D7D70BCEE@gmail.com> <82632887-E9D4-42D0-AC05-3764ABAC6B86@gmail.com> <20171007150848.7d50cad4@fabiankeil.de> <DFD0528D-549E-44C9-A093-D4A8837CB499@gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
Am 07.10.2017 um 15:57 schrieb Ben RUBSON >> Indeed. In the face of other types of errors as well, though. >> >>>> Essentially, each logical i/o request obtains a configuration lock of >>>> type 'zio' in shared mode to prevent certain configuration changes >>>> from happening while there are any outsanding zio-s. >>>> If a zio is lost, then this lock is leaked. >>>> Then, the code that deals with vdev failures tries to take this lock in >>>> exclusive mode while holding a few other configuration locks also in >>>> exclsuive mode so, any other thread needing those locks would block. >>>> And there are code paths where a configuration lock is taken while >>>> spa_namespace_lock is held. >>>> And when spa_namespace_lock is never dropped then the system is close >>>> to toast, because all pool lookups would get stuck. >>>> I don't see how this can be fixed in ZFS. >> >> While I haven't used iSCSI for a while now, over the years I've seen >> lots of similar issues with ZFS pools located on external USB disks <3 >> and ggate devices (backed by systems with patches for the known data >> corruption issues). Ben started a discussion about his setup a few months ago, where he described what he is going to do. And, at least my (and i am pretty sure there where others, too) prophecy was that it will end up in a pretty unreliable setup (gremlins and other things are included!) which is far far away from being helpful in term of HA. A single node setup, with reliable hardware configuration and as little as possible moving parts, whould be way more reliable and flawless. but anyhow, i hate (no i dont in that case) to say "told you so" its like using tons of external usb disks hooked up to flaky consumer grade controllers, creating a raidz on top of it and start looking surprised when the pool starts going crazy. sorry for being a ironic dick, its frustrating... > > There's no mention to code revision in this thread. > It finishes with a message from Alexander Motin : > "(...) I've got to conclusion that ZFS in many places > written in a way that simply does not expect errors. In such cases it > just stucks, waiting for disk to reappear and I/O to complete. (...)" > yep, nothing new? if the underlying block device works like expected a error should be returned to zfs, but noone knows what happens in this setup during failure. maybe its some switch issue, or network driver bug which prevents this and stalls the pool. who knows, what errors probably already got masked through the additional iscsi layer between phys. disk and zfs. lots of fun included for future issues. for a ha setup, its funny to add as much components as possible just to somehow get it look like ha. its the complete opposite, no matter what one would call it. >> I'm not claiming that the patch or other workarounds I'm aware of >> would actually help with your ZFS stalls at all, but it's not obvious >> to me that your problems can actually be blamed on the iSCSI code >> either. >> i would guess that its not an direct issue of the ctld, but its just a guess... @ben can you post your iscsi network configuration including ctld.conf and so on? is your iscsi setup using multipath, lacp or is it just single pathed? overall... ben is stacking up way too much layers which prevent root cause diagnostic. lets see, i am trying to describe what i see here (please correct me if this setup is different from my thinking). i think, to debug this mess, its absolutely necessary to see all involved components - physical machine(s), exporting single raw disks via iscsi to "frontends" (please provide exact configuration, software versions, and built in hardware -> especially nic models, drivers, firmware) - switch infrastructure (brand, model, firmware version, line speed, link aggregation in use? if yes, lacp or whatever is in use here?) - single switch or stacked setup? - frontend boxes, importing the raw iscsi disks for a zpool (again, exact hardware configuration, network configuration, driver / firmware versions and so on) did one already check the switch logs / error counts? another thing which came to my mind is, if has zfs ever been designed to be used on top of iscsi block devices? my thoughts so far where that zfs loves native disks, without any layer between (no volume manager, no partitions, no nothing). most ha setups i have seen so far where using rock solid cross over cabled sas jbods with on demand activated paths in case of failure. theres not that much that can cause voodoo in such setups, compared to iscsi ha however failover scenarios with tons of possible problematic components in between. >> Did you try to reproduce the problem without iSCSI? i bet the problem wont occur anymore on native disks. which should NOT mean that zfs cant be used on iscsi devices, i am pretty sure it will work fine... as long as: - iscsi target behavior is doing well, which includes that no strange bugs start partying on your san network - rock solid stable networking, no hickups, no retrans, no loss, no nothing (dont forget to 100% seperate san traffic from other traffic, better go for completely dedicated switches which only handle san traffic) - no nic driver / firmware issues - no switch firmware issues the point is, with this kind of setup you get that much components into the game that its nearly impossible to figure out where the vodoo comes from. its getting even more worse with ha setups using such a infrastructure, which need to be debugged while stay into production. >> Anyway, good luck with your ZFS-on-iscsi issue(s). good luck from me, too. whould be very interesting which caused this issue. until the next one pops up > > Thank you very much Fabian for your help and contribution, > I really hope we'll find the root cause of this issue, > as it's quite annoying in a HA-expected production environment :/ > the point with ha setups is... planning, planning, planning, testing, testing, praying & hoping that nothing unexpected will happen to your setup. its always somehow a gamble, and in there will be still enough situations where a well planned ha setup will still fail. but, usually its a pretty good starting point to keep things as simple as possible when designing such setups. extra layers like iscsi in this case are nothing which should be seen as "keeping things simple", this things are a good way to prepare a ha setup to fail with whatever obscure issues. debugging is a bit... overall, please forgive my ironic writing > Ben > > _______________________________________________ > freebsd-fs@freebsd.org mailing list > https://lists.freebsd.org/mailman/listinfo/freebsd-fs > To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org" >
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?6d1c80df-7e9f-c891-31ae-74dad3f67985>