From owner-freebsd-fs@freebsd.org  Fri Oct 13 16:58:35 2017
Return-Path: <owner-freebsd-fs@freebsd.org>
Delivered-To: freebsd-fs@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 78692E273DD
 for <freebsd-fs@mailman.ysv.freebsd.org>; Fri, 13 Oct 2017 16:58:35 +0000 (UTC)
 (envelope-from ben.rubson@gmail.com)
Received: from mail-wr0-x241.google.com (mail-wr0-x241.google.com
 [IPv6:2a00:1450:400c:c0c::241])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id 069476D295
 for <freebsd-fs@freebsd.org>; Fri, 13 Oct 2017 16:58:35 +0000 (UTC)
 (envelope-from ben.rubson@gmail.com)
Received: by mail-wr0-x241.google.com with SMTP id y44so1231427wry.2
 for <freebsd-fs@freebsd.org>; Fri, 13 Oct 2017 09:58:34 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025;
 h=mime-version:subject:from:in-reply-to:date
 :content-transfer-encoding:message-id:references:to;
 bh=OLXZYhwNpFg+7LfT3d70kZE2axZynW14Di1wIvLXASE=;
 b=sZrVAKP5uxuF2J/S5Rp1gPSTzSfVbkchCY359KitifGZNZhH5+ceb8UzduYi0J0YuY
 fLiJhtAahlmzCiRQvVzjuDJUwxVAdT/jRU2IeBtBBKYRgNZG2XwgQgtSMR2yT649IuQq
 VHgP+L6+BIEY3Iyli/WJobyQENMPy31jYrjHlhY1gqFTaUq1XUOnRomgQmQJq8t5UK3+
 DEpuOGklgIxk6Mqsn/Odn4aPswxj9l44Piudjzwz32t8c5q5tVgnDB3L4OId/JgnFZic
 VLItWMB30bbyvE5uYkkB/+pHGQTHbJ08noTSuK/d/MZIDeTz+XlF3K/wQTYQjTunlnQD
 YWYA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:mime-version:subject:from:in-reply-to:date
 :content-transfer-encoding:message-id:references:to;
 bh=OLXZYhwNpFg+7LfT3d70kZE2axZynW14Di1wIvLXASE=;
 b=XNAe4+GiA44MPu3OdRFnOflF9aDQUridWab9DGEt5/09MOKV84W8DnPlleEOJ09wGP
 NEojJf75F5wwrCeYUJ+R9aQoSFCXnXseaIYRpXQwk/tSGKSHAeuLBvQxqhSv5eWKbirY
 3Dhgxr99UO0olo3umgvXU3D3729AU+FnYhZMF6L7Nvxpa4KpkPTEx3D0RFGC0C/kOwec
 wr+5JVdfg1XIpBuKAtUq4QCWLFsLF5mazWjsZm5NY+hpPfG4QYsqO1bCYQy0JTUdk32Q
 hoLwMRCVCwOijN9kMvnm/ItkRoKTKECANl+FqeP/CtBMWGeeOJdc+v3yc2qy/D5r0awS
 JgrQ==
X-Gm-Message-State: AMCzsaUbEzePSNBhAcslzawcf1qtVOPs8bgM8UEPEpBeFt81Q9Vp9EK+
 AKWIVOuBhpLODcgnGddFvSpF99mS
X-Google-Smtp-Source: AOwi7QCzD4TSilR2oXD6+OHYNOb4DPB4zJCbNCpCUC9Xt16Cvf8eYC4hQI/yYsUPf9j6C01XyzmtuQ==
X-Received: by 10.223.161.156 with SMTP id u28mr1788055wru.244.1507913913084; 
 Fri, 13 Oct 2017 09:58:33 -0700 (PDT)
Received: from bens-mac.home (LFbn-MAR-1-445-220.w2-15.abo.wanadoo.fr.
 [2.15.38.220])
 by smtp.gmail.com with ESMTPSA id i16sm2520283wrf.19.2017.10.13.09.58.31
 for <freebsd-fs@freebsd.org>
 (version=TLS1 cipher=ECDHE-RSA-AES128-SHA bits=128/128);
 Fri, 13 Oct 2017 09:58:32 -0700 (PDT)
Content-Type: text/plain; charset=us-ascii; delsp=yes; format=flowed
Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\))
Subject: Re: ZFS stalled after some mirror disks were lost
From: Ben RUBSON <ben.rubson@gmail.com>
In-Reply-To: <6d1c80df-7e9f-c891-31ae-74dad3f67985@internetx.com>
Date: Fri, 13 Oct 2017 18:58:30 +0200
Content-Transfer-Encoding: 7bit
Message-Id: <13AF09F5-3ED9-4C81-A5E2-94BA770E991B@gmail.com>
References: <4A0E9EB8-57EA-4E76-9D7E-3E344B2037D2@gmail.com>
 <DDCFAC80-2D72-4364-85B2-7F4D7D70BCEE@gmail.com>
 <82632887-E9D4-42D0-AC05-3764ABAC6B86@gmail.com>
 <20171007150848.7d50cad4@fabiankeil.de>
 <DFD0528D-549E-44C9-A093-D4A8837CB499@gmail.com>
 <6d1c80df-7e9f-c891-31ae-74dad3f67985@internetx.com>
To: Freebsd fs <freebsd-fs@freebsd.org>
X-Mailer: Apple Mail (2.3124)
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs/>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 13 Oct 2017 16:58:35 -0000

On 12 Oct 2017 22:52, InterNetX - Juergen Gotteswinter wrote:

> Ben started a discussion about his setup a few months ago, where he
> described what he is going to do. And, at least my (and i am pretty sure
> there where others, too) prophecy was that it will end up in a pretty
> unreliable setup  (gremlins and other things are included!) which is far
> far away from being helpful in term of HA. A single node setup, with
> reliable hardware configuration and as little as possible moving parts,
> whould be way more reliable and flawless.

First, thank you for your answer Juergen, I do appreciate it, of course, as  
well as the help you propose below. So thank you ! :)

Yes, the discussion, more than one year ago now, on this list, was named  
"HAST + ZFS + NFS + CARP", but quickly moved around iSCSI when I initiated  
the idea :)

I must say, after one year of production, that I'm rather happy with this  
setup.
It works flawlessly (but the current discussed issue of course), and I  
switched from one node to another several times, successfully.
Main purpose of the previous discussion was to have a second chassis to  
host the pool in case of a failure with the first one.

> (...) if the underlying block device works like expected a
> error should be returned to zfs, but noone knows what happens in this
> setup during failure. maybe its some switch issue, or network driver bug
> which prevents this and stalls the pool.

The issue only happens when I disconnect iSCSI drives, it does not occurs  
suddenly by itself.
So I would say the issue is on FreeBSD side, not network hardware :)

2 distinct behaviours/issues :
- 1 : when I disconnect iSCSI drives from the server running the pool  
(iscsictl -Ra), some iSCSI drives remain on the system, leaving ZFS stalled  
;
- 2 : when I disconnect iSCSI drives from the target (shut NIC down /  
shutdown ctld), server running the pool sometimes panics (traces in my  
previous mail, 06/10).

> @ben
>
> can you post your iscsi network configuration including ctld.conf and so
> on? is your iscsi setup using multipath, lacp or is it just single pathed?

Sure. So I use single pathed iSCSI.

### Target side :

# /etc/ctl.conf (all targets are equally configured) :
target iqn.............:hm1 {
	portal-group pg0 au0
	alias G1207FHRDP2SThm
	lun 0 {
		path /dev/gpt/G1207FHRDP2SThm
		serial G1207FHRDP2SThm
	}
}

So, each target has its GPT label to clearly & quickly identify it  
(location, serial).
Best practice, which I find to be very useful, taken from storage/ZFS books  
from Michael W Lucas & Allan Jude.

### Initiator side :

# /etc/iscsi.conf (all targets are equally configured) :
hm1 {
	InitiatorName = iqn.............
	TargetAddress = 192.168.2.2
	TargetName    = iqn.............:hm1
	HeaderDigest  = CRC32C
	DataDigest    = CRC32C
}

Then, each disk is geom-labeled (glabel label) so that the previous example  
disk appears on initiator side in :
/dev/label/G1207FHRDP2SThm

Still the naming best practice which allows me to identify a disk wherever  
it is, without mistake.

ZFS thus uses /dev/label/ paths.

NAME                       STATE     READ WRITE CKSUM
home                       ONLINE       0     0     0
   mirror-0                 ONLINE       0     0     0
     label/G1203_serial_hm  ONLINE       0     0     0
     label/G1204_serial_hm  ONLINE       0     0     0
     label/G2203_serial_hm  ONLINE       0     0     0
     label/G2204_serial_hm  ONLINE       0     0     0
   mirror-1                 ONLINE       0     0     0
     label/G1205_serial_hm  ONLINE       0     0     0
     label/G1206_serial_hm  ONLINE       0     0     0
     label/G2205_serial_hm  ONLINE       0     0     0
     label/G2206_serial_hm  ONLINE       0     0     0
cache
   label/G2200_serial_ch    ONLINE       0     0     0
   label/G2201_serial_ch    ONLINE       0     0     0

(G2* are local disks, G1* are iCSCI disks)

> overall...
>
> ben is stacking up way too much layers which prevent root cause diagnostic.
>
> lets see, i am trying to describe what i see here (please correct me if
> this setup is different from my thinking). i think, to debug this mess,
> its absolutely necessary to see all involved components
>
> - physical machine(s), exporting single raw disks via iscsi to
> "frontends" (please provide exact configuration, software versions, and
> built in hardware -> especially nic models, drivers, firmware)
>
> - frontend boxes, importing the raw iscsi disks for a zpool (again,
> exact hardware configuration, network configuration, driver / firmware
> versions and so on)

Exact same hardware and software on both sides.
FreeBSD 11.0-RELEASE-p12
SuperMicro motherboard
ECC RAM
NIC Mellanox ConnectX-3 40G fw 2.36.5000
HBA SAS 2008 LSI 9211-8i fw 20.00.07.00-IT
SAS-only disks (no SATA)

> - switch infrastructure (brand, model, firmware version, line speed,
> link aggregation in use? if yes, lacp or whatever is in use here?)
>
> - single switch or stacked setup?
>
> did one already check the switch logs / error counts?

No, as issue seems to come from FreeeBSD (easily reproductible with the 2  
scenarios I gave above).

> another thing which came to my mind is, if has zfs ever been designed to
> be used on top of iscsi block devices? my thoughts so far where that zfs
> loves native disks, without any layer between (no volume manager, no
> partitions, no nothing). most ha setups i have seen so far where using
> rock solid cross over cabled sas jbods with on demand activated paths in
> case of failure. theres not that much that can cause voodoo in such
> setups, compared to iscsi ha however failover scenarios with tons of
> possible problematic components in between.

We analyzed this in the previous topic :
https://lists.freebsd.org/pipermail/freebsd-fs/2016-July/023503.html
https://lists.freebsd.org/pipermail/freebsd-fs/2016-July/023527.html

>>> Did you try to reproduce the problem without iSCSI?
>
> i bet the problem wont occur anymore on native disks. which should NOT
> mean that zfs cant be used on iscsi devices, i am pretty sure it will
> work fine... as long as:
>
> - iscsi target behavior is doing well, which includes that no strange
> bugs start partying on your san network
> (...)

Andriy, who took many debug traces from my system, managed to reproduce the  
first issue locally, using a 3-way ZFS mirror with one local disk plus two  
iSCSI disks.
Sounds like there is a deadlock issue on iSCSI initiator side (of course  
Andriy feel free to correct me if I'm wrong).

Regarding the second issue, I'm not able to reproduce it if I don't use  
geom-labels.
There may then be an issue on geom-label side (which could then also affect  
fully-local ZFS pools using geom-labels).

Thank you again,

Ben