From owner-freebsd-fs@freebsd.org Thu Aug 11 11:49:25 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 82B8CBB61E1 for ; Thu, 11 Aug 2016 11:49:25 +0000 (UTC) (envelope-from julien@perdition.city) Received: from relay-b01.edpnet.be (relay-b01.edpnet.be [212.71.1.221]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "edpnet.email", Issuer "Go Daddy Secure Certificate Authority - G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 2B0E115B0 for ; Thu, 11 Aug 2016 11:49:24 +0000 (UTC) (envelope-from julien@perdition.city) X-ASG-Debug-ID: 1470916160-0a7ff569f93107780001-3nHGF7 Received: from mordor.lan (213.211.139.72.dyn.edpnet.net [213.211.139.72]) by relay-b01.edpnet.be with ESMTP id 2y00BqHroQqeUrwp (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Thu, 11 Aug 2016 13:49:21 +0200 (CEST) X-Barracuda-Envelope-From: julien@perdition.city X-Barracuda-Effective-Source-IP: 213.211.139.72.dyn.edpnet.net[213.211.139.72] X-Barracuda-Apparent-Source-IP: 213.211.139.72 Date: Thu, 11 Aug 2016 13:49:20 +0200 From: Julien Cigar To: Borja Marcos Cc: freebsd-fs@freebsd.org, Jordan Hubbard Subject: Re: HAST + ZFS + NFS + CARP Message-ID: <20160811114919.GP70364@mordor.lan> X-ASG-Orig-Subj: Re: HAST + ZFS + NFS + CARP References: <61283600-A41A-4A8A-92F9-7FAFF54DD175@ixsystems.com> <20160704183643.GI41276@mordor.lan> <20160704193131.GJ41276@mordor.lan> <20160811091016.GI70364@mordor.lan> <1AA52221-9B04-4CF6-97A3-D2C2B330B7F9@sarenet.es> <20160811101539.GM70364@mordor.lan> <20160811110235.GN70364@mordor.lan> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha512; protocol="application/pgp-signature"; boundary="5/6IVfYouxg+lu1D" Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.6.1 (2016-04-27) X-Barracuda-Connect: 213.211.139.72.dyn.edpnet.net[213.211.139.72] X-Barracuda-Start-Time: 1470916160 X-Barracuda-Encrypted: ECDHE-RSA-AES256-GCM-SHA384 X-Barracuda-URL: https://212.71.1.221:443/cgi-mod/mark.cgi X-Barracuda-Scan-Msg-Size: 4031 X-Virus-Scanned: by bsmtpd at edpnet.be X-Barracuda-BRTS-Status: 1 X-Barracuda-Bayes: INNOCENT GLOBAL 0.5000 1.0000 0.0100 X-Barracuda-Spam-Score: 0.01 X-Barracuda-Spam-Status: No, SCORE=0.01 using global scores of TAG_LEVEL=1000.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=6.0 tests= X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.3.31931 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 11 Aug 2016 11:49:25 -0000 --5/6IVfYouxg+lu1D Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Thu, Aug 11, 2016 at 01:22:05PM +0200, Borja Marcos wrote: >=20 > > On 11 Aug 2016, at 13:02, Julien Cigar wrote: > >=20 > > On Thu, Aug 11, 2016 at 12:15:39PM +0200, Julien Cigar wrote: > >> On Thu, Aug 11, 2016 at 11:24:40AM +0200, Borja Marcos wrote: > >>>=20 > >>>> On 11 Aug 2016, at 11:10, Julien Cigar wrote: > >>>>=20 > >>>> As I said in a previous post I tested the zfs send/receive approach = (with > >>>> zrep) and it works (more or less) perfectly.. so I concur in all wha= t you > >>>> said, especially about off-site replicate and synchronous replicatio= n. > >>>>=20 > >>>> Out of curiosity I'm also testing a ZFS + iSCSI + CARP at the moment= ,=20 > >>>> I'm in the early tests, haven't done any heavy writes yet, but ATM i= t=20 > >>>> works as expected, I havent' managed to corrupt the zpool. > >>>=20 > >>> I must be too old school, but I don=E2=80=99t quite like the idea of = using an essentially unreliable transport > >>> (Ethernet) for low-level filesystem operations. > >>>=20 > >>> In case something went wrong, that approach could risk corrupting a p= ool. Although, frankly, > >=20 > > Now I'm thinking of the following scenario: > > - filer1 is the MASTER, filer2 the BACKUP > > - on filer1 a zpool data mirror over loc1, loc2, rem1, rem2 (where rem1= =20 > > and rem2 are iSCSI disks) > > - the pool is mounted on MASTER > >=20 > > Now imagine that the replication interface corrupts packets silently, > > but data are still written on rem1 and rem2. Does ZFS will detect=20 > > immediately that written blocks on rem1 and rem2 are corrupted? >=20 > As far as I know ZFS does not read after write. It can detect silent corr= uption when reading a file > or a metadata block, but that will happen only when requested (file), whe= n needed (metadata) > or in a scrub. It doesn=E2=80=99t do preemptive read-after-write, I think= =2E Or I don=E2=80=99t recall having read it. Nop, ZFS doesn't read after write. So in theory you pool can become corrupted in the following case: T1: a zpool scrub is made, everything is OK T2: the replication interface starts to silently corrupt packets T3: corrupted data blocks are written on the two iSCSI disks while=20 valid data blocks are written on the two local disks. T4: corrupted data blocks are not replayed, so ZFS will not notice it. T5: master dies before another zpool scrub is run T6: failover happens, BACKUP becomes the new MASTER, try to import the pool -> corruption -> fail >:O Although very very unlikely, this scenario is in theory possible. BTW any idea if some sort of checksum for payload is made in the iSCSI protocol? >=20 > Silent corruption can be overcome by ZFS as long as it isn=E2=80=99t too = much. In my case with the > evil HBA it was like a block operation error in an hour of intensive I/O.= In normal operation it could > be a block error in a week or so. With that error rate, the chance of a r= andom I/O error corrupting the > same block in three different devices (it=E2=80=99s a raidz2 vdev) are re= ally remote.=20 >=20 > But, again, and I won=E2=80=99t push more at the risk of annoying you to = death. Just, think that your I/O=20 > throughput will be bound by your network and iSCSI performance, anyway ;) >=20 >=20 >=20 >=20 > Borja. >=20 >=20 > P.D: I forgot to reply to this before: >=20 > >> Yeah.. although you could have silent data corruption with any broken > >> hardware too. Some years ago I suffered a silent data corruption due t= o=20 > >> a broken RAID card, and had to restore from backups.. >=20 > Ethernet hardware is designed with the assumption that the loss of a pack= et is not such a big deal.=20 > Shit happens on SAS and other specialized storage networks of course, but= you should expect it to be=20 > at least a bit less. ;) >=20 >=20 --=20 Julien Cigar Belgian Biodiversity Platform (http://www.biodiversity.be) PGP fingerprint: EEF9 F697 4B68 D275 7B11 6A25 B2BB 3710 A204 23C0 No trees were killed in the creation of this message. However, many electrons were terribly inconvenienced. --5/6IVfYouxg+lu1D Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAABCgAGBQJXrGY8AAoJELK7NxCiBCPAVR4QAJjCONr/kBtfr3gUX1xRD28K GSB463adUYv8DCLIUFSSWqC95+qpP8EkjKNK85I54y2SsBeuUh97QFCStEqMVjxJ k48DPikAyByrC9ohu2MoqHJprPC4v7M0EizoMiA3CUQ7pOWEWyMQ6bpaB/TYxA1J X9DQGqDbT1nWJNS3KVQ2rdAzFyq8nAfaKOoyFz6QGghiw0/p6tUY1s0qJT43ir0g n/1fuHuoktG9KwjiAnC+6ULDUnZX2ZW3um4nnvi13u2Cc9M+S7XRgIzvDpnzKskq 29Y787tF11AxmGmGq9jjYXyZ1CZkR/bybSC3b774Llheje2jK3zezwh48PHdMaez 4rP0w7tbIa848CpBYNHQkFwS1/UwmjvyU+KTACc4nVA50+nZ0FJrRBmhU+J1/NCS QhsMhmJ2hLWKvMMD9y9TjBK5L7Yf6gKiTeZ5tg4aq7cmLka3o7zBXOaA/0kRdyfG CjnxMDCAmxzvr9xItQDnWen3/IYYwn5IkN32I8w2sl9k1Y6PCwOxgvDzPl+pAXH7 OGv1JU31UV6w0Jo4uxODwkKmBHt04FZuykfPdKaWUvsrWaD7Py80oeBXxgSdwsmz 8uj7bm2EssT7sljRrq5AiuYJn/DtV4kqDHTRssJdmifVRK++qqfUYtAcR6vyLIHS ioqio1mcGxm3eblqay4U =j2GE -----END PGP SIGNATURE----- --5/6IVfYouxg+lu1D--