Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 11 Aug 2016 13:49:20 +0200
From:      Julien Cigar <julien@perdition.city>
To:        Borja Marcos <borjam@sarenet.es>
Cc:        freebsd-fs@freebsd.org, Jordan Hubbard <jkh@ixsystems.com>
Subject:   Re: HAST + ZFS + NFS + CARP
Message-ID:  <20160811114919.GP70364@mordor.lan>
In-Reply-To: <F46B3811-52E3-4D31-AA19-5D0D2E023D3A@sarenet.es>
References:  <61283600-A41A-4A8A-92F9-7FAFF54DD175@ixsystems.com> <20160704183643.GI41276@mordor.lan> <AE372BF0-02BE-4BF3-9073-A05DB4E7FE34@ixsystems.com> <20160704193131.GJ41276@mordor.lan> <E7D42341-D324-41C7-B03A-2420DA7A7952@sarenet.es> <20160811091016.GI70364@mordor.lan> <1AA52221-9B04-4CF6-97A3-D2C2B330B7F9@sarenet.es> <20160811101539.GM70364@mordor.lan> <20160811110235.GN70364@mordor.lan> <F46B3811-52E3-4D31-AA19-5D0D2E023D3A@sarenet.es>

next in thread | previous in thread | raw e-mail | index | archive | help

--5/6IVfYouxg+lu1D
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Thu, Aug 11, 2016 at 01:22:05PM +0200, Borja Marcos wrote:
>=20
> > On 11 Aug 2016, at 13:02, Julien Cigar <julien@perdition.city> wrote:
> >=20
> > On Thu, Aug 11, 2016 at 12:15:39PM +0200, Julien Cigar wrote:
> >> On Thu, Aug 11, 2016 at 11:24:40AM +0200, Borja Marcos wrote:
> >>>=20
> >>>> On 11 Aug 2016, at 11:10, Julien Cigar <julien@perdition.city> wrote:
> >>>>=20
> >>>> As I said in a previous post I tested the zfs send/receive approach =
(with
> >>>> zrep) and it works (more or less) perfectly.. so I concur in all wha=
t you
> >>>> said, especially about off-site replicate and synchronous replicatio=
n.
> >>>>=20
> >>>> Out of curiosity I'm also testing a ZFS + iSCSI + CARP at the moment=
,=20
> >>>> I'm in the early tests, haven't done any heavy writes yet, but ATM i=
t=20
> >>>> works as expected, I havent' managed to corrupt the zpool.
> >>>=20
> >>> I must be too old school, but I don=E2=80=99t quite like the idea of =
using an essentially unreliable transport
> >>> (Ethernet) for low-level filesystem operations.
> >>>=20
> >>> In case something went wrong, that approach could risk corrupting a p=
ool. Although, frankly,
> >=20
> > Now I'm thinking of the following scenario:
> > - filer1 is the MASTER, filer2 the BACKUP
> > - on filer1 a zpool data mirror over loc1, loc2, rem1, rem2 (where rem1=
=20
> > and rem2 are iSCSI disks)
> > - the pool is mounted on MASTER
> >=20
> > Now imagine that the replication interface corrupts packets silently,
> > but data are still written on rem1 and rem2. Does ZFS will detect=20
> > immediately that written blocks on rem1 and rem2 are corrupted?
>=20
> As far as I know ZFS does not read after write. It can detect silent corr=
uption when reading a file
> or a metadata block, but that will happen only when requested (file), whe=
n needed (metadata)
> or in a scrub. It doesn=E2=80=99t do preemptive read-after-write, I think=
=2E Or I don=E2=80=99t recall having read it.

Nop, ZFS doesn't read after write. So in theory you pool can become
corrupted in the following case:

T1: a zpool scrub is made, everything is OK
T2: the replication interface starts to silently corrupt packets
T3: corrupted data blocks are written on the two iSCSI disks while=20
valid data blocks are written on the two local disks.
T4: corrupted data blocks are not replayed, so ZFS will not notice it.
T5: master dies before another zpool scrub is run
T6: failover happens, BACKUP becomes the new MASTER, try to import the
pool -> corruption -> fail >:O

Although very very unlikely, this scenario is in theory possible.

BTW any idea if some sort of checksum for payload is made in the iSCSI
protocol?

>=20
> Silent corruption can be overcome by ZFS as long as it isn=E2=80=99t too =
much. In my case with the
> evil HBA it was like a block operation error in an hour of intensive I/O.=
 In normal operation it could
> be a block error in a week or so. With that error rate, the chance of a r=
andom I/O error corrupting the
> same block in three different devices (it=E2=80=99s a raidz2 vdev) are re=
ally remote.=20
>=20
> But, again, and I won=E2=80=99t push more at the risk of annoying you to =
death. Just, think that your I/O=20
> throughput will be bound by your network and iSCSI performance, anyway ;)
>=20
>=20
>=20
>=20
> Borja.
>=20
>=20
> P.D: I forgot to reply to this before:
>=20
> >> Yeah.. although you could have silent data corruption with any broken
> >> hardware too. Some years ago I suffered a silent data corruption due t=
o=20
> >> a broken RAID card, and had to restore from backups..
>=20
> Ethernet hardware is designed with the assumption that the loss of a pack=
et is not such a big deal.=20
> Shit happens on SAS and other specialized storage networks of course, but=
 you should expect it to be=20
> at least a bit less. ;)
>=20
>=20

--=20
Julien Cigar
Belgian Biodiversity Platform (http://www.biodiversity.be)
PGP fingerprint: EEF9 F697 4B68 D275 7B11  6A25 B2BB 3710 A204 23C0
No trees were killed in the creation of this message.
However, many electrons were terribly inconvenienced.

--5/6IVfYouxg+lu1D
Content-Type: application/pgp-signature; name="signature.asc"

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2

iQIcBAABCgAGBQJXrGY8AAoJELK7NxCiBCPAVR4QAJjCONr/kBtfr3gUX1xRD28K
GSB463adUYv8DCLIUFSSWqC95+qpP8EkjKNK85I54y2SsBeuUh97QFCStEqMVjxJ
k48DPikAyByrC9ohu2MoqHJprPC4v7M0EizoMiA3CUQ7pOWEWyMQ6bpaB/TYxA1J
X9DQGqDbT1nWJNS3KVQ2rdAzFyq8nAfaKOoyFz6QGghiw0/p6tUY1s0qJT43ir0g
n/1fuHuoktG9KwjiAnC+6ULDUnZX2ZW3um4nnvi13u2Cc9M+S7XRgIzvDpnzKskq
29Y787tF11AxmGmGq9jjYXyZ1CZkR/bybSC3b774Llheje2jK3zezwh48PHdMaez
4rP0w7tbIa848CpBYNHQkFwS1/UwmjvyU+KTACc4nVA50+nZ0FJrRBmhU+J1/NCS
QhsMhmJ2hLWKvMMD9y9TjBK5L7Yf6gKiTeZ5tg4aq7cmLka3o7zBXOaA/0kRdyfG
CjnxMDCAmxzvr9xItQDnWen3/IYYwn5IkN32I8w2sl9k1Y6PCwOxgvDzPl+pAXH7
OGv1JU31UV6w0Jo4uxODwkKmBHt04FZuykfPdKaWUvsrWaD7Py80oeBXxgSdwsmz
8uj7bm2EssT7sljRrq5AiuYJn/DtV4kqDHTRssJdmifVRK++qqfUYtAcR6vyLIHS
ioqio1mcGxm3eblqay4U
=j2GE
-----END PGP SIGNATURE-----

--5/6IVfYouxg+lu1D--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20160811114919.GP70364>