From owner-freebsd-fs@freebsd.org  Sun Jun 19 23:29:26 2016
Return-Path: <owner-freebsd-fs@freebsd.org>
Delivered-To: freebsd-fs@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 13485A7A0B9
 for <freebsd-fs@mailman.ysv.freebsd.org>; Sun, 19 Jun 2016 23:29:26 +0000 (UTC)
 (envelope-from rmacklem@uoguelph.ca)
Received: from esa-jnhn.mail.uoguelph.ca (esa-jnhn.mail.uoguelph.ca
 [131.104.91.44])
 by mx1.freebsd.org (Postfix) with ESMTP id 8F99A2EBF;
 Sun, 19 Jun 2016 23:29:25 +0000 (UTC)
 (envelope-from rmacklem@uoguelph.ca)
IronPort-PHdr: 9a23:TEfmQRJsW/aqu/m6StmcpTZWNBhigK39O0sv0rFitYgUL//xwZ3uMQTl6Ol3ixeRBMOAu6MC2rad6fuocFdDyKjCmUhKSIZLWR4BhJdetC0bK+nBN3fGKuX3ZTcxBsVIWQwt1Xi6NU9IBJS2PAWK8TWM5DIfUi/yKRBybrysXNWC3oLmi6vooNX6WEZhunmUWftKNhK4rAHc5IE9oLBJDeIP8CbPuWZCYO9MxGlldhq5lhf44dqsrtY4q3wD86Fpy8kVc6Lgc60+BZxFBj4vKWx9sM/otTHCXRCe/WcRV35QmR1NVVvr9hb/C63wuSiyk+N22y2XOIWiV7U9Ujem4qJDVRjnlSoDLz5/+2iB2Z84t75SvB/0/083+IXTeozAbPc=
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: A2DYBACJKmdX/61jaINchBR9Bq5bjXwihXUCgV4RAQEBAQEBAQFkJ4IxghoBAQEDASMESAoFCwIBCA4KAgINGQICITYCBBMbh3sDDwgOrz6MHw2DXgEBAQcBAQEBI4EBhSaETYJDgWcCFIMBgloFjXWKTTSGBoYqhDGMa4gKh2sCNCCEDCAyiQVEfwEBAQ
X-IronPort-AV: E=Sophos;i="5.26,495,1459828800"; d="scan'208";a="288352735"
Received: from nipigon.cs.uoguelph.ca (HELO zcs1.mail.uoguelph.ca)
 ([131.104.99.173])
 by esa-jnhn.mail.uoguelph.ca with ESMTP; 19 Jun 2016 19:29:14 -0400
Received: from localhost (localhost [127.0.0.1])
 by zcs1.mail.uoguelph.ca (Postfix) with ESMTP id 406D415F5C0;
 Sun, 19 Jun 2016 19:29:14 -0400 (EDT)
Received: from zcs1.mail.uoguelph.ca ([127.0.0.1])
 by localhost (zcs1.mail.uoguelph.ca [127.0.0.1]) (amavisd-new, port 10032)
 with ESMTP id CV-WxLD3aO4Y; Sun, 19 Jun 2016 19:29:13 -0400 (EDT)
Received: from localhost (localhost [127.0.0.1])
 by zcs1.mail.uoguelph.ca (Postfix) with ESMTP id 245AC15F5C6;
 Sun, 19 Jun 2016 19:29:13 -0400 (EDT)
X-Virus-Scanned: amavisd-new at zcs1.mail.uoguelph.ca
Received: from zcs1.mail.uoguelph.ca ([127.0.0.1])
 by localhost (zcs1.mail.uoguelph.ca [127.0.0.1]) (amavisd-new, port 10026)
 with ESMTP id xQ7sZFtPMu2C; Sun, 19 Jun 2016 19:29:13 -0400 (EDT)
Received: from zcs1.mail.uoguelph.ca (zcs1.mail.uoguelph.ca [172.17.95.18])
 by zcs1.mail.uoguelph.ca (Postfix) with ESMTP id F35F115F5C0;
 Sun, 19 Jun 2016 19:29:12 -0400 (EDT)
Date: Sun, 19 Jun 2016 19:29:12 -0400 (EDT)
From: Rick Macklem <rmacklem@uoguelph.ca>
To: Jordan Hubbard <jkh@ixsystems.com>
Cc: Chris Watson <bsdunix44@gmail.com>, freebsd-fs <freebsd-fs@freebsd.org>, 
 Alexander Motin <mav@freebsd.org>
Message-ID: <1845469514.159182764.1466378952929.JavaMail.zimbra@uoguelph.ca>
In-Reply-To: <B2907C1F-D32A-48FB-8E58-209E6AF1E86D@ixsystems.com>
References: <1524639039.147096032.1465856925174.JavaMail.zimbra@uoguelph.ca>
 <D20C793E-A2FD-49F3-AD88-7C2FED5E7715@ixsystems.com>
 <7E27FA25-E18F-41D3-8974-EAE1EACABF38@gmail.com>
 <B2907C1F-D32A-48FB-8E58-209E6AF1E86D@ixsystems.com>
Subject: Re: pNFS server Plan B
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Originating-IP: [172.17.95.10]
X-Mailer: Zimbra 8.0.9_GA_6191 (ZimbraWebClient - FF47 (Win)/8.0.9_GA_6191)
Thread-Topic: pNFS server Plan B
Thread-Index: 10x17g/wb7qkit7Lp7CytadquFv17Q==
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs/>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 19 Jun 2016 23:29:26 -0000

Jordan Hubbard wrote:
>=20
> > On Jun 18, 2016, at 6:14 PM, Chris Watson <bsdunix44@gmail.com> wrote:
> >=20
> > Since Jordan brought up clustering, I would be interested to hear Justi=
n
> > Gibbs thoughts here. I know about a year ago he was asked on an "after
> > hours" video chat hosted by Matt Aherns about a feature he would really
> > like to see and he mentioned he would really like, in a universe filled
> > with time and money I'm sure, to work on a native clustering solution f=
or
> > FreeBSD. I don't know if he is subscribed to the list, and I'm certainl=
y
> > not throwing him under the bus by bringing his name up, but I know he h=
as
> > at least been thinking about this for some time and probably has some
> > value to add here.
>=20
> I think we should also be careful to define our terms in such a discussio=
n.
> Specifically:
>=20
> 1. Are we talking about block-level clustering underneath ZFS (e.g. HAST =
or
> ${somethingElse}) or otherwise incorporated into ZFS itself at some low
> level?  If you Google for =E2=80=9CHigh-availability ZFS=E2=80=9D you wil=
l encounter things
> like RSF-1 or the somewhat more mysterious Zetavault
> (http://www.zeta.systems/zetavault/high-availability/) but it=E2=80=99s n=
ot entirely
> clear how these technologies work, they simply claim to =E2=80=9Cscale-ou=
t ZFS=E2=80=9D or
> =E2=80=9Ccluster ZFS=E2=80=9D (which can be done within ZFS or one level =
above and still
> probably pass the Marketing Test for what people are willing to put on a =
web
> page).
>=20
> 2. Are we talking about clustering at a slightly higher level, in a
> filesystem-agnostic fashion which still preserves filesystem semantics?
>=20
> 3. Are we talking about clustering for data objects, in a fashion which d=
oes
> not necessarily provide filesystem semantics (a sharding database which c=
an
> store arbitrary BLOBs would qualify)?
>=20
For the pNFS use case I am looking at, I would say #2.

I suspect #1 sits at a low enough level that redirecting I/O via the pNFS l=
ayouts
isn't practical, since ZFS is taking care of block allocations, etc.

I see #3 as a separate problem space, since NFS deals with files and not ob=
jects.
However, GlusterFS maps file objects on top of the POSIX-like FS, so I supp=
ose that
could be done at the client end. (What glusterfs.org calls SwiftonFile, I t=
hink?)
It is also possible to map POSIX files onto file objects, but that sounds l=
ike more
work, which would need to be done under the NFS service.

> For all of the above:  Are we seeking to be compatible with any other
> mechanisms, or are we talking about a FreeBSD-only solution?
>=20
> This is why I brought up glusterfs / ceph / RiakCS in my previous comment=
s -
> when talking to the $users that Rick wants to involve in the discussion,
> they rarely come to the table asking for =E2=80=9Csome or any sort of clu=
stering,
> don=E2=80=99t care which or how it works=E2=80=9D - they ask if I can off=
er an S3 compatible
> object store with horizontal scaling, or=20

> if they can use NFS in some
> clustered fashion where there=E2=80=99s a single namespace offering petab=
ytes of
> storage with configurable redundancy such that no portion of that namespa=
ce
> is ever unavailable.
>=20
I tend to think of this last case as the target for any pNFS server. The ba=
sic
idea is to redirect the I/O operations to wherever the data is actually sto=
red,
so that I/O performance doesn't degrade with scale.

If redundancy is a necessary feature, then maybe Plan A is preferable to Pl=
an B,
since GlusterFS does provide for redundancy and resilvering of lost copies,=
 at
least from my understanding of the docs on gluster.org.

I'd also like to see how GlusterFS performs on a typical Linux setup.
Even without having the nfsd use FUSE, access of GlusterFS via FUSE results=
 in crossing
user (syscall on mount) --> kernel --> user (glusterfs daemon) within the c=
lient machine,
if I understand how GlusterFS works. Then the gluster brick server glusterf=
sd daemon does
file system syscall(s) to get at the actual file on the underlying FS (xfs =
or ZFS or ...).
As such, there is already a lot of user<->kernel boundary crossings.
I wonder how much delay is added by the extra nfsd step for metadata?
- I can't say much about performance of Plan A yet, but metadata operations=
 are slow
  and latency seems to be the issue. (I actually seem to get better perform=
ance by
  disabling SMP, for example.)

> I=E2=80=99d be interested in what Justin had in mind when he asked Matt a=
bout this.
> Being able to =E2=80=9Cattach ZFS pools to one another=E2=80=9D in such a=
 fashion that all
> clients just see One Big Pool and ZFS=E2=80=99s own redundancy / snapshot=
ting
> characteristics magically apply to the =C3=BCberpool would be Pretty Cool=
,
> obviously, and would allow one to do round-robin DNS for NFS such that an=
y
> node could serve the same contents, but that also sounds pretty ambitious=
,
> depending on how it=E2=80=99s implemented.
>=20
This would probably work with the extant nfsd and wouldn't have a use for p=
NFS.
I also agree that this sounds pretty ambitious.

rick

> - Jordan
>=20
>=20