From owner-freebsd-fs@freebsd.org  Sat Jun 18 20:50:36 2016
Return-Path: <owner-freebsd-fs@freebsd.org>
Delivered-To: freebsd-fs@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 65A14A78DCA
 for <freebsd-fs@mailman.ysv.freebsd.org>; Sat, 18 Jun 2016 20:50:36 +0000 (UTC)
 (envelope-from jkh@ixsystems.com)
Received: from barracuda.ixsystems.com (barracuda.ixsystems.com [12.229.62.30])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client CN "*.ixsystems.com",
 Issuer "Go Daddy Secure Certificate Authority - G2" (not verified))
 by mx1.freebsd.org (Postfix) with ESMTPS id 476AD16E7
 for <freebsd-fs@freebsd.org>; Sat, 18 Jun 2016 20:50:35 +0000 (UTC)
 (envelope-from jkh@ixsystems.com)
X-ASG-Debug-ID: 1466283034-08ca041142195570001-3nHGF7
Received: from zimbra.ixsystems.com ([10.246.0.20]) by barracuda.ixsystems.com
 with ESMTP id QgMuqJeGFJ3XI6CQ (version=TLSv1.2
 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO);
 Sat, 18 Jun 2016 13:50:34 -0700 (PDT)
X-Barracuda-Envelope-From: jkh@ixsystems.com
X-Barracuda-RBL-Trusted-Forwarder: 10.246.0.20
X-ASG-Whitelist: Client
Received: from localhost (localhost [127.0.0.1])
 by zimbra.ixsystems.com (Postfix) with ESMTP id C364ADD0C51;
 Sat, 18 Jun 2016 13:50:34 -0700 (PDT)
Received: from zimbra.ixsystems.com ([127.0.0.1])
 by localhost (zimbra.ixsystems.com [127.0.0.1]) (amavisd-new, port 10032)
 with ESMTP id vCykHjNm8UXi; Sat, 18 Jun 2016 13:50:32 -0700 (PDT)
Received: from localhost (localhost [127.0.0.1])
 by zimbra.ixsystems.com (Postfix) with ESMTP id 5182CDD0C50;
 Sat, 18 Jun 2016 13:50:32 -0700 (PDT)
X-Virus-Scanned: amavisd-new at ixsystems.com
Received: from zimbra.ixsystems.com ([127.0.0.1])
 by localhost (zimbra.ixsystems.com [127.0.0.1]) (amavisd-new, port 10026)
 with ESMTP id dyzQGuABO0Oe; Sat, 18 Jun 2016 13:50:32 -0700 (PDT)
Received: from [172.20.0.10] (vpn.ixsystems.com [10.249.0.2])
 by zimbra.ixsystems.com (Postfix) with ESMTPSA id 95A57DD0C39;
 Sat, 18 Jun 2016 13:50:31 -0700 (PDT)
Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\))
Subject: Re: pNFS server Plan B
From: Jordan Hubbard <jkh@ixsystems.com>
X-ASG-Orig-Subj: Re: pNFS server Plan B
In-Reply-To: <1524639039.147096032.1465856925174.JavaMail.zimbra@uoguelph.ca>
Date: Sat, 18 Jun 2016 13:50:29 -0700
Cc: freebsd-fs <freebsd-fs@freebsd.org>,
 Alexander Motin <mav@freebsd.org>
Message-Id: <D20C793E-A2FD-49F3-AD88-7C2FED5E7715@ixsystems.com>
References: <1524639039.147096032.1465856925174.JavaMail.zimbra@uoguelph.ca>
To: Rick Macklem <rmacklem@uoguelph.ca>
X-Mailer: Apple Mail (2.3124)
X-Barracuda-Connect: UNKNOWN[10.246.0.20]
X-Barracuda-Start-Time: 1466283034
X-Barracuda-Encrypted: ECDHE-RSA-AES256-GCM-SHA384
X-Barracuda-URL: https://10.246.0.26:443/cgi-mod/mark.cgi
X-Virus-Scanned: by bsmtpd at ixsystems.com
X-Barracuda-BRTS-Status: 1
Content-Type: text/plain;
	charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Content-Filtered-By: Mailman/MimeDel 2.1.22
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs/>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 18 Jun 2016 20:50:36 -0000


> On Jun 13, 2016, at 3:28 PM, Rick Macklem <rmacklem@uoguelph.ca> =
wrote:
>=20
> You may have already heard of Plan A, which sort of worked
> and you could test by following the instructions here:
>=20
> http://people.freebsd.org/~rmacklem/pnfs-setup.txt
>=20
> However, it is very slow for metadata operations (everything other =
than
> read/write) and I don't think it is very useful.

Hi guys,

I finally got a chance to catch up and bring up Rick=E2=80=99s pNFS =
setup on a couple of test machines.  He=E2=80=99s right, obviously - The =
=E2=80=9Cplan A=E2=80=9D approach is a bit convoluted and not at all =
surprisingly slow.  With all of those transits twixt kernel and =
userland, not to mention glusterfs itself which has not really been =
tuned for our platform (there are a number of papers on this we probably =
haven=E2=80=99t even all read yet), we=E2=80=99re obviously still in the =
=E2=80=9Cfirst make it work=E2=80=9D stage.

That said, I think there are probably more possible plans than just A =
and B here, and we should give the broader topic of =E2=80=9Cwhat does =
FreeBSD want to do in the Enterprise / Cloud computing space?" at least =
some consideration at the same time, since there are more than a few =
goals running in parallel here.

First, let=E2=80=99s talk about our story around clustered filesystems + =
associated command-and-control APIs in FreeBSD.  There is something of =
an embarrassment of riches in the industry at the moment - glusterfs, =
ceph, Hadoop HDFS, RiakCS, moose, etc.  All or most of them offer =
different pros and cons, and all offer more than just the ability to =
store files and scale =E2=80=9Celastically=E2=80=9D.  They also have =
ReST APIs for configuring and monitoring the health of the cluster, some =
offer object as well as file storage, and Riak offers a distributed KVS =
for storing information *about* file objects in addition to the object =
themselves (and when your application involves storing and managing =
several million photos, for example, the idea of distributing the index =
as well as the files in a fault-tolerant fashion is also compelling).  =
Some, if not most, of them are also far better supported under Linux =
than FreeBSD (I don=E2=80=99t think we even have a working ceph port =
yet).   I=E2=80=99m not saying we need to blindly follow the herds and =
do all the same things others are doing here, either, I=E2=80=99m just =
saying that it=E2=80=99s a much bigger problem space than simply =
=E2=80=9Cparallelizing NFS=E2=80=9D and if we can kill multiple birds =
with one stone on the way to doing that, we should certainly consider =
doing so.

Why?  Because pNFS was first introduced as a draft RFC (RFC5661 =
<https://datatracker.ietf.org/doc/rfc5661/>) in 2005.  The linux folks =
have been working on it =
<http://events.linuxfoundation.org/sites/events/files/slides/pnfs.pdf> =
since 2006.  Ten years is a long time in this business, and when I =
raised the topic of pNFS at the recent SNIA DSI conference (where =
storage developers gather to talk about trends and things), the most =
prevalent reaction I got was =E2=80=9Cpeople are still using pNFS?!=E2=80=9D=
   This is clearly one of those technologies that may still have some =
runway left, but it=E2=80=99s been rapidly overtaken by other approaches =
to solving more or less the same problems in coherent, distributed =
filesystem access and if we want to get mindshare for this, we should at =
least have an answer ready for the =E2=80=9Cwhy did you guys do pNFS =
that way rather than just shimming it on top of ${someNewerHotness}??=E2=80=
=9D argument.   I=E2=80=99m not suggesting pNFS is dead - hell, even AFS =
<https://www.openafs.org/> still appears to be somewhat alive, but =
there=E2=80=99s a difference between appealing to an increasingly narrow =
niche and trying to solve the sorts of problems most DevOps folks =
working At Scale these days are running into.

That is also why I am not sure I would totally embrace the idea of a =
central MDS being a Real Option.  Sure, the risks can be mitigated (as =
you say, by mirroring it), but even saying the words =E2=80=9Ccentral =
MDS=E2=80=9D (or central anything) may be such a turn-off to those very =
same DevOps folks, folks who have been burned so many times by SPOFs and =
scaling bottlenecks in large environments, that we'll lose the audience =
the minute they hear the trigger phrase.  Even if it means signing up =
for Other Problems later, it=E2=80=99s a lot easier to =E2=80=9Csell=E2=80=
=9D the concept of completely distributed mechanisms where, if there is =
any notion of centralization at all, it=E2=80=99s at least the result of =
a quorum election and the DevOps folks don=E2=80=99t have to do anything =
manually to cause it to happen - the cluster is =E2=80=9Cresilient" and =
"self-healing" and they are happy with being able to say those buzzwords =
to the CIO, who nods knowingly and tells them they=E2=80=99re doing a =
fine job!

Let=E2=80=99s get back, however, to the notion of downing multiple =
avians with the same semi-spherical kinetic projectile:  What seems to =
be The Rage at the moment, and I don=E2=80=99t know how well it actually =
scales since I=E2=80=99ve yet to be at the pointy end of such a =
real-world deployment, is the idea of clustering the storage =
(=E2=80=9Csomehow=E2=80=9D) underneath and then providing NFS and SMB =
protocol access entirely in userland, usually with both of those =
services cooperating with the same lock manager and even the same ACL =
translation layer.  Our buddies at Red Hat do this with glusterfs at the =
bottom and NFS Ganesha + Samba on top - I talked to one of the Samba =
core team guys at SNIA and he indicated that this was increasingly =
common, with the team having helped here and there when approached by =
different vendors with the same idea.   We (iXsystems) also get a lot of =
requests to be able to make the same file(s) available via both NFS and =
SMB at the same time and they don=E2=80=99t much at all like being told =
=E2=80=9Cbut that=E2=80=99s dangerous - don=E2=80=99t do that!  Your =
file contents and permissions models are not guaranteed to survive such =
an experience!=E2=80=9D  They really want to do it, because the rest of =
the world lives in Heterogenous environments and that=E2=80=99s just the =
way it is.

Even the object storage folks, like Openstack=E2=80=99s Swift project, =
are spending significant amounts of mental energy on the topic of how to =
re-export their object stores as shared filesystems over NFS and SMB, =
the single consistent and distributed object store being, of course, =
Their Thing.  They wish, of course, that the rest of the world would =
just fall into line and use their object system for everything, but they =
also get that the "legacy stuff=E2=80=9D just won=E2=80=99t go away and =
needs some sort of attention if they=E2=80=99re to remain players at the =
standards table.

So anyway, that=E2=80=99s the view I have from the perspective of =
someone who actually sells storage solutions for a living, and while I =
could certainly =E2=80=9Csell some pNFS=E2=80=9D to various customers =
who just want to add a dash of steroids to their current NFS =
infrastructure, or need to use NFS but also need to store far more data =
into a single namespace than any one box will accommodate, I also know =
that offering even more elastic solutions will be a necessary part of =
offering solutions to the growing contingent of folks who are not tied =
to any existing storage infrastructure and have various non-greybearded =
folks shouting in their ears about object this and cloud that.  Might =
there not be some compromise solution which allows us to put more of =
this in userland with less context switches in and out of the kernel, =
also giving us the option of presenting a more united front to multiple =
protocols that require more ACL and lock impedance-matching than we=E2=80=99=
d ever want to put in the kernel anyway?

- Jordan