Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 18 Jun 2016 19:05:41 -0400 (EDT)
From:      Rick Macklem <rmacklem@uoguelph.ca>
To:        Jordan Hubbard <jkh@ixsystems.com>
Cc:        freebsd-fs <freebsd-fs@freebsd.org>, Alexander Motin <mav@freebsd.org>
Subject:   Re: pNFS server Plan B
Message-ID:  <2021361361.156197173.1466291141156.JavaMail.zimbra@uoguelph.ca>
In-Reply-To: <D20C793E-A2FD-49F3-AD88-7C2FED5E7715@ixsystems.com>
References:  <1524639039.147096032.1465856925174.JavaMail.zimbra@uoguelph.ca> <D20C793E-A2FD-49F3-AD88-7C2FED5E7715@ixsystems.com>

next in thread | previous in thread | raw e-mail | index | archive | help
Jordan Hubbard wrote:
>=20
> > On Jun 13, 2016, at 3:28 PM, Rick Macklem <rmacklem@uoguelph.ca> wrote:
> >=20
> > You may have already heard of Plan A, which sort of worked
> > and you could test by following the instructions here:
> >=20
> > http://people.freebsd.org/~rmacklem/pnfs-setup.txt
> >=20
> > However, it is very slow for metadata operations (everything other than
> > read/write) and I don't think it is very useful.
>=20
I am going to respond to a few of the comments, but I hope that people who
actually run server farms and might be a user of a fairly large/inexpensive
storage cluster will comment.

Put another way, I'd really like to hear a "user" perspective.

> Hi guys,
>=20
> I finally got a chance to catch up and bring up Rick=E2=80=99s pNFS setup=
 on a couple
> of test machines.  He=E2=80=99s right, obviously - The =E2=80=9Cplan A=E2=
=80=9D approach is a bit
> convoluted and not at all surprisingly slow.  With all of those transits
> twixt kernel and userland, not to mention glusterfs itself which has not
> really been tuned for our platform (there are a number of papers on this =
we
> probably haven=E2=80=99t even all read yet), we=E2=80=99re obviously stil=
l in the =E2=80=9Cfirst
> make it work=E2=80=9D stage.
>=20
> That said, I think there are probably more possible plans than just A and=
 B
> here, and we should give the broader topic of =E2=80=9Cwhat does FreeBSD =
want to do
> in the Enterprise / Cloud computing space?" at least some consideration a=
t
> the same time, since there are more than a few goals running in parallel
> here.
>=20
> First, let=E2=80=99s talk about our story around clustered filesystems + =
associated
> command-and-control APIs in FreeBSD.  There is something of an embarrassm=
ent
> of riches in the industry at the moment - glusterfs, ceph, Hadoop HDFS,
> RiakCS, moose, etc.  All or most of them offer different pros and cons, a=
nd
> all offer more than just the ability to store files and scale =E2=80=9Cel=
astically=E2=80=9D.
> They also have ReST APIs for configuring and monitoring the health of the
> cluster, some offer object as well as file storage, and Riak offers a
> distributed KVS for storing information *about* file objects in addition =
to
> the object themselves (and when your application involves storing and
> managing several million photos, for example, the idea of distributing th=
e
> index as well as the files in a fault-tolerant fashion is also compelling=
).
> Some, if not most, of them are also far better supported under Linux than
> FreeBSD (I don=E2=80=99t think we even have a working ceph port yet).   I=
=E2=80=99m not
> saying we need to blindly follow the herds and do all the same things oth=
ers
> are doing here, either, I=E2=80=99m just saying that it=E2=80=99s a much =
bigger problem
> space than simply =E2=80=9Cparallelizing NFS=E2=80=9D and if we can kill =
multiple birds with
> one stone on the way to doing that, we should certainly consider doing so=
.
>=20
> Why?  Because pNFS was first introduced as a draft RFC (RFC5661
> <https://datatracker.ietf.org/doc/rfc5661/>) in 2005.  The linux folks ha=
ve
> been working on it
> <http://events.linuxfoundation.org/sites/events/files/slides/pnfs.pdf>; si=
nce
> 2006.  Ten years is a long time in this business, and when I raised the
> topic of pNFS at the recent SNIA DSI conference (where storage developers
> gather to talk about trends and things), the most prevalent reaction I go=
t
> was =E2=80=9Cpeople are still using pNFS?!=E2=80=9D
Actually, I would have worded this as "will anyone ever use pNFS?".

Although 10 years is a long time in this business, it doesn't seem to be lo=
ng
at all in the standards world where the NFSv4 protocols are being developed=
.
- You note that the Linux folk started development in 2006.
  I will note that RFC5661 (the RFC that describes pNFS) is dated 2010.
  I will also note that I believe the first vendor to ship a server that su=
pported pNFS
  happened sometime after the RFC was published.
  - I could be wrong, but I'd guess that Netapp's clustered Filers were the
    first to ship, about 4 years ago.

To this date, very few vendors have actually shipped working pNFS servers
as far as I am aware. Other than Netapp, the only one I know that has shipp=
ed
are the large EMC servers (not Isilon).
I am not sure if Oracle/Solaris has ever shipped a pNFS server to customers=
 yet.
Same goes for Panasas. I am not aware of a Linux based pNFS server usable i=
n
a production environment, although Ganesha-NFS might be shipping with pNFS =
support now.
- If others are aware of other pNFS servers that are shipping to customers,
  please correct me. (I haven't been to a NFSv4.1 testing event for 3 years=
,
  so my info is definitely dated.)

Note that the "Flex Files" layout I used for the Plan A experiment is only =
an
Internet draft at this time and hasn't even made it to the RFC stage.

--> As such, I think it is very much an open question w.r.t. whether or not
    this protocol will become widely used or yet another forgotten standard=
?
    I also suspect that some storage vendors that have invested considerabl=
e
    resources in NFSv4.1/pNFS development might ask the same question in-ho=
use;-)

>   This is clearly one of those
> technologies that may still have some runway left, but it=E2=80=99s been =
rapidly
> overtaken by other approaches to solving more or less the same problems i=
n
> coherent, distributed filesystem access and if we want to get mindshare f=
or
> this, we should at least have an answer ready for the =E2=80=9Cwhy did yo=
u guys do
> pNFS that way rather than just shimming it on top of ${someNewerHotness}?=
?=E2=80=9D
> argument.   I=E2=80=99m not suggesting pNFS is dead - hell, even AFS
> <https://www.openafs.org/>; still appears to be somewhat alive, but there=
=E2=80=99s a
> difference between appealing to an increasingly narrow niche and trying t=
o
> solve the sorts of problems most DevOps folks working At Scale these days
> are running into.
>=20
> That is also why I am not sure I would totally embrace the idea of a cent=
ral
> MDS being a Real Option.  Sure, the risks can be mitigated (as you say, b=
y
> mirroring it), but even saying the words =E2=80=9Ccentral MDS=E2=80=9D (o=
r central anything)
> may be such a turn-off to those very same DevOps folks, folks who have be=
en
> burned so many times by SPOFs and scaling bottlenecks in large environmen=
ts,
> that we'll lose the audience the minute they hear the trigger phrase.  Ev=
en
> if it means signing up for Other Problems later, it=E2=80=99s a lot easie=
r to =E2=80=9Csell=E2=80=9D
> the concept of completely distributed mechanisms where, if there is any
> notion of centralization at all, it=E2=80=99s at least the result of a qu=
orum
> election and the DevOps folks don=E2=80=99t have to do anything manually =
to cause it
> to happen - the cluster is =E2=80=9Cresilient" and "self-healing" and the=
y are happy
> with being able to say those buzzwords to the CIO, who nods knowingly and
> tells them they=E2=80=99re doing a fine job!
>=20
I'll admit that I'm a bits and bytes guy. I have a hunch how difficult it i=
s
to get "resilient" and "self-healing" to really work. I also know it is way
beyond what I am capable of.

> Let=E2=80=99s get back, however, to the notion of downing multiple avians=
 with the
> same semi-spherical kinetic projectile:  What seems to be The Rage at the
> moment, and I don=E2=80=99t know how well it actually scales since I=E2=
=80=99ve yet to be at
> the pointy end of such a real-world deployment, is the idea of clustering
> the storage (=E2=80=9Csomehow=E2=80=9D) underneath and then providing NFS=
 and SMB protocol
> access entirely in userland, usually with both of those services cooperat=
ing
> with the same lock manager and even the same ACL translation layer.  Our
> buddies at Red Hat do this with glusterfs at the bottom and NFS Ganesha +
> Samba on top - I talked to one of the Samba core team guys at SNIA and he
> indicated that this was increasingly common, with the team having helped
> here and there when approached by different vendors with the same idea.  =
 We
> (iXsystems) also get a lot of requests to be able to make the same file(s=
)
> available via both NFS and SMB at the same time and they don=E2=80=99t mu=
ch at all
> like being told =E2=80=9Cbut that=E2=80=99s dangerous - don=E2=80=99t do =
that!  Your file contents
> and permissions models are not guaranteed to survive such an experience!=
=E2=80=9D
> They really want to do it, because the rest of the world lives in
> Heterogenous environments and that=E2=80=99s just the way it is.
>=20
If you want to make SMB and NFS work to-gether on the same uderlying file s=
ystems,
I suspect it is doable, although messy. To do this with the current FreeBSD=
 nfsd,
it would require someone with Samba/Windows knowledge pointing out what Sam=
ba
needs to interact with NFSv4 and those hooks could probably be implemented.
(I know nothing about Samba/Windows, so I'd need someone else doing that si=
de
 of it.)

I actually mentioned Ganesha-NFS at the little talk/discussion I gave.
At this time, they have ripped a FreeBSD port out of their sources and they
use Linux specific thread primitives.
--> It would probably be significant work to get Ganesha-NFS up to speed on
    FreeBSD. Maybe a good project, but it needs some person/group dedicatin=
g
    resources to get it to happen.

> Even the object storage folks, like Openstack=E2=80=99s Swift project, ar=
e spending
> significant amounts of mental energy on the topic of how to re-export the=
ir
> object stores as shared filesystems over NFS and SMB, the single consiste=
nt
> and distributed object store being, of course, Their Thing.  They wish, o=
f
> course, that the rest of the world would just fall into line and use thei=
r
> object system for everything, but they also get that the "legacy stuff=E2=
=80=9D just
> won=E2=80=99t go away and needs some sort of attention if they=E2=80=99re=
 to remain players
> at the standards table.
>=20
> So anyway, that=E2=80=99s the view I have from the perspective of someone=
 who
> actually sells storage solutions for a living, and while I could certainl=
y
> =E2=80=9Csell some pNFS=E2=80=9D to various customers who just want to ad=
d a dash of
> steroids to their current NFS infrastructure, or need to use NFS but also
> need to store far more data into a single namespace than any one box will
> accommodate, I also know that offering even more elastic solutions will b=
e a
> necessary part of offering solutions to the growing contingent of folks w=
ho
> are not tied to any existing storage infrastructure and have various
> non-greybearded folks shouting in their ears about object this and cloud
> that.  Might there not be some compromise solution which allows us to put
> more of this in userland with less context switches in and out of the
> kernel, also giving us the option of presenting a more united front to
> multiple protocols that require more ACL and lock impedance-matching than
> we=E2=80=99d ever want to put in the kernel anyway?
>=20
For SMB + NFS in userland, the combination of Samba and Ganesha is probably
your main open source choice, from what I am aware of.

I am one guy who does this as a spare time retirement hobby. As such, doing
something like a Ganesha port etc is probably beyond what I am interested i=
n.
When saying this, I don't want to imply that it isn't a good approach.

You sent me the URL for an abstract for a paper discussing how Facebook is
using GlusterFS. It would be nice to get more details w.r.t. how they use i=
t,
such as:
- How do their client servers access it? (NFS, Fuse, or ???)
- Whether or not they've tried the Ganesha-NFS stuff that GlusterFS is
  transitioning to?
Put another way, they might have some insight into whether the NFS is userl=
and
via Ganesha works well or not?

Hopefully some "users" for this stuff will respond, rick
ps: Maybe this could be reposted in a place they are likely to read it.

> - Jordan
>=20
>=20
>=20
>=20



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?2021361361.156197173.1466291141156.JavaMail.zimbra>