From owner-freebsd-fs@freebsd.org Sat Jun 18 23:05:52 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 78B78A79C7C for ; Sat, 18 Jun 2016 23:05:52 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: from esa-jnhn.mail.uoguelph.ca (esa-jnhn.mail.uoguelph.ca [131.104.91.44]) by mx1.freebsd.org (Postfix) with ESMTP id E9C851B78; Sat, 18 Jun 2016 23:05:51 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) IronPort-PHdr: 9a23:SQYLhhOi+2OCsKSqmIUl6mtUPXoX/o7sNwtQ0KIMzox0KPn+rarrMEGX3/hxlliBBdydsKIVzbuM+PmwACQp2tWojjMrSNR0TRgLiMEbzUQLIfWuLgnFFsPsdDEwB89YVVVorDmROElRH9viNRWJ+iXhpQAbFhi3DwdpPOO9QteU1JTmkbHosMSDOk1hv3mUX/BbFF2OtwLft80b08NJC50a7V/3mEZOYPlc3mhyJFiezF7W78a0+4N/oWwL46pyv50IbaKvXaMiQbVeRBQ7OWo8/sGj4RvATSOO9mANSXkblwEOCA/AukLURJD05xH7vek1/SCRPsn7SPhgQzGr5KRvRRrAlSAIKjM96GGRgcUm3/ETmw6ouxEqm92cW4qSLvcrJq4= X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: A2DaBABU02VX/61jaINdFoN+Lk8GvFkXC4V1AoFbEAEBAQEBAQEBZCeCMYIaAQEBAgEBIwRBDAUFCwIBCA4KAgINGQICSQENAgQTGwSICQgOr0iQAQEBAQcBAQEBAQEhgQGFJoRNhCoCFIJJOBOCRwWGAEuSK4UdaYVohCVOhASDLYU6j3UCNR+CBQMcgWggMgGJBER/AQEB X-IronPort-AV: E=Sophos;i="5.26,489,1459828800"; d="scan'208";a="288280302" Received: from nipigon.cs.uoguelph.ca (HELO zcs1.mail.uoguelph.ca) ([131.104.99.173]) by esa-jnhn.mail.uoguelph.ca with ESMTP; 18 Jun 2016 19:05:43 -0400 Received: from localhost (localhost [127.0.0.1]) by zcs1.mail.uoguelph.ca (Postfix) with ESMTP id E347215F5DF; Sat, 18 Jun 2016 19:05:43 -0400 (EDT) Received: from zcs1.mail.uoguelph.ca ([127.0.0.1]) by localhost (zcs1.mail.uoguelph.ca [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id 1XPfbgTqjzbB; Sat, 18 Jun 2016 19:05:41 -0400 (EDT) Received: from localhost (localhost [127.0.0.1]) by zcs1.mail.uoguelph.ca (Postfix) with ESMTP id BC43015F5E2; Sat, 18 Jun 2016 19:05:41 -0400 (EDT) X-Virus-Scanned: amavisd-new at zcs1.mail.uoguelph.ca Received: from zcs1.mail.uoguelph.ca ([127.0.0.1]) by localhost (zcs1.mail.uoguelph.ca [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id D1pabmn0Zy48; Sat, 18 Jun 2016 19:05:41 -0400 (EDT) Received: from zcs1.mail.uoguelph.ca (zcs1.mail.uoguelph.ca [172.17.95.18]) by zcs1.mail.uoguelph.ca (Postfix) with ESMTP id 93DCE15F5DF; Sat, 18 Jun 2016 19:05:41 -0400 (EDT) Date: Sat, 18 Jun 2016 19:05:41 -0400 (EDT) From: Rick Macklem To: Jordan Hubbard Cc: freebsd-fs , Alexander Motin Message-ID: <2021361361.156197173.1466291141156.JavaMail.zimbra@uoguelph.ca> In-Reply-To: References: <1524639039.147096032.1465856925174.JavaMail.zimbra@uoguelph.ca> Subject: Re: pNFS server Plan B MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Originating-IP: [172.17.95.11] X-Mailer: Zimbra 8.0.9_GA_6191 (ZimbraWebClient - FF47 (Win)/8.0.9_GA_6191) Thread-Topic: pNFS server Plan B Thread-Index: mAo6z2utn+2vydDy12iaRMkdgHIljg== X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 18 Jun 2016 23:05:52 -0000 Jordan Hubbard wrote: >=20 > > On Jun 13, 2016, at 3:28 PM, Rick Macklem wrote: > >=20 > > You may have already heard of Plan A, which sort of worked > > and you could test by following the instructions here: > >=20 > > http://people.freebsd.org/~rmacklem/pnfs-setup.txt > >=20 > > However, it is very slow for metadata operations (everything other than > > read/write) and I don't think it is very useful. >=20 I am going to respond to a few of the comments, but I hope that people who actually run server farms and might be a user of a fairly large/inexpensive storage cluster will comment. Put another way, I'd really like to hear a "user" perspective. > Hi guys, >=20 > I finally got a chance to catch up and bring up Rick=E2=80=99s pNFS setup= on a couple > of test machines. He=E2=80=99s right, obviously - The =E2=80=9Cplan A=E2= =80=9D approach is a bit > convoluted and not at all surprisingly slow. With all of those transits > twixt kernel and userland, not to mention glusterfs itself which has not > really been tuned for our platform (there are a number of papers on this = we > probably haven=E2=80=99t even all read yet), we=E2=80=99re obviously stil= l in the =E2=80=9Cfirst > make it work=E2=80=9D stage. >=20 > That said, I think there are probably more possible plans than just A and= B > here, and we should give the broader topic of =E2=80=9Cwhat does FreeBSD = want to do > in the Enterprise / Cloud computing space?" at least some consideration a= t > the same time, since there are more than a few goals running in parallel > here. >=20 > First, let=E2=80=99s talk about our story around clustered filesystems + = associated > command-and-control APIs in FreeBSD. There is something of an embarrassm= ent > of riches in the industry at the moment - glusterfs, ceph, Hadoop HDFS, > RiakCS, moose, etc. All or most of them offer different pros and cons, a= nd > all offer more than just the ability to store files and scale =E2=80=9Cel= astically=E2=80=9D. > They also have ReST APIs for configuring and monitoring the health of the > cluster, some offer object as well as file storage, and Riak offers a > distributed KVS for storing information *about* file objects in addition = to > the object themselves (and when your application involves storing and > managing several million photos, for example, the idea of distributing th= e > index as well as the files in a fault-tolerant fashion is also compelling= ). > Some, if not most, of them are also far better supported under Linux than > FreeBSD (I don=E2=80=99t think we even have a working ceph port yet). I= =E2=80=99m not > saying we need to blindly follow the herds and do all the same things oth= ers > are doing here, either, I=E2=80=99m just saying that it=E2=80=99s a much = bigger problem > space than simply =E2=80=9Cparallelizing NFS=E2=80=9D and if we can kill = multiple birds with > one stone on the way to doing that, we should certainly consider doing so= . >=20 > Why? Because pNFS was first introduced as a draft RFC (RFC5661 > ) in 2005. The linux folks ha= ve > been working on it > si= nce > 2006. Ten years is a long time in this business, and when I raised the > topic of pNFS at the recent SNIA DSI conference (where storage developers > gather to talk about trends and things), the most prevalent reaction I go= t > was =E2=80=9Cpeople are still using pNFS?!=E2=80=9D Actually, I would have worded this as "will anyone ever use pNFS?". Although 10 years is a long time in this business, it doesn't seem to be lo= ng at all in the standards world where the NFSv4 protocols are being developed= . - You note that the Linux folk started development in 2006. I will note that RFC5661 (the RFC that describes pNFS) is dated 2010. I will also note that I believe the first vendor to ship a server that su= pported pNFS happened sometime after the RFC was published. - I could be wrong, but I'd guess that Netapp's clustered Filers were the first to ship, about 4 years ago. To this date, very few vendors have actually shipped working pNFS servers as far as I am aware. Other than Netapp, the only one I know that has shipp= ed are the large EMC servers (not Isilon). I am not sure if Oracle/Solaris has ever shipped a pNFS server to customers= yet. Same goes for Panasas. I am not aware of a Linux based pNFS server usable i= n a production environment, although Ganesha-NFS might be shipping with pNFS = support now. - If others are aware of other pNFS servers that are shipping to customers, please correct me. (I haven't been to a NFSv4.1 testing event for 3 years= , so my info is definitely dated.) Note that the "Flex Files" layout I used for the Plan A experiment is only = an Internet draft at this time and hasn't even made it to the RFC stage. --> As such, I think it is very much an open question w.r.t. whether or not this protocol will become widely used or yet another forgotten standard= ? I also suspect that some storage vendors that have invested considerabl= e resources in NFSv4.1/pNFS development might ask the same question in-ho= use;-) > This is clearly one of those > technologies that may still have some runway left, but it=E2=80=99s been = rapidly > overtaken by other approaches to solving more or less the same problems i= n > coherent, distributed filesystem access and if we want to get mindshare f= or > this, we should at least have an answer ready for the =E2=80=9Cwhy did yo= u guys do > pNFS that way rather than just shimming it on top of ${someNewerHotness}?= ?=E2=80=9D > argument. I=E2=80=99m not suggesting pNFS is dead - hell, even AFS > still appears to be somewhat alive, but there= =E2=80=99s a > difference between appealing to an increasingly narrow niche and trying t= o > solve the sorts of problems most DevOps folks working At Scale these days > are running into. >=20 > That is also why I am not sure I would totally embrace the idea of a cent= ral > MDS being a Real Option. Sure, the risks can be mitigated (as you say, b= y > mirroring it), but even saying the words =E2=80=9Ccentral MDS=E2=80=9D (o= r central anything) > may be such a turn-off to those very same DevOps folks, folks who have be= en > burned so many times by SPOFs and scaling bottlenecks in large environmen= ts, > that we'll lose the audience the minute they hear the trigger phrase. Ev= en > if it means signing up for Other Problems later, it=E2=80=99s a lot easie= r to =E2=80=9Csell=E2=80=9D > the concept of completely distributed mechanisms where, if there is any > notion of centralization at all, it=E2=80=99s at least the result of a qu= orum > election and the DevOps folks don=E2=80=99t have to do anything manually = to cause it > to happen - the cluster is =E2=80=9Cresilient" and "self-healing" and the= y are happy > with being able to say those buzzwords to the CIO, who nods knowingly and > tells them they=E2=80=99re doing a fine job! >=20 I'll admit that I'm a bits and bytes guy. I have a hunch how difficult it i= s to get "resilient" and "self-healing" to really work. I also know it is way beyond what I am capable of. > Let=E2=80=99s get back, however, to the notion of downing multiple avians= with the > same semi-spherical kinetic projectile: What seems to be The Rage at the > moment, and I don=E2=80=99t know how well it actually scales since I=E2= =80=99ve yet to be at > the pointy end of such a real-world deployment, is the idea of clustering > the storage (=E2=80=9Csomehow=E2=80=9D) underneath and then providing NFS= and SMB protocol > access entirely in userland, usually with both of those services cooperat= ing > with the same lock manager and even the same ACL translation layer. Our > buddies at Red Hat do this with glusterfs at the bottom and NFS Ganesha + > Samba on top - I talked to one of the Samba core team guys at SNIA and he > indicated that this was increasingly common, with the team having helped > here and there when approached by different vendors with the same idea. = We > (iXsystems) also get a lot of requests to be able to make the same file(s= ) > available via both NFS and SMB at the same time and they don=E2=80=99t mu= ch at all > like being told =E2=80=9Cbut that=E2=80=99s dangerous - don=E2=80=99t do = that! Your file contents > and permissions models are not guaranteed to survive such an experience!= =E2=80=9D > They really want to do it, because the rest of the world lives in > Heterogenous environments and that=E2=80=99s just the way it is. >=20 If you want to make SMB and NFS work to-gether on the same uderlying file s= ystems, I suspect it is doable, although messy. To do this with the current FreeBSD= nfsd, it would require someone with Samba/Windows knowledge pointing out what Sam= ba needs to interact with NFSv4 and those hooks could probably be implemented. (I know nothing about Samba/Windows, so I'd need someone else doing that si= de of it.) I actually mentioned Ganesha-NFS at the little talk/discussion I gave. At this time, they have ripped a FreeBSD port out of their sources and they use Linux specific thread primitives. --> It would probably be significant work to get Ganesha-NFS up to speed on FreeBSD. Maybe a good project, but it needs some person/group dedicatin= g resources to get it to happen. > Even the object storage folks, like Openstack=E2=80=99s Swift project, ar= e spending > significant amounts of mental energy on the topic of how to re-export the= ir > object stores as shared filesystems over NFS and SMB, the single consiste= nt > and distributed object store being, of course, Their Thing. They wish, o= f > course, that the rest of the world would just fall into line and use thei= r > object system for everything, but they also get that the "legacy stuff=E2= =80=9D just > won=E2=80=99t go away and needs some sort of attention if they=E2=80=99re= to remain players > at the standards table. >=20 > So anyway, that=E2=80=99s the view I have from the perspective of someone= who > actually sells storage solutions for a living, and while I could certainl= y > =E2=80=9Csell some pNFS=E2=80=9D to various customers who just want to ad= d a dash of > steroids to their current NFS infrastructure, or need to use NFS but also > need to store far more data into a single namespace than any one box will > accommodate, I also know that offering even more elastic solutions will b= e a > necessary part of offering solutions to the growing contingent of folks w= ho > are not tied to any existing storage infrastructure and have various > non-greybearded folks shouting in their ears about object this and cloud > that. Might there not be some compromise solution which allows us to put > more of this in userland with less context switches in and out of the > kernel, also giving us the option of presenting a more united front to > multiple protocols that require more ACL and lock impedance-matching than > we=E2=80=99d ever want to put in the kernel anyway? >=20 For SMB + NFS in userland, the combination of Samba and Ganesha is probably your main open source choice, from what I am aware of. I am one guy who does this as a spare time retirement hobby. As such, doing something like a Ganesha port etc is probably beyond what I am interested i= n. When saying this, I don't want to imply that it isn't a good approach. You sent me the URL for an abstract for a paper discussing how Facebook is using GlusterFS. It would be nice to get more details w.r.t. how they use i= t, such as: - How do their client servers access it? (NFS, Fuse, or ???) - Whether or not they've tried the Ganesha-NFS stuff that GlusterFS is transitioning to? Put another way, they might have some insight into whether the NFS is userl= and via Ganesha works well or not? Hopefully some "users" for this stuff will respond, rick ps: Maybe this could be reposted in a place they are likely to read it. > - Jordan >=20 >=20 >=20 >=20