Date: Mon, 20 Jun 2016 11:01:40 +0100 From: Doug Rabson <dfr@rabson.org> To: Jordan Hubbard <jkh@ixsystems.com> Cc: Rick Macklem <rmacklem@uoguelph.ca>, freebsd-fs <freebsd-fs@freebsd.org>, Alexander Motin <mav@freebsd.org> Subject: Re: pNFS server Plan B Message-ID: <CACA0VUibM1giAkJdNNkn1_m8QqqLzdNC86hFhRxMmY7gMb1nvg@mail.gmail.com> In-Reply-To: <D20C793E-A2FD-49F3-AD88-7C2FED5E7715@ixsystems.com> References: <1524639039.147096032.1465856925174.JavaMail.zimbra@uoguelph.ca> <D20C793E-A2FD-49F3-AD88-7C2FED5E7715@ixsystems.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On 18 June 2016 at 21:50, Jordan Hubbard <jkh@ixsystems.com> wrote: > > > On Jun 13, 2016, at 3:28 PM, Rick Macklem <rmacklem@uoguelph.ca> wrote: > > > > You may have already heard of Plan A, which sort of worked > > and you could test by following the instructions here: > > > > http://people.freebsd.org/~rmacklem/pnfs-setup.txt > > > > However, it is very slow for metadata operations (everything other than > > read/write) and I don't think it is very useful. > > Hi guys, > > I finally got a chance to catch up and bring up Rick=E2=80=99s pNFS setup= on a > couple of test machines. He=E2=80=99s right, obviously - The =E2=80=9Cpl= an A=E2=80=9D approach is > a bit convoluted and not at all surprisingly slow. With all of those > transits twixt kernel and userland, not to mention glusterfs itself which > has not really been tuned for our platform (there are a number of papers = on > this we probably haven=E2=80=99t even all read yet), we=E2=80=99re obviou= sly still in the > =E2=80=9Cfirst make it work=E2=80=9D stage. > > That said, I think there are probably more possible plans than just A and > B here, and we should give the broader topic of =E2=80=9Cwhat does FreeBS= D want to > do in the Enterprise / Cloud computing space?" at least some consideratio= n > at the same time, since there are more than a few goals running in parall= el > here. > > First, let=E2=80=99s talk about our story around clustered filesystems + > associated command-and-control APIs in FreeBSD. There is something of an > embarrassment of riches in the industry at the moment - glusterfs, ceph, > Hadoop HDFS, RiakCS, moose, etc. All or most of them offer different pro= s > and cons, and all offer more than just the ability to store files and sca= le > =E2=80=9Celastically=E2=80=9D. They also have ReST APIs for configuring = and monitoring the > health of the cluster, some offer object as well as file storage, and Ria= k > offers a distributed KVS for storing information *about* file objects in > addition to the object themselves (and when your application involves > storing and managing several million photos, for example, the idea of > distributing the index as well as the files in a fault-tolerant fashion i= s > also compelling). Some, if not most, of them are also far better support= ed > under Linux than FreeBSD (I don=E2=80=99t think we even have a working ce= ph port > yet). I=E2=80=99m not saying we need to blindly follow the herds and do= all the > same things others are doing here, either, I=E2=80=99m just saying that i= t=E2=80=99s a much > bigger problem space than simply =E2=80=9Cparallelizing NFS=E2=80=9D and = if we can kill > multiple birds with one stone on the way to doing that, we should certain= ly > consider doing so. > > Why? Because pNFS was first introduced as a draft RFC (RFC5661 < > https://datatracker.ietf.org/doc/rfc5661/>) in 2005. The linux folks > have been working on it < > http://events.linuxfoundation.org/sites/events/files/slides/pnfs.pdf> > since 2006. Ten years is a long time in this business, and when I raised > the topic of pNFS at the recent SNIA DSI conference (where storage > developers gather to talk about trends and things), the most prevalent > reaction I got was =E2=80=9Cpeople are still using pNFS?!=E2=80=9D This= is clearly one of > those technologies that may still have some runway left, but it=E2=80=99s= been > rapidly overtaken by other approaches to solving more or less the same > problems in coherent, distributed filesystem access and if we want to get > mindshare for this, we should at least have an answer ready for the =E2= =80=9Cwhy > did you guys do pNFS that way rather than just shimming it on top of > ${someNewerHotness}??=E2=80=9D argument. I=E2=80=99m not suggesting pNF= S is dead - hell, > even AFS <https://www.openafs.org/> still appears to be somewhat alive, > but there=E2=80=99s a difference between appealing to an increasingly nar= row niche > and trying to solve the sorts of problems most DevOps folks working At > Scale these days are running into. > > That is also why I am not sure I would totally embrace the idea of a > central MDS being a Real Option. Sure, the risks can be mitigated (as yo= u > say, by mirroring it), but even saying the words =E2=80=9Ccentral MDS=E2= =80=9D (or central > anything) may be such a turn-off to those very same DevOps folks, folks w= ho > have been burned so many times by SPOFs and scaling bottlenecks in large > environments, that we'll lose the audience the minute they hear the trigg= er > phrase. Even if it means signing up for Other Problems later, it=E2=80= =99s a lot > easier to =E2=80=9Csell=E2=80=9D the concept of completely distributed me= chanisms where, if > there is any notion of centralization at all, it=E2=80=99s at least the r= esult of a > quorum election and the DevOps folks don=E2=80=99t have to do anything ma= nually to > cause it to happen - the cluster is =E2=80=9Cresilient" and "self-healing= " and they > are happy with being able to say those buzzwords to the CIO, who nods > knowingly and tells them they=E2=80=99re doing a fine job! > My main reason for liking NFS is that it has decent client support in upstream Linux. One reason I started working on pNFS was that at $work our existing cluster filesystem product which uses a proprietary client protocol caused us to delay OS upgrades for months while we waited for $vendor to port their client code to RHEL7. The NFS protocol is well documented with several accessible reference implementations and pNFS gives enough flexibility to support a distributed filesystem at an interesting scale. You mention a 'central MDS' as being an issue. I'm not going to go through your list but at least HDFS also has this 'issue' and it doesn't seem to be a problem for many users storing >100 Pb across >10^5 servers. In practice, the MDS would be replicated for redundancy - there are lots of approaches for this, my preference being Paxos but Raft would work just as well. Google's GFS also followed this model and was an extremely reliable large scale filesystem. I am building an MDS as a layer on top of a key/value database which offers the possibility of moving the backing store to some kind of distributed key/value store in future which would remove the scaling and reliability concerns. > > Let=E2=80=99s get back, however, to the notion of downing multiple avians= with the > same semi-spherical kinetic projectile: What seems to be The Rage at the > moment, and I don=E2=80=99t know how well it actually scales since I=E2= =80=99ve yet to be > at the pointy end of such a real-world deployment, is the idea of > clustering the storage (=E2=80=9Csomehow=E2=80=9D) underneath and then pr= oviding NFS and > SMB protocol access entirely in userland, usually with both of those > services cooperating with the same lock manager and even the same ACL > translation layer. Our buddies at Red Hat do this with glusterfs at the > bottom and NFS Ganesha + Samba on top - I talked to one of the Samba core > team guys at SNIA and he indicated that this was increasingly common, wit= h > the team having helped here and there when approached by different vendor= s > with the same idea. We (iXsystems) also get a lot of requests to be abl= e > to make the same file(s) available via both NFS and SMB at the same time > and they don=E2=80=99t much at all like being told =E2=80=9Cbut that=E2= =80=99s dangerous - don=E2=80=99t do > that! Your file contents and permissions models are not guaranteed to > survive such an experience!=E2=80=9D They really want to do it, because = the rest > of the world lives in Heterogenous environments and that=E2=80=99s just t= he way it > is. > > Even the object storage folks, like Openstack=E2=80=99s Swift project, ar= e > spending significant amounts of mental energy on the topic of how to > re-export their object stores as shared filesystems over NFS and SMB, the > single consistent and distributed object store being, of course, Their > Thing. They wish, of course, that the rest of the world would just fall > into line and use their object system for everything, but they also get > that the "legacy stuff=E2=80=9D just won=E2=80=99t go away and needs some= sort of attention > if they=E2=80=99re to remain players at the standards table. > > So anyway, that=E2=80=99s the view I have from the perspective of someone= who > actually sells storage solutions for a living, and while I could certainl= y > =E2=80=9Csell some pNFS=E2=80=9D to various customers who just want to ad= d a dash of > steroids to their current NFS infrastructure, or need to use NFS but also > need to store far more data into a single namespace than any one box will > accommodate, I also know that offering even more elastic solutions will b= e > a necessary part of offering solutions to the growing contingent of folks > who are not tied to any existing storage infrastructure and have various > non-greybearded folks shouting in their ears about object this and cloud > that. Might there not be some compromise solution which allows us to put > more of this in userland with less context switches in and out of the > kernel, also giving us the option of presenting a more united front to > multiple protocols that require more ACL and lock impedance-matching than > we=E2=80=99d ever want to put in the kernel anyway? > I can agree with this - everything I'm working on is in userland. Given that I'm not trying to export a local filesystem most of the reasons for wanting a kernel implementation disappear. Adding support for NFS over RDMA removes all the network context switching and for frequently accessed data would typically be served out of a userland cache which removes the rest of the context switches.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CACA0VUibM1giAkJdNNkn1_m8QqqLzdNC86hFhRxMmY7gMb1nvg>