Date: Sun, 19 Jun 2016 19:29:12 -0400 (EDT) From: Rick Macklem <rmacklem@uoguelph.ca> To: Jordan Hubbard <jkh@ixsystems.com> Cc: Chris Watson <bsdunix44@gmail.com>, freebsd-fs <freebsd-fs@freebsd.org>, Alexander Motin <mav@freebsd.org> Subject: Re: pNFS server Plan B Message-ID: <1845469514.159182764.1466378952929.JavaMail.zimbra@uoguelph.ca> In-Reply-To: <B2907C1F-D32A-48FB-8E58-209E6AF1E86D@ixsystems.com> References: <1524639039.147096032.1465856925174.JavaMail.zimbra@uoguelph.ca> <D20C793E-A2FD-49F3-AD88-7C2FED5E7715@ixsystems.com> <7E27FA25-E18F-41D3-8974-EAE1EACABF38@gmail.com> <B2907C1F-D32A-48FB-8E58-209E6AF1E86D@ixsystems.com>
next in thread | previous in thread | raw e-mail | index | archive | help
Jordan Hubbard wrote: >=20 > > On Jun 18, 2016, at 6:14 PM, Chris Watson <bsdunix44@gmail.com> wrote: > >=20 > > Since Jordan brought up clustering, I would be interested to hear Justi= n > > Gibbs thoughts here. I know about a year ago he was asked on an "after > > hours" video chat hosted by Matt Aherns about a feature he would really > > like to see and he mentioned he would really like, in a universe filled > > with time and money I'm sure, to work on a native clustering solution f= or > > FreeBSD. I don't know if he is subscribed to the list, and I'm certainl= y > > not throwing him under the bus by bringing his name up, but I know he h= as > > at least been thinking about this for some time and probably has some > > value to add here. >=20 > I think we should also be careful to define our terms in such a discussio= n. > Specifically: >=20 > 1. Are we talking about block-level clustering underneath ZFS (e.g. HAST = or > ${somethingElse}) or otherwise incorporated into ZFS itself at some low > level? If you Google for =E2=80=9CHigh-availability ZFS=E2=80=9D you wil= l encounter things > like RSF-1 or the somewhat more mysterious Zetavault > (http://www.zeta.systems/zetavault/high-availability/) but it=E2=80=99s n= ot entirely > clear how these technologies work, they simply claim to =E2=80=9Cscale-ou= t ZFS=E2=80=9D or > =E2=80=9Ccluster ZFS=E2=80=9D (which can be done within ZFS or one level = above and still > probably pass the Marketing Test for what people are willing to put on a = web > page). >=20 > 2. Are we talking about clustering at a slightly higher level, in a > filesystem-agnostic fashion which still preserves filesystem semantics? >=20 > 3. Are we talking about clustering for data objects, in a fashion which d= oes > not necessarily provide filesystem semantics (a sharding database which c= an > store arbitrary BLOBs would qualify)? >=20 For the pNFS use case I am looking at, I would say #2. I suspect #1 sits at a low enough level that redirecting I/O via the pNFS l= ayouts isn't practical, since ZFS is taking care of block allocations, etc. I see #3 as a separate problem space, since NFS deals with files and not ob= jects. However, GlusterFS maps file objects on top of the POSIX-like FS, so I supp= ose that could be done at the client end. (What glusterfs.org calls SwiftonFile, I t= hink?) It is also possible to map POSIX files onto file objects, but that sounds l= ike more work, which would need to be done under the NFS service. > For all of the above: Are we seeking to be compatible with any other > mechanisms, or are we talking about a FreeBSD-only solution? >=20 > This is why I brought up glusterfs / ceph / RiakCS in my previous comment= s - > when talking to the $users that Rick wants to involve in the discussion, > they rarely come to the table asking for =E2=80=9Csome or any sort of clu= stering, > don=E2=80=99t care which or how it works=E2=80=9D - they ask if I can off= er an S3 compatible > object store with horizontal scaling, or=20 > if they can use NFS in some > clustered fashion where there=E2=80=99s a single namespace offering petab= ytes of > storage with configurable redundancy such that no portion of that namespa= ce > is ever unavailable. >=20 I tend to think of this last case as the target for any pNFS server. The ba= sic idea is to redirect the I/O operations to wherever the data is actually sto= red, so that I/O performance doesn't degrade with scale. If redundancy is a necessary feature, then maybe Plan A is preferable to Pl= an B, since GlusterFS does provide for redundancy and resilvering of lost copies,= at least from my understanding of the docs on gluster.org. I'd also like to see how GlusterFS performs on a typical Linux setup. Even without having the nfsd use FUSE, access of GlusterFS via FUSE results= in crossing user (syscall on mount) --> kernel --> user (glusterfs daemon) within the c= lient machine, if I understand how GlusterFS works. Then the gluster brick server glusterf= sd daemon does file system syscall(s) to get at the actual file on the underlying FS (xfs = or ZFS or ...). As such, there is already a lot of user<->kernel boundary crossings. I wonder how much delay is added by the extra nfsd step for metadata? - I can't say much about performance of Plan A yet, but metadata operations= are slow and latency seems to be the issue. (I actually seem to get better perform= ance by disabling SMP, for example.) > I=E2=80=99d be interested in what Justin had in mind when he asked Matt a= bout this. > Being able to =E2=80=9Cattach ZFS pools to one another=E2=80=9D in such a= fashion that all > clients just see One Big Pool and ZFS=E2=80=99s own redundancy / snapshot= ting > characteristics magically apply to the =C3=BCberpool would be Pretty Cool= , > obviously, and would allow one to do round-robin DNS for NFS such that an= y > node could serve the same contents, but that also sounds pretty ambitious= , > depending on how it=E2=80=99s implemented. >=20 This would probably work with the extant nfsd and wouldn't have a use for p= NFS. I also agree that this sounds pretty ambitious. rick > - Jordan >=20 >=20
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1845469514.159182764.1466378952929.JavaMail.zimbra>