From owner-freebsd-fs@freebsd.org Sun Jun 19 23:29:26 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 13485A7A0B9 for ; Sun, 19 Jun 2016 23:29:26 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: from esa-jnhn.mail.uoguelph.ca (esa-jnhn.mail.uoguelph.ca [131.104.91.44]) by mx1.freebsd.org (Postfix) with ESMTP id 8F99A2EBF; Sun, 19 Jun 2016 23:29:25 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) IronPort-PHdr: 9a23:TEfmQRJsW/aqu/m6StmcpTZWNBhigK39O0sv0rFitYgUL//xwZ3uMQTl6Ol3ixeRBMOAu6MC2rad6fuocFdDyKjCmUhKSIZLWR4BhJdetC0bK+nBN3fGKuX3ZTcxBsVIWQwt1Xi6NU9IBJS2PAWK8TWM5DIfUi/yKRBybrysXNWC3oLmi6vooNX6WEZhunmUWftKNhK4rAHc5IE9oLBJDeIP8CbPuWZCYO9MxGlldhq5lhf44dqsrtY4q3wD86Fpy8kVc6Lgc60+BZxFBj4vKWx9sM/otTHCXRCe/WcRV35QmR1NVVvr9hb/C63wuSiyk+N22y2XOIWiV7U9Ujem4qJDVRjnlSoDLz5/+2iB2Z84t75SvB/0/083+IXTeozAbPc= X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: A2DYBACJKmdX/61jaINchBR9Bq5bjXwihXUCgV4RAQEBAQEBAQFkJ4IxghoBAQEDASMESAoFCwIBCA4KAgINGQICITYCBBMbh3sDDwgOrz6MHw2DXgEBAQcBAQEBI4EBhSaETYJDgWcCFIMBgloFjXWKTTSGBoYqhDGMa4gKh2sCNCCEDCAyiQVEfwEBAQ X-IronPort-AV: E=Sophos;i="5.26,495,1459828800"; d="scan'208";a="288352735" Received: from nipigon.cs.uoguelph.ca (HELO zcs1.mail.uoguelph.ca) ([131.104.99.173]) by esa-jnhn.mail.uoguelph.ca with ESMTP; 19 Jun 2016 19:29:14 -0400 Received: from localhost (localhost [127.0.0.1]) by zcs1.mail.uoguelph.ca (Postfix) with ESMTP id 406D415F5C0; Sun, 19 Jun 2016 19:29:14 -0400 (EDT) Received: from zcs1.mail.uoguelph.ca ([127.0.0.1]) by localhost (zcs1.mail.uoguelph.ca [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id CV-WxLD3aO4Y; Sun, 19 Jun 2016 19:29:13 -0400 (EDT) Received: from localhost (localhost [127.0.0.1]) by zcs1.mail.uoguelph.ca (Postfix) with ESMTP id 245AC15F5C6; Sun, 19 Jun 2016 19:29:13 -0400 (EDT) X-Virus-Scanned: amavisd-new at zcs1.mail.uoguelph.ca Received: from zcs1.mail.uoguelph.ca ([127.0.0.1]) by localhost (zcs1.mail.uoguelph.ca [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id xQ7sZFtPMu2C; Sun, 19 Jun 2016 19:29:13 -0400 (EDT) Received: from zcs1.mail.uoguelph.ca (zcs1.mail.uoguelph.ca [172.17.95.18]) by zcs1.mail.uoguelph.ca (Postfix) with ESMTP id F35F115F5C0; Sun, 19 Jun 2016 19:29:12 -0400 (EDT) Date: Sun, 19 Jun 2016 19:29:12 -0400 (EDT) From: Rick Macklem To: Jordan Hubbard Cc: Chris Watson , freebsd-fs , Alexander Motin Message-ID: <1845469514.159182764.1466378952929.JavaMail.zimbra@uoguelph.ca> In-Reply-To: References: <1524639039.147096032.1465856925174.JavaMail.zimbra@uoguelph.ca> <7E27FA25-E18F-41D3-8974-EAE1EACABF38@gmail.com> Subject: Re: pNFS server Plan B MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Originating-IP: [172.17.95.10] X-Mailer: Zimbra 8.0.9_GA_6191 (ZimbraWebClient - FF47 (Win)/8.0.9_GA_6191) Thread-Topic: pNFS server Plan B Thread-Index: 10x17g/wb7qkit7Lp7CytadquFv17Q== X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 19 Jun 2016 23:29:26 -0000 Jordan Hubbard wrote: >=20 > > On Jun 18, 2016, at 6:14 PM, Chris Watson wrote: > >=20 > > Since Jordan brought up clustering, I would be interested to hear Justi= n > > Gibbs thoughts here. I know about a year ago he was asked on an "after > > hours" video chat hosted by Matt Aherns about a feature he would really > > like to see and he mentioned he would really like, in a universe filled > > with time and money I'm sure, to work on a native clustering solution f= or > > FreeBSD. I don't know if he is subscribed to the list, and I'm certainl= y > > not throwing him under the bus by bringing his name up, but I know he h= as > > at least been thinking about this for some time and probably has some > > value to add here. >=20 > I think we should also be careful to define our terms in such a discussio= n. > Specifically: >=20 > 1. Are we talking about block-level clustering underneath ZFS (e.g. HAST = or > ${somethingElse}) or otherwise incorporated into ZFS itself at some low > level? If you Google for =E2=80=9CHigh-availability ZFS=E2=80=9D you wil= l encounter things > like RSF-1 or the somewhat more mysterious Zetavault > (http://www.zeta.systems/zetavault/high-availability/) but it=E2=80=99s n= ot entirely > clear how these technologies work, they simply claim to =E2=80=9Cscale-ou= t ZFS=E2=80=9D or > =E2=80=9Ccluster ZFS=E2=80=9D (which can be done within ZFS or one level = above and still > probably pass the Marketing Test for what people are willing to put on a = web > page). >=20 > 2. Are we talking about clustering at a slightly higher level, in a > filesystem-agnostic fashion which still preserves filesystem semantics? >=20 > 3. Are we talking about clustering for data objects, in a fashion which d= oes > not necessarily provide filesystem semantics (a sharding database which c= an > store arbitrary BLOBs would qualify)? >=20 For the pNFS use case I am looking at, I would say #2. I suspect #1 sits at a low enough level that redirecting I/O via the pNFS l= ayouts isn't practical, since ZFS is taking care of block allocations, etc. I see #3 as a separate problem space, since NFS deals with files and not ob= jects. However, GlusterFS maps file objects on top of the POSIX-like FS, so I supp= ose that could be done at the client end. (What glusterfs.org calls SwiftonFile, I t= hink?) It is also possible to map POSIX files onto file objects, but that sounds l= ike more work, which would need to be done under the NFS service. > For all of the above: Are we seeking to be compatible with any other > mechanisms, or are we talking about a FreeBSD-only solution? >=20 > This is why I brought up glusterfs / ceph / RiakCS in my previous comment= s - > when talking to the $users that Rick wants to involve in the discussion, > they rarely come to the table asking for =E2=80=9Csome or any sort of clu= stering, > don=E2=80=99t care which or how it works=E2=80=9D - they ask if I can off= er an S3 compatible > object store with horizontal scaling, or=20 > if they can use NFS in some > clustered fashion where there=E2=80=99s a single namespace offering petab= ytes of > storage with configurable redundancy such that no portion of that namespa= ce > is ever unavailable. >=20 I tend to think of this last case as the target for any pNFS server. The ba= sic idea is to redirect the I/O operations to wherever the data is actually sto= red, so that I/O performance doesn't degrade with scale. If redundancy is a necessary feature, then maybe Plan A is preferable to Pl= an B, since GlusterFS does provide for redundancy and resilvering of lost copies,= at least from my understanding of the docs on gluster.org. I'd also like to see how GlusterFS performs on a typical Linux setup. Even without having the nfsd use FUSE, access of GlusterFS via FUSE results= in crossing user (syscall on mount) --> kernel --> user (glusterfs daemon) within the c= lient machine, if I understand how GlusterFS works. Then the gluster brick server glusterf= sd daemon does file system syscall(s) to get at the actual file on the underlying FS (xfs = or ZFS or ...). As such, there is already a lot of user<->kernel boundary crossings. I wonder how much delay is added by the extra nfsd step for metadata? - I can't say much about performance of Plan A yet, but metadata operations= are slow and latency seems to be the issue. (I actually seem to get better perform= ance by disabling SMP, for example.) > I=E2=80=99d be interested in what Justin had in mind when he asked Matt a= bout this. > Being able to =E2=80=9Cattach ZFS pools to one another=E2=80=9D in such a= fashion that all > clients just see One Big Pool and ZFS=E2=80=99s own redundancy / snapshot= ting > characteristics magically apply to the =C3=BCberpool would be Pretty Cool= , > obviously, and would allow one to do round-robin DNS for NFS such that an= y > node could serve the same contents, but that also sounds pretty ambitious= , > depending on how it=E2=80=99s implemented. >=20 This would probably work with the extant nfsd and wouldn't have a use for p= NFS. I also agree that this sounds pretty ambitious. rick > - Jordan >=20 >=20