From owner-freebsd-fs@freebsd.org  Tue Jun 14 22:35:47 2016
Return-Path: <owner-freebsd-fs@freebsd.org>
Delivered-To: freebsd-fs@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 2E3D7B72BD0
 for <freebsd-fs@mailman.ysv.freebsd.org>; Tue, 14 Jun 2016 22:35:47 +0000 (UTC)
 (envelope-from rmacklem@uoguelph.ca)
Received: from esa-annu.net.uoguelph.ca (esa-annu.mail.uoguelph.ca
 [131.104.91.36])
 by mx1.freebsd.org (Postfix) with ESMTP id B238A234F;
 Tue, 14 Jun 2016 22:35:46 +0000 (UTC)
 (envelope-from rmacklem@uoguelph.ca)
IronPort-PHdr: 9a23:g4Ew1RWYJauKOTnZpIOJ5ANXEn7V8LGtZVwlr6E/grcLSJyIuqrYZhCAt8tkgFKBZ4jH8fUM07OQ6PCxHzxdqs/Y+Fk5M7VyFDY9wf0MmAIhBMPXQWbaF9XNKxIAIcJZSVV+9Gu6O0UGUOz3ZlnVv2HgpWVKQka3CwN5K6zPF5LIiIzvjqbpq8yVM1gD3WP1SIgxBSv1hD2ZjtMRj4pmJ/R54TryiVwMRd5rw3h1L0mYhRf265T41pdi9yNNp6BprJYYAu2pN5g/GJBfETtuCWk//8rt/U3PQxGn/HIWSWIQ1B1SDF6Wwgv9W8LLsyD5/s900yqeMMi+GaoxUD+h66puYALvhzoKMyY5tmre3J8jxJlHqQ6s8kQsi7XfZ5uYYaJz
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: A2BeAgAshmBX/61jaINTChaDfi5PBro+doF5JIVzAoFuFAEBAQEBAQEBZCeCMYIbAQECAiMEUhACAQgOCgICDRkCAlcCBBMbiBUOrFeRBAEBAQcBAQEBASKBAYUmhE2COYFfBA4CgxWCWgWNcIpzhgVwiR1OhASDLYU6hk2JJQIeNoIEAh2BZyAyAQEBiEEBJR9/AQEB
X-IronPort-AV: E=Sophos;i="5.26,472,1459828800"; d="scan'208";a="289245665"
Received: from nipigon.cs.uoguelph.ca (HELO zcs1.mail.uoguelph.ca)
 ([131.104.99.173])
 by esa-annu.net.uoguelph.ca with ESMTP; 14 Jun 2016 18:34:15 -0400
Received: from localhost (localhost [127.0.0.1])
 by zcs1.mail.uoguelph.ca (Postfix) with ESMTP id 33B9215F55D;
 Tue, 14 Jun 2016 18:34:15 -0400 (EDT)
Received: from zcs1.mail.uoguelph.ca ([127.0.0.1])
 by localhost (zcs1.mail.uoguelph.ca [127.0.0.1]) (amavisd-new, port 10032)
 with ESMTP id QsHp-Hn0rjxZ; Tue, 14 Jun 2016 18:34:14 -0400 (EDT)
Received: from localhost (localhost [127.0.0.1])
 by zcs1.mail.uoguelph.ca (Postfix) with ESMTP id ED46415F565;
 Tue, 14 Jun 2016 18:34:13 -0400 (EDT)
X-Virus-Scanned: amavisd-new at zcs1.mail.uoguelph.ca
Received: from zcs1.mail.uoguelph.ca ([127.0.0.1])
 by localhost (zcs1.mail.uoguelph.ca [127.0.0.1]) (amavisd-new, port 10026)
 with ESMTP id RMW4akJycbVX; Tue, 14 Jun 2016 18:34:13 -0400 (EDT)
Received: from zcs1.mail.uoguelph.ca (zcs1.mail.uoguelph.ca [172.17.95.18])
 by zcs1.mail.uoguelph.ca (Postfix) with ESMTP id C5FDE15F55D;
 Tue, 14 Jun 2016 18:34:13 -0400 (EDT)
Date: Tue, 14 Jun 2016 18:34:13 -0400 (EDT)
From: Rick Macklem <rmacklem@uoguelph.ca>
To: Doug Rabson <dfr@rabson.org>
Cc: freebsd-fs <freebsd-fs@freebsd.org>, Jordan Hubbard <jkh@ixsystems.com>, 
 Alexander Motin <mav@freebsd.org>
Message-ID: <1344776266.148298197.1465943653751.JavaMail.zimbra@uoguelph.ca>
In-Reply-To: <CACA0VUisyi0=iFWK3PZR22meSR8kneL75+cRtQZLHOtt1rfmXA@mail.gmail.com>
References: <1524639039.147096032.1465856925174.JavaMail.zimbra@uoguelph.ca>
 <CACA0VUisyi0=iFWK3PZR22meSR8kneL75+cRtQZLHOtt1rfmXA@mail.gmail.com>
Subject: Re: pNFS server Plan B
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
X-Originating-IP: [172.17.95.11]
X-Mailer: Zimbra 8.0.9_GA_6191 (ZimbraWebClient - FF47 (Win)/8.0.9_GA_6191)
Thread-Topic: pNFS server Plan B
Thread-Index: ASrQ/M2KGJa5G0b8rxnaUXVTWDao4Q==
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs/>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 14 Jun 2016 22:35:47 -0000

Doug Rabson wrote:
> As I mentioned to Rick, I have been working on similar lines to put
> together a pNFS implementation. Comments embedded below.
> 
> On 13 June 2016 at 23:28, Rick Macklem <rmacklem@uoguelph.ca> wrote:
> 
> > You may have already heard of Plan A, which sort of worked
> > and you could test by following the instructions here:
> >
> > http://people.freebsd.org/~rmacklem/pnfs-setup.txt
> >
> > However, it is very slow for metadata operations (everything other than
> > read/write) and I don't think it is very useful.
> >
> > After my informal talk at BSDCan, here are some thoughts I have:
> > - I think the slowness is related to latency w.r.t. all the messages
> >   being passed between the nfsd, GlusterFS via Fuse and between the
> >   GlusterFS daemons. As such, I don't think faster hardware is likely
> >   to help a lot w.r.t. performance.
> > - I have considered switching to MooseFS, but I would still be using Fuse.
> >   *** MooseFS uses a centralized metadata store, which would imply only
> >       a single Metadata Server (MDS) could be supported, I think?
> >       (More on this later...)
> > - dfr@ suggested that avoiding Fuse and doing everything in userspace
> >   might help.
> > - I thought of porting the nfsd to userland, but that would be quite a
> >   bit of work, since it uses the kernel VFS/VOP interface, etc.
> >
> 
> I ended up writing everything from scratch as userland code rather than
> consider porting the kernel code. It was quite a bit of work :)
> 
> 
> >
> > All of the above has led me to Plan B.
> > It would be limited to a single MDS, but as you'll see
> > I'm not sure that is as large a limitation as I thought it would be.
> > (If you aren't interested in details of this Plan B design, please
> >  skip to "Single Metadata server..." for the issues.)
> >
> > Plan B:
> > - Do it all in kernel using a slightly modified nfsd. (FreeBSD nfsd would
> >   be used for both the MDS and Data Server (DS).)
> > - One FreeBSD server running nfsd would be the MDS. It would
> >   build a file system tree that looks exactly like it would without pNFS,
> >   except that the files would be empty. (size == 0)
> >   --> As such, all the current nfsd code would do metadata operations on
> >       this file system exactly like the nfsd does now.
> > - When a new file is created (an Open operation on NFSv4.1), the file would
> >   be created exactly like it is now for the MDS.
> >   - Then DS(s) would be selected and the MDS would do
> >     a Create of a data storage file on these DS(s).
> >     (This algorithm could become interesting later, but initially it would
> >      probably just pick one DS at random or similar.)
> >     - These file(s) would be in a single directory on the DS(s) and would
> > have
> >       a file name which is simply the File Handle for this file on the
> >       MDS (an FH is 28bytes->48bytes of Hex in ASCII).
> >
> 
> I have something similar but using a directory hierarchy to try to avoid
> any one directory being excessively large.
> 
I thought of that, but since no one will be doing an "ls" of it, I wasn't going to
bother doing multiple dirs initially. However, now that I think of it, the Create
and Remove RPCs will end up doing VOP_LOOKUP()s, so breaking these up into multiple
directories sounds like a good idea. (I may just hash the FH and let the hash choose
a directory.)

Good suggestion, thanks.

> 
> >   - Extended attributes would be added to the Metadata file for:
> >     - The data file's actual size.
> >     - The DS(s) the data file in on.
> >     - The File Handle for these data files on the DS(s).
> >   This would add some overhead to the Open/create, which would be one
> >   Create RPC for each DS the data file is on.
> >
> 
> An alternative here would be to store the extra metadata in the file itself
> rather than use extended attributes.
> 
Yep. I'm not sure if there is any performance advantage of doing data vs. extended attributes?

> 
> > *** Initially there would only be one file on one DS. Mirroring for
> >     redundancy can be added later.
> >
> 
> The scale of filesystem I want to build more or less requires the extra
> redundancy of mirroring so I added this at the start. It does add quite a
> bit of complexity to the MDS to keep track of which DS should have which
> piece of data and to handle DS failures properly, re-silvering data etc.
> 
> 
> >
> > Now, the layout would be generated from these extended attributes for any
> > NFSv4.1 client that asks for it.
> >
> > If I/O operations (read/write/setattr_of_size) are performed on the
> > Metadata
> > server, it would act as a proxy and do them on the DS using the extended
> > attribute information (doing an RPC on the DS for the client).
> >
> > When the file is removed on the Metadata server (link cnt --> 0), the
> > Metadata server would do Remove RPC(s) on the DS(s) for the data file(s).
> > (This requires the file name, which is just the Metadata FH in ASCII.)
> >
> 
> Currently I have a non-nfs control protocol for this but strictly speaking
> it isn't necessary as you note.
> 
> 
> >
> > The only addition that the nfsd for the DS(s) would need would be a
> > callback
> > to the MDS done whenever a client (not the MDS) does
> > a write to the file, notifying the Metadata server the file has been
> > modified and is now Size=K, so the Metadata server can keep the attributes
> > up to date for the file. (It can identify the file by the MDS FH.)
> >
> 
> I don't think you need this - the client should perform LAYOUTCOMMIT rpcs
> which will inform the MDS of the last write position and last modify time.
> This can be used to update the file metadata. The Linux client does this
> before the CLOSE rpc on the client as far as I can tell.
> 
When I developed the NFSv4.1_Files layout client, I had three servers to test
against.
- The Netapp filer just returned EOPNOTSUPP for LayoutCommit.
- The Linux test server (had MDS and DS on the same Linux system) accepted the
  LayoutCommit, but didn't do anything for it, so doing it had no effect.
- The only pNFS server I've ever tested against that needed LayoutCommit was
  Oracle/Solaris and the Oracle folk never explained why their server required
  it or what would break if you didn't do it. (I don't recall attributes being
  messed up when I didn't do it correctly.)
As such, I've never been sure what it is used for.

I need to read the LayoutCommit stuff in the RFC and Flex Files draft again.
It would be nice if the DS->MDS calls could be avoided for every write.
Doing one when the DS receives a Commit RPC wouldn't be too bad.

> 
> >
> > All of this is a relatively small amount of change to the FreeBSD nfsd,
> > so it shouldn't be that much work (I'm a lazy guy looking for a minimal
> > solution;-).
> >
> > Single Metadata server...
> > The big limitation to all of the above is the "single MDS" limitation.
> > I had thought this would be a serious limitation to the design scaling
> > up to large stores.
> > However, I'm not so sure it is a big limitation??
> > 1 - Since the files on the MDS are all empty, the file system is only
> >     i-nodes, directories and extended attribute blocks.
> >     As such, I hope it can be put on fast storage.
> > *** I don't know anything about current and near term future SSD
> > technologies.
> >     Hopefully others can suggest how large/fast a store for the MDS could
> >     be built easily?
> >     --> I am hoping that it will be possible to build an MDS that can
> > handle
> >         a lot of DS/storage this way?
> >     (If anyone has access to hardware and something like SpecNFS, they
> > could
> >      test an RPC load with almost no Read/Write RPCs and this would
> > probably
> >      show about what the metadata RPC limits are for one of these.)
> >
> 
> I think a single MDS can scale up to petabytes of storage easily. It
> remains to be seen how far it can scale for TPS. I will note that Google's
> GFS filesystem (you can find a paper describing it at
> http://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf)
> uses effectively a single MDS, replicated for redundancy but still serving
> just from one master MDS at a time. That filesystem scaled pretty well for
> both data size and transactions so I think the approach is viable.
> 
> 
> 
> >
> > 2 - Although it isn't quite having multiple MDSs, the directory tree could
> >     be split up with an MDS for each subtree. This would allow some scaling
> >     beyond one MDS.
> >     (Although not implemented in FreeBSD's NFSv4.1 yet, Referrals are
> > basically
> >      an NFS server driven "automount" that redirects the NFSv4.1 client to
> >      a different server for a subtree. This might be a useful tool for
> >      splitting off subtrees to different MDSs?)
> >
> > If you actually read this far, any comments on this would be welcome.
> > In particular, if you have an opinion w.r.t. this single MDS limitation
> > and/or how big an MDS could be built, that would be appreciated.
> >
> > Thanks for any comments, rick
> >
> 
> My back-of-envelope calculation assumed a 10 Pb filesystem containing
> mostly large files which would be striped in 10 Mb pieces. Guessing that we
> need 200 bytes of metadata per piece, that gives around 200 Gb of metadata
> which is very reasonable. Even for file sets containing much smaller files,
> a single server should have no trouble storing the metadata.
> 
Thanks for all the good comments, rick
ps: Good luck with your pNFS server. Maybe someday it will be available for FreeBSD?