From owner-freebsd-fs@freebsd.org Tue Jun 14 22:35:47 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 2E3D7B72BD0 for ; Tue, 14 Jun 2016 22:35:47 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: from esa-annu.net.uoguelph.ca (esa-annu.mail.uoguelph.ca [131.104.91.36]) by mx1.freebsd.org (Postfix) with ESMTP id B238A234F; Tue, 14 Jun 2016 22:35:46 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) IronPort-PHdr: 9a23:g4Ew1RWYJauKOTnZpIOJ5ANXEn7V8LGtZVwlr6E/grcLSJyIuqrYZhCAt8tkgFKBZ4jH8fUM07OQ6PCxHzxdqs/Y+Fk5M7VyFDY9wf0MmAIhBMPXQWbaF9XNKxIAIcJZSVV+9Gu6O0UGUOz3ZlnVv2HgpWVKQka3CwN5K6zPF5LIiIzvjqbpq8yVM1gD3WP1SIgxBSv1hD2ZjtMRj4pmJ/R54TryiVwMRd5rw3h1L0mYhRf265T41pdi9yNNp6BprJYYAu2pN5g/GJBfETtuCWk//8rt/U3PQxGn/HIWSWIQ1B1SDF6Wwgv9W8LLsyD5/s900yqeMMi+GaoxUD+h66puYALvhzoKMyY5tmre3J8jxJlHqQ6s8kQsi7XfZ5uYYaJz X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: A2BeAgAshmBX/61jaINTChaDfi5PBro+doF5JIVzAoFuFAEBAQEBAQEBZCeCMYIbAQECAiMEUhACAQgOCgICDRkCAlcCBBMbiBUOrFeRBAEBAQcBAQEBASKBAYUmhE2COYFfBA4CgxWCWgWNcIpzhgVwiR1OhASDLYU6hk2JJQIeNoIEAh2BZyAyAQEBiEEBJR9/AQEB X-IronPort-AV: E=Sophos;i="5.26,472,1459828800"; d="scan'208";a="289245665" Received: from nipigon.cs.uoguelph.ca (HELO zcs1.mail.uoguelph.ca) ([131.104.99.173]) by esa-annu.net.uoguelph.ca with ESMTP; 14 Jun 2016 18:34:15 -0400 Received: from localhost (localhost [127.0.0.1]) by zcs1.mail.uoguelph.ca (Postfix) with ESMTP id 33B9215F55D; Tue, 14 Jun 2016 18:34:15 -0400 (EDT) Received: from zcs1.mail.uoguelph.ca ([127.0.0.1]) by localhost (zcs1.mail.uoguelph.ca [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id QsHp-Hn0rjxZ; Tue, 14 Jun 2016 18:34:14 -0400 (EDT) Received: from localhost (localhost [127.0.0.1]) by zcs1.mail.uoguelph.ca (Postfix) with ESMTP id ED46415F565; Tue, 14 Jun 2016 18:34:13 -0400 (EDT) X-Virus-Scanned: amavisd-new at zcs1.mail.uoguelph.ca Received: from zcs1.mail.uoguelph.ca ([127.0.0.1]) by localhost (zcs1.mail.uoguelph.ca [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id RMW4akJycbVX; Tue, 14 Jun 2016 18:34:13 -0400 (EDT) Received: from zcs1.mail.uoguelph.ca (zcs1.mail.uoguelph.ca [172.17.95.18]) by zcs1.mail.uoguelph.ca (Postfix) with ESMTP id C5FDE15F55D; Tue, 14 Jun 2016 18:34:13 -0400 (EDT) Date: Tue, 14 Jun 2016 18:34:13 -0400 (EDT) From: Rick Macklem To: Doug Rabson Cc: freebsd-fs , Jordan Hubbard , Alexander Motin Message-ID: <1344776266.148298197.1465943653751.JavaMail.zimbra@uoguelph.ca> In-Reply-To: References: <1524639039.147096032.1465856925174.JavaMail.zimbra@uoguelph.ca> Subject: Re: pNFS server Plan B MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [172.17.95.11] X-Mailer: Zimbra 8.0.9_GA_6191 (ZimbraWebClient - FF47 (Win)/8.0.9_GA_6191) Thread-Topic: pNFS server Plan B Thread-Index: ASrQ/M2KGJa5G0b8rxnaUXVTWDao4Q== X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 14 Jun 2016 22:35:47 -0000 Doug Rabson wrote: > As I mentioned to Rick, I have been working on similar lines to put > together a pNFS implementation. Comments embedded below. > > On 13 June 2016 at 23:28, Rick Macklem wrote: > > > You may have already heard of Plan A, which sort of worked > > and you could test by following the instructions here: > > > > http://people.freebsd.org/~rmacklem/pnfs-setup.txt > > > > However, it is very slow for metadata operations (everything other than > > read/write) and I don't think it is very useful. > > > > After my informal talk at BSDCan, here are some thoughts I have: > > - I think the slowness is related to latency w.r.t. all the messages > > being passed between the nfsd, GlusterFS via Fuse and between the > > GlusterFS daemons. As such, I don't think faster hardware is likely > > to help a lot w.r.t. performance. > > - I have considered switching to MooseFS, but I would still be using Fuse. > > *** MooseFS uses a centralized metadata store, which would imply only > > a single Metadata Server (MDS) could be supported, I think? > > (More on this later...) > > - dfr@ suggested that avoiding Fuse and doing everything in userspace > > might help. > > - I thought of porting the nfsd to userland, but that would be quite a > > bit of work, since it uses the kernel VFS/VOP interface, etc. > > > > I ended up writing everything from scratch as userland code rather than > consider porting the kernel code. It was quite a bit of work :) > > > > > > All of the above has led me to Plan B. > > It would be limited to a single MDS, but as you'll see > > I'm not sure that is as large a limitation as I thought it would be. > > (If you aren't interested in details of this Plan B design, please > > skip to "Single Metadata server..." for the issues.) > > > > Plan B: > > - Do it all in kernel using a slightly modified nfsd. (FreeBSD nfsd would > > be used for both the MDS and Data Server (DS).) > > - One FreeBSD server running nfsd would be the MDS. It would > > build a file system tree that looks exactly like it would without pNFS, > > except that the files would be empty. (size == 0) > > --> As such, all the current nfsd code would do metadata operations on > > this file system exactly like the nfsd does now. > > - When a new file is created (an Open operation on NFSv4.1), the file would > > be created exactly like it is now for the MDS. > > - Then DS(s) would be selected and the MDS would do > > a Create of a data storage file on these DS(s). > > (This algorithm could become interesting later, but initially it would > > probably just pick one DS at random or similar.) > > - These file(s) would be in a single directory on the DS(s) and would > > have > > a file name which is simply the File Handle for this file on the > > MDS (an FH is 28bytes->48bytes of Hex in ASCII). > > > > I have something similar but using a directory hierarchy to try to avoid > any one directory being excessively large. > I thought of that, but since no one will be doing an "ls" of it, I wasn't going to bother doing multiple dirs initially. However, now that I think of it, the Create and Remove RPCs will end up doing VOP_LOOKUP()s, so breaking these up into multiple directories sounds like a good idea. (I may just hash the FH and let the hash choose a directory.) Good suggestion, thanks. > > > - Extended attributes would be added to the Metadata file for: > > - The data file's actual size. > > - The DS(s) the data file in on. > > - The File Handle for these data files on the DS(s). > > This would add some overhead to the Open/create, which would be one > > Create RPC for each DS the data file is on. > > > > An alternative here would be to store the extra metadata in the file itself > rather than use extended attributes. > Yep. I'm not sure if there is any performance advantage of doing data vs. extended attributes? > > > *** Initially there would only be one file on one DS. Mirroring for > > redundancy can be added later. > > > > The scale of filesystem I want to build more or less requires the extra > redundancy of mirroring so I added this at the start. It does add quite a > bit of complexity to the MDS to keep track of which DS should have which > piece of data and to handle DS failures properly, re-silvering data etc. > > > > > > Now, the layout would be generated from these extended attributes for any > > NFSv4.1 client that asks for it. > > > > If I/O operations (read/write/setattr_of_size) are performed on the > > Metadata > > server, it would act as a proxy and do them on the DS using the extended > > attribute information (doing an RPC on the DS for the client). > > > > When the file is removed on the Metadata server (link cnt --> 0), the > > Metadata server would do Remove RPC(s) on the DS(s) for the data file(s). > > (This requires the file name, which is just the Metadata FH in ASCII.) > > > > Currently I have a non-nfs control protocol for this but strictly speaking > it isn't necessary as you note. > > > > > > The only addition that the nfsd for the DS(s) would need would be a > > callback > > to the MDS done whenever a client (not the MDS) does > > a write to the file, notifying the Metadata server the file has been > > modified and is now Size=K, so the Metadata server can keep the attributes > > up to date for the file. (It can identify the file by the MDS FH.) > > > > I don't think you need this - the client should perform LAYOUTCOMMIT rpcs > which will inform the MDS of the last write position and last modify time. > This can be used to update the file metadata. The Linux client does this > before the CLOSE rpc on the client as far as I can tell. > When I developed the NFSv4.1_Files layout client, I had three servers to test against. - The Netapp filer just returned EOPNOTSUPP for LayoutCommit. - The Linux test server (had MDS and DS on the same Linux system) accepted the LayoutCommit, but didn't do anything for it, so doing it had no effect. - The only pNFS server I've ever tested against that needed LayoutCommit was Oracle/Solaris and the Oracle folk never explained why their server required it or what would break if you didn't do it. (I don't recall attributes being messed up when I didn't do it correctly.) As such, I've never been sure what it is used for. I need to read the LayoutCommit stuff in the RFC and Flex Files draft again. It would be nice if the DS->MDS calls could be avoided for every write. Doing one when the DS receives a Commit RPC wouldn't be too bad. > > > > > All of this is a relatively small amount of change to the FreeBSD nfsd, > > so it shouldn't be that much work (I'm a lazy guy looking for a minimal > > solution;-). > > > > Single Metadata server... > > The big limitation to all of the above is the "single MDS" limitation. > > I had thought this would be a serious limitation to the design scaling > > up to large stores. > > However, I'm not so sure it is a big limitation?? > > 1 - Since the files on the MDS are all empty, the file system is only > > i-nodes, directories and extended attribute blocks. > > As such, I hope it can be put on fast storage. > > *** I don't know anything about current and near term future SSD > > technologies. > > Hopefully others can suggest how large/fast a store for the MDS could > > be built easily? > > --> I am hoping that it will be possible to build an MDS that can > > handle > > a lot of DS/storage this way? > > (If anyone has access to hardware and something like SpecNFS, they > > could > > test an RPC load with almost no Read/Write RPCs and this would > > probably > > show about what the metadata RPC limits are for one of these.) > > > > I think a single MDS can scale up to petabytes of storage easily. It > remains to be seen how far it can scale for TPS. I will note that Google's > GFS filesystem (you can find a paper describing it at > http://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf) > uses effectively a single MDS, replicated for redundancy but still serving > just from one master MDS at a time. That filesystem scaled pretty well for > both data size and transactions so I think the approach is viable. > > > > > > > 2 - Although it isn't quite having multiple MDSs, the directory tree could > > be split up with an MDS for each subtree. This would allow some scaling > > beyond one MDS. > > (Although not implemented in FreeBSD's NFSv4.1 yet, Referrals are > > basically > > an NFS server driven "automount" that redirects the NFSv4.1 client to > > a different server for a subtree. This might be a useful tool for > > splitting off subtrees to different MDSs?) > > > > If you actually read this far, any comments on this would be welcome. > > In particular, if you have an opinion w.r.t. this single MDS limitation > > and/or how big an MDS could be built, that would be appreciated. > > > > Thanks for any comments, rick > > > > My back-of-envelope calculation assumed a 10 Pb filesystem containing > mostly large files which would be striped in 10 Mb pieces. Guessing that we > need 200 bytes of metadata per piece, that gives around 200 Gb of metadata > which is very reasonable. Even for file sets containing much smaller files, > a single server should have no trouble storing the metadata. > Thanks for all the good comments, rick ps: Good luck with your pNFS server. Maybe someday it will be available for FreeBSD?