Date: Mon, 13 Jun 2016 18:28:45 -0400 (EDT) From: Rick Macklem <rmacklem@uoguelph.ca> To: freebsd-fs <freebsd-fs@freebsd.org> Cc: Jordan Hubbard <jkh@ixsystems.com>, Doug Rabson <dfr@rabson.org>, Alexander Motin <mav@freebsd.org> Subject: pNFS server Plan B Message-ID: <1524639039.147096032.1465856925174.JavaMail.zimbra@uoguelph.ca>
next in thread | raw e-mail | index | archive | help
You may have already heard of Plan A, which sort of worked and you could test by following the instructions here: http://people.freebsd.org/~rmacklem/pnfs-setup.txt However, it is very slow for metadata operations (everything other than read/write) and I don't think it is very useful. After my informal talk at BSDCan, here are some thoughts I have: - I think the slowness is related to latency w.r.t. all the messages being passed between the nfsd, GlusterFS via Fuse and between the GlusterFS daemons. As such, I don't think faster hardware is likely to help a lot w.r.t. performance. - I have considered switching to MooseFS, but I would still be using Fuse. *** MooseFS uses a centralized metadata store, which would imply only a single Metadata Server (MDS) could be supported, I think? (More on this later...) - dfr@ suggested that avoiding Fuse and doing everything in userspace might help. - I thought of porting the nfsd to userland, but that would be quite a bit of work, since it uses the kernel VFS/VOP interface, etc. All of the above has led me to Plan B. It would be limited to a single MDS, but as you'll see I'm not sure that is as large a limitation as I thought it would be. (If you aren't interested in details of this Plan B design, please skip to "Single Metadata server..." for the issues.) Plan B: - Do it all in kernel using a slightly modified nfsd. (FreeBSD nfsd would be used for both the MDS and Data Server (DS).) - One FreeBSD server running nfsd would be the MDS. It would build a file system tree that looks exactly like it would without pNFS, except that the files would be empty. (size == 0) --> As such, all the current nfsd code would do metadata operations on this file system exactly like the nfsd does now. - When a new file is created (an Open operation on NFSv4.1), the file would be created exactly like it is now for the MDS. - Then DS(s) would be selected and the MDS would do a Create of a data storage file on these DS(s). (This algorithm could become interesting later, but initially it would probably just pick one DS at random or similar.) - These file(s) would be in a single directory on the DS(s) and would have a file name which is simply the File Handle for this file on the MDS (an FH is 28bytes->48bytes of Hex in ASCII). - Extended attributes would be added to the Metadata file for: - The data file's actual size. - The DS(s) the data file in on. - The File Handle for these data files on the DS(s). This would add some overhead to the Open/create, which would be one Create RPC for each DS the data file is on. *** Initially there would only be one file on one DS. Mirroring for redundancy can be added later. Now, the layout would be generated from these extended attributes for any NFSv4.1 client that asks for it. If I/O operations (read/write/setattr_of_size) are performed on the Metadata server, it would act as a proxy and do them on the DS using the extended attribute information (doing an RPC on the DS for the client). When the file is removed on the Metadata server (link cnt --> 0), the Metadata server would do Remove RPC(s) on the DS(s) for the data file(s). (This requires the file name, which is just the Metadata FH in ASCII.) The only addition that the nfsd for the DS(s) would need would be a callback to the MDS done whenever a client (not the MDS) does a write to the file, notifying the Metadata server the file has been modified and is now Size=K, so the Metadata server can keep the attributes up to date for the file. (It can identify the file by the MDS FH.) All of this is a relatively small amount of change to the FreeBSD nfsd, so it shouldn't be that much work (I'm a lazy guy looking for a minimal solution;-). Single Metadata server... The big limitation to all of the above is the "single MDS" limitation. I had thought this would be a serious limitation to the design scaling up to large stores. However, I'm not so sure it is a big limitation?? 1 - Since the files on the MDS are all empty, the file system is only i-nodes, directories and extended attribute blocks. As such, I hope it can be put on fast storage. *** I don't know anything about current and near term future SSD technologies. Hopefully others can suggest how large/fast a store for the MDS could be built easily? --> I am hoping that it will be possible to build an MDS that can handle a lot of DS/storage this way? (If anyone has access to hardware and something like SpecNFS, they could test an RPC load with almost no Read/Write RPCs and this would probably show about what the metadata RPC limits are for one of these.) 2 - Although it isn't quite having multiple MDSs, the directory tree could be split up with an MDS for each subtree. This would allow some scaling beyond one MDS. (Although not implemented in FreeBSD's NFSv4.1 yet, Referrals are basically an NFS server driven "automount" that redirects the NFSv4.1 client to a different server for a subtree. This might be a useful tool for splitting off subtrees to different MDSs?) If you actually read this far, any comments on this would be welcome. In particular, if you have an opinion w.r.t. this single MDS limitation and/or how big an MDS could be built, that would be appreciated. Thanks for any comments, rick
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1524639039.147096032.1465856925174.JavaMail.zimbra>