FreeBSD Mail Archives

Date:      Fri, 7 Aug 2015 20:28:20 -0400 (EDT)
From:      Rick Macklem <rmacklem@uoguelph.ca>
To:        Julian Elischer <julian@freebsd.org>
Cc:        freebsd-fs@freebsd.org
Subject:   Re: changes to export whole FS hierarhy to mount it with one command on client? [changed subject]
Message-ID:  <1970606860.13124005.1438993700838.JavaMail.zimbra@uoguelph.ca>
In-Reply-To: <55C037D0.1000606@freebsd.org>
References:  <795246861.20150801140429@serebryakov.spb.ru> <1363497421.7238055.1438428070047.JavaMail.zimbra@uoguelph.ca> <1593307781.20150801143052@serebryakov.spb.ru> <55BEE668.3080303@freebsd.org> <67101638.8226696.1438604713620.JavaMail.zimbra@uoguelph.ca> <55BFC58C.6030802@freebsd.org> <987522757.8576059.1438636467059.JavaMail.zimbra@uoguelph.ca> <55C037D0.1000606@freebsd.org>

Julian Elischer wrote:
> On 8/4/15 5:14 AM, Rick Macklem wrote:
> > Julian Elischer wrote:
> >> On 8/3/15 8:25 PM, Rick Macklem wrote:
> >>> Julian Elischer wrote:
> >>>> On 8/1/15 7:30 PM, Lev Serebryakov wrote:
> >>>>> Hello Rick,
> >>>>>
> >>>>> Saturday, August 1, 2015, 2:21:10 PM, you wrote:
> >>>>>
> >>>>>> To mount multiple file systems as one mount, you'll need to use NFSv4.
> >>>>>> I
> >>>>>> believe
> >>>>>> you will have to have a separate export entry in the server for each
> >>>>>> of
> >>>>>> the file
> >>>>>> systems.
> >>>>>     So, /etc/exports needs to have BOTH v3-style exports & V4: root of
> >>>>>     tree
> >>>>>     line?
> >>>> OR you can have a non-standard patch that pjd wrote to do recursive
> >>>> mounts of sub-filesystems.
> >>>> it is not supposed to happen according to the standard but we have
> >>>> found it useful.
> >>>> Unfortnately it is written agains the old NFS Server.
> >>>>
> >>>> Rick, if I gave you the original pjd patch for the old server, could
> >>>> you integrate it into the new server as an option?
> >>>>
> >>> A patch like this basically inserts the file system volume identifier
> >>> in the high order bits of the fileid# (inode# if you prefer), so that
> >>> duplicate fileid#s don't show up in a "consolidated file system" (for
> >>> want of a better term). It also replies with the same "fake" fsid for
> >>> all volumes involved.
> >>>
> >>> I see certain issues w.r.t. this:
> >>> 1 - What happens when the exported volumes are disjoint and don't form
> >>>       one tree? (I think any just option should be restricted to volumes
> >>>       that form a tree, but I don't know an easy way to enforce that
> >>>       restriction?)
> >>> 2 - It would be fine at this point to use the high order bits of the
> >>> fileid#,
> >>>       since NFSv3 defines it as 64bits and FreeBSD's ino_t is 32bits.
> >>>       However,
> >>>       I believe FreeBSD is going to have to increase ino_t to 64bits
> >>>       soon.
> >>>       (I hope such a patch will be in FreeBSD11.)
> >>>       Once ino_t is 64bits, this option would have to assume that some #
> >>>       of
> >>>       the high order bits of the fileid# are always 0. Something like
> >>>       "the high order 24bits are always 0" would work ok for a while,
> >>>       then
> >>>       someone would build a file system large enough to overflow the
> >>>       40bit
> >>>       (I know that's a lot, but some are already exceeding 32bits for #
> >>>       of
> >>>        fileids) field and cause trouble.
> >>> 3 - You could get weird behaviour when the tree includes exports with
> >>> different
> >>>       export options. This discussion includes just that and NFSv3
> >>>       clients
> >>>       don't expect things to change within a mount. (An example would be
> >>>       having
> >>>       part of this consolidated tree require Kerberos authentication.
> >>>       Another
> >>>       might be having parts of the consolidated tree use different uid
> >>>       mapping
> >>>       for AUTH_SYS.)
> >>> 4 - Some file systems (msdosfs ie. FAT) have limited capabilities w.r.t.
> >>> what
> >>>       the NFS server can do to the file system. If one of these was
> >>>       imbedded
> >>>       in
> >>>       the consolidated tree, then it could cause confusion similar to #3.
> >>>
> >>> All in all, the "hack" is relatively easy to do, if:
> >>> You use one kind of file system (for example ZFS) and make everything you
> >>> are
> >>> exporting one directory tree which is all exported in a compatible way.
> >>> You also "know" that all the fileid#s in the underlying file systems will
> >>> fit
> >>> in the low order K bits of the 64bit fileid#.
> >>>
> >>> My biggest concern is #2, once ino_t becomes 64bits.
> >>>
> >>> If the collective thinks this is a good idea despite the issues above and
> >>> can
> >>> propose a good way to do it. (Maybe an export flag for all the volumes
> >>> that
> >>> will participate in the "consolidated file system"? The exports(5) man
> >>> page
> >>> could then try to clearly explain the limitations of its use, etc. Even
> >>> with
> >>> that, I suspect some would misuse the option and cause themselves grief.)
> >>>
> >>> Personally, since NFSv4 does this correctly, I don't see a need to "hack
> >>> it"
> >>> for NFSv3, but I'll leave it up to the collective.
> >>>
> >>> rick
> >>> ps: Julian, you might want to repost this under a separate subject line,
> >>> so
> >>>       people not interested in how ZFS can export multiple volumes
> >>>       without
> >>>       separate entries will read it.
> >>>
> >> In our environment we need to export V3 (and maybe even V2) in a
> >> single hierarchy, even though it's multiple ZFS filesystems.
> >> It's not dissimilar to having a separate ZFS for each user, except in
> >> this case it's  a separate ZFS for each site.
> >> The "modified ZFS" filesystems have very special characteristics. We
> >> are only having our very first nibbles (questions) about NFSv4. Until
> >> now it's all NFS3.   Possibly we'd only have to support it for NFSv3
> >> if V4 can use its native mechanisms.
> >>
> >>
> > Sure. You have a particular environment where it is useful and you
> > understand
> > how to use it in that situation. I could do it here in about 10minutes and
> > would
> > do so if I needed it myself. The trick is I understand what is going on and
> > the
> > limitations w.r.t. doing it.
> >
> > If you know your file systems are all in one directory hierarchy (tree),
> > all are ZFS
> > and none of them even generate fileid#s that don't fit in 32bits and you
> > are exporting
> > them all in the same way, it's pretty easy to do.
> > Unfortunately, that isn't what generic NFS server support for FreeBSD does.
> > (If this is done, I think it has to be somehow restricted to the above or
> > at least
> >   documented that it only works for the above cases.)
> >
> > Since an NFSv2 fileid# is 32bits, I don't believe this is practical for
> > NFSv2
> > and I don't think anyone would care. Since NFSv4 does this "out of the
> > box",
> > I think the question is whether or not it should be done for NFSv3?
> >
> > The challenge would be to put it in FreeBSD in a way that people who don't
> > necessarily understand what is "behind the curtain" can use it effectively
> > and not run into problems. (An example being the previous thread where the
> > different file systems are being created with different characteristics for
> > different users. That could be confusing if the sysadmin thought it was
> > "one volume".)
> > I'll leave whether or not to do this up to the collective. (This is yet
> > another one
> > of these "easy to code" but hard to tell if it the correct thing to do
> > situations.)
> > If done I'd suggest:
> > - Restricted to one file system type (ZFS or UFS or ...). The code would
> > probably
> >    have file system specifics in it. The correct way to do this will be
> >    different for
> >    ZFS than UFS, I think?
> > - A check for any fileid# that has high order bits set that would syslog an
> > error.
> > - Enabled by an export option, so it doesn't automatically apply to all
> > file systems
> >    on the server. This also provides a place for it to be documented,
> >    including limitations.
> 
Oh, I remembered another issue related to this:
- Until FreeBSD changes ino_t to 64bits, any "virtual consolidated file system"
  from an NFSv3 server needs to provide fileid#s that are unique in the low order 32bits
  so that it won't break the FreeBSD client.
  --> I think this implies that it shouldn't go into head until after ino_t
      becomes 64bits, at the earliest.
      Even then this option would break older FreeBSD clients. (The documentation
      for the export option would need to mention this.)
  Without 64bit ino_t, the fileid the server puts on the wire would need to
  be something like 24bits from the file system on the server + 8bits for which
  file system it is (to uniquify the 32bit value). Expecting the fileid# to fit
  in 24bits on the underlying file system seems too restrictive to me.

rick

> Well obviously I would like it, becasue we need it and I don't want to
> have to maintain patches going forward.
> If it is in the tree YOU work on then it would automatically get
> updated as needed. The mount option is nice, but at the moment we just
> have it wired on, and only export a single (PZFS) hierarchy. (PZFS is
> our own heavily modified version of ZFS that uses Amazon storage(*) as
> a block backend in parallel with the local drives (which are more a
> cache.. the cloud is authoritative).
> 
> (*) a gross simplification.
> 
> Different parts of the hierarchy are actually be different cloud
> 'buckets',, (e.g. theoretically some could be amazon and some could be
> google cloud storage).  These sub-filesystems are unified as a
> hierachy of ZFS filesystems into a single storage hierarchy via PZFS
> and exported to the user via NFS and CIFS/Samba.
> 
> If I need to maintain a separate set of changes for the option then
> that's life. but it's of course preferable to me to have it upstreamed.
> 
> p.s. to any Filesystem types.. yes we are hiring FreeBSD filesystem
> people..
> http://panzura.com/company/careers-panzura/senior-software-engineer/..
> resumes via me for fast track  :-)   ..
> 
> 
> 
> 
> 
> >
> > Anyhow, if anyone has an opinion on whether ir not this should be in
> > FreeBSD, please post, rick
> >
> 
>

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1970606860.13124005.1438993700838.JavaMail.zimbra>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation