Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 22 Dec 2018 01:46:06 +0100
From:      Peter Eriksson <peter@ifm.liu.se>
To:        "freebsd-fs@freebsd.org" <freebsd-fs@freebsd.org>
Subject:   Re: Suggestion for hardware for ZFS fileserver
Message-ID:  <D0E7579B-2768-46DB-94CF-DBD23259E74B@ifm.liu.se>
In-Reply-To: <YQBPR01MB038805DBCCE94383219306E1DDB80@YQBPR01MB0388.CANPRD01.PROD.OUTLOOK.COM>
References:  <CAEW%2BogZnWC07OCSuzO7E4TeYGr1E9BARKSKEh9ELCL9Zc4YY3w@mail.gmail.com> <C839431D-628C-4C73-8285-2360FE6FFE88@gmail.com> <CAEW%2BogYWKPL5jLW2H_UWEsCOiz=8fzFcSJ9S5k8k7FXMQjywsw@mail.gmail.com> <4f816be7-79e0-cacb-9502-5fbbe343cfc9@denninger.net> <3160F105-85C1-4CB4-AAD5-D16CF5D6143D@ifm.liu.se> <YQBPR01MB038805DBCCE94383219306E1DDB80@YQBPR01MB0388.CANPRD01.PROD.OUTLOOK.COM>

next in thread | previous in thread | raw e-mail | index | archive | help


> On 22 Dec 2018, at 00:49, Rick Macklem <rmacklem@uoguelph.ca> wrote:
>=20
> Peter Eriksson wrote:
> [good stuff snipped]
>> This has caused some interesting problems=E2=80=A6
>>=20
>> First thing we noticed was that booting would take forever=E2=80=A6 =
Mounting the 20-100k >filesystems _and_ enabling them to be shared via =
NFS is not done efficient at all (for >each filesystem it re-reads =
/etc/zfs/exports (a couple of times) befor appending one >line to the =
end. Repeat 20-100,000 times=E2=80=A6 Not to mention the big kernel lock =
for >NFS =E2=80=9Chold all NFS activity while we flush and reinstalls =
all sharing information per >filesystem=E2=80=9D being done by mountd=E2=80=
=A6
> Yes, /etc/exports and mountd were implemented in the 1980s, when a =
dozen
> file systems would have been a large server. Scaling to 10,000 or more =
file
> systems wasn't even conceivable back then.

Yeah, for a normal user with non-silly amounts of filesystems this is a =
non-issue. Anyway it=E2=80=99s the kind of issues that I kind of like to =
think about how to solve. It=E2=80=99s fun :-)


>> Wish list item #1: A BerkeleyDB-based =E2=80=99sharetab=E2=80=99 that =
replaces the horribly >slow /etc/zfs/exports text file.
>> Wish list item #2: A reimplementation of mountd and the kernel =
interface to allow >a =E2=80=9Cdiff=E2=80=9D between the contents of the =
DB-based sharetab above be input into the >kernel instead of the =
brute-force way it=E2=80=99s done now..
> The parser in mountd for /etc/exports is already an ugly beast and I =
think
> implementing a "diff" version will be difficult, especially figuring =
out what needs
> to be deleted.

Yeah, I tried to decode it (this summer) and I think I sort of got the =
hang of it eventually.=20


> I do have a couple of questions related to this:
> 1 - Would your case work if there was an "add these lines to =
/etc/exports"?
>     (Basically adding entries for file systems, but not trying to =
delete anything
>      previously exported. I am not a ZFS guy, but I think ZFS just =
generates another
>      exports file and then gets mountd to export everything again.)

Yeah, the ZFS library that the zfs commands use just reads and updates =
the separate /etc/zfs/exports text file (and have mountd read both =
/etc/exports and /etc/zfs/exports). The problem is that basically what =
it does when you tell it to =E2=80=9Czfs mount -a=E2=80=9D (mount all =
filesystems in all zpools) is a big (pseudocode):

For P in ZPOOLS; do
  For Z in ZFILESYSTEMS-AND-SNAPSHOTS in $P; do
    Mount $Z
    If $Z Have =E2=80=9Csharenfs=E2=80=9D option; Then
       Open /etc/zfs/exports
       Read until you find a matching line, replace with the options, =
else if not found, Append options
       Close /etc/zfs/exports
       Signal mountd
         (Which then opens /etc/exports and /etc/zfs/exports and does =
it=E2=80=99s magic)
    End
  End
End

All wrapped up in a Solaris compatibility layer I libzfs. Actually I =
think it even reads the /etc/zfs/exports file twice for each loop =
iteration due to some abstractions. Btw things got really =E2=80=9Cfun=E2=80=
=9D when the hourly snapshots we were taking (adding 10-20k new =
snapshots every hour, and we didn=E2=80=99t expire them fast enough in =
the beginning) triggered the code above and that code took longer than 1 =
hour to execute - mountd was 100% busy getting signalled and rereading, =
flushing and reinstalling exports into the kernel all the time) and =
basically never finished. Luckily we didn=E2=80=99t have an NFS clients =
accessing the servers at that time :-)

This summer I wrote some code to instead use a Btree BerkeleyDB file and =
modified the libzfs code and mountd daemon to instead use that database =
for much faster lookups (no need to read the whole /etc/zfs/exports file =
all the time) and additions. Worked pretty well actually and wasn=E2=80=99=
t that hard to add. Wanted to also add a possibility to add =
=E2=80=9Cexports=E2=80=9D arguments =E2=80=9CSolaris=E2=80=9D-style so =
one could say things like:

	/export/staff 	=
vers=3D4,sec=3Dkrb5:krb5i:krb5p,rw=3D130.236.0.0/16,sec=3Dsys,ro=3D130.236=
.160.0/24:10.1.2.3

But I never finished that (solaris-style exports options) part=E2=80=A6.

We=E2=80=99ve lately been toying with putting the NFS sharing stuff into =
separate =E2=80=9Cprivate" ZFS attribute (separate from official =
=E2=80=9Csharenfs=E2=80=9D one) and have another tool to read them =
instead and generate another =E2=80=9Cexports=E2=80=9D file so that file =
can be generated in =E2=80=9Cone go=E2=80=9D and just signal mountd once =
after all filesystems have been mounted. Unfortunately that would mean =
that they won=E2=80=99t be shared until after all of them have been =
mounted but we think it would take less time all-in-all.

We also modified the FreeBSD boot scripts so that we make sure to first =
mount all most important ZFS filesystems that is needed on the boot =
disks (not just /) and then we mount (and share via NFS the rest in the =
background so we can login to the machine as root early (no need for =
everything to have been mounted before giving us a login prompt).

(Right now a reboot of the bigger servers take an hour or two before all =
filesystems are mounted and exported).

=20
> 2 - Are all (or maybe most) of these ZFS file systems exported with =
the same
>      arguments?
>      - Here I am thinking that a "default-for-all-ZFS-filesystems" =
line could be
>         put in /etc/exports that would apply to all ZFS file systems =
not exported
>         by explicit lines in the exports file(s).
>      This would be fairly easy to implement and would avoid trying to =
handle
>      1000s of entries.

For us most have exactly the same exports arguments. (We set options on =
the top level filsystems (/export/staff, /export/students etc) and then =
all home dirs inherit those.

> In particular, #2 above could be easily implemented on top of what is =
already
> there, using a new type of line in /etc/exports and handling that as a =
special
> case by the NFS server code, when no specific export for the file =
system to the
> client is found.
>=20
>> (I=E2=80=99ve written some code that implements item #1 above and it =
helps quite a bit. >Nothing near production quality yet though. I have =
looked at item #2 a bit too but >not done anything about it.)
> [more good stuff snipped]
> Btw, although I put the questions here, I think a separate thread =
discussing
> how to scale to 10000+ file systems might be useful. (On freebsd-fs@ =
or
> freebsd-current@. The latter sometimes gets the attention of more =
developers.)

Yeah, probably a good idea!

- Peter

> rick
>=20
>=20




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?D0E7579B-2768-46DB-94CF-DBD23259E74B>