Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 21 Dec 2018 19:53:58 +0100
From:      Peter Eriksson <peter@ifm.liu.se>
To:        freebsd-fs@freebsd.org
Subject:   Re: Suggestion for hardware for ZFS fileserver
Message-ID:  <3160F105-85C1-4CB4-AAD5-D16CF5D6143D@ifm.liu.se>
In-Reply-To: <4f816be7-79e0-cacb-9502-5fbbe343cfc9@denninger.net>
References:  <CAEW%2BogZnWC07OCSuzO7E4TeYGr1E9BARKSKEh9ELCL9Zc4YY3w@mail.gmail.com> <C839431D-628C-4C73-8285-2360FE6FFE88@gmail.com> <CAEW%2BogYWKPL5jLW2H_UWEsCOiz=8fzFcSJ9S5k8k7FXMQjywsw@mail.gmail.com> <4f816be7-79e0-cacb-9502-5fbbe343cfc9@denninger.net>

next in thread | previous in thread | raw e-mail | index | archive | help
Just to add a few pointers based on our experience from our =
FreeBSD-based filers.

> 1. RAM *must* be ECC.  No wiggle room here.  Undetected RAM corruption

100% agreed. ECC is definitely the way to go!

> 2. More RAM is better, up to a point, in that cache is faster than =
disk
> I/O in all cases as operations are avoided.  HOWEVER, there are
> pathologies in both the FreeBSD VFS and the ARC when considered as a
> group.  I and others have tried to eliminate the pathological behavior
> under certain workloads (and some of us have had long-running debates =
on
> same.)  Therefore, your workload must be considered -- simply saying
> "more is better" may not be correct for your particular circumstances.

Yes, our servers all have 256GB RAM and we=E2=80=99ve been having some =
performance issues every now and then that has forced us to adjust a =
couple of kernel settings in order to minimise the impact for the users. =
I=E2=80=99m sure we=E2=80=99ll run into more in the future.

Our setup:

A number of Dell servers with 256GB RAM each, =E2=80=9CHBA330=E2=80=9D =
(LSI 3008) controllers and 14 10TB 7200rpm SAS drives, dual SLOG SSDs =
and dual L2ARC SSDs connected to the network via dual 10Gbps ethernet, =
serving users via SMB (Samba), NFS and SFTP. Managed via Puppet.

Every user get their own ZFS filesystem with a ref quota set -> about =
20000 zfs filesystems per frontend server (we have around 110K users =
(students & staff) - and around 3000 active (around 500 per server) at =
the same time currently (mostly SMB for now, but NFS is growing). LZ4 =
compression enabled on all.

Every filesystem gets a snapshot taken every hour (accessible via =
Windows =E2=80=9Cprevious versions=E2=80=9D).=20

1st level backups is done via rsync to secondary servers (HPs with big =
SAS disk cabinets (70 disks/cab)) so around 100K filesystems on the =
biggest one right now. And snapshots on those too. No users have direct =
access to them.
We decided against using zfs send/recv since we wanted some better =
=E2=80=9Cfault=E2=80=9D isolation between the primary and secondary =
servers in case of ZFS corruption on the primary frontend servers. =
Considering the panic-causing bugs with zfs send+recv that has been =
reported this was probably a good choice :-)


This has caused some interesting problems=E2=80=A6=20

First thing we noticed was that booting would take forever=E2=80=A6 =
Mounting the 20-100k filesystems _and_ enabling them to be shared via =
NFS is not done efficient at all (for each filesystem it re-reads =
/etc/zfs/exports (a couple of times) befor appending one line to the =
end. Repeat 20-100,000 times=E2=80=A6 Not to mention the big kernel lock =
for NFS =E2=80=9Chold all NFS activity while we flush and reinstalls all =
sharing information per filesystem=E2=80=9D being done by mountd=E2=80=A6

Wish list item #1: A BerkeleyDB-based =E2=80=99sharetab=E2=80=99 that =
replaces the horribly slow /etc/zfs/exports text file.
Wish list item #2: A reimplementation of mountd and the kernel interface =
to allow a =E2=80=9Cdiff=E2=80=9D between the contents of the DB-based =
sharetab above be input into the kernel instead of the brute-force way =
it=E2=80=99s done now..

(I=E2=80=99ve written some code that implements item #1 above and it =
helps quite a bit. Nothing near production quality yet though. I have =
looked at item #2 a bit too but not done anything about it.)


And then we have the Puppet =E2=80=9Cfacter=E2=80=9D process that does =
an inventory of the systems. Doing things like =E2=80=9Czfs list=E2=80=9D =
(to list all filesystems and then try to upload it to the PuppetDB (and =
fail due to too much data) and =E2=80=9Czfs upgrade=E2=80=9D (to get the =
first line of output about ZFS version - which has the side effect of =
also doing a recursive walk thru all filesystems - taking something like =
6 hours on the main backup server=E2=80=A6 Solved that one with some =
binary patching and a wrapper script around /sbin/zfs :-)

Wish list item #3: A =E2=80=9Czfs version=E2=80=9D command that just =
prints the ZFS version instead that Puppet =E2=80=98facter=E2=80=99 =
could use.
Wish list item #4: A better (we currently binary-patch /sbin/zfs into =
/lbin/zfs (which doesn=E2=80=99t exist) in the libfacter shared =
library=E2=80=A6) way to disable the ZFS filesystem enumeration that =
facter does=E2=80=A6.
=20
Another thing we noticed was that when the rsync backups were running =
lot=E2=80=99s of things would start to go really slow. Things one at =
first didn=E2=80=99t think would be affected - like everything that =
stat:s /etc/nsswitch.conf (or other files/devices) on the (separate) =
root disks (mirrored ZFS). Access time would inflate 100x times or more. =
Turns out the rsync processes that would stat() all the users file would =
cause the kernel =E2=80=98vnodes=E2=80=99 table to get full (and older =
entries would get flushed out - like the vnode entry for =
/etc/nsswitch.conf (and basically everything). So we increased the =
kern.maxvnodes setting a number of times=E2=80=A6 Currently we=E2=80=99re =
running at a vnode table size of 20 million (but we probably should =
increase it even more). Uses a number of GBs of RAM but for us it=E2=80=99=
s worth it.

Wish list item #5: Separate vnode tables per ZFS pool, or some way to =
=E2=80=9Cprotect=E2=80=9D =E2=80=9Cimportant=E2=80=9D vnodes...


Another thing we noticed recently was that the ARC - which had a default =
cap of about 251GB - would use all that and then we we would have a =
three-way fight of memory between the ARC, the VNode table and the 500+ =
=E2=80=99smbd=E2=80=99 user processes that uses quite a lot of RAM to =
per process (and all the others) - causing the kernel pagedaemon to work =
a lot causing the Samba smbd & Winbindd daemons to become really slow =
(causing multi second login times for users).

Solved that one by capping the ARC to 128GB=E2=80=A6 (Can probably =
increase it a bit but 128GB seems to be plenty enough for us right now). =
Now ARC gets=E2=80=99 50% of the machine and the rest have plenty of RAM =
to play with.

Dunno what to wish for here though :-)



> 4. On FreeBSD I prefer GELI on the base partition to which ZFS is then
> pointed as a pool member for encryption at the present time.  It's
> proven, uses AES hardware acceleration on modern processors and works.=20=

> Read the documentation carefully and understand your options for =
keying
> (e.g. password only, password + key file, etc) and how you will manage
> the security of the key component(s).

Yep, we use GELI+ZFS too on one server. Works fine!

- Peter




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3160F105-85C1-4CB4-AAD5-D16CF5D6143D>