From owner-freebsd-fs@freebsd.org  Fri Dec 21 19:25:07 2018
Return-Path: <owner-freebsd-fs@freebsd.org>
Delivered-To: freebsd-fs@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id C8F3F13547A0
 for <freebsd-fs@mailman.ysv.freebsd.org>; Fri, 21 Dec 2018 19:25:06 +0000 (UTC)
 (envelope-from peter@ifm.liu.se)
Received: from mailout.ifm.liu.se (mailout.ifm.liu.se [130.236.160.13])
 by mx1.freebsd.org (Postfix) with ESMTP id 276BF8093B
 for <freebsd-fs@freebsd.org>; Fri, 21 Dec 2018 19:25:04 +0000 (UTC)
 (envelope-from peter@ifm.liu.se)
Received: from [192.168.1.79] (h-99-23.A785.priv.bahnhof.se [158.174.99.23])
 (authenticated bits=0)
 by mail.ifm.liu.se (8.15.2/8.14.4) with ESMTPSA id wBLIs4Be020741
 (version=TLSv1.2 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
 Fri, 21 Dec 2018 19:54:05 +0100 (CET)
From: Peter Eriksson <peter@ifm.liu.se>
Message-Id: <3160F105-85C1-4CB4-AAD5-D16CF5D6143D@ifm.liu.se>
Mime-Version: 1.0 (Mac OS X Mail 12.2 \(3445.102.3\))
Subject: Re: Suggestion for hardware for ZFS fileserver
Date: Fri, 21 Dec 2018 19:53:58 +0100
In-Reply-To: <4f816be7-79e0-cacb-9502-5fbbe343cfc9@denninger.net>
To: freebsd-fs@freebsd.org
References: <CAEW+ogZnWC07OCSuzO7E4TeYGr1E9BARKSKEh9ELCL9Zc4YY3w@mail.gmail.com>
 <C839431D-628C-4C73-8285-2360FE6FFE88@gmail.com>
 <CAEW+ogYWKPL5jLW2H_UWEsCOiz=8fzFcSJ9S5k8k7FXMQjywsw@mail.gmail.com>
 <4f816be7-79e0-cacb-9502-5fbbe343cfc9@denninger.net>
X-Mailer: Apple Mail (2.3445.102.3)
X-IFMLiUSE-MailScanner-Information: Please contact postmaster@ifm.liu.se more
 information
X-IFMLiUSE-MailScanner-ID: wBLIs4Be020741
X-IFMLiUSE-MailScanner: Found to be clean
X-IFMLiUSE-MailScanner-From: peter@ifm.liu.se
X-Spam-Status: No
X-Rspamd-Queue-Id: 276BF8093B
X-Spamd-Bar: ---
Authentication-Results: mx1.freebsd.org;
 dmarc=pass (policy=none) header.from=liu.se;
 spf=pass (mx1.freebsd.org: domain of peter@ifm.liu.se designates
 130.236.160.13 as permitted sender) smtp.mailfrom=peter@ifm.liu.se
X-Spamd-Result: default: False [-3.21 / 15.00]; ARC_NA(0.00)[];
 RCVD_VIA_SMTP_AUTH(0.00)[]; NEURAL_HAM_MEDIUM(-0.99)[-0.986,0];
 FROM_HAS_DN(0.00)[]; TO_DN_SOME(0.00)[];
 R_SPF_ALLOW(-0.20)[+ptr]; MV_CASE(0.50)[];
 MIME_GOOD(-0.10)[multipart/alternative,text/plain];
 IP_SCORE(-0.00)[country: SE(-0.02)];
 NEURAL_HAM_LONG(-0.99)[-0.993,0];
 TO_MATCH_ENVRCPT_SOME(0.00)[];
 RCVD_IN_DNSWL_MED(-0.20)[13.160.236.130.list.dnswl.org : 127.0.11.2];
 RCPT_COUNT_TWO(0.00)[2];
 DMARC_POLICY_ALLOW(-0.50)[liu.se,none];
 MX_GOOD(-0.01)[e-mailfilter04.sunet.se,e-mailfilter03.sunet.se];
 NEURAL_HAM_SHORT(-0.82)[-0.820,0]; RCVD_NO_TLS_LAST(0.10)[];
 FROM_EQ_ENVFROM(0.00)[]; R_DKIM_NA(0.00)[];
 MIME_TRACE(0.00)[0:+,1:+];
 ASN(0.00)[asn:2843, ipnet:130.236.0.0/16, country:SE];
 MID_RHS_MATCH_FROM(0.00)[]; RCVD_COUNT_TWO(0.00)[2]
Content-Type: text/plain;
	charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Content-Filtered-By: Mailman/MimeDel 2.1.29
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs/>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 21 Dec 2018 19:25:07 -0000

Just to add a few pointers based on our experience from our =
FreeBSD-based filers.

> 1. RAM *must* be ECC.  No wiggle room here.  Undetected RAM corruption

100% agreed. ECC is definitely the way to go!

> 2. More RAM is better, up to a point, in that cache is faster than =
disk
> I/O in all cases as operations are avoided.  HOWEVER, there are
> pathologies in both the FreeBSD VFS and the ARC when considered as a
> group.  I and others have tried to eliminate the pathological behavior
> under certain workloads (and some of us have had long-running debates =
on
> same.)  Therefore, your workload must be considered -- simply saying
> "more is better" may not be correct for your particular circumstances.

Yes, our servers all have 256GB RAM and we=E2=80=99ve been having some =
performance issues every now and then that has forced us to adjust a =
couple of kernel settings in order to minimise the impact for the users. =
I=E2=80=99m sure we=E2=80=99ll run into more in the future.

Our setup:

A number of Dell servers with 256GB RAM each, =E2=80=9CHBA330=E2=80=9D =
(LSI 3008) controllers and 14 10TB 7200rpm SAS drives, dual SLOG SSDs =
and dual L2ARC SSDs connected to the network via dual 10Gbps ethernet, =
serving users via SMB (Samba), NFS and SFTP. Managed via Puppet.

Every user get their own ZFS filesystem with a ref quota set -> about =
20000 zfs filesystems per frontend server (we have around 110K users =
(students & staff) - and around 3000 active (around 500 per server) at =
the same time currently (mostly SMB for now, but NFS is growing). LZ4 =
compression enabled on all.

Every filesystem gets a snapshot taken every hour (accessible via =
Windows =E2=80=9Cprevious versions=E2=80=9D).=20

1st level backups is done via rsync to secondary servers (HPs with big =
SAS disk cabinets (70 disks/cab)) so around 100K filesystems on the =
biggest one right now. And snapshots on those too. No users have direct =
access to them.
We decided against using zfs send/recv since we wanted some better =
=E2=80=9Cfault=E2=80=9D isolation between the primary and secondary =
servers in case of ZFS corruption on the primary frontend servers. =
Considering the panic-causing bugs with zfs send+recv that has been =
reported this was probably a good choice :-)


This has caused some interesting problems=E2=80=A6=20

First thing we noticed was that booting would take forever=E2=80=A6 =
Mounting the 20-100k filesystems _and_ enabling them to be shared via =
NFS is not done efficient at all (for each filesystem it re-reads =
/etc/zfs/exports (a couple of times) befor appending one line to the =
end. Repeat 20-100,000 times=E2=80=A6 Not to mention the big kernel lock =
for NFS =E2=80=9Chold all NFS activity while we flush and reinstalls all =
sharing information per filesystem=E2=80=9D being done by mountd=E2=80=A6

Wish list item #1: A BerkeleyDB-based =E2=80=99sharetab=E2=80=99 that =
replaces the horribly slow /etc/zfs/exports text file.
Wish list item #2: A reimplementation of mountd and the kernel interface =
to allow a =E2=80=9Cdiff=E2=80=9D between the contents of the DB-based =
sharetab above be input into the kernel instead of the brute-force way =
it=E2=80=99s done now..

(I=E2=80=99ve written some code that implements item #1 above and it =
helps quite a bit. Nothing near production quality yet though. I have =
looked at item #2 a bit too but not done anything about it.)


And then we have the Puppet =E2=80=9Cfacter=E2=80=9D process that does =
an inventory of the systems. Doing things like =E2=80=9Czfs list=E2=80=9D =
(to list all filesystems and then try to upload it to the PuppetDB (and =
fail due to too much data) and =E2=80=9Czfs upgrade=E2=80=9D (to get the =
first line of output about ZFS version - which has the side effect of =
also doing a recursive walk thru all filesystems - taking something like =
6 hours on the main backup server=E2=80=A6 Solved that one with some =
binary patching and a wrapper script around /sbin/zfs :-)

Wish list item #3: A =E2=80=9Czfs version=E2=80=9D command that just =
prints the ZFS version instead that Puppet =E2=80=98facter=E2=80=99 =
could use.
Wish list item #4: A better (we currently binary-patch /sbin/zfs into =
/lbin/zfs (which doesn=E2=80=99t exist) in the libfacter shared =
library=E2=80=A6) way to disable the ZFS filesystem enumeration that =
facter does=E2=80=A6.
=20
Another thing we noticed was that when the rsync backups were running =
lot=E2=80=99s of things would start to go really slow. Things one at =
first didn=E2=80=99t think would be affected - like everything that =
stat:s /etc/nsswitch.conf (or other files/devices) on the (separate) =
root disks (mirrored ZFS). Access time would inflate 100x times or more. =
Turns out the rsync processes that would stat() all the users file would =
cause the kernel =E2=80=98vnodes=E2=80=99 table to get full (and older =
entries would get flushed out - like the vnode entry for =
/etc/nsswitch.conf (and basically everything). So we increased the =
kern.maxvnodes setting a number of times=E2=80=A6 Currently we=E2=80=99re =
running at a vnode table size of 20 million (but we probably should =
increase it even more). Uses a number of GBs of RAM but for us it=E2=80=99=
s worth it.

Wish list item #5: Separate vnode tables per ZFS pool, or some way to =
=E2=80=9Cprotect=E2=80=9D =E2=80=9Cimportant=E2=80=9D vnodes...


Another thing we noticed recently was that the ARC - which had a default =
cap of about 251GB - would use all that and then we we would have a =
three-way fight of memory between the ARC, the VNode table and the 500+ =
=E2=80=99smbd=E2=80=99 user processes that uses quite a lot of RAM to =
per process (and all the others) - causing the kernel pagedaemon to work =
a lot causing the Samba smbd & Winbindd daemons to become really slow =
(causing multi second login times for users).

Solved that one by capping the ARC to 128GB=E2=80=A6 (Can probably =
increase it a bit but 128GB seems to be plenty enough for us right now). =
Now ARC gets=E2=80=99 50% of the machine and the rest have plenty of RAM =
to play with.

Dunno what to wish for here though :-)


> 4. On FreeBSD I prefer GELI on the base partition to which ZFS is then
> pointed as a pool member for encryption at the present time.  It's
> proven, uses AES hardware acceleration on modern processors and works.=20=

> Read the documentation carefully and understand your options for =
keying
> (e.g. password only, password + key file, etc) and how you will manage
> the security of the key component(s).

Yep, we use GELI+ZFS too on one server. Works fine!

- Peter