Date: Fri, 21 Dec 2018 19:53:58 +0100 From: Peter Eriksson <peter@ifm.liu.se> To: freebsd-fs@freebsd.org Subject: Re: Suggestion for hardware for ZFS fileserver Message-ID: <3160F105-85C1-4CB4-AAD5-D16CF5D6143D@ifm.liu.se> In-Reply-To: <4f816be7-79e0-cacb-9502-5fbbe343cfc9@denninger.net> References: <CAEW%2BogZnWC07OCSuzO7E4TeYGr1E9BARKSKEh9ELCL9Zc4YY3w@mail.gmail.com> <C839431D-628C-4C73-8285-2360FE6FFE88@gmail.com> <CAEW%2BogYWKPL5jLW2H_UWEsCOiz=8fzFcSJ9S5k8k7FXMQjywsw@mail.gmail.com> <4f816be7-79e0-cacb-9502-5fbbe343cfc9@denninger.net>
next in thread | previous in thread | raw e-mail | index | archive | help
Just to add a few pointers based on our experience from our = FreeBSD-based filers. > 1. RAM *must* be ECC. No wiggle room here. Undetected RAM corruption 100% agreed. ECC is definitely the way to go! > 2. More RAM is better, up to a point, in that cache is faster than = disk > I/O in all cases as operations are avoided. HOWEVER, there are > pathologies in both the FreeBSD VFS and the ARC when considered as a > group. I and others have tried to eliminate the pathological behavior > under certain workloads (and some of us have had long-running debates = on > same.) Therefore, your workload must be considered -- simply saying > "more is better" may not be correct for your particular circumstances. Yes, our servers all have 256GB RAM and we=E2=80=99ve been having some = performance issues every now and then that has forced us to adjust a = couple of kernel settings in order to minimise the impact for the users. = I=E2=80=99m sure we=E2=80=99ll run into more in the future. Our setup: A number of Dell servers with 256GB RAM each, =E2=80=9CHBA330=E2=80=9D = (LSI 3008) controllers and 14 10TB 7200rpm SAS drives, dual SLOG SSDs = and dual L2ARC SSDs connected to the network via dual 10Gbps ethernet, = serving users via SMB (Samba), NFS and SFTP. Managed via Puppet. Every user get their own ZFS filesystem with a ref quota set -> about = 20000 zfs filesystems per frontend server (we have around 110K users = (students & staff) - and around 3000 active (around 500 per server) at = the same time currently (mostly SMB for now, but NFS is growing). LZ4 = compression enabled on all. Every filesystem gets a snapshot taken every hour (accessible via = Windows =E2=80=9Cprevious versions=E2=80=9D).=20 1st level backups is done via rsync to secondary servers (HPs with big = SAS disk cabinets (70 disks/cab)) so around 100K filesystems on the = biggest one right now. And snapshots on those too. No users have direct = access to them. We decided against using zfs send/recv since we wanted some better = =E2=80=9Cfault=E2=80=9D isolation between the primary and secondary = servers in case of ZFS corruption on the primary frontend servers. = Considering the panic-causing bugs with zfs send+recv that has been = reported this was probably a good choice :-) This has caused some interesting problems=E2=80=A6=20 First thing we noticed was that booting would take forever=E2=80=A6 = Mounting the 20-100k filesystems _and_ enabling them to be shared via = NFS is not done efficient at all (for each filesystem it re-reads = /etc/zfs/exports (a couple of times) befor appending one line to the = end. Repeat 20-100,000 times=E2=80=A6 Not to mention the big kernel lock = for NFS =E2=80=9Chold all NFS activity while we flush and reinstalls all = sharing information per filesystem=E2=80=9D being done by mountd=E2=80=A6 Wish list item #1: A BerkeleyDB-based =E2=80=99sharetab=E2=80=99 that = replaces the horribly slow /etc/zfs/exports text file. Wish list item #2: A reimplementation of mountd and the kernel interface = to allow a =E2=80=9Cdiff=E2=80=9D between the contents of the DB-based = sharetab above be input into the kernel instead of the brute-force way = it=E2=80=99s done now.. (I=E2=80=99ve written some code that implements item #1 above and it = helps quite a bit. Nothing near production quality yet though. I have = looked at item #2 a bit too but not done anything about it.) And then we have the Puppet =E2=80=9Cfacter=E2=80=9D process that does = an inventory of the systems. Doing things like =E2=80=9Czfs list=E2=80=9D = (to list all filesystems and then try to upload it to the PuppetDB (and = fail due to too much data) and =E2=80=9Czfs upgrade=E2=80=9D (to get the = first line of output about ZFS version - which has the side effect of = also doing a recursive walk thru all filesystems - taking something like = 6 hours on the main backup server=E2=80=A6 Solved that one with some = binary patching and a wrapper script around /sbin/zfs :-) Wish list item #3: A =E2=80=9Czfs version=E2=80=9D command that just = prints the ZFS version instead that Puppet =E2=80=98facter=E2=80=99 = could use. Wish list item #4: A better (we currently binary-patch /sbin/zfs into = /lbin/zfs (which doesn=E2=80=99t exist) in the libfacter shared = library=E2=80=A6) way to disable the ZFS filesystem enumeration that = facter does=E2=80=A6. =20 Another thing we noticed was that when the rsync backups were running = lot=E2=80=99s of things would start to go really slow. Things one at = first didn=E2=80=99t think would be affected - like everything that = stat:s /etc/nsswitch.conf (or other files/devices) on the (separate) = root disks (mirrored ZFS). Access time would inflate 100x times or more. = Turns out the rsync processes that would stat() all the users file would = cause the kernel =E2=80=98vnodes=E2=80=99 table to get full (and older = entries would get flushed out - like the vnode entry for = /etc/nsswitch.conf (and basically everything). So we increased the = kern.maxvnodes setting a number of times=E2=80=A6 Currently we=E2=80=99re = running at a vnode table size of 20 million (but we probably should = increase it even more). Uses a number of GBs of RAM but for us it=E2=80=99= s worth it. Wish list item #5: Separate vnode tables per ZFS pool, or some way to = =E2=80=9Cprotect=E2=80=9D =E2=80=9Cimportant=E2=80=9D vnodes... Another thing we noticed recently was that the ARC - which had a default = cap of about 251GB - would use all that and then we we would have a = three-way fight of memory between the ARC, the VNode table and the 500+ = =E2=80=99smbd=E2=80=99 user processes that uses quite a lot of RAM to = per process (and all the others) - causing the kernel pagedaemon to work = a lot causing the Samba smbd & Winbindd daemons to become really slow = (causing multi second login times for users). Solved that one by capping the ARC to 128GB=E2=80=A6 (Can probably = increase it a bit but 128GB seems to be plenty enough for us right now). = Now ARC gets=E2=80=99 50% of the machine and the rest have plenty of RAM = to play with. Dunno what to wish for here though :-) > 4. On FreeBSD I prefer GELI on the base partition to which ZFS is then > pointed as a pool member for encryption at the present time. It's > proven, uses AES hardware acceleration on modern processors and works.=20= > Read the documentation carefully and understand your options for = keying > (e.g. password only, password + key file, etc) and how you will manage > the security of the key component(s). Yep, we use GELI+ZFS too on one server. Works fine! - Peter
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3160F105-85C1-4CB4-AAD5-D16CF5D6143D>