From owner-freebsd-fs@freebsd.org Fri Dec 21 19:25:07 2018 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id C8F3F13547A0 for ; Fri, 21 Dec 2018 19:25:06 +0000 (UTC) (envelope-from peter@ifm.liu.se) Received: from mailout.ifm.liu.se (mailout.ifm.liu.se [130.236.160.13]) by mx1.freebsd.org (Postfix) with ESMTP id 276BF8093B for ; Fri, 21 Dec 2018 19:25:04 +0000 (UTC) (envelope-from peter@ifm.liu.se) Received: from [192.168.1.79] (h-99-23.A785.priv.bahnhof.se [158.174.99.23]) (authenticated bits=0) by mail.ifm.liu.se (8.15.2/8.14.4) with ESMTPSA id wBLIs4Be020741 (version=TLSv1.2 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 21 Dec 2018 19:54:05 +0100 (CET) From: Peter Eriksson Message-Id: <3160F105-85C1-4CB4-AAD5-D16CF5D6143D@ifm.liu.se> Mime-Version: 1.0 (Mac OS X Mail 12.2 \(3445.102.3\)) Subject: Re: Suggestion for hardware for ZFS fileserver Date: Fri, 21 Dec 2018 19:53:58 +0100 In-Reply-To: <4f816be7-79e0-cacb-9502-5fbbe343cfc9@denninger.net> To: freebsd-fs@freebsd.org References: <4f816be7-79e0-cacb-9502-5fbbe343cfc9@denninger.net> X-Mailer: Apple Mail (2.3445.102.3) X-IFMLiUSE-MailScanner-Information: Please contact postmaster@ifm.liu.se more information X-IFMLiUSE-MailScanner-ID: wBLIs4Be020741 X-IFMLiUSE-MailScanner: Found to be clean X-IFMLiUSE-MailScanner-From: peter@ifm.liu.se X-Spam-Status: No X-Rspamd-Queue-Id: 276BF8093B X-Spamd-Bar: --- Authentication-Results: mx1.freebsd.org; dmarc=pass (policy=none) header.from=liu.se; spf=pass (mx1.freebsd.org: domain of peter@ifm.liu.se designates 130.236.160.13 as permitted sender) smtp.mailfrom=peter@ifm.liu.se X-Spamd-Result: default: False [-3.21 / 15.00]; ARC_NA(0.00)[]; RCVD_VIA_SMTP_AUTH(0.00)[]; NEURAL_HAM_MEDIUM(-0.99)[-0.986,0]; FROM_HAS_DN(0.00)[]; TO_DN_SOME(0.00)[]; R_SPF_ALLOW(-0.20)[+ptr]; MV_CASE(0.50)[]; MIME_GOOD(-0.10)[multipart/alternative,text/plain]; IP_SCORE(-0.00)[country: SE(-0.02)]; NEURAL_HAM_LONG(-0.99)[-0.993,0]; TO_MATCH_ENVRCPT_SOME(0.00)[]; RCVD_IN_DNSWL_MED(-0.20)[13.160.236.130.list.dnswl.org : 127.0.11.2]; RCPT_COUNT_TWO(0.00)[2]; DMARC_POLICY_ALLOW(-0.50)[liu.se,none]; MX_GOOD(-0.01)[e-mailfilter04.sunet.se,e-mailfilter03.sunet.se]; NEURAL_HAM_SHORT(-0.82)[-0.820,0]; RCVD_NO_TLS_LAST(0.10)[]; FROM_EQ_ENVFROM(0.00)[]; R_DKIM_NA(0.00)[]; MIME_TRACE(0.00)[0:+,1:+]; ASN(0.00)[asn:2843, ipnet:130.236.0.0/16, country:SE]; MID_RHS_MATCH_FROM(0.00)[]; RCVD_COUNT_TWO(0.00)[2] Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Content-Filtered-By: Mailman/MimeDel 2.1.29 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 21 Dec 2018 19:25:07 -0000 Just to add a few pointers based on our experience from our = FreeBSD-based filers. > 1. RAM *must* be ECC. No wiggle room here. Undetected RAM corruption 100% agreed. ECC is definitely the way to go! > 2. More RAM is better, up to a point, in that cache is faster than = disk > I/O in all cases as operations are avoided. HOWEVER, there are > pathologies in both the FreeBSD VFS and the ARC when considered as a > group. I and others have tried to eliminate the pathological behavior > under certain workloads (and some of us have had long-running debates = on > same.) Therefore, your workload must be considered -- simply saying > "more is better" may not be correct for your particular circumstances. Yes, our servers all have 256GB RAM and we=E2=80=99ve been having some = performance issues every now and then that has forced us to adjust a = couple of kernel settings in order to minimise the impact for the users. = I=E2=80=99m sure we=E2=80=99ll run into more in the future. Our setup: A number of Dell servers with 256GB RAM each, =E2=80=9CHBA330=E2=80=9D = (LSI 3008) controllers and 14 10TB 7200rpm SAS drives, dual SLOG SSDs = and dual L2ARC SSDs connected to the network via dual 10Gbps ethernet, = serving users via SMB (Samba), NFS and SFTP. Managed via Puppet. Every user get their own ZFS filesystem with a ref quota set -> about = 20000 zfs filesystems per frontend server (we have around 110K users = (students & staff) - and around 3000 active (around 500 per server) at = the same time currently (mostly SMB for now, but NFS is growing). LZ4 = compression enabled on all. Every filesystem gets a snapshot taken every hour (accessible via = Windows =E2=80=9Cprevious versions=E2=80=9D).=20 1st level backups is done via rsync to secondary servers (HPs with big = SAS disk cabinets (70 disks/cab)) so around 100K filesystems on the = biggest one right now. And snapshots on those too. No users have direct = access to them. We decided against using zfs send/recv since we wanted some better = =E2=80=9Cfault=E2=80=9D isolation between the primary and secondary = servers in case of ZFS corruption on the primary frontend servers. = Considering the panic-causing bugs with zfs send+recv that has been = reported this was probably a good choice :-) This has caused some interesting problems=E2=80=A6=20 First thing we noticed was that booting would take forever=E2=80=A6 = Mounting the 20-100k filesystems _and_ enabling them to be shared via = NFS is not done efficient at all (for each filesystem it re-reads = /etc/zfs/exports (a couple of times) befor appending one line to the = end. Repeat 20-100,000 times=E2=80=A6 Not to mention the big kernel lock = for NFS =E2=80=9Chold all NFS activity while we flush and reinstalls all = sharing information per filesystem=E2=80=9D being done by mountd=E2=80=A6 Wish list item #1: A BerkeleyDB-based =E2=80=99sharetab=E2=80=99 that = replaces the horribly slow /etc/zfs/exports text file. Wish list item #2: A reimplementation of mountd and the kernel interface = to allow a =E2=80=9Cdiff=E2=80=9D between the contents of the DB-based = sharetab above be input into the kernel instead of the brute-force way = it=E2=80=99s done now.. (I=E2=80=99ve written some code that implements item #1 above and it = helps quite a bit. Nothing near production quality yet though. I have = looked at item #2 a bit too but not done anything about it.) And then we have the Puppet =E2=80=9Cfacter=E2=80=9D process that does = an inventory of the systems. Doing things like =E2=80=9Czfs list=E2=80=9D = (to list all filesystems and then try to upload it to the PuppetDB (and = fail due to too much data) and =E2=80=9Czfs upgrade=E2=80=9D (to get the = first line of output about ZFS version - which has the side effect of = also doing a recursive walk thru all filesystems - taking something like = 6 hours on the main backup server=E2=80=A6 Solved that one with some = binary patching and a wrapper script around /sbin/zfs :-) Wish list item #3: A =E2=80=9Czfs version=E2=80=9D command that just = prints the ZFS version instead that Puppet =E2=80=98facter=E2=80=99 = could use. Wish list item #4: A better (we currently binary-patch /sbin/zfs into = /lbin/zfs (which doesn=E2=80=99t exist) in the libfacter shared = library=E2=80=A6) way to disable the ZFS filesystem enumeration that = facter does=E2=80=A6. =20 Another thing we noticed was that when the rsync backups were running = lot=E2=80=99s of things would start to go really slow. Things one at = first didn=E2=80=99t think would be affected - like everything that = stat:s /etc/nsswitch.conf (or other files/devices) on the (separate) = root disks (mirrored ZFS). Access time would inflate 100x times or more. = Turns out the rsync processes that would stat() all the users file would = cause the kernel =E2=80=98vnodes=E2=80=99 table to get full (and older = entries would get flushed out - like the vnode entry for = /etc/nsswitch.conf (and basically everything). So we increased the = kern.maxvnodes setting a number of times=E2=80=A6 Currently we=E2=80=99re = running at a vnode table size of 20 million (but we probably should = increase it even more). Uses a number of GBs of RAM but for us it=E2=80=99= s worth it. Wish list item #5: Separate vnode tables per ZFS pool, or some way to = =E2=80=9Cprotect=E2=80=9D =E2=80=9Cimportant=E2=80=9D vnodes... Another thing we noticed recently was that the ARC - which had a default = cap of about 251GB - would use all that and then we we would have a = three-way fight of memory between the ARC, the VNode table and the 500+ = =E2=80=99smbd=E2=80=99 user processes that uses quite a lot of RAM to = per process (and all the others) - causing the kernel pagedaemon to work = a lot causing the Samba smbd & Winbindd daemons to become really slow = (causing multi second login times for users). Solved that one by capping the ARC to 128GB=E2=80=A6 (Can probably = increase it a bit but 128GB seems to be plenty enough for us right now). = Now ARC gets=E2=80=99 50% of the machine and the rest have plenty of RAM = to play with. Dunno what to wish for here though :-) > 4. On FreeBSD I prefer GELI on the base partition to which ZFS is then > pointed as a pool member for encryption at the present time. It's > proven, uses AES hardware acceleration on modern processors and works.=20= > Read the documentation carefully and understand your options for = keying > (e.g. password only, password + key file, etc) and how you will manage > the security of the key component(s). Yep, we use GELI+ZFS too on one server. Works fine! - Peter