From owner-freebsd-fs@freebsd.org Sun Dec 23 01:52:01 2018 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id E8FDE13441F1 for ; Sun, 23 Dec 2018 01:52:00 +0000 (UTC) (envelope-from peter@ifm.liu.se) Received: from mailout.ifm.liu.se (mailout.ifm.liu.se [130.236.160.13]) by mx1.freebsd.org (Postfix) with ESMTP id D209283B04 for ; Sun, 23 Dec 2018 01:51:58 +0000 (UTC) (envelope-from peter@ifm.liu.se) Received: from [192.168.1.79] (h-99-23.A785.priv.bahnhof.se [158.174.99.23]) (authenticated bits=0) by mail.ifm.liu.se (8.15.2/8.14.4) with ESMTPSA id wBN1p8LL006630 (version=TLSv1.2 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Sun, 23 Dec 2018 02:51:09 +0100 (CET) From: Peter Eriksson Message-Id: <3F3EC02F-B969-43E3-B9B5-342504ED0962@ifm.liu.se> Mime-Version: 1.0 (Mac OS X Mail 12.2 \(3445.102.3\)) Subject: {Disarmed} Re: {Disarmed} Re: Suggestion for hardware for ZFS fileserver Date: Sun, 23 Dec 2018 02:51:02 +0100 In-Reply-To: To: freebsd-fs@freebsd.org References: <4f816be7-79e0-cacb-9502-5fbbe343cfc9@denninger.net> <3160F105-85C1-4CB4-AAD5-D16CF5D6143D@ifm.liu.se> X-Mailer: Apple Mail (2.3445.102.3) X-IFMLiUSE-MailScanner-Information: Please contact postmaster@ifm.liu.se more information X-IFMLiUSE-MailScanner-ID: wBN1p8LL006630 X-IFMLiUSE-MailScanner: Found to be clean X-IFMLiUSE-MailScanner-From: peter@ifm.liu.se X-Spam-Status: No X-Rspamd-Queue-Id: D209283B04 X-Spamd-Bar: - Authentication-Results: mx1.freebsd.org; dmarc=pass (policy=none) header.from=liu.se; spf=pass (mx1.freebsd.org: domain of peter@ifm.liu.se designates 130.236.160.13 as permitted sender) smtp.mailfrom=peter@ifm.liu.se X-Spamd-Result: default: False [-1.16 / 15.00]; RCVD_VIA_SMTP_AUTH(0.00)[]; TO_DN_SOME(0.00)[]; R_SPF_ALLOW(-0.20)[+ptr]; MV_CASE(0.50)[]; URI_COUNT_ODD(1.00)[5]; RCVD_IN_DNSWL_MED(-0.20)[13.160.236.130.list.dnswl.org : 127.0.11.2]; RCPT_COUNT_TWO(0.00)[2]; DMARC_POLICY_ALLOW(-0.50)[liu.se,none]; MX_GOOD(-0.01)[e-mailfilter04.sunet.se,e-mailfilter03.sunet.se]; NEURAL_HAM_SHORT(-0.84)[-0.840,0]; RCVD_NO_TLS_LAST(0.10)[]; FROM_EQ_ENVFROM(0.00)[]; IP_SCORE(-0.00)[country: SE(-0.02)]; R_DKIM_NA(0.00)[]; ASN(0.00)[asn:2843, ipnet:130.236.0.0/16, country:SE]; MIME_TRACE(0.00)[0:+,1:+]; MID_RHS_MATCH_FROM(0.00)[]; ARC_NA(0.00)[]; NEURAL_HAM_MEDIUM(-0.93)[-0.930,0]; FROM_HAS_DN(0.00)[]; NEURAL_HAM_LONG(-0.98)[-0.977,0]; MIME_GOOD(-0.10)[multipart/alternative,text/plain]; TO_MATCH_ENVRCPT_SOME(0.00)[]; HTTP_TO_IP(1.00)[]; FREEMAIL_CC(0.00)[gmail.com]; RCVD_COUNT_TWO(0.00)[2] Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Content-Filtered-By: Mailman/MimeDel 2.1.29 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 23 Dec 2018 01:52:01 -0000 Can=E2=80=99t really give you some generic recommendations but on our Dell = R730xd and R740xd servers we use the Dell HB330 SAS HBA card, also know as = =E2=80=9CDell Storage Controller 12GB-SASHBA=E2=80=9D that uses the =E2=80= =9Cmpr=E2=80=9D device driver. This is an LSI3008 based controller and work= s really well. Only internal drives on the Dell servers though (730xd and 7= 40xd servers). Beware that this is not the same as the =E2=80=9CH330=E2=80= =9D RAID controller that Dell normally sells you. We had to do a =E2=80=9Cs= pecial=E2=80=9D in order to get the 10TB drives with 4K sectors with the HB= A330 controller since Dell only would sell use the 10TB drives together wit= h the H330 controller at the time we bought them. So we bought the HBA330:s= separately and swapped them ourself... And then we had to do a low-level r= eformat of all the disks since Dell by default would deliver them formatted= with a nonstandard sector size (4160 bytes I think, or perhaps 4112) and = =E2=80=9CProtection Information=E2=80=9D enabled (used and understood by th= e H330 controller, but not FreeBSD when using HBAs). But that=E2=80=99s eas= y to fix (just takes an hour or so per drive to do): # sg_format =E2=80=94size=3D4096 =E2=80=94fmtpinfo=3D0 /dev/da0 On our HP servers we use the HP Smart HBA H241 controller in HBA mode (set = via the BIOS configuration page) connected to external HP D6030 SAS shelfs = (70 disks per shelf). This is a HP special one that uses the =E2=80=9Cciss= =E2=80=9D driver. Also works fine. - Peter > On 22 Dec 2018, at 15:49, Sami Halabi wrote: >=20 > Hi, >=20 > What sas hba card do you recommend for 16/24 internal ports and 2 externa= l that are recognized and work well with freebsd ZFS. > Sami >=20 > =D7=91=D7=AA=D7=90=D7=A8=D7=99=D7=9A =D7=A9=D7=91=D7=AA, 22 =D7=91=D7=93= =D7=A6=D7=9E=D7=B3 2018, 2:48, =D7=9E=D7=90=D7=AA Peter Eriksson >: >=20 >=20 > > On 22 Dec 2018, at 00:49, Rick Macklem > wrote: > >=20 > > Peter Eriksson wrote: > > [good stuff snipped] > >> This has caused some interesting problems=E2=80=A6 > >>=20 > >> First thing we noticed was that booting would take forever=E2=80=A6 Mo= unting the 20-100k >filesystems _and_ enabling them to be shared via NFS is= not done efficient at all (for >each filesystem it re-reads /etc/zfs/expor= ts (a couple of times) befor appending one >line to the end. Repeat 20-100,= 000 times=E2=80=A6 Not to mention the big kernel lock for >NFS =E2=80=9Chol= d all NFS activity while we flush and reinstalls all sharing information pe= r >filesystem=E2=80=9D being done by mountd=E2=80=A6 > > Yes, /etc/exports and mountd were implemented in the 1980s, when a dozen > > file systems would have been a large server. Scaling to 10,000 or more = file > > systems wasn't even conceivable back then. >=20 > Yeah, for a normal user with non-silly amounts of filesystems this is a n= on-issue. Anyway it=E2=80=99s the kind of issues that I kind of like to thi= nk about how to solve. It=E2=80=99s fun :-) >=20 >=20 > >> Wish list item #1: A BerkeleyDB-based =E2=80=99sharetab=E2=80=99 that = replaces the horribly >slow /etc/zfs/exports text file. > >> Wish list item #2: A reimplementation of mountd and the kernel interfa= ce to allow >a =E2=80=9Cdiff=E2=80=9D between the contents of the DB-based = sharetab above be input into the >kernel instead of the brute-force way it= =E2=80=99s done now.. > > The parser in mountd for /etc/exports is already an ugly beast and I th= ink > > implementing a "diff" version will be difficult, especially figuring ou= t what needs > > to be deleted. >=20 > Yeah, I tried to decode it (this summer) and I think I sort of got the ha= ng of it eventually.=20 >=20 >=20 > > I do have a couple of questions related to this: > > 1 - Would your case work if there was an "add these lines to /etc/expor= ts"? > > (Basically adding entries for file systems, but not trying to delet= e anything > > previously exported. I am not a ZFS guy, but I think ZFS just gene= rates another > > exports file and then gets mountd to export everything again.) >=20 > Yeah, the ZFS library that the zfs commands use just reads and updates th= e separate /etc/zfs/exports text file (and have mountd read both /etc/expor= ts and /etc/zfs/exports). The problem is that basically what it does when y= ou tell it to =E2=80=9Czfs mount -a=E2=80=9D (mount all filesystems in all = zpools) is a big (pseudocode): >=20 > For P in ZPOOLS; do > For Z in ZFILESYSTEMS-AND-SNAPSHOTS in $P; do > Mount $Z > If $Z Have =E2=80=9Csharenfs=E2=80=9D option; Then > Open /etc/zfs/exports > Read until you find a matching line, replace with the options, els= e if not found, Append options > Close /etc/zfs/exports > Signal mountd > (Which then opens /etc/exports and /etc/zfs/exports and does it= =E2=80=99s magic) > End > End > End >=20 > All wrapped up in a Solaris compatibility layer I libzfs. Actually I thin= k it even reads the /etc/zfs/exports file twice for each loop iteration due= to some abstractions. Btw things got really =E2=80=9Cfun=E2=80=9D when the= hourly snapshots we were taking (adding 10-20k new snapshots every hour, a= nd we didn=E2=80=99t expire them fast enough in the beginning) triggered th= e code above and that code took longer than 1 hour to execute - mountd was = 100% busy getting signalled and rereading, flushing and reinstalling export= s into the kernel all the time) and basically never finished. Luckily we di= dn=E2=80=99t have an NFS clients accessing the servers at that time :-) >=20 > This summer I wrote some code to instead use a Btree BerkeleyDB file and = modified the libzfs code and mountd daemon to instead use that database for= much faster lookups (no need to read the whole /etc/zfs/exports file all t= he time) and additions. Worked pretty well actually and wasn=E2=80=99t that= hard to add. Wanted to also add a possibility to add =E2=80=9Cexports=E2= =80=9D arguments =E2=80=9CSolaris=E2=80=9D-style so one could say things li= ke: >=20 > /export/staff vers=3D4,sec=3Dkrb5:krb5i:krb5p,rw=3DMailScanner = warning: numerical links are often malicious: 130.236.0.0/16,sec=3Dsys,ro= =3D130.236.160.0/24:10.1.2.3 >=20 > But I never finished that (solaris-style exports options) part=E2=80=A6. >=20 > We=E2=80=99ve lately been toying with putting the NFS sharing stuff into = separate =E2=80=9Cprivate" ZFS attribute (separate from official =E2=80=9Cs= harenfs=E2=80=9D one) and have another tool to read them instead and genera= te another =E2=80=9Cexports=E2=80=9D file so that file can be generated in = =E2=80=9Cone go=E2=80=9D and just signal mountd once after all filesystems = have been mounted. Unfortunately that would mean that they won=E2=80=99t be= shared until after all of them have been mounted but we think it would tak= e less time all-in-all. >=20 > We also modified the FreeBSD boot scripts so that we make sure to first m= ount all most important ZFS filesystems that is needed on the boot disks (n= ot just /) and then we mount (and share via NFS the rest in the background = so we can login to the machine as root early (no need for everything to hav= e been mounted before giving us a login prompt). >=20 > (Right now a reboot of the bigger servers take an hour or two before all = filesystems are mounted and exported). >=20 >=20 > > 2 - Are all (or maybe most) of these ZFS file systems exported with the= same > > arguments? > > - Here I am thinking that a "default-for-all-ZFS-filesystems" line= could be > > put in /etc/exports that would apply to all ZFS file systems no= t exported > > by explicit lines in the exports file(s). > > This would be fairly easy to implement and would avoid trying to h= andle > > 1000s of entries. >=20 > For us most have exactly the same exports arguments. (We set options on t= he top level filsystems (/export/staff, /export/students etc) and then all = home dirs inherit those. >=20 > > In particular, #2 above could be easily implemented on top of what is a= lready > > there, using a new type of line in /etc/exports and handling that as a = special > > case by the NFS server code, when no specific export for the file syste= m to the > > client is found. > >=20 > >> (I=E2=80=99ve written some code that implements item #1 above and it h= elps quite a bit. >Nothing near production quality yet though. I have looke= d at item #2 a bit too but >not done anything about it.) > > [more good stuff snipped] > > Btw, although I put the questions here, I think a separate thread discu= ssing > > how to scale to 10000+ file systems might be useful. (On freebsd-fs@ or > > freebsd-current@. The latter sometimes gets the attention of more devel= opers.) >=20 > Yeah, probably a good idea! >=20 > - Peter >=20 > > rick > >=20 > >=20 >=20 > _______________________________________________ > freebsd-fs@freebsd.org mailing list > https://lists.freebsd.org/mailman/listinfo/freebsd-fs > To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org "