From owner-freebsd-fs@freebsd.org Fri Dec 28 00:20:11 2018 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id B51741427A80 for ; Fri, 28 Dec 2018 00:20:11 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: from CAN01-QB1-obe.outbound.protection.outlook.com (mail-eopbgr660051.outbound.protection.outlook.com [40.107.66.51]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-SHA384 (256/256 bits)) (Client CN "mail.protection.outlook.com", Issuer "GlobalSign Organization Validation CA - SHA256 - G3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 64EE26B70F for ; Fri, 28 Dec 2018 00:20:10 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: from YQBPR01MB0388.CANPRD01.PROD.OUTLOOK.COM (10.169.142.146) by YQBPR01MB0179.CANPRD01.PROD.OUTLOOK.COM (10.169.141.137) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.1471.20; Fri, 28 Dec 2018 00:20:09 +0000 Received: from YQBPR01MB0388.CANPRD01.PROD.OUTLOOK.COM ([fe80::9d84:f9d8:b5bb:3b7c]) by YQBPR01MB0388.CANPRD01.PROD.OUTLOOK.COM ([fe80::9d84:f9d8:b5bb:3b7c%8]) with mapi id 15.20.1471.019; Fri, 28 Dec 2018 00:20:08 +0000 From: Rick Macklem To: Peter Eriksson , "freebsd-fs@freebsd.org" Subject: Re: Suggestion for hardware for ZFS fileserver Thread-Topic: Suggestion for hardware for ZFS fileserver Thread-Index: AQHUmU1yCLTPDjfxZEms4Ul8AJXDZ6WJb0QAgAAbMACAAE3yOYAJeSyb Date: Fri, 28 Dec 2018 00:20:08 +0000 Message-ID: References: <4f816be7-79e0-cacb-9502-5fbbe343cfc9@denninger.net>, <3160F105-85C1-4CB4-AAD5-D16CF5D6143D@ifm.liu.se>, In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-ms-publictraffictype: Email x-microsoft-exchange-diagnostics: 1; YQBPR01MB0179; 6:jlvDfA+nXCX8RsjjZWlefV9mtc7lI/pYlhT+V59pK2zgppJ3NfLWll/cJVP+h2YpSy75qI54Sx1NBqsikLSlbb480f0dHj/W4NgKEuvU1DUAU0axgPa/OR8HYVwhPnzEF+z6j5mE01c2jPITeEsCSB/fojEO+5oW8kSCNCxQ1G+eWxSWk9nci9u3MtK98Q9PoNt6SeY1U6gZZe1ABPvfpGJcHXEOTIzHzxvFxKIEIg29jaNWYSHl9L0E9YKGBJ99DTYsZvflHgsMqI2Mspjb0oYqVpfP3pqaDWsr18NroqKnp4Z7Lod+z8ZL7kqjx/UXTwshd04mUkyN5h2l9SH6IZWr+vOms7UQfLfPgpJY7Kv/0CjMmdyc6cZrJYavyC8tJ45silgv3oRpyoYm1e3H5cox7rGRlWfYsWkCvKGbTFBvXTc64BCA2GAt2cRHA4GLIjKuV7f0X904L755RsIed5kiag57BmQT7Ii3jolgG3Y=; 5:KaxTDzFzjtuIQeziDj9WD8D+kX9Dd/80/faGKoFi8TccP8jfUd73ieRjgYB9kSXK2sqUvcqJ8t9loy3BY5KghjguQaggpFxhfB19JjB8k88z+qRmf7UrAKdWJ948aAHpX/8xKDuaNrkrt5wtuNAzqT9ZDc72cYTh2zR6O6eSOAU=; 7:LkDQayKxCgPWVF0BUlKwSoqzYBeVVFNZPz74FE7MNFmsvM+jqqPb2UihPXNpLbnAOcYSe72k6r0++/Xt6lciwzihqSGCym4mRCNGdDKSJfXZt5jNNuQH4AX1+apVit7sDE7wXSJcSSRWS7ocoHvnWA== x-ms-exchange-antispam-srfa-diagnostics: SOS; x-ms-office365-filtering-correlation-id: 1e8f13dd-5eae-45a9-62df-08d66c5a3996 x-microsoft-antispam: BCL:0; PCL:0; RULEID:(2390118)(7020095)(4652040)(8989299)(5600107)(711020)(4534185)(4627221)(201703031133081)(201702281549075)(8990200)(2017052603328)(7153060)(7193020); SRVR:YQBPR01MB0179; x-ms-traffictypediagnostic: YQBPR01MB0179: x-microsoft-antispam-prvs: x-exchange-antispam-report-cfa-test: BCL:0; PCL:0; RULEID:(3230021)(908002)(999002)(5005026)(6040522)(2401047)(8121501046)(93006095)(93001095)(3231475)(944501520)(52105112)(3002001)(10201501046)(6041310)(20161123564045)(201703131423095)(201703031522075)(201702281528075)(201702281529075)(20161123555045)(201703061421075)(201703061406153)(20161123558120)(20161123560045)(20161123562045)(201708071742011)(7699051)(76991095); SRVR:YQBPR01MB0179; BCL:0; PCL:0; RULEID:; SRVR:YQBPR01MB0179; x-forefront-prvs: 09007040D4 x-forefront-antispam-report: SFV:NSPM; SFS:(10009020)(376002)(39860400002)(136003)(366004)(396003)(346002)(174864002)(199004)(189003)(5660300001)(7696005)(99286004)(53936002)(110136005)(74316002)(102836004)(6246003)(316002)(46003)(76176011)(11346002)(446003)(296002)(6506007)(33656002)(105586002)(305945005)(786003)(106356001)(186003)(476003)(229853002)(9686003)(55016002)(93886005)(86362001)(81166006)(81156014)(6436002)(68736007)(97736004)(25786009)(8676002)(14444005)(8936002)(74482002)(478600001)(14454004)(71200400001)(2501003)(71190400001)(256004)(2906002)(486006); DIR:OUT; SFP:1101; SCL:1; SRVR:YQBPR01MB0179; H:YQBPR01MB0388.CANPRD01.PROD.OUTLOOK.COM; FPR:; SPF:None; LANG:en; PTR:InfoNoRecords; A:1; MX:1; received-spf: None (protection.outlook.com: uoguelph.ca does not designate permitted sender hosts) x-ms-exchange-senderadcheck: 1 x-microsoft-antispam-message-info: Bbfk3iU+xt5gwyTlTfP75mSKeX6IpIIheUv4kq1FbBrwerSHXhikQ+tRwlNTsy4sgGpiBex7p+wBalNUm4/UipAd2cltr4dKZKRXuPE5t5sYVjUxS85LKedyfG9/lnHJkTHMnoAu0cfqcwl6SAnub5LhPiSp+RaPaLShY4w9c7AWTGLmjNoGNe+zhIi1ftBkeU8X7H2TmHMABhkS2Yzku235+XS/kRnhQoCQ3wknis/kxo+kvpP/00Z8Ik3vueFCfopJbCLUyyG+SEkNE+5IfGyQ4HKWOlPKK1CLTJNUD+cp+X/GZJVDvtYt4cgWEwH7 spamdiagnosticoutput: 1:99 spamdiagnosticmetadata: NSPM Content-Type: text/plain; charset="Windows-1252" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-OriginatorOrg: uoguelph.ca X-MS-Exchange-CrossTenant-Network-Message-Id: 1e8f13dd-5eae-45a9-62df-08d66c5a3996 X-MS-Exchange-CrossTenant-originalarrivaltime: 28 Dec 2018 00:20:08.7966 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: be62a12b-2cad-49a1-a5fa-85f4f3156a7d X-MS-Exchange-Transport-CrossTenantHeadersStamped: YQBPR01MB0179 X-Rspamd-Queue-Id: 64EE26B70F X-Spamd-Bar: --- Authentication-Results: mx1.freebsd.org; spf=pass (mx1.freebsd.org: domain of rmacklem@uoguelph.ca designates 40.107.66.51 as permitted sender) smtp.mailfrom=rmacklem@uoguelph.ca X-Spamd-Result: default: False [-3.91 / 15.00]; ARC_NA(0.00)[]; TO_DN_EQ_ADDR_SOME(0.00)[]; NEURAL_HAM_MEDIUM(-1.00)[-1.000,0]; FROM_HAS_DN(0.00)[]; TO_DN_SOME(0.00)[]; R_SPF_ALLOW(-0.20)[+ip4:40.107.0.0/17]; NEURAL_HAM_LONG(-1.00)[-1.000,0]; MIME_GOOD(-0.10)[text/plain]; DMARC_NA(0.00)[uoguelph.ca]; RCVD_COUNT_THREE(0.00)[3]; TO_MATCH_ENVRCPT_SOME(0.00)[]; MX_GOOD(-0.01)[mx2.hc184-76.ca.iphmx.com,mx1.hc184-76.ca.iphmx.com,mx2.hc184-76.ca.iphmx.com,mx1.hc184-76.ca.iphmx.com,mx2.hc184-76.ca.iphmx.com,mx1.hc184-76.ca.iphmx.com,mx2.hc184-76.ca.iphmx.com,mx1.hc184-76.ca.iphmx.com,mx2.hc184-76.ca.iphmx.com,mx1.hc184-76.ca.iphmx.com,mx2.hc184-76.ca.iphmx.com,mx1.hc184-76.ca.iphmx.com,mx2.hc184-76.ca.iphmx.com,mx1.hc184-76.ca.iphmx.com,mx2.hc184-76.ca.iphmx.com,mx1.hc184-76.ca.iphmx.com,mx2.hc184-76.ca.iphmx.com,mx1.hc184-76.ca.iphmx.com,mx2.hc184-76.ca.iphmx.com,mx1.hc184-76.ca.iphmx.com,mx2.hc184-76.ca.iphmx.com,mx1.hc184-76.ca.iphmx.com,mx2.hc184-76.ca.iphmx.com,mx1.hc184-76.ca.iphmx.com,mx2.hc184-76.ca.iphmx.com,mx1.hc184-76.ca.iphmx.com,mx2.hc184-76.ca.iphmx.com,mx1.hc184-76.ca.iphmx.com,mx2.hc184-76.ca.iphmx.com,mx1.hc184-76.ca.iphmx.com]; RCPT_COUNT_TWO(0.00)[2]; RCVD_IN_DNSWL_NONE(0.00)[51.66.107.40.list.dnswl.org : 127.0.3.0]; NEURAL_HAM_SHORT(-0.71)[-0.707,0]; IP_SCORE(-0.89)[ipnet: 40.64.0.0/10(-2.27), asn: 8075(-2.11), country: US(-0.08)]; FROM_EQ_ENVFROM(0.00)[]; R_DKIM_NA(0.00)[]; MIME_TRACE(0.00)[0:+]; ASN(0.00)[asn:8075, ipnet:40.64.0.0/10, country:US]; RCVD_TLS_LAST(0.00)[] X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 28 Dec 2018 00:20:11 -0000 I wrote: >Peter Eriksson wrote: >[good stuff snipped] >>This has caused some interesting problems=85 >> >>First thing we noticed was that booting would take forever=85 Mounting th= e 20-100k >>filesystems _and_ enabling them to be shared via NFS is not don= e efficient at all (for each filesystem it re-reads /etc/zfs/exports (a cou= ple of times) befor appending one line to the end. Repeat 20-100,000 times= =85 Not to mention the big kernel lock for NFS =93hold all NFS activity whi= le we flush and reinstalls all sharing information per filesystem=94 being = done by mountd=85 >Yes, /etc/exports and mountd were implemented in the 1980s, when a dozen >file systems would have been a large server. Scaling to 10,000 or more fil= e systems wasn't even conceivable back then. >Wish list item #1: A BerkeleyDB-based =92sharetab=92 that replaces the hor= ribly >slow /etc/zfs/exports text file. >Wish list item #2: A reimplementation of mountd and the kernel interface t= o allow >a =93diff=94 between the contents of the DB-based sharetab above b= e input into the >kernel instead of the brute-force way it=92s done now.. >The parser in mountd for /etc/exports is already an ugly beast and I think >implementing a "diff" version will be difficult, especially figuring out w= hat needs >to be deleted. > >I do have a couple of questions related to this: >1 - Would your case work if there was an "add these lines to /etc/exports"= ? > (Basically adding entries for file systems, but not trying to delete = anything > previously exported. I am not a ZFS guy, but I think ZFS just genera= tes another > exports file and then gets mountd to export everything again.) >2 - Are all (or maybe most) of these ZFS file systems exported with the sa= me > arguments? > - Here I am thinking that a "default-for-all-ZFS-filesystems" line c= ould be > put in /etc/exports that would apply to all ZFS file systems not = exported > by explicit lines in the exports file(s). > This would be fairly easy to implement and would avoid trying to han= dle > 1000s of entries. > >In particular, #2 above could be easily implemented on top of what is alre= ady >there, using a new type of line in /etc/exports and handling that as a spe= cial >case by the NFS server code, when no specific export for the file system t= o the >client is found. Unfortunately, it doesn't sound like #2 above would be useful for Peter. Al= though it is easy to implement a single default export for all ZFS file systems not alre= ady exported, it would not be easy to say "export all file systems below /foo/bar this wa= y", since the kernel code basically doesn't know the directory structure. It has vnod= es for file objects and mount points to work with. (The kernel exports hang off of= the mount points.) >>(I=92ve written some code that implements item #1 above and it helps quit= e a bit. >>Nothing near production quality yet though. I have looked at ite= m #2 a bit too but >>not done anything about it.) Btw, this "item #2" is not what I am referring to. [more good stuff snipped] rick