From owner-freebsd-net@freebsd.org Tue Jun 2 13:12:49 2020 Return-Path: Delivered-To: freebsd-net@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id 3510F336B54 for ; Tue, 2 Jun 2020 13:12:49 +0000 (UTC) (envelope-from pen@lysator.liu.se) Received: from mail.lysator.liu.se (mail.lysator.liu.se [130.236.254.3]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 49bssW09GPz3YVg; Tue, 2 Jun 2020 13:12:46 +0000 (UTC) (envelope-from pen@lysator.liu.se) Received: from mail.lysator.liu.se (localhost [127.0.0.1]) by mail.lysator.liu.se (Postfix) with ESMTP id D623340027; Tue, 2 Jun 2020 15:12:43 +0200 (CEST) Received: by mail.lysator.liu.se (Postfix, from userid 1004) id B8DE940009; Tue, 2 Jun 2020 15:12:43 +0200 (CEST) X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on bernadotte.lysator.liu.se X-Spam-Level: X-Spam-Status: No, score=-1.0 required=5.0 tests=ALL_TRUSTED, AWL, HTML_MESSAGE autolearn=disabled version=3.4.2 X-Spam-Score: -1.0 Received: from [192.168.1.132] (h-201-140.A785.priv.bahnhof.se [98.128.201.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.lysator.liu.se (Postfix) with ESMTPSA id E31A940006; Tue, 2 Jun 2020 15:12:40 +0200 (CEST) From: Peter Eriksson Message-Id: <7E0A7D8E-72E6-4D32-B2A7-C4CE4127DDEF@lysator.liu.se> Mime-Version: 1.0 (Mac OS X Mail 13.4 \(3608.80.23.2.2\)) Subject: Re: how to fix an interesting issue with mountd? Date: Tue, 2 Jun 2020 15:12:40 +0200 In-Reply-To: Cc: "Rodney W. Grimes" , Mark Johnston , "patrykkotlowski@gmail.com" , Rick Macklem To: "freebsd-net@freebsd.org" References: X-Mailer: Apple Mail (2.3608.80.23.2.2) X-Virus-Scanned: ClamAV using ClamSMTP X-Rspamd-Queue-Id: 49bssW09GPz3YVg X-Spamd-Bar: -- Authentication-Results: mx1.freebsd.org; dkim=none; dmarc=pass (policy=none) header.from=liu.se; spf=pass (mx1.freebsd.org: domain of pen@lysator.liu.se designates 130.236.254.3 as permitted sender) smtp.mailfrom=pen@lysator.liu.se X-Spamd-Result: default: False [-2.83 / 15.00]; TO_DN_EQ_ADDR_SOME(0.00)[]; RCVD_VIA_SMTP_AUTH(0.00)[]; ARC_NA(0.00)[]; FROM_HAS_DN(0.00)[]; TO_DN_SOME(0.00)[]; MV_CASE(0.50)[]; R_SPF_ALLOW(-0.20)[+a:mail.lysator.liu.se]; MIME_GOOD(-0.10)[multipart/alternative,text/plain]; RCVD_TLS_LAST(0.00)[]; NEURAL_HAM_LONG(-1.03)[-1.026]; RCPT_COUNT_FIVE(0.00)[5]; RCVD_COUNT_THREE(0.00)[4]; TO_MATCH_ENVRCPT_SOME(0.00)[]; RCVD_IN_DNSWL_MED(-0.20)[130.236.254.3:from]; DMARC_POLICY_ALLOW(-0.50)[liu.se,none]; NEURAL_HAM_SHORT(-1.32)[-1.318]; NEURAL_HAM_MEDIUM(-0.99)[-0.989]; FROM_EQ_ENVFROM(0.00)[]; R_DKIM_NA(0.00)[]; MIME_TRACE(0.00)[0:+,1:+,2:~]; ASN(0.00)[asn:2843, ipnet:130.236.0.0/16, country:SE]; SUBJECT_ENDS_QUESTION(1.00)[]; MID_RHS_MATCH_FROM(0.00)[]; FREEMAIL_CC(0.00)[gndrsh.dnsmgr.net,FreeBSD.org,gmail.com,uoguelph.ca] Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Content-Filtered-By: Mailman/MimeDel 2.1.33 X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.33 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 02 Jun 2020 13:12:49 -0000 I once reported that we had a server with many thousands (typically = 23000 or so per server) ZFS filesystems (and 300+ snapshots per = filesystem) where mountd was 100% busy reading and updating the kernel = (and while doing that holding the NFS lock for a very long time) every = hour (when we took snapshots of all the filesystems - the code in the = zfs commands send a lot of SIGHUPs to mountd it seems)=E2=80=A6.=20 (Causing NFS users to complain quite a bit) I have also seen the effect that when there are a lot of updates to = filesystems that some exports can get =E2=80=9Cmissed=E2=80=9D if mountd = is bombarded with multiple SIGHUPS - but with the new incremental update = code in mountd this window (for SIGHUPs to get lost) is much smaller = (and I now also have a Nagios check that verifies that all exports in = /etc/zfs/exports also is visible in the kernel). But while we had this problem it I also investigated going to a DB based = exports =E2=80=9Cfile=E2=80=9D in order to make the code in the = =E2=80=9Czfs=E2=80=9D commands that read and update /etc/zfs/exports a = lot faster too. As Rick says there is room for _huge_ improvements = there.=20 For every change of =E2=80=9Csharenfs=E2=80=9D per filesystem it would = open and read and parse, line-by-line, /etc/zfs/exports *two* times and = then rewrite the whole file. Now imagine doing that recursively for = 23000 filesystems=E2=80=A6 My change to the zfs code simple opened a DB = file and just did a =E2=80=9Cput=E2=80=9D of a record for the filesystem = (and then sent mountd a SIGHUP). (And even worse - when doing the boot-time =E2=80=9Czfs share -a=E2=80=9D = - for each filesystem it would open(/etc/zfs/exports, read it line by = line and check to make sure the filesystem isn=E2=80=99t already in the = file, then open a tmp file, write out all the old filesystem + plus the = new one, rename it to /etc/zfs/exports, send a SIGHUP) and the go on to = the next one.. Repeat. Pretty fast for 1-10 filesystems, not so fast = for 20000+ ones=E2=80=A6 And tests the boot disk I/O a bit :-) =20 I have seen that the (ZFS-on-Linux) OpenZFS code has changed a bit = regarding this and I think for Linux they are going the route of = directly updating the kernel instead of going via some external updater = (like mountd). That probably would be an even better way (for ZFS) but a = DB database might be useful anyway. It=E2=80=99s a very simple change = (especially in mountd - it just opens the DB file and reads the records = sequentially instead of the text file). - Peter > On 2 Jun 2020, at 06:30, Rick Macklem wrote: >=20 > Rodney Grimes wrote: >>> Hi, >>>=20 >>> I'm posting this one to freebsd-net@ since it seems vaguely similar >>> to a network congestion problem and thought that network types >>> might have some ideas w.r.t. fixing it? >>>=20 >>> PR#246597 - Reports a problem (which if I understand it is) where a = sighup >>> is posted to mountd and then another sighup is posted to mountd = while >>> it is reloading exports and the exports are not reloaded again. >>> --> The simple patch in the PR fixes the above problem, but I = think will >>> aggravate another one. >>> For some NFS servers, it can take minutes to reload the exports = file(s). >>> (I believe Peter Erriksonn has a server with 80000+ file systems = exported.) >>> r348590 reduced the time taken, but it is still minutes, if I recall = correctly. > Actually, my recollection w.r.t. the times was way off. > I just looked at the old PR#237860 and, without r348590, it was = 16seconds > (aka seconds, not minutes) and with r348590 that went down to a = fraction > of a second (there was no exact number in the PR, but I noted = milliseconds in > the commit log entry. >=20 > I still think there is a risk of doing the reloads repeatedly. >=20 >>> --> If you apply the patch in the PR and sighups are posted to = mountd as >>> often as it takes to reload the exports file(s), it will = simply reload the >>> exports file(s) over and over and over again, instead of = processing >>> Mount RPC requests. >>>=20 >>> So, finally to the interesting part... >>> - It seems that the code needs to be changed so that it won't = "forget" >>> sighup(s) posted to it, but it should not reload the exports = file(s) too >>> frequently. >>> --> My thoughts are something like: >>> - Note that sighup(s) were posted while reloading the exports = file(s) and >>> do the reload again, after some minimum delay. >>> --> The minimum delay might only need to be 1second to allow some >>> RPCs to be processed before reload happens again. >>> Or >>> --> The minimum delay could be some fraction of how long a reload = takes. >>> (The code could time the reload and use that to calculate = how long to >>> delay before doing the reload again.) >>>=20 >>> Any ideas or suggestions? rick >>> ps: I've actually known about this for some time, but since I didn't = have a good >>> solution... >>=20 >> Build a system that allows adding and removing entries from the >> in mountd exports data so that you do not have to do a full >> reload every time one is added or removed? >>=20 >> Build a system that used 2 exports tables, the active one, and the >> one that was being loaded, so that you can process RPC's and reloads >> at the same time. > Well, r348590 modified mountd so that it built a new set of linked = list > structures from the modified exports file(s) and then compared them = with > the old ones, only doing updates to the kernel exports for changes. >=20 > It still processes the entire exports file each time, to produce the = in mountd > memory linked lists (using hash tables and a binary tree). >=20 > Peter did send me a patch to use a db frontend, but he felt the only > performance improvements would be related to ZFS. > Since ZFS is something I avoid like the plague I never pursued it. > (If anyone willing to ZFS stuff wants to pursue this, > just email and I can send you the patch.) > Here's a snippet of what he said about it. >> It looks like a very simple patch to create and even though it = wouldn=E2=80=99t really > improve the speed for the work that = mountd does it would make possible really > drastic speed improvements = in the zfs commands. They (zfs commands) currently > reads the thru = text-based exports file multiple times when you do work with zfs > = filesystems (mounting/sharing/changing share options etc). Using a db = based =20 >> exports file for the zfs exports (b-tree based probably) would allow = the zfs code > to be much faster. >=20 > At this point, I am just interested in fixing the problem in the PR, = rick >=20 > _______________________________________________ > freebsd-net@freebsd.org mailing list > https://lists.freebsd.org/mailman/listinfo/freebsd-net = > To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org = "