From owner-freebsd-net@freebsd.org Wed Jun 3 00:50:12 2020 Return-Path: Delivered-To: freebsd-net@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id 3ED752FF9BA for ; Wed, 3 Jun 2020 00:50:12 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: from CAN01-QB1-obe.outbound.protection.outlook.com (mail-eopbgr660054.outbound.protection.outlook.com [40.107.66.54]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "mail.protection.outlook.com", Issuer "GlobalSign Organization Validation CA - SHA256 - G3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 49c9LB6xpyz4FF1; Wed, 3 Jun 2020 00:50:10 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=n3kjE+LVKVVx7J0EabqHgglxh7j0yTdZsBdTK+Mwgkb4qMoGPctZCOKHnYPmzGBKtFZwAinafZWKTTNiYYZHvwyGFEKfB9bH9F1XsETa9Qd15Kk0VVi8DLcEmBa0c+CfgKy8m4R+Z3LFZsgemsq2x85FTiHOdAufPMSn7PgXGoyzsW3RgVayc7ZBMJdjghA1ows6mmrkF8iT1l8TTcI6eCAGBILLXwgi3c/9QMrRjFnpOVwappJpLXxNRSCQO/LNwXX7GjSKo5vHavLGpNrV5BwHtz5OCiqEsjFyH4Sl7FUQr+diKrRGooZpFPsVsTSDM2LvY1gMpKB0b2nTHqE3HQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=5Lnm0ENnOJpB+exVkna93eZleBBuqpN4DdNz7Bsy0F4=; b=nt/OjWXRM5E1ozz1NyjmYqFwweDxTPc7Nnm7MeWOLIKeegYjFmjYGH/lgIm5igOShqKa5ZzNoYsQC+HqaAVzboOfILgZRlndakBWLDXuskYh8WQ90M6MKZAGaSmYsoUeKnkcC4SEdJo5/5UmfJ03Y4keOdDYtzRimGV3QbVPsMstmKZJG8IsQhTNu0lpPBR+eO/pSgL8Sd0lLhsxNV48437C5MOsEo83z96vly1O7YHuJiGkdR0tCfcQk2mu3mFMndj8s5v3NaQezFpyc6zIiuFXfBTNU3Nw3/WOnaIpUZkwvRqwNwiK7xHYyujI/Q/j/2IN9v/4bBF89ZH1a4vjwA== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=uoguelph.ca; dmarc=pass action=none header.from=uoguelph.ca; dkim=pass header.d=uoguelph.ca; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=uoguelph.ca; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=5Lnm0ENnOJpB+exVkna93eZleBBuqpN4DdNz7Bsy0F4=; b=aFEhGsQ5OxUVen+x09lJG3QmJCx7qWRTXMFUOONl08aZ5evN8pCcI1iQGzSW+VcAdgZ2DstzGn6oS6t/GjiYujB24kf2VxtJAzFzFIQ0x6d1NZnpo16QWFmkqAmQJRGXlBM01mZGmE9z11OQVreM+z9VVkma4atGBkKbrdPs+ejzQTyTAApgtJTYQl9UXc5iNamMhWPI+2l3lX7IOivqGVG2YYS6jnsnWgcVNXgcUbB+kx8yCipkilQ5Svf7XyAdh6oWCmbH26znavznBYV28bQ5y8aQBObre1FhjnoSRcjkBEHSKrugfKPdQJRwbQTtLjy9Grw9R559JOifxYA1+Q== Received: from QB1PR01MB3649.CANPRD01.PROD.OUTLOOK.COM (2603:10b6:c00:32::26) by QB1PR01MB3394.CANPRD01.PROD.OUTLOOK.COM (2603:10b6:c00:3a::20) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.3045.18; Wed, 3 Jun 2020 00:50:09 +0000 Received: from QB1PR01MB3649.CANPRD01.PROD.OUTLOOK.COM ([fe80::dd96:945c:b6ee:ffa2]) by QB1PR01MB3649.CANPRD01.PROD.OUTLOOK.COM ([fe80::dd96:945c:b6ee:ffa2%6]) with mapi id 15.20.3066.018; Wed, 3 Jun 2020 00:50:09 +0000 From: Rick Macklem To: Peter Eriksson , "freebsd-net@freebsd.org" CC: "Rodney W. Grimes" , Mark Johnston , "patrykkotlowski@gmail.com" Subject: Re: how to fix an interesting issue with mountd? Thread-Topic: how to fix an interesting issue with mountd? Thread-Index: AQHWOJM1VwWbfER5x0eYgpEDvZCC2ajFTdUAgADAQsA= Date: Wed, 3 Jun 2020 00:50:09 +0000 Message-ID: References: , <7E0A7D8E-72E6-4D32-B2A7-C4CE4127DDEF@lysator.liu.se> In-Reply-To: <7E0A7D8E-72E6-4D32-B2A7-C4CE4127DDEF@lysator.liu.se> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-ms-publictraffictype: Email x-ms-office365-filtering-correlation-id: 7b37cc16-6ead-43b1-0b33-08d8075810c1 x-ms-traffictypediagnostic: QB1PR01MB3394: x-microsoft-antispam-prvs: x-ms-oob-tlc-oobclassifiers: OLM:8882; x-forefront-prvs: 04238CD941 x-ms-exchange-senderadcheck: 1 x-microsoft-antispam: BCL:0; x-microsoft-antispam-message-info: yNn1O0/tovHnAdQ9HZfszRUnDsfGWONTst7hEYOWre9Kd7Cvb2S5xeI92koklEr9i0pcelZ9xvivnk61P7u13Ws5cIwD4P/++QhwEK2yNN+V6i8DSqSfRzATvBUfkwXs8gGp9VYYwU11yQWYir6zi3BFAhvR+EP8MF5Bz5or4gZkGkcWS1dOkeLP0PnHwuDSduq2x2mf/2keAHt4xWjA26ZDV1J5HRfmLsy+KdG00AT7MDcZ9CqB3hOh9yeJwBto2Bcss3l/jsdWJRxYBZwqqf71oPoIXsESVOziS5xbs0X6Fe1AsSNU7ZJkjlX+853qmntVTck5urQv8l8aVBNTCA2cLYthrellKmDGerNyxWUsYREPg14WBcRDDVLNBhJtT5HzVnY2ntYvm0+T6ppfVA== x-forefront-antispam-report: CIP:255.255.255.255; CTRY:; LANG:en; SCL:1; SRV:; IPV:NLI; SFV:NSPM; H:QB1PR01MB3649.CANPRD01.PROD.OUTLOOK.COM; PTR:; CAT:NONE; SFTY:; SFS:(396003)(346002)(366004)(376002)(136003)(39860400002)(478600001)(4326008)(966005)(8676002)(9686003)(64756008)(86362001)(5660300002)(52536014)(66946007)(76116006)(91956017)(66556008)(66476007)(66446008)(71200400001)(6506007)(786003)(316002)(296002)(110136005)(33656002)(186003)(53546011)(54906003)(7696005)(55016002)(2906002)(83380400001)(8936002); DIR:OUT; SFP:1101; x-ms-exchange-antispam-messagedata: wNmnYHM0xjgc2Vm6pPlaR434nRzlUJM2/fDHc3rdIWZALR8ga8Ai81XGtDkeSHbFHbk22Zto6xmknGE7JiZRC/QKRm1prjkFm97YzAb+SspXRYY3N2uBeEsQCUGWg86SYhezhIeDKaFqLftezlI9JmtbdXKI3/SadIvEuez2EcR+yq115+8OeuTkLN3imQ2bbkA8eX6NrlTy1LsFzfW7cAktDgAuKj35Acro1muXKr1ZbSwPOvDb3QkN0yDH/LOQ9qCH7UWpJFBV1Ay6iBTMQ/a2nvPPLr/l/g01W1j/BgDm5McTdVPVKF3w51l/po1xHgesQmf6LtFv+UNL6oqEqytYGb505z7cx6IOE9lSXHKDuk/iCJe9OMVteF3cAhvgUrCcMDM0rVIetMTp9G0fvZqbtMjH6wnnqx9Pm/XBMKhaPQAxetchZwRYDSXqp4hlN1rrO76SzY4YdyZho7UAhzMiHCrrvPpj8tzEFuQ0S59AlGbs15+YcbarMujRx6Qi5v8MXDQjxIrMdUJ2yG91IgfC2mAsexUWnNbwamiYsiyAN0KLEb3DOPaQ5po5TLd3 x-ms-exchange-transport-forked: True Content-Type: text/plain; charset="Windows-1252" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-OriginatorOrg: uoguelph.ca X-MS-Exchange-CrossTenant-Network-Message-Id: 7b37cc16-6ead-43b1-0b33-08d8075810c1 X-MS-Exchange-CrossTenant-originalarrivaltime: 03 Jun 2020 00:50:09.1579 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: be62a12b-2cad-49a1-a5fa-85f4f3156a7d X-MS-Exchange-CrossTenant-mailboxtype: HOSTED X-MS-Exchange-CrossTenant-userprincipalname: rCBw8j0QjcFLkj0D+hb26ITNne+6HydNzCY75etkpIq5bg9ILspqU4VrWKMphtn9yZFDSOem3Phfms5hnzrm7Q== X-MS-Exchange-Transport-CrossTenantHeadersStamped: QB1PR01MB3394 X-Rspamd-Queue-Id: 49c9LB6xpyz4FF1 X-Spamd-Bar: ---- Authentication-Results: mx1.freebsd.org; dkim=pass header.d=uoguelph.ca header.s=selector1 header.b=aFEhGsQ5; dmarc=none; spf=pass (mx1.freebsd.org: domain of rmacklem@uoguelph.ca designates 40.107.66.54 as permitted sender) smtp.mailfrom=rmacklem@uoguelph.ca X-Spamd-Result: default: False [-4.45 / 15.00]; TO_DN_EQ_ADDR_SOME(0.00)[]; NEURAL_HAM_MEDIUM(-1.04)[-1.040]; R_DKIM_ALLOW(-0.20)[uoguelph.ca:s=selector1]; FROM_HAS_DN(0.00)[]; TO_DN_SOME(0.00)[]; R_SPF_ALLOW(-0.20)[+ip4:40.107.0.0/16]; NEURAL_HAM_LONG(-1.03)[-1.029]; MIME_GOOD(-0.10)[text/plain]; RCVD_TLS_LAST(0.00)[]; DMARC_NA(0.00)[uoguelph.ca]; ARC_ALLOW(-1.00)[microsoft.com:s=arcselector9901:i=1]; RCPT_COUNT_FIVE(0.00)[5]; DWL_DNSWL_LOW(-1.00)[uoguelph.ca:dkim]; RCVD_COUNT_THREE(0.00)[3]; TO_MATCH_ENVRCPT_SOME(0.00)[]; DKIM_TRACE(0.00)[uoguelph.ca:+]; NEURAL_HAM_SHORT(-0.88)[-0.881]; RCVD_IN_DNSWL_NONE(0.00)[40.107.66.54:from]; FROM_EQ_ENVFROM(0.00)[]; MIME_TRACE(0.00)[0:+]; SUBJECT_ENDS_QUESTION(1.00)[]; ASN(0.00)[asn:8075, ipnet:40.64.0.0/10, country:US]; FREEMAIL_CC(0.00)[gndrsh.dnsmgr.net,FreeBSD.org,gmail.com]; RWL_MAILSPIKE_POSSIBLE(0.00)[40.107.66.54:from] X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.33 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 03 Jun 2020 00:50:12 -0000 Peter Eriksson wrote:=0A= >I once reported that we had a server with many thousands (typically 23000 = or so >per server) ZFS filesystems (and 300+ snapshots per filesystem) wher= e mountd >was 100% busy reading and updating the kernel (and while doing th= at holding the >NFS lock for a very long time) every hour (when we took sna= pshots of all the >filesystems - the code in the zfs commands send a lot of= SIGHUPs to mountd it >seems)=85.=0A= >=0A= >(Causing NFS users to complain quite a bit)=0A= >=0A= >I have also seen the effect that when there are a lot of updates to filesy= stems that >some exports can get =93missed=94 if mountd is bombarded with m= ultiple SIGHUPS - >but with the new incremental update code in mountd this = window (for SIGHUPs to >get lost) is much smaller (and I now also have a Na= gios check that verifies that all >exports in /etc/zfs/exports also is visi= ble in the kernel).=0A= I just put a patch up in PR#246597, which you might want to try.=0A= =0A= >But while we had this problem it I also investigated going to a DB based e= xports >=93file=94 in order to make the code in the =93zfs=94 commands that= read and update >/etc/zfs/exports a lot faster too. As Rick says there is = room for _huge_ >improvements there.=0A= >=0A= >For every change of =93sharenfs=94 per filesystem it would open and read a= nd parse, >line-by-line, /etc/zfs/exports *two* times and then rewrite the = whole file. Now >imagine doing that recursively for 23000 filesystems=85 My= change to the zfs code >simple opened a DB file and just did a =93put=94 o= f a record for the filesystem (and >then sent mountd a SIGHUP).=0A= Just to clarify, if someone else can put Peter's patch in ZFS, I am willing= to=0A= put the required changes in mountd.=0A= =0A= >=0A= >(And even worse - when doing the boot-time =93zfs share -a=94 - for each f= ilesystem it >would open(/etc/zfs/exports, read it line by line and check t= o make sure the >filesystem isn=92t already in the file, then open a tmp fi= le, write out all the old >filesystem + plus the new one, rename it to /etc= /zfs/exports, send a SIGHUP) and >the go on to the next one.. Repeat. Pret= ty fast for 1-10 filesystems, not so fast for >20000+ ones=85 And tests the= boot disk I/O a bit :-)=0A= >=0A= >=0A= >I have seen that the (ZFS-on-Linux) OpenZFS code has changed a bit regardi= ng >this and I think for Linux they are going the route of directly updatin= g the kernel >instead of going via some external updater (like mountd).=0A= The problem here is NFSv3, where something (currently mountd) needs to know= =0A= about this stuff, so it can do the Mount protocol (used for NFSv3 mounting = and=0A= done with Mount RPCs, not NFS ones).=0A= =0A= >That probably would be an even better way (for ZFS) but a DB database migh= t be >useful anyway. It=92s a very simple change (especially in mountd - it= just opens the >DB file and reads the records sequentially instead of the = text file).=0A= I think what you have, which puts the info in a db file and then SIGHUPs mo= untd=0A= is a good start.=0A= Again, if someone else can get this into ZFS, I can put the bits in mountd.= =0A= =0A= Thanks for posting this, rick=0A= ps: Do you happen to know how long a reload of exports in mountd is current= ly=0A= taking, with the patches done to it last year?=0A= =0A= - Peter=0A= =0A= On 2 Jun 2020, at 06:30, Rick Macklem > wrote:=0A= =0A= Rodney Grimes wrote:=0A= Hi,=0A= =0A= I'm posting this one to freebsd-net@ since it seems vaguely similar=0A= to a network congestion problem and thought that network types=0A= might have some ideas w.r.t. fixing it?=0A= =0A= PR#246597 - Reports a problem (which if I understand it is) where a sighup= =0A= is posted to mountd and then another sighup is posted to mountd while=0A= it is reloading exports and the exports are not reloaded again.=0A= --> The simple patch in the PR fixes the above problem, but I think will= =0A= aggravate another one.=0A= For some NFS servers, it can take minutes to reload the exports file(s).=0A= (I believe Peter Erriksonn has a server with 80000+ file systems exported.)= =0A= r348590 reduced the time taken, but it is still minutes, if I recall correc= tly.=0A= Actually, my recollection w.r.t. the times was way off.=0A= I just looked at the old PR#237860 and, without r348590, it was 16seconds= =0A= (aka seconds, not minutes) and with r348590 that went down to a fraction=0A= of a second (there was no exact number in the PR, but I noted milliseconds = in=0A= the commit log entry.=0A= =0A= I still think there is a risk of doing the reloads repeatedly.=0A= =0A= --> If you apply the patch in the PR and sighups are posted to mountd as=0A= often as it takes to reload the exports file(s), it will simply reloa= d the=0A= exports file(s) over and over and over again, instead of processing= =0A= Mount RPC requests.=0A= =0A= So, finally to the interesting part...=0A= - It seems that the code needs to be changed so that it won't "forget"=0A= sighup(s) posted to it, but it should not reload the exports file(s) too= =0A= frequently.=0A= --> My thoughts are something like:=0A= - Note that sighup(s) were posted while reloading the exports file(s) and= =0A= do the reload again, after some minimum delay.=0A= --> The minimum delay might only need to be 1second to allow some=0A= RPCs to be processed before reload happens again.=0A= Or=0A= --> The minimum delay could be some fraction of how long a reload takes.= =0A= (The code could time the reload and use that to calculate how long= to=0A= delay before doing the reload again.)=0A= =0A= Any ideas or suggestions? rick=0A= ps: I've actually known about this for some time, but since I didn't have a= good=0A= solution...=0A= =0A= Build a system that allows adding and removing entries from the=0A= in mountd exports data so that you do not have to do a full=0A= reload every time one is added or removed?=0A= =0A= Build a system that used 2 exports tables, the active one, and the=0A= one that was being loaded, so that you can process RPC's and reloads=0A= at the same time.=0A= Well, r348590 modified mountd so that it built a new set of linked list=0A= structures from the modified exports file(s) and then compared them with=0A= the old ones, only doing updates to the kernel exports for changes.=0A= =0A= It still processes the entire exports file each time, to produce the in mou= ntd=0A= memory linked lists (using hash tables and a binary tree).=0A= =0A= Peter did send me a patch to use a db frontend, but he felt the only=0A= performance improvements would be related to ZFS.=0A= Since ZFS is something I avoid like the plague I never pursued it.=0A= (If anyone willing to ZFS stuff wants to pursue this,=0A= just email and I can send you the patch.)=0A= Here's a snippet of what he said about it.=0A= It looks like a very simple patch to create and even though it wouldn=92t r= eally > improve the speed for the work that mountd does it would ma= ke possible really > drastic speed improvements in the zfs commands. They (= zfs commands) currently > reads the thru text-based exports file multiple = times when you do work with zfs > filesystems (mounting/sharing/changing s= hare options etc). Using a db based=0A= exports file for the zfs exports (b-tree based probably) would allow the zf= s code > to be much faster.=0A= =0A= At this point, I am just interested in fixing the problem in the PR, rick= =0A= =0A= _______________________________________________=0A= freebsd-net@freebsd.org mailing list=0A= https://lists.freebsd.org/mailman/listinfo/freebsd-net=0A= To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org"=0A= =0A=