From owner-freebsd-fs@freebsd.org Fri Dec 11 23:28:32 2020 Return-Path: Delivered-To: freebsd-fs@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id D27F24BE96D for ; Fri, 11 Dec 2020 23:28:32 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: from CAN01-QB1-obe.outbound.protection.outlook.com (mail-qb1can01on0620.outbound.protection.outlook.com [IPv6:2a01:111:f400:fe5c::620]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "mail.protection.outlook.com", Issuer "GlobalSign Organization Validation CA - SHA256 - G3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4Ct6RN08KTz3KSh for ; Fri, 11 Dec 2020 23:28:31 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=k9luQW4TfsQ4w4nxwJRv4Et/2t2tai7jmTHvDacnVr9rTbqRl/bcRUadgZifJME19+goIs0p11v5I8l2L4LkCXBAD7Dvt2CAI2UK9i3FShmMTodE0eBNfU+0/QUZ7Bbyd6PqmUHWaaIXrif0XvDx0V47+Y6RlN+AtwoERPOHn35/O83DdYkQg7bYd9m/pOYMCiN+IPhajjWb4/RXoofQ+YZzD5Q2q3Bsva/DpYW/F1pdCWBPPbIqgDcryQkhRzl9UXby9bO2dmwVvxU+DIoxOiEWcuuez5OVr8e91/OpWvWauHX2KJWE/pwl2XKmLYrME8kh0VfpW/cM/HNz9C7lvg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=mSPd6CYGzrUtF9v2TIXrbxpPyMu0uhXD8LkvYsfYkEw=; b=GTPa71Icv6306/MesXLlF4g0RuTeuqxbZt/XuOvVC+iFXiq6T+4ozJ63kz7c9TaWsdfrDsCuQwWG8iCKRWF7fRfXV7i3ivO/9my1C6+5Ybcv400cABPZaTjQ0P1RSEneUN7U0BAQY3/oS3A0rkwBMX9AkgqbyYDGRYRzhk+EpIdww3UWZuKcoJuih4s20uyWwflQ51UeLDRMO0QDMUQkaEzxXXjsOcOTGHm0nUI5k8fXCN8SLsVqiIGVZOYroVxG9uug3Z+d3iNnFJ3kuUejjMhmwq1pCB2w9p95yqg2rbDxWqheUnzx/EggbG3L9ZGxVKCfw5vhgKO+3rhFQ5r1yw== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=uoguelph.ca; dmarc=pass action=none header.from=uoguelph.ca; dkim=pass header.d=uoguelph.ca; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=uoguelph.ca; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=mSPd6CYGzrUtF9v2TIXrbxpPyMu0uhXD8LkvYsfYkEw=; b=LUPNRKykN011au51Xaw5HlnOVWg5EJbk8D3MhN8BUQdfkTsRCCyMSDl7DiAFs+xUbS3bKXST7ttiLp8BDM76AG0wYqowkbcwTusy4qcXyeBr4Mv8qo5lLBP/a1tFPuA+5HV2nHAfNcgzVL5RN+HdpWujhZJn3D9agLUc/3/Vajq5JCXNeVjAeHvF8SYnfebbLChW6PbGQEexk0HONah/eRi8xpvqUyBppg/0XFJQNWXDnKkuqG61VJTj8+enADJOauFRVuSSKLeHlviiNEp7aBJI0zT8fyt/5uEizbts7s5FVH1l7icBisVqApqKBi7j+GPimGzxgTwki9T6yHrdpQ== Received: from YTOPR0101MB0970.CANPRD01.PROD.OUTLOOK.COM (2603:10b6:b00:20::29) by YTBPR01MB3040.CANPRD01.PROD.OUTLOOK.COM (2603:10b6:b01:1b::21) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.3654.12; Fri, 11 Dec 2020 23:28:30 +0000 Received: from YTOPR0101MB0970.CANPRD01.PROD.OUTLOOK.COM ([fe80::b131:712c:ca2d:c7b7]) by YTOPR0101MB0970.CANPRD01.PROD.OUTLOOK.COM ([fe80::b131:712c:ca2d:c7b7%5]) with mapi id 15.20.3632.022; Fri, 11 Dec 2020 23:28:30 +0000 From: Rick Macklem To: J David CC: "freebsd-fs@freebsd.org" Subject: Re: Major issues with nfsv4 Thread-Topic: Major issues with nfsv4 Thread-Index: AQHWzw/HDat+dHoH9kKG5K3Xpd53kqnxDteQgAFi0QCAABTa8w== Date: Fri, 11 Dec 2020 23:28:30 +0000 Message-ID: References: , In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-ms-publictraffictype: Email x-ms-office365-filtering-correlation-id: 94310d6c-1352-49e0-5110-08d89e2c7823 x-ms-traffictypediagnostic: YTBPR01MB3040: x-microsoft-antispam-prvs: x-ms-oob-tlc-oobclassifiers: OLM:10000; x-ms-exchange-senderadcheck: 1 x-microsoft-antispam: BCL:0; x-microsoft-antispam-message-info: Z6DPizYMfz5smm2mlMGa94n0nNIh8JTzhBguSHOTBExALQSDf4e6x4MmySFVHgIwFlmGRnwYrIhFBPobr9w8U3nBzT7UfJ6Q74TbvWT6STPXw/WQici9maxrg3vJk0FJY9h1Xfg9PS+jOAnlJaFwxGaWeyRTPEeJpVVcw6cFBpaWFg+FxfOdcIMa8To+AWti1Gq4X6TcKDXQTfqRZMoCVnqOVQF1IrDOGqmENnYEp5PysSAv9LRF0eXZhgBlDrNLVopfaLOZhjK+olDDk0v56ZTWDysifYsFrCBHym/hj1A2Cx6ITjriMfE3Z7E0KjTLQLh6GM0hNaO5Fnmi1NA4RA== x-forefront-antispam-report: CIP:255.255.255.255; CTRY:; LANG:en; SCL:1; SRV:; IPV:NLI; SFV:NSPM; H:YTOPR0101MB0970.CANPRD01.PROD.OUTLOOK.COM; PTR:; CAT:NONE; SFS:(39860400002)(376002)(346002)(366004)(136003)(396003)(5660300002)(76116006)(2906002)(786003)(316002)(478600001)(83380400001)(4326008)(66574015)(71200400001)(8936002)(6506007)(66556008)(52536014)(66476007)(66446008)(86362001)(66946007)(6916009)(64756008)(6512007)(33656002)(6486002)(186003)(9686003)(91956017)(8676002); DIR:OUT; SFP:1101; x-ms-exchange-antispam-messagedata: =?iso-8859-1?Q?xLQKtwtCQgkrngFKFIhGvhWclrXDT0elsB8FVcZEL5Sn6N5Ru1GYMwn3aE?= =?iso-8859-1?Q?x2d0TMwPZd5nya2w47PIlEtQw0tZwhD7XVlwE6PktABB2sPme4kqaPFA2I?= =?iso-8859-1?Q?rclen+Hd7pi43NtwrYr16Z91wd4QxNPZqFVJXH5A0CViHRc3iX6sJy8Ui8?= =?iso-8859-1?Q?7IJRJOvYuMtOOIi8nVp5Iq21GMxnPTf+8MjULSGeJdrPAHjhIoUQmKsJRm?= =?iso-8859-1?Q?CwMoYGrxhWIsfVkHeEcsCpMfz64IM81VnTVlW5uQlpUYgNjQWIN+++5M+R?= =?iso-8859-1?Q?NKOlCxUWh/0M8Okjqdq6cSBQZmc7u5bzD67slTiq24tGvaC+gAqdp2Md54?= =?iso-8859-1?Q?AHKdp2hlcBTJVsWsLYI+RtoqnxnjbskfJqyMFW2kH8YKScuCTMfjsxxsWX?= =?iso-8859-1?Q?Lpu4CvnIJYLlL0tGhUyaEJ5c7ITEQi3XdhUgDat6W4UGxyL+vCTup8DlYz?= =?iso-8859-1?Q?peJsoT5ETyKb8HQouaf9nVBAHg91EJrTz/0DOH9KCm1B/kf2lvgIFvU7UT?= =?iso-8859-1?Q?EUP1oSwcuXX0XPRnB/DtJI1LqjV/0UBNQfrPdfOvztd//h9ZUBFm290RcN?= =?iso-8859-1?Q?E/YMWpf/8+Uw51UfCRa59J9zzivAXB4TshjJov90NAV4G1sYiPwnZGkDY7?= =?iso-8859-1?Q?kKR3t95qV8jKMz9VGe+uufWMfPzJoqc4k9rB4FmQ+ls5P6Ar5JsnFHHqxq?= =?iso-8859-1?Q?h3FCfKZzEpYd4XA0YmyqdDES9tedOEtvEiMHagiFhTac2NQNYEeDTS6L+j?= =?iso-8859-1?Q?/E/WYvF/0Z6v9kUX/ij7hQOpbIdVhZ1lhC3paOOoWzmQ+rbdRu6kZFIqgR?= =?iso-8859-1?Q?b2ifKXHnh+llr60mNGbrslIUA9sno1Sq6oyOk1bbtUPXIjBuygjIAwhUMx?= =?iso-8859-1?Q?mDO4CvA/BHm0zs5E4E45/Gas0LNRxz82++KsYlkU8XaTJ/mkoCJpzYOAyi?= =?iso-8859-1?Q?9Z+5859m4+RSx8EIyYxEM4P1S5Gl4T4B1HDkOLkycXBEzwSwhXrBTDE+ES?= =?iso-8859-1?Q?H/zqO7ZdrIussLyw0kna7cqX7nfRZvNT8zqvUxpUTQo0sjH81WJLYF0NKh?= =?iso-8859-1?Q?nSijzi76PTl0r+/T6d2F8ec=3D?= x-ms-exchange-transport-forked: True Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-OriginatorOrg: uoguelph.ca X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-AuthSource: YTOPR0101MB0970.CANPRD01.PROD.OUTLOOK.COM X-MS-Exchange-CrossTenant-Network-Message-Id: 94310d6c-1352-49e0-5110-08d89e2c7823 X-MS-Exchange-CrossTenant-originalarrivaltime: 11 Dec 2020 23:28:30.3325 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: be62a12b-2cad-49a1-a5fa-85f4f3156a7d X-MS-Exchange-CrossTenant-mailboxtype: HOSTED X-MS-Exchange-CrossTenant-userprincipalname: 4n8nic69tJqXnnloshxjqG+T4zsbJKJiTWy8MLUunjd2J+EhUhjgjqQlEdS/7x9n5M1xJ8VWtZhDxnbThfS/jw== X-MS-Exchange-Transport-CrossTenantHeadersStamped: YTBPR01MB3040 X-Rspamd-Queue-Id: 4Ct6RN08KTz3KSh X-Spamd-Bar: ---- Authentication-Results: mx1.freebsd.org; dkim=pass header.d=uoguelph.ca header.s=selector1 header.b=LUPNRKyk; arc=pass (microsoft.com:s=arcselector9901:i=1); dmarc=pass (policy=none) header.from=uoguelph.ca; spf=pass (mx1.freebsd.org: domain of rmacklem@uoguelph.ca designates 2a01:111:f400:fe5c::620 as permitted sender) smtp.mailfrom=rmacklem@uoguelph.ca X-Spamd-Result: default: False [-4.00 / 15.00]; TO_DN_EQ_ADDR_SOME(0.00)[]; TO_DN_SOME(0.00)[]; R_SPF_ALLOW(-0.20)[+ip6:2a01:111:f400::/48]; RCVD_COUNT_THREE(0.00)[3]; DKIM_TRACE(0.00)[uoguelph.ca:+]; RCPT_COUNT_TWO(0.00)[2]; DMARC_POLICY_ALLOW(-0.50)[uoguelph.ca,none]; NEURAL_HAM_SHORT(-1.00)[-1.000]; FREEMAIL_TO(0.00)[gmail.com]; FROM_EQ_ENVFROM(0.00)[]; RCVD_TLS_LAST(0.00)[]; RBL_DBL_DONT_QUERY_IPS(0.00)[2a01:111:f400:fe5c::620:from]; ARC_ALLOW(-1.00)[microsoft.com:s=arcselector9901:i=1]; MIME_TRACE(0.00)[0:+]; ASN(0.00)[asn:8075, ipnet:2a01:111:f000::/36, country:US]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; R_DKIM_ALLOW(-0.20)[uoguelph.ca:s=selector1]; FREEFALL_USER(0.00)[rmacklem]; FROM_HAS_DN(0.00)[]; TAGGED_RCPT(0.00)[]; MIME_GOOD(-0.10)[text/plain]; DWL_DNSWL_LOW(-1.00)[uoguelph.ca:dkim]; SPAMHAUS_ZRD(0.00)[2a01:111:f400:fe5c::620:from:127.0.2.255]; TO_MATCH_ENVRCPT_SOME(0.00)[]; NEURAL_SPAM_LONG(1.00)[1.000]; MAILMAN_DEST(0.00)[freebsd-fs] X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 11 Dec 2020 23:28:32 -0000 J David wrote:=0A= >Unfortunately, switching the FreeBSD NFS clients to NFSv4.1 did not=0A= >resolve our issue. But I've narrowed down the problem to a harmful=0A= >interaction between NFSv4 and nullfs.=0A= I am afraid I know nothing about nullfs and jails. I suspect it will be=0A= something related to when file descriptors in the NFS client mount=0A= get closed.=0A= =0A= The NFSv4 Open is a Windows Open lock and has nothing to do with=0A= a POSIX open. Since only one of these can exist for each=0A= tuple, the NFSv4 Close must be delayed until=0A= all POSIX Opens on the file have been closed, including open file=0A= descriptors inherited by children processes.=0A= =0A= Someone else recently reported problems using nullfs and vnet jails.=0A= =0A= >These FreeBSD NFS clients form a pool of application servers that run=0A= >jobs for the application. A given job needs read-write access to its=0A= >data and read-only access to the set of binaries it needs to run.=0A= >=0A= >The job data is horizontally partitioned across a set of directory=0A= >trees spread over one set of NFS servers. A separate set of NFS=0A= >servers store the read-only binary roots.=0A= >=0A= >The jobs are assigned to these machines by a scheduler. A job might=0A= >take five milliseconds or five days.=0A= >=0A= >Historically, we have mounted the job data trees and the various=0A= >binary roots on each application server over NFSv3. When a job=0A= >starts, its setup binds the needed data and binaries into a jail via=0A= >nullfs, then runs the job in the jail. This approach has worked=0A= >perfectly for 10+ years.=0A= Well, NFSv3 is not going away any time soon, so if you don't need=0A= any of the additional features it offers...=0A= =0A= >After I switched a server to NFSv4.1 to test that recommendation, it=0A= >started having the same load problems as NFSv4. As a test, I altered=0A= >it to mount NFS directly in the jails for both the data and the=0A= >binaries. As "nullfs-NFS" jobs finished and "direct NFS" jobs=0A= >started, the load and CPU usage started to fall dramatically.=0A= Good work isolating the problem. Imay try playing with NFSv4/nullfs=0A= someday soon and see if I can break it.=0A= =0A= >The critical problem with this approach is that privileged TCP ports=0A= >are a finite resource. At two per job, this creates two issues.=0A= >=0A= >First, there's a hard limit on both simultaneous jobs per server=0A= >inconsistent with the hardware's capabilities. Second, due to=0A= >TIME_WAIT, it places a hard limit on job throughput. In practice,=0A= >these limits also interfere with each other; the more simultaneous=0A= >long jobs are running, the more impact TIME_WAIT has on short job=0A= >throughput.=0A= >=0A= >While it's certainly possible to configure NFS not to require reserved=0A= >ports, the slightest possibility of a non-root user establishing a=0A= >session to the NFS server kills that as an option.=0A= Personally, I've never thought the reserved port# requirement provided=0A= any real security for most situations. Unless you set "vfs.usermount=3D1"= =0A= only root can do the mount. For non-root to mount the NFS server=0A= when "vfs.usermount=3D0", a user would have to run their own custom hacked= =0A= userland NFS client. Although doable, I have never heard of it being done.= =0A= =0A= rick=0A= =0A= Turning down TIME_WAIT helps, though the ability to do that only on=0A= the interface facing the NFS server would be more palatable than doing=0A= it globally.=0A= =0A= Adjusting net.inet.ip.portrange.lowlast does not seem to help. The=0A= code at sys/nfs/krpc_subr.c correctly uses ports between=0A= IPPORT_RESERVED and IPPORT_RESERVED/2 instead of ipport_lowfirstauto=0A= and ipport_lowlastauto. But is that the correct place to look for=0A= NFSv4.1?=0A= =0A= How explosive would adding SO_REUSEADDR to the NFS client be? It's=0A= not a full solution, but it would handle the TIME_WAIT side of the=0A= issue.=0A= =0A= Even so, there may be no workaround for the simultaneous mount limit=0A= as long as reserved ports are required. Solving the negative=0A= interaction with nullfs seems like the only long-term fix.=0A= =0A= What would be a good next step there?=0A= =0A= Thanks!=0A=