From owner-freebsd-fs@freebsd.org Fri Dec 11 21:52:30 2020 Return-Path: Delivered-To: freebsd-fs@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id 0FE9F4BBEC8 for ; Fri, 11 Dec 2020 21:52:30 +0000 (UTC) (envelope-from jdavidlists@gmail.com) Received: from mail-lf1-x12b.google.com (mail-lf1-x12b.google.com [IPv6:2a00:1450:4864:20::12b]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "GTS CA 1O1" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4Ct4JY2vCDz3CTt for ; Fri, 11 Dec 2020 21:52:29 +0000 (UTC) (envelope-from jdavidlists@gmail.com) Received: by mail-lf1-x12b.google.com with SMTP id a9so15354822lfh.2 for ; Fri, 11 Dec 2020 13:52:29 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=KGLSdwx8YS111NcOayZ61Lecwj/5lDLDZmc2tieFbFY=; b=m4E9E4CkF3WC6cgKOlIsgf7usmVEan2NxXUE4pzrx0ablS8VUK0bBi++VGY7NhT7N3 v6colW57FbUshoAeDzksfzIilM3NEXL0i9WieDnKNjmNi0mcxU/q7l/tH9nrteIBIYTR dkUjKuB9Rrw/qT/++t1ePBEKVasMyaum7Jpo0PElKTnuQyl7ONbyRl3lQK1A5LYrrDde pviwsEBWJ1t8VLH39fTsgjSunZq1MxQrKraaKmAiP78t9DHDfEwgM51R/eMvh68093Xt oPoQkxiy9dOYD2/D3/lGf4jS6D/6shKES/X7CGvSfm4755dCRJpg5lK9tBiIeFik1ppK LRTg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=KGLSdwx8YS111NcOayZ61Lecwj/5lDLDZmc2tieFbFY=; b=WZUxWVIPAWGDXSa53rJCiXEr4DMZEUatbRkXw2FY5YuTpeIEO8yKsfylgdIwEhTCcy ++Ta3jFP2BvqFdiAE0RgHPXXJseTX+e5VWXcABhqn8iqL0Xn3EzzVjaxItiHHwdvNFST 1aJqh4HCVHlc9iPUaK9dvXlRb8E1GeEwxrOo+LpE77+AO95KWk6s2OmGggSem24+xaMR ks/OJOVCjlW/xv6a1ehO1B7JTUuYY7po0kgMWg4ppBcPJaRGiXjjHi66te7LKCFC1hCw J5aVPGN40xB+uHeDNKqu8tllQ2lI2+CLHD1pI4a71MhA9PJByjxoXbZ2IzBra7JSfjO5 aVPQ== X-Gm-Message-State: AOAM533n1C/TSHDj/XZ9hXpkPBfRxgtpa664EqbPvHB9s0g3updNRYBT pLMGHq/z4GWoDZsSAIc2J5kLzb/+/kxU3OSDJ/E= X-Google-Smtp-Source: ABdhPJz4rXFzwI1dhcG9gFuIMD2hM7SfUdDVWKX06S+dvL9+Md9Kzh218fwCytSgf8fPPn/gXFPRHmU1UGRb8NsEE88= X-Received: by 2002:a05:6512:1095:: with SMTP id j21mr5555997lfg.309.1607723547537; Fri, 11 Dec 2020 13:52:27 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: J David Date: Fri, 11 Dec 2020 16:52:16 -0500 Message-ID: Subject: Re: Major issues with nfsv4 To: Rick Macklem Cc: "freebsd-fs@freebsd.org" Content-Type: text/plain; charset="UTF-8" X-Rspamd-Queue-Id: 4Ct4JY2vCDz3CTt X-Spamd-Bar: - Authentication-Results: mx1.freebsd.org; dkim=pass header.d=gmail.com header.s=20161025 header.b=m4E9E4Ck; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (mx1.freebsd.org: domain of jdavidlists@gmail.com designates 2a00:1450:4864:20::12b as permitted sender) smtp.mailfrom=jdavidlists@gmail.com X-Spamd-Result: default: False [-1.26 / 15.00]; TO_DN_EQ_ADDR_SOME(0.00)[]; TO_DN_SOME(0.00)[]; FREEMAIL_FROM(0.00)[gmail.com]; R_SPF_ALLOW(-0.20)[+ip6:2a00:1450:4000::/36]; DKIM_TRACE(0.00)[gmail.com:+]; RCPT_COUNT_TWO(0.00)[2]; DMARC_POLICY_ALLOW(-0.50)[gmail.com,none]; NEURAL_HAM_SHORT(-0.26)[-0.259]; FROM_EQ_ENVFROM(0.00)[]; MIME_TRACE(0.00)[0:+]; RBL_DBL_DONT_QUERY_IPS(0.00)[2a00:1450:4864:20::12b:from]; FREEMAIL_ENVFROM(0.00)[gmail.com]; ASN(0.00)[asn:15169, ipnet:2a00:1450::/32, country:US]; TAGGED_FROM(0.00)[]; DWL_DNSWL_NONE(0.00)[gmail.com:dkim]; ARC_NA(0.00)[]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; R_DKIM_ALLOW(-0.20)[gmail.com:s=20161025]; FROM_HAS_DN(0.00)[]; MIME_GOOD(-0.10)[text/plain]; PREVIOUSLY_DELIVERED(0.00)[freebsd-fs@freebsd.org]; SPAMHAUS_ZRD(0.00)[2a00:1450:4864:20::12b:from:127.0.2.255]; TO_MATCH_ENVRCPT_SOME(0.00)[]; NEURAL_SPAM_LONG(1.00)[1.000]; RCVD_IN_DNSWL_NONE(0.00)[2a00:1450:4864:20::12b:from]; RCVD_COUNT_TWO(0.00)[2]; RCVD_TLS_ALL(0.00)[]; MAILMAN_DEST(0.00)[freebsd-fs] X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 11 Dec 2020 21:52:30 -0000 Unfortunately, switching the FreeBSD NFS clients to NFSv4.1 did not resolve our issue. But I've narrowed down the problem to a harmful interaction between NFSv4 and nullfs. These FreeBSD NFS clients form a pool of application servers that run jobs for the application. A given job needs read-write access to its data and read-only access to the set of binaries it needs to run. The job data is horizontally partitioned across a set of directory trees spread over one set of NFS servers. A separate set of NFS servers store the read-only binary roots. The jobs are assigned to these machines by a scheduler. A job might take five milliseconds or five days. Historically, we have mounted the job data trees and the various binary roots on each application server over NFSv3. When a job starts, its setup binds the needed data and binaries into a jail via nullfs, then runs the job in the jail. This approach has worked perfectly for 10+ years. After I switched a server to NFSv4.1 to test that recommendation, it started having the same load problems as NFSv4. As a test, I altered it to mount NFS directly in the jails for both the data and the binaries. As "nullfs-NFS" jobs finished and "direct NFS" jobs started, the load and CPU usage started to fall dramatically. The critical problem with this approach is that privileged TCP ports are a finite resource. At two per job, this creates two issues. First, there's a hard limit on both simultaneous jobs per server inconsistent with the hardware's capabilities. Second, due to TIME_WAIT, it places a hard limit on job throughput. In practice, these limits also interfere with each other; the more simultaneous long jobs are running, the more impact TIME_WAIT has on short job throughput. While it's certainly possible to configure NFS not to require reserved ports, the slightest possibility of a non-root user establishing a session to the NFS server kills that as an option. Turning down TIME_WAIT helps, though the ability to do that only on the interface facing the NFS server would be more palatable than doing it globally. Adjusting net.inet.ip.portrange.lowlast does not seem to help. The code at sys/nfs/krpc_subr.c correctly uses ports between IPPORT_RESERVED and IPPORT_RESERVED/2 instead of ipport_lowfirstauto and ipport_lowlastauto. But is that the correct place to look for NFSv4.1? How explosive would adding SO_REUSEADDR to the NFS client be? It's not a full solution, but it would handle the TIME_WAIT side of the issue. Even so, there may be no workaround for the simultaneous mount limit as long as reserved ports are required. Solving the negative interaction with nullfs seems like the only long-term fix. What would be a good next step there? Thanks!