From owner-freebsd-fs@freebsd.org Fri Dec 11 23:08:24 2020 Return-Path: Delivered-To: freebsd-fs@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id 8A1204BE1D6 for ; Fri, 11 Dec 2020 23:08:24 +0000 (UTC) (envelope-from asomers@gmail.com) Received: from mail-ot1-f54.google.com (mail-ot1-f54.google.com [209.85.210.54]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "GTS CA 1O1" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4Ct6073Rqbz3JjS for ; Fri, 11 Dec 2020 23:08:23 +0000 (UTC) (envelope-from asomers@gmail.com) Received: by mail-ot1-f54.google.com with SMTP id b18so9808590ots.0 for ; Fri, 11 Dec 2020 15:08:23 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=k/5FB9tI5gPpuQu1jNK/U1eva7QDrF+Kb5HpHMiyCgs=; b=T4y5MfRY+qkYkEK9yRhd93KISFCzsbOvobRuka++7kXRhpnZ0/Y/HJvYbfcT/cB72L sFzDV9bDoP6y8RYl67GK5ppZfX6Ji29JNVH0sixNv49ZX2tTqv33XSxOqUnPbueBjCAo YRqLaz/J6AT4CQkSFMBS131fICA4FqYBo89XLVsUZ7nmvbaoAaNiBhNnYzwmqTeNZsP4 f1Fji0jVO7CqB1IFBt0Ywcp9eZVRfOtr3ZMV6PCxgV8QZywLwluRL9HVo4S2STP58lsj 18i0VQDvK1OgJHcEpGwNBauVmJLoE3jOSDyBsgglxwnYERtnuURVtsIn4sk+5mwCi424 3Nvw== X-Gm-Message-State: AOAM533+76hG5281KzidVdpJsRZ60D/iftbhHb9KHrYQzx01w1TsihVl KbLxux8buZ1zvuZBuwU5rpX4s7JvGymVXPiOdtE= X-Google-Smtp-Source: ABdhPJx9Bs8lYHFVeCAJcnYUfILGryYMppvADxYkNBudn8W3oSFDJG4Aexsb/I3IV0GURLpTFZYTdjOu5dJeMCJAX78= X-Received: by 2002:a9d:646:: with SMTP id 64mr11625976otn.18.1607728101418; Fri, 11 Dec 2020 15:08:21 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Alan Somers Date: Fri, 11 Dec 2020 16:08:10 -0700 Message-ID: Subject: Re: Major issues with nfsv4 To: J David Cc: Rick Macklem , "freebsd-fs@freebsd.org" X-Rspamd-Queue-Id: 4Ct6073Rqbz3JjS X-Spamd-Bar: / Authentication-Results: mx1.freebsd.org; dkim=none; dmarc=none; spf=pass (mx1.freebsd.org: domain of asomers@gmail.com designates 209.85.210.54 as permitted sender) smtp.mailfrom=asomers@gmail.com X-Spamd-Result: default: False [-0.97 / 15.00]; TO_DN_EQ_ADDR_SOME(0.00)[]; TO_DN_SOME(0.00)[]; R_SPF_ALLOW(-0.20)[+ip4:209.85.128.0/17:c]; RWL_MAILSPIKE_GOOD(0.00)[209.85.210.54:from]; NEURAL_HAM_SHORT(-0.97)[-0.973]; FREEMAIL_TO(0.00)[gmail.com]; FORGED_SENDER(0.30)[asomers@freebsd.org,asomers@gmail.com]; MIME_TRACE(0.00)[0:+,1:+,2:~]; FREEMAIL_ENVFROM(0.00)[gmail.com]; RBL_DBL_DONT_QUERY_IPS(0.00)[209.85.210.54:from]; R_DKIM_NA(0.00)[]; FROM_NEQ_ENVFROM(0.00)[asomers@freebsd.org,asomers@gmail.com]; ASN(0.00)[asn:15169, ipnet:209.85.128.0/17, country:US]; ARC_NA(0.00)[]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; FREEFALL_USER(0.00)[asomers]; FROM_HAS_DN(0.00)[]; RCPT_COUNT_THREE(0.00)[3]; TAGGED_RCPT(0.00)[]; MIME_GOOD(-0.10)[multipart/alternative,text/plain]; PREVIOUSLY_DELIVERED(0.00)[freebsd-fs@freebsd.org]; DMARC_NA(0.00)[freebsd.org]; SPAMHAUS_ZRD(0.00)[209.85.210.54:from:127.0.2.255]; TO_MATCH_ENVRCPT_SOME(0.00)[]; NEURAL_SPAM_LONG(1.00)[1.000]; RCVD_IN_DNSWL_NONE(0.00)[209.85.210.54:from]; RCVD_COUNT_TWO(0.00)[2]; RCVD_TLS_ALL(0.00)[]; MAILMAN_DEST(0.00)[freebsd-fs] Content-Type: text/plain; charset="UTF-8" X-Content-Filtered-By: Mailman/MimeDel 2.1.34 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 11 Dec 2020 23:08:24 -0000 On Fri, Dec 11, 2020 at 2:52 PM J David wrote: > Unfortunately, switching the FreeBSD NFS clients to NFSv4.1 did not > resolve our issue. But I've narrowed down the problem to a harmful > interaction between NFSv4 and nullfs. > > These FreeBSD NFS clients form a pool of application servers that run > jobs for the application. A given job needs read-write access to its > data and read-only access to the set of binaries it needs to run. > > The job data is horizontally partitioned across a set of directory > trees spread over one set of NFS servers. A separate set of NFS > servers store the read-only binary roots. > > The jobs are assigned to these machines by a scheduler. A job might > take five milliseconds or five days. > > Historically, we have mounted the job data trees and the various > binary roots on each application server over NFSv3. When a job > starts, its setup binds the needed data and binaries into a jail via > nullfs, then runs the job in the jail. This approach has worked > perfectly for 10+ years. > > After I switched a server to NFSv4.1 to test that recommendation, it > started having the same load problems as NFSv4. As a test, I altered > it to mount NFS directly in the jails for both the data and the > binaries. As "nullfs-NFS" jobs finished and "direct NFS" jobs > started, the load and CPU usage started to fall dramatically. > > The critical problem with this approach is that privileged TCP ports > are a finite resource. At two per job, this creates two issues. > > First, there's a hard limit on both simultaneous jobs per server > inconsistent with the hardware's capabilities. Second, due to > TIME_WAIT, it places a hard limit on job throughput. In practice, > these limits also interfere with each other; the more simultaneous > long jobs are running, the more impact TIME_WAIT has on short job > throughput. > > While it's certainly possible to configure NFS not to require reserved > ports, the slightest possibility of a non-root user establishing a > session to the NFS server kills that as an option. > > Turning down TIME_WAIT helps, though the ability to do that only on > the interface facing the NFS server would be more palatable than doing > it globally. > > Adjusting net.inet.ip.portrange.lowlast does not seem to help. The > code at sys/nfs/krpc_subr.c correctly uses ports between > IPPORT_RESERVED and IPPORT_RESERVED/2 instead of ipport_lowfirstauto > and ipport_lowlastauto. But is that the correct place to look for > NFSv4.1? > > How explosive would adding SO_REUSEADDR to the NFS client be? It's > not a full solution, but it would handle the TIME_WAIT side of the > issue. > > Even so, there may be no workaround for the simultaneous mount limit > as long as reserved ports are required. Solving the negative > interaction with nullfs seems like the only long-term fix. > > What would be a good next step there? > > Thanks! > That's some good information. However, it must not be the whole story. I've been nullfs mounting my NFS mounts for years. For example, right now on a FreeBSD 12.2-RC2 machine: > sudo nfsstat -m Password: 192.168.0.2:/home on /usr/home nfsv4,minorversion=1,tcp,resvport,soft,cto,sec=sys,acdirmin=3,acdirmax=60,acregmin=5,acregmax=60,nametimeo=60,negnametimeo=60,rsize=65536,wsize=65536,readdirsize=65536,readahead=1,wcommitsize=16777216,timeout=120,retrans=2147483647 > mount | grep home 192.168.0.2:/home on /usr/home (nfs, nfsv4acls) /usr/home on /iocage/jails/rustup2/root/usr/home (nullfs) Are you using any mount options with nullfs? It might be worth trying to make the read-only mount into read-write, to see if that helps. And what does "jls -n" show? -Alan