Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 24 Nov 2021 09:08:04 +0000
From:      bugzilla-noreply@freebsd.org
To:        bugs@FreeBSD.org
Subject:   [Bug 260011] Unresponsive NFS mount on AWS EFS
Message-ID:  <bug-260011-227@https.bugs.freebsd.org/bugzilla/>

next in thread | raw e-mail | index | archive | help
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D260011

            Bug ID: 260011
           Summary: Unresponsive NFS mount on AWS EFS
           Product: Base System
           Version: 13.0-RELEASE
          Hardware: Any
                OS: Any
            Status: New
          Severity: Affects Only Me
          Priority: ---
         Component: kern
          Assignee: bugs@FreeBSD.org
          Reporter: ale@FreeBSD.org

I'm experiencing annoying issues with an AWS EFS mountpoint on FreeBSD 13 E=
C2
instances. The filesystem is mounted by 3 instances (2 with the same access
patterns, 1 with a different one)

Initially I had the /etc/fstab entry configured with:=20

`rw,nosuid,noatime,bg,nfsv4,minorversion=3D1,rsize=3D1048576,wsize=3D104857=
6,timeo=3D600,oneopenown`

and this after a few days led my java application to have all threads block=
ed
on never returning `stat64` kernel calls, without the ability to even kill =
-9
the process.

After digging it up it seems the normal behavior for hard mount points, eve=
n if
I fail to understand why one should prefer to have the system completely
freezed when the NFS mount point is not responding.

So I later changed the configuration with:

`rw,nosuid,noatime,bg,nfsv4,minorversion=3D1,intr,soft,retrans=3D2,rsize=3D=
1048576,wsize=3D1048576,timeo=3D600,oneopenown`

by adding `intr,soft,retrans=3D2`.

Btw, I think there is a typo in mount_nfs(8), it says to set `retrycnt` ins=
tead
of `retrans` for the `soft` option, can you confirm?

After the change `nfsstat -m` reports:
`nfsv4,minorversion=3D1,oneopenown,tcp,resvport,soft,intr,cto,sec=3Dsys,acd=
irmin=3D3,acdirmax=3D60,acregmin=3D5,acregmax=3D60,nametimeo=3D60,negnameti=
meo=3D60,rsize=3D65536,wsize=3D65536,readdirsize=3D65536,readahead=3D1,wcom=
mitsize=3D16777216,timeout=3D120,retrans=3D2`

I wonder why it seems that the timeo,rsize,wsize have been ignored, but thi=
s is
irrelevant to the issue.

After a few days the application on the two similar EC2 instances stopped
working again, though. Any command accessing the mounted efs filesystem did=
n't
complete in reasonable time (ls, df, umount, etc.), but I could kill the
processes. The only way to recover the situation was to reboot the instance=
s,
though.

On one of them I've seen the following kernel messages, but they have been
generated only when I tried to debug the issue hours later, and only on one=
 EC2
instance, so I'm not sure if they are relevant or helpful:

```
kernel: newnfs: server 'fs-xxx.efs.us-east-1.amazonaws.com' error: fileid
changed. fsid 0:0: expected fileid 0x4d2369b89a58a920, got 0x2. (BROKEN NFS
SERVER OR MIDDLEWARE)
kernel: nfs server fs-xxx.efs.us-east-1.amazonaws.com:/: not responding
```

The third EC2 instance survived and was still able to access the filesystem,
but I think it wasn't accessing the filesystem when there have been the
network/nfs issue  that affected the two others.

--=20
You are receiving this mail because:
You are the assignee for the bug.=



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?bug-260011-227>