Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 20 Jan 2019 17:19:08 +1100 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        Martin Birgmeier <d8zNeCFG@aon.at>
Cc:        Eugene Grosbein <eugen@grosbein.net>, net@freebsd.org
Subject:   Re: [Bug 235031] [em] em0: poor NFS performance, strange behavior
Message-ID:  <20190120163533.K1342@besplex.bde.org>
In-Reply-To: <20190120145627.X1077@besplex.bde.org>
References:  <bug-235031-7501@https.bugs.freebsd.org/bugzilla/> <bug-235031-7501-goXNmp3zVl@https.bugs.freebsd.org/bugzilla/> <20190119204156.D929@besplex.bde.org> <3e407ee7-54e3-a6ac-5535-d11aceca9558@grosbein.net> <20190120061258.X3312@besplex.bde.org> <16ce1832-13da-d7bb-cce2-6682e058b5a6@aon.at> <20190120145627.X1077@besplex.bde.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Sun, 20 Jan 2019, Bruce Evans wrote:

> [iflib_media_change() is missing iflib_stop(), like iflib_resume() was]
> 
> I don't know what the media was after the broken resume.  Its reported
> result can't be trusted anyway.  To recover from the broken resume, it
> usually worked to repeat down/up a few times.  This is consistent with
> bug -- eventually, previous down/up's change the state to close enough
> to stopped.  But using the interface in any way (including pinging it
> to see if it is still broken) makes it not so close to being stopped.

Further debugging after restoring the bug in resume:
- I use mainly zzz to suspend
- the bug usually doesn't break the interface if I copy zzz from nfs to
   non-nfs and use the copy.  This explains why almost no one except me
   noticed the bug -- zzz is usually not on nfs, and other nfs activity
   is usually lighter than mine too.  (Suspend apparently doesn't do enough
   stopping or syncing generally.  It should fsync() all files ...)
- the bug usually does break the interface if zzz is on nfs
- when the bug breaks the interface:
   - the media is reported as unchanged
   - after DUPs starting with a delay of many seconds and reducing by the
     ping interval of 1 second for each until the delay is less than 1
     second, the ping latency stabilizes at quite different values after
     each suspend/resume.  These values tend to be higher than for media
     change (several hundred ms instead of 76 ms).
   - my ifconfig excutable is one of several under /sbin which is not on nfs,
     but my ifconfig is actually a shell script in $HOME/bin; the script
     selects the correct version of ifconfig for the current kernel; it is
     on nfs, and uses utilties on nfs.  I sometimes forget this, and then
     running plain ifconfig to attempt to recover takes too long, and if I
     wait then the nfs activity for finding ifconfig not on nfs tends to
     propagate the broken interface (like zzz not on nfs breaks it).
     Manually selecting the correct version of ifconfig under /sbin and using
     it tends to work right (like zzz not on nfs).
   - even an mtu change is enough to recover.  This is not surprising, since
     it does slightly more than down/up as an implementation detail.  This
     shows that the reported media value is at least used by the reinit for
     the mtu change.
   - pinging the interface didn't make it active enough for the recovery to
     not usually work.

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20190120163533.K1342>