From owner-freebsd-fs@freebsd.org  Wed Jul  8 14:21:07 2015
Return-Path: <owner-freebsd-fs@freebsd.org>
Delivered-To: freebsd-fs@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 8685A9964AC
 for <freebsd-fs@mailman.ysv.freebsd.org>; Wed,  8 Jul 2015 14:21:07 +0000 (UTC)
 (envelope-from email.ahmedkamal@googlemail.com)
Received: from mail-wi0-x235.google.com (mail-wi0-x235.google.com
 [IPv6:2a00:1450:400c:c05::235])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id EF5801C69;
 Wed,  8 Jul 2015 14:21:06 +0000 (UTC)
 (envelope-from email.ahmedkamal@googlemail.com)
Received: by wifm2 with SMTP id m2so91158379wif.1;
 Wed, 08 Jul 2015 07:21:05 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=googlemail.com; s=20120113;
 h=mime-version:in-reply-to:references:from:date:message-id:subject:to
 :cc:content-type;
 bh=1EhHi3t63se/jWQid+AX5mG+L6QhjZbv47JRuo1Byl0=;
 b=a1qN6xe+EbOuhrk4j1lIbC5/D0oPIwm7T63JwgSgLSAL/yUGxU0vlxf73oqSul+gty
 D6hmUhBTWe0uRIpmyVz+D+vSMN5HT3PFxccCdBXb9FDID6L8+EuxgCsn/tzYHhOiMP7y
 ifYX3fRryBiv1wYexB/+MTHErOvrRlpEtUjxSHM4XkPVIJTFdb5KKAp2OrksmATfZmhs
 D5vJ/DuClSpGWgsqIRfM8kn7pkIgdukUF+N/3pfjBaZh9j73lKdIZmBPlyoxJeL1WfZV
 QnWqDxAZV0Q/pjun2Ywf/XqN0+uDwG79uk2grMBofXNod9wS8YLLpXXXWtlBqH2BpnV8
 FbYA==
X-Received: by 10.194.192.33 with SMTP id hd1mr20709204wjc.96.1436365265462;
 Wed, 08 Jul 2015 07:21:05 -0700 (PDT)
MIME-Version: 1.0
Received: by 10.28.6.143 with HTTP; Wed, 8 Jul 2015 07:20:45 -0700 (PDT)
In-Reply-To: <CANzjMX4MzqtBD-myifpT6i_HM97FVQ31vWjh7fiMsLJBe7Bh0w@mail.gmail.com>
References: <CANzjMX45QaC8yZx2nHPAohJRvQjmUOHuhMQWP9nX+srJs707Hg@mail.gmail.com>
 <1022558302.2863702.1435838360534.JavaMail.zimbra@uoguelph.ca>
 <CANzjMX5eN1FsnHMf6KGZe_b3vwxxF=dy3fJUHxeGO4BXuNzfPA@mail.gmail.com>
 <791936587.3443190.1435873993955.JavaMail.zimbra@uoguelph.ca>
 <CANzjMX427XNQJ1o6Wh2CVy1LF1ivspGcfNeRCmv+OyApK2UhJg@mail.gmail.com>
 <CANzjMX5xyUz6OkMKS4O-MrV2w58YT9ricOPLJWVtAR5Ci-LMew@mail.gmail.com>
 <2010996878.3611963.1435884702063.JavaMail.zimbra@uoguelph.ca>
 <CANzjMX6EoPOcY9V5EQeu5KO1WhwFxxo7-mYRhccVvKiaDW8nGQ@mail.gmail.com>
 <1463698530.4486572.1436135333962.JavaMail.zimbra@uoguelph.ca>
 <CANzjMX4MzqtBD-myifpT6i_HM97FVQ31vWjh7fiMsLJBe7Bh0w@mail.gmail.com>
From: Ahmed Kamal <email.ahmedkamal@googlemail.com>
Date: Wed, 8 Jul 2015 16:20:45 +0200
Message-ID: <CANzjMX7bvh3_+EBBRn6A-PeC_1tnh9FOPeOuN0x=Rr6fGCa-SA@mail.gmail.com>
Subject: Re: Linux NFSv4 clients are getting (bad sequence-id error!)
To: Rick Macklem <rmacklem@uoguelph.ca>
Cc: Julian Elischer <julian@freebsd.org>, freebsd-fs@freebsd.org,
 Xin LI <d@delphij.net>
Content-Type: text/plain; charset=UTF-8
X-Content-Filtered-By: Mailman/MimeDel 2.1.20
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs/>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 08 Jul 2015 14:21:07 -0000

Another note .. is that the linux boxes when they have hung processes ..
They have a process (rpciod) taking 10-15% CPU

On Wed, Jul 8, 2015 at 4:18 PM, Ahmed Kamal <email.ahmedkamal@googlemail.com
> wrote:

> Hi folks,
>
> I have tested Xin's patches .. Unfortunately the problem didn't go away :/
> Many users are still reporting hung processes. If it would help, can you
> show me how to dump a network trace that would help you identify the issue ?
>
> Also, is it possible in any way to have my trusted nfs3, handle the case
> where every zfs /home folder is its own dataset ?
>
> On Mon, Jul 6, 2015 at 12:28 AM, Rick Macklem <rmacklem@uoguelph.ca>
> wrote:
>
>> Ahmed Kamal wrote:
>> > Hi folks,
>> >
>> > Just a quick update. I did not test Xin's patches yet .. What I did so
>> far
>> > is to increase the tcp highwater tunable and increase nfsd threads to
>> 60.
>> > Today (a working day) I noticed I only got one bad sequence error
>> message!
>> > Check this:
>> >
>> > # grep 'bad sequence' messages* | awk '{print $1 $2}' | uniq -c
>> >       1 messages:Jul5
>> >      39 messages.1:Jun28
>> >      15 messages.1:Jun29
>> >       4 messages.1:Jun30
>> >       9 messages.1:Jul1
>> >      23 messages.1:Jul2
>> >       1 messages.1:Jul4
>> >       1 messages.2:Jun28
>> >
>> > So there seems to be an improvement! Not sure if the Linux nfs4 client
>> is
>> > able to somehow recover from those bad-sequence situations or not .. I
>> did
>> > get some user complaints that running "ls -l" is sometimes slow and
>> takes a
>> > couple of seconds to finish.
>> >
>> > One final question .. Do you folks think nfs4.1 is more reliable in
>> general
>> > than nfs4 .. I've always only used nfs3 (I guess it can't work here with
>> > /home/* being separate zfs filesystems) .. So should I go through the
>> pain
>> > of upgrading a few servers to RHEL-6 to try out nfs4.1 ? Basically do
>> you
>> > expect the protocol to be more solid ? I know it's a fluffy question,
>> just
>> > give me your thoughts. Thanks a lot!
>> >
>> All I can say is that the "bad seqid" errors should not occur, since
>> NFSv4.1
>> doesn't use the seqid#s to order RPCs.
>>
>> Also I would say that a correctly implemented NFSv4.1 protocol should
>> function
>> "more correctly" since all RPCs and performed "exactly once". (How much
>> effect
>> this will have in practice, I can't say.)
>>
>> On the other hand, NFSv4.1 is a newer protocol (with an RFC of over
>> 500pages),
>> so it is hard to say how mature the implementations are.
>> I think only testing will give you the answer.
>>
>> I would suggest that you test Xi Lin's patch that allows the "seqid + 2"
>> case
>> and see if that makes the "bad seqid" errors go away. (Even though I
>> think this
>> would indicate a client bug, adding this in way that it can be enabled
>> via a sysctl
>> seems reasonable.)
>>
>> Btw, I haven't seen any additional posts from nfsv4@ietf.org on this,
>> rick
>>
>> >
>> >
>> > On Fri, Jul 3, 2015 at 2:51 AM, Rick Macklem <rmacklem@uoguelph.ca>
>> wrote:
>> >
>> > > Ahmed Kamal wrote:
>> > > > PS: Today (after adjusting tcp.highwater) I didn't get any screaming
>> > > > reports from users about hung vnc sessions. So maybe just maybe,
>> linux
>> > > > clients are able to somehow recover from this bad sequence
>> messages. I
>> > > > could still see the bad sequence error message in logs though
>> > > >
>> > > > Why isn't the highwater tunable set to something better by default
>> ? I
>> > > mean
>> > > > this server is certainly not under a high or unusual load (it's
>> only 40
>> > > PCs
>> > > > mounting from it)
>> > > >
>> > > > On Fri, Jul 3, 2015 at 1:15 AM, Ahmed Kamal <
>> > > email.ahmedkamal@googlemail.com
>> > > > > wrote:
>> > > >
>> > > > > Thanks all .. I understand now we're doing the "right thing" ..
>> > > Although
>> > > > > if mounting keeps wedging, I will have to solve it somehow! Either
>> > > using
>> > > > > Xin's patch .. or Upgrading RHEL to 6.x and using NFS4.1.
>> > > > >
>> > > > > Regarding Xin's patch, is it possible to build the patched nfsd
>> code,
>> > > as a
>> > > > > kernel module ? I'm looking to minimize my delta to upstream.
>> > > > >
>> > > Yes, you can build the nfsd as a module. If your kernel config does
>> not
>> > > include
>> > > "options NFSD" the module will get loaded/used. It is also possible to
>> > > replace
>> > > the module without rebooting, but you need to kill of the nfsd daemon
>> then
>> > > kldunload nfsd.ko and replace nfsd.ko with the new one. (In
>> > > /boot/<kernel-name>.)
>> > >
>> > > > > Also would adopting Xin's patch and hiding it behind a
>> > > > > kern.nfs.allow_linux_broken_client be an option (I'm probably not
>> the
>> > > last
>> > > > > person on earth to hit this) ?
>> > > > >
>> > > If it fixes your problem, I think this is reasonable.
>> > > I'm also hoping that someone that works on the Linux client reports
>> > > if/when this
>> > > was changed.
>> > >
>> > > rick
>> > >
>> > > > > Thanks a lot for all the help!
>> > > > >
>> > > > > On Thu, Jul 2, 2015 at 11:53 PM, Rick Macklem <
>> rmacklem@uoguelph.ca>
>> > > > > wrote:
>> > > > >
>> > > > >> Ahmed Kamal wrote:
>> > > > >> > Appreciating the fruitful discussion! Can someone please
>> explain to
>> > > me,
>> > > > >> > what would happen in the current situation (linux client doing
>> this
>> > > > >> > skip-by-1 thing, and freebsd not doing it) ? What is the
>> effect of
>> > > that?
>> > > > >> Well, as you've seen, the Linux client doesn't function correctly
>> > > against
>> > > > >> the FreeBSD server (and probably others that don't support this
>> > > > >> "skip-by-1"
>> > > > >> case).
>> > > > >>
>> > > > >> > What do users see? Any chances of data loss?
>> > > > >> Hmm. Mostly it will cause Opens to fail, but I can't guess what
>> the
>> > > Linux
>> > > > >> client behaviour is after receiving NFS4ERR_BAD_SEQID. You're
>> the guy
>> > > > >> observing
>> > > > >> it.
>> > > > >>
>> > > > >> >
>> > > > >> > Also, I find it strange that netapp have acknowledged this is
>> a bug
>> > > on
>> > > > >> > their side, which has been fixed since then!
>> > > > >> Yea, I think Netapp screwed up. For some reason their server
>> allowed
>> > > this,
>> > > > >> then was fixed to not allow it and then someone decided that was
>> > > broken
>> > > > >> and
>> > > > >> reversed it.
>> > > > >>
>> > > > >> > I also find it strange that I'm the first to hit this :) Is no
>> one
>> > > > >> running
>> > > > >> > nfs4 yet!
>> > > > >> >
>> > > > >> Well, it seems to be slowly catching on. I suspect that the Linux
>> > > client
>> > > > >> mounting a Netapp is the most common use of it. Since it appears
>> that
>> > > they
>> > > > >> flip flopped w.r.t. who's bug this is, it has probably persisted.
>> > > > >>
>> > > > >> It may turn out that the Linux client has been fixed or it may
>> turn
>> > > out
>> > > > >> that most servers allowed this "skip-by-1" even though David
>> Noveck
>> > > (one
>> > > > >> of the main authors of the protocol) seems to agree with me that
>> it
>> > > should
>> > > > >> not be allowed.
>> > > > >>
>> > > > >> It is possible that others have bumped into this, but it wasn't
>> > > isolated
>> > > > >> (I wouldn't have guessed it, so it was good you pointed to the
>> RedHat
>> > > > >> discussion)
>> > > > >> and they worked around it by reverting to NFSv3 or similar.
>> > > > >> The protocol is rather complex in this area and changed
>> completely for
>> > > > >> NFSv4.1,
>> > > > >> so many have also probably moved onto NFSv4.1 where this won't
>> be an
>> > > > >> issue.
>> > > > >> (NFSv4.1 uses sessions to provide exactly once RPC semantics and
>> > > doesn't
>> > > > >> use
>> > > > >>  these seqid fields.)
>> > > > >>
>> > > > >> This is all just mho, rick
>> > > > >>
>> > > > >> > On Thu, Jul 2, 2015 at 1:59 PM, Rick Macklem <
>> rmacklem@uoguelph.ca>
>> > > > >> wrote:
>> > > > >> >
>> > > > >> > > Julian Elischer wrote:
>> > > > >> > > > On 7/2/15 9:09 AM, Rick Macklem wrote:
>> > > > >> > > > > I am going to post to nfsv4@ietf.org to see what they
>> say.
>> > > Please
>> > > > >> > > > > let me know if Xin Li's patch resolves your problem, even
>> > > though I
>> > > > >> > > > > don't believe it is correct except for the UINT32_MAX
>> case.
>> > > Good
>> > > > >> > > > > luck with it, rick
>> > > > >> > > > and please keep us all in the loop as to what they say!
>> > > > >> > > >
>> > > > >> > > > the general N+2 bit sounds like bullshit to me.. its
>> always N+1
>> > > in a
>> > > > >> > > > number field that has a
>> > > > >> > > > bit of slack at wrap time (probably due to some ambiguity
>> in the
>> > > > >> > > > original spec).
>> > > > >> > > >
>> > > > >> > > Actually, since N is the lock op already done, N + 1 is the
>> next
>> > > lock
>> > > > >> > > operation in order. Since lock ops need to be strictly
>> ordered,
>> > > > >> allowing
>> > > > >> > > N + 2 (which means N + 2 would be done before N + 1) makes no
>> > > sense.
>> > > > >> > >
>> > > > >> > > I think the author of the RFC meant that N + 2 or greater
>> fails,
>> > > but
>> > > > >> it
>> > > > >> > > was poorly worded.
>> > > > >> > >
>> > > > >> > > I will pass along whatever I get from nfsv4@ietf.org.
>> (There is
>> > > an
>> > > > >> archive
>> > > > >> > > of it somewhere, but I can't remember where.;-)
>> > > > >> > >
>> > > > >> > > rick
>> > > > >> > > _______________________________________________
>> > > > >> > > freebsd-fs@freebsd.org mailing list
>> > > > >> > > http://lists.freebsd.org/mailman/listinfo/freebsd-fs
>> > > > >> > > To unsubscribe, send any mail to "
>> > > freebsd-fs-unsubscribe@freebsd.org"
>> > > > >> > >
>> > > > >> >
>> > > > >>
>> > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>
>