From owner-freebsd-fs@FreeBSD.ORG  Tue Feb  5 04:45:40 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115])
 by hub.freebsd.org (Postfix) with ESMTP id 867D3619;
 Tue,  5 Feb 2013 04:45:40 +0000 (UTC)
 (envelope-from pyunyh@gmail.com)
Received: from mail-pa0-f44.google.com (mail-pa0-f44.google.com
 [209.85.220.44]) by mx1.freebsd.org (Postfix) with ESMTP id 3D2076DA;
 Tue,  5 Feb 2013 04:45:39 +0000 (UTC)
Received: by mail-pa0-f44.google.com with SMTP id kp1so1047507pab.17
 for <multiple recipients>; Mon, 04 Feb 2013 20:45:32 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=x-received:from:date:to:cc:subject:message-id:reply-to:references
 :mime-version:content-type:content-disposition:in-reply-to
 :user-agent; bh=zDTVncVd7DCDcP8RnfhIMVwvl6UeD3n+qrLKBySrkMc=;
 b=cY8chhdRrsbWtbPaM0hx79/Z0lB8hCmgwNi2eXLl+Gk9txIIa+w451LN2sbwWc8g7N
 3W9JwbFfLKElk2UxtKSc/woQkk5fPpEfRncbl7XVaPuZHnQfEn5HwsD5WXe5LMlgVmFg
 TO8jRQOUyIJ+X6l9YpODWzL3r6jMj3vO6PB+8+OlaZW6w6nUsnVNXWuhL+DD5GcE49/P
 o91pcxefIdji9frK5MZJmGXt/mXLB7Hy+VxZpfWLjKwAxO+2Alz9WgV8G165Xtha5SuE
 GBSOhvLGuimCcC9/w0LK/UeEoMluuMgwNLotSQVvt+svf75QpeTxFIJNi0FuGvQwLBqF
 xeFg==
X-Received: by 10.66.83.165 with SMTP id r5mr60601919pay.3.1360039532352;
 Mon, 04 Feb 2013 20:45:32 -0800 (PST)
Received: from pyunyh@gmail.com (lpe4.p59-icn.cdngp.net. [114.111.62.249])
 by mx.google.com with ESMTPS id t7sm22684585pax.17.2013.02.04.20.45.27
 (version=TLSv1 cipher=RC4-SHA bits=128/128);
 Mon, 04 Feb 2013 20:45:31 -0800 (PST)
Received: by pyunyh@gmail.com (sSMTP sendmail emulation);
 Tue, 05 Feb 2013 13:45:22 +0900
From: YongHyeon PYUN <pyunyh@gmail.com>
Date: Tue, 5 Feb 2013 13:45:22 +0900
To: Christian Gusenbauer <c47g@gmx.at>
Subject: Re: [SOLVED] Re: 9.1-stable crashes while copying data from a NFS
 mounted directory
Message-ID: <20130205044522.GA1439@michelle.cdnetworks.com>
References: <201301241805.57623.c47g@gmx.at>
 <20130124212212.GM2522@kib.kiev.ua> <201301241721.51102.jhb@freebsd.org>
 <201302041705.31461.c47g@gmx.at>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <201302041705.31461.c47g@gmx.at>
User-Agent: Mutt/1.4.2.3i
Cc: freebsd-fs@freebsd.org, yongari@freebsd.org
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
Reply-To: pyunyh@gmail.com
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 05 Feb 2013 04:45:40 -0000

On Mon, Feb 04, 2013 at 05:05:31PM +0100, Christian Gusenbauer wrote:
> On Thursday 24 January 2013 23:21:50 John Baldwin wrote:
> > On Thursday, January 24, 2013 4:22:12 pm Konstantin Belousov wrote:
> > > On Thu, Jan 24, 2013 at 09:50:52PM +0100, Christian Gusenbauer wrote:
> > > > On Thursday 24 January 2013 20:37:09 Konstantin Belousov wrote:
> > > > > On Thu, Jan 24, 2013 at 07:50:49PM +0100, Christian Gusenbauer wrote:
> > > > > > On Thursday 24 January 2013 19:07:23 Konstantin Belousov wrote:
> > > > > > > On Thu, Jan 24, 2013 at 08:03:59PM +0200, Konstantin Belousov 
> wrote:
> > > > > > > > On Thu, Jan 24, 2013 at 06:05:57PM +0100, Christian Gusenbauer 
> wrote:
> > > > > > > > > Hi!
> > > > > > > > > 
> > > > > > > > > I'm using 9.1 stable svn revision 245605 and I get the panic
> > > > > > > > > below if I execute the following commands (as single user):
> > > > > > > > > 
> > > > > > > > > # swapon -a
> > > > > > > > > # dumpon /dev/ada0s3b
> > > > > > > > > # mount -u /
> > > > > > > > > # ifconfig age0 inet 192.168.2.2 mtu 6144 up
> > > > > > > > > # mount -t nfs -o rsize=32768 data:/multimedia /mnt
> > > > > > > > > # cp /mnt/Movies/test/a.m2ts /tmp
> > > > > > > > > 
> > > > > > > > > then the system panics almost immediately. I'll attach the
> > > > > > > > > stack trace.
> > > > > > > > > 
> > > > > > > > > Note, that I'm using jumbo frames (6144 byte) on a 1Gbit
> > > > > > > > > network, maybe that's the cause for the panic, because the
> > > > > > > > > bcopy (see stack frame #15) fails.
> > > > > > > > > 
> > > > > > > > > Any clues?
> > > > > > > > 
> > > > > > > > I tried a similar operation with the nfs mount of rsize=32768
> > > > > > > > and mtu 6144, but the machine runs HEAD and em instead of age.
> > > > > > > > I was unable to reproduce the panic on the copy of the 5GB
> > > > > > > > file from nfs mount.
> > > > > > 
> > > > > > Hmmm, I did a quick test. If I do not change the MTU, so just
> > > > > > configuring age0 with
> > > > > > 
> > > > > > # ifconfig age0 inet 192.168.2.2 up
> > > > > > 
> > > > > > then I can copy all files from the mounted directory without any
> > > > > > problems, too. So it's probably age0 related?
> > > > > 
> > > > > From your backtrace and the buffer printout, I see somewhat strange
> > > > > thing. The buffer data address is 0xffffff8171418000, while kernel
> > > > > faulted at the attempt to write at 0xffffff8171413000, which is is
> > > > > lower then the buffer data pointer, at the attempt to bcopy to the
> > > > > buffer.
> > > > > 
> > > > > The other data suggests that there were no overflow of the data from
> > > > > the server response. So it might be that mbuf_len(mp) returned
> > > > > negative number ? I am not sure is it possible at all.
> > > > > 
> > > > > Try this debugging patch, please. You need to add INVARIANTS etc to
> > > > > the kernel config.
> > > > > 
> > > > > diff --git a/sys/fs/nfs/nfs_commonsubs.c
> > > > > b/sys/fs/nfs/nfs_commonsubs.c index efc0786..9a6bda5 100644
> > > > > --- a/sys/fs/nfs/nfs_commonsubs.c
> > > > > +++ b/sys/fs/nfs/nfs_commonsubs.c
> > > > > @@ -218,6 +218,7 @@ nfsm_mbufuio(struct nfsrv_descript *nd, struct
> > > > > uio *uiop, int siz) }
> > > > > 
> > > > >  				mbufcp = NFSMTOD(mp, caddr_t);
> > > > >  				len = mbuf_len(mp);
> > > > > 
> > > > > +				KASSERT(len > 0, ("len %d", len));
> > > > > 
> > > > >  			}
> > > > >  			xfer = (left > len) ? len : left;
> > > > >  
> > > > >  #ifdef notdef
> > > > > 
> > > > > @@ -239,6 +240,8 @@ nfsm_mbufuio(struct nfsrv_descript *nd, struct
> > > > > uio *uiop, int siz) uiop->uio_resid -= xfer;
> > > > > 
> > > > >  		}
> > > > >  		if (uiop->uio_iov->iov_len <= siz) {
> > > > > 
> > > > > +			KASSERT(uiop->uio_iovcnt > 1, ("uio_iovcnt %d",
> > > > > +			    uiop->uio_iovcnt));
> > > > > 
> > > > >  			uiop->uio_iovcnt--;
> > > > >  			uiop->uio_iov++;
> > > > >  		
> > > > >  		} else {
> > > > > 
> > > > > I thought that server have returned too long response, but it seems
> > > > > to be not the case from your data. Still, I think the patch below
> > > > > might be due.
> > > > > 
> > > > > diff --git a/sys/fs/nfsclient/nfs_clrpcops.c
> > > > > b/sys/fs/nfsclient/nfs_clrpcops.c index be0476a..a89b907 100644
> > > > > --- a/sys/fs/nfsclient/nfs_clrpcops.c
> > > > > +++ b/sys/fs/nfsclient/nfs_clrpcops.c
> > > > > @@ -1444,7 +1444,7 @@ nfsrpc_readrpc(vnode_t vp, struct uio *uiop,
> > > > > struct ucred *cred, NFSM_DISSECT(tl, u_int32_t *, NFSX_UNSIGNED);
> > > > > 
> > > > >  			eof = fxdr_unsigned(int, *tl);
> > > > >  		
> > > > >  		}
> > > > > 
> > > > > -		NFSM_STRSIZ(retlen, rsize);
> > > > > +		NFSM_STRSIZ(retlen, len);
> > > > > 
> > > > >  		error = nfsm_mbufuio(nd, uiop, retlen);
> > > > >  		if (error)
> > > > >  		
> > > > >  			goto nfsmout;
> > > > 
> > > > I applied your patches and now I get a
> > > > 
> > > > panic: len -4
> > > > cpuid = 1
> > > > KDB: enter: panic
> > > > Dumping 377 out of 6116
> > > > MB:..5%..13%..22%..34%..43%..51%..64%..73%..81%..94%
> > > 
> > > This means that the age driver either produced corrupted mbuf chain,
> > > or filled wrong negative value into the mbuf len field. I am quite
> > > certain that the issue is in the driver.
> > > 
> > > I added the net@ to Cc:, hopefully you could get help there.
> > 
> > And I've cc'd Pyun who has written most of this driver and is likely the
> > one most familiar with its handling of jumbo frames.
> 
> Hi All!
> 
> I was in contact with Pyun. We quickly found out that it is indeed a driver 
> problem. Pyun solved it and will commit the fix within the next few days.
> 

Committed in r246341.
Thanks for reporting and testing!

> There's only one (minor) problem open, which I can not tell if it really is 
> one: Konstantin sent me an initial patch for the NFS code where he added an 
> KASSERT(uiop->uio_iovcnt > 1) which triggers even with Pyun's fix. Without 
> that assert my tests show now problem at all. So is this a problem?
> 
> Thanks guys (especially Pyun) for helping & fixing!
> 
> Ciao,
> Christian.