From owner-freebsd-fs@FreeBSD.ORG  Mon Jan 28 15:44:56 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id E83479B7
 for <freebsd-fs@freebsd.org>; Mon, 28 Jan 2013 15:44:56 +0000 (UTC)
 (envelope-from c47g@gmx.at)
Received: from mout.gmx.net (mout.gmx.net [212.227.15.18])
 by mx1.freebsd.org (Postfix) with ESMTP id 96A85680
 for <freebsd-fs@freebsd.org>; Mon, 28 Jan 2013 15:44:56 +0000 (UTC)
Received: from mailout-de.gmx.net ([10.1.76.12]) by mrigmx.server.lan
 (mrigmx002) with ESMTP (Nemesis) id 0MTdbK-1UQLx92daS-00QQGP for
 <freebsd-fs@freebsd.org>; Mon, 28 Jan 2013 16:44:55 +0100
Received: (qmail invoked by alias); 28 Jan 2013 15:44:55 -0000
Received: from cm56-168-232.liwest.at (EHLO bones.gusis.at) [86.56.168.232]
 by mail.gmx.net (mp012) with SMTP; 28 Jan 2013 16:44:55 +0100
X-Authenticated: #9978462
X-Provags-ID: V01U2FsdGVkX1+RNOEagoaR8Aj+CRm8gJ6VegtnFdM0y+SAncGTcB
 2AVQOib+LqwOIQ
From: Christian Gusenbauer <c47g@gmx.at>
To: pyunyh@gmail.com
Subject: Re: 9.1-stable crashes while copying data from a NFS mounted directory
Date: Mon, 28 Jan 2013 16:46:43 +0100
User-Agent: KMail/1.13.7 (FreeBSD/9.1-STABLE; KDE/4.8.4; amd64; ; )
References: <201301241805.57623.c47g@gmx.at> <201301251809.50929.c47g@gmx.at>
 <20130128063531.GC1447@michelle.cdnetworks.com>
In-Reply-To: <20130128063531.GC1447@michelle.cdnetworks.com>
MIME-Version: 1.0
Content-Type: Text/Plain;
  charset="us-ascii"
Content-Transfer-Encoding: 7bit
Message-Id: <201301281646.43551.c47g@gmx.at>
X-Y-GMX-Trusted: 0
Cc: freebsd-fs@freebsd.org, net@freebsd.org, yongari@freebsd.org
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 28 Jan 2013 15:44:57 -0000

On Monday 28 January 2013 07:35:31 YongHyeon PYUN wrote:
> On Fri, Jan 25, 2013 at 06:09:50PM +0100, Christian Gusenbauer wrote:
> > On Friday 25 January 2013 05:50:48 YongHyeon PYUN wrote:
> > > On Fri, Jan 25, 2013 at 01:30:43PM +0900, YongHyeon PYUN wrote:
> > > > On Thu, Jan 24, 2013 at 05:21:50PM -0500, John Baldwin wrote:
> > > > > On Thursday, January 24, 2013 4:22:12 pm Konstantin Belousov wrote:
> > > > > > On Thu, Jan 24, 2013 at 09:50:52PM +0100, Christian Gusenbauer 
wrote:
> > > > > > > On Thursday 24 January 2013 20:37:09 Konstantin Belousov wrote:
> > > > > > > > On Thu, Jan 24, 2013 at 07:50:49PM +0100, Christian
> > > > > > > > Gusenbauer
> > 
> > wrote:
> > > > > > > > > On Thursday 24 January 2013 19:07:23 Konstantin Belousov 
wrote:
> > > > > > > > > > On Thu, Jan 24, 2013 at 08:03:59PM +0200, Konstantin
> > > > > > > > > > Belousov
> > 
> > wrote:
> > > > > > > > > > > On Thu, Jan 24, 2013 at 06:05:57PM +0100, Christian
> > 
> > Gusenbauer wrote:
> > > > > > > > > > > > Hi!
> > > > > > > > > > > > 
> > > > > > > > > > > > I'm using 9.1 stable svn revision 245605 and I get
> > > > > > > > > > > > the panic below if I execute the following commands
> > > > > > > > > > > > (as single user):
> > > > > > > > > > > > 
> > > > > > > > > > > > # swapon -a
> > > > > > > > > > > > # dumpon /dev/ada0s3b
> > > > > > > > > > > > # mount -u /
> > > > > > > > > > > > # ifconfig age0 inet 192.168.2.2 mtu 6144 up
> > > > > > > > > > > > # mount -t nfs -o rsize=32768 data:/multimedia /mnt
> > > > > > > > > > > > # cp /mnt/Movies/test/a.m2ts /tmp
> > > > > > > > > > > > 
> > > > > > > > > > > > then the system panics almost immediately. I'll
> > > > > > > > > > > > attach the stack trace.
> > > > > > > > > > > > 
> > > > > > > > > > > > Note, that I'm using jumbo frames (6144 byte) on a
> > > > > > > > > > > > 1Gbit network, maybe that's the cause for the panic,
> > > > > > > > > > > > because the bcopy (see stack frame #15) fails.
> > > > > > > > > > > > 
> > > > > > > > > > > > Any clues?
> > > > > > > > > > > 
> > > > > > > > > > > I tried a similar operation with the nfs mount of
> > > > > > > > > > > rsize=32768 and mtu 6144, but the machine runs HEAD and
> > > > > > > > > > > em instead of age. I was unable to reproduce the panic
> > > > > > > > > > > on the copy of the 5GB file from nfs mount.
> > > > > > > > > 
> > > > > > > > > Hmmm, I did a quick test. If I do not change the MTU, so
> > > > > > > > > just configuring age0 with
> > > > > > > > > 
> > > > > > > > > # ifconfig age0 inet 192.168.2.2 up
> > > > > > > > > 
> > > > > > > > > then I can copy all files from the mounted directory
> > > > > > > > > without any problems, too. So it's probably age0 related?
> > > > > > > > 
> > > > > > > > From your backtrace and the buffer printout, I see somewhat
> > > > > > > > strange thing. The buffer data address is 0xffffff8171418000,
> > > > > > > > while kernel faulted at the attempt to write at
> > > > > > > > 0xffffff8171413000, which is is lower then the buffer data
> > > > > > > > pointer, at the attempt to bcopy to the buffer.
> > > > > > > > 
> > > > > > > > The other data suggests that there were no overflow of the
> > > > > > > > data from the server response. So it might be that
> > > > > > > > mbuf_len(mp) returned negative number ? I am not sure is it
> > > > > > > > possible at all.
> > > > > > > > 
> > > > > > > > Try this debugging patch, please. You need to add INVARIANTS
> > > > > > > > etc to the kernel config.
> > > > > > > > 
> > > > > > > > diff --git a/sys/fs/nfs/nfs_commonsubs.c
> > > > > > > > b/sys/fs/nfs/nfs_commonsubs.c index efc0786..9a6bda5 100644
> > > > > > > > --- a/sys/fs/nfs/nfs_commonsubs.c
> > > > > > > > +++ b/sys/fs/nfs/nfs_commonsubs.c
> > > > > > > > @@ -218,6 +218,7 @@ nfsm_mbufuio(struct nfsrv_descript *nd,
> > > > > > > > struct uio *uiop, int siz) }
> > > > > > > > 
> > > > > > > >  				mbufcp = NFSMTOD(mp, caddr_t);
> > > > > > > >  				len = mbuf_len(mp);
> > > > > > > > 
> > > > > > > > +				KASSERT(len > 0, ("len %d", len));
> > > > > > > > 
> > > > > > > >  			}
> > > > > > > >  			xfer = (left > len) ? len : left;
> > > > > > > >  
> > > > > > > >  #ifdef notdef
> > > > > > > > 
> > > > > > > > @@ -239,6 +240,8 @@ nfsm_mbufuio(struct nfsrv_descript *nd,
> > > > > > > > struct uio *uiop, int siz) uiop->uio_resid -= xfer;
> > > > > > > > 
> > > > > > > >  		}
> > > > > > > >  		if (uiop->uio_iov->iov_len <= siz) {
> > > > > > > > 
> > > > > > > > +			KASSERT(uiop->uio_iovcnt > 1, ("uio_iovcnt %d",
> > > > > > > > +			    uiop->uio_iovcnt));
> > > > > > > > 
> > > > > > > >  			uiop->uio_iovcnt--;
> > > > > > > >  			uiop->uio_iov++;
> > > > > > > >  		
> > > > > > > >  		} else {
> > > > > > > > 
> > > > > > > > I thought that server have returned too long response, but it
> > > > > > > > seems to be not the case from your data. Still, I think the
> > > > > > > > patch below might be due.
> > > > > > > > 
> > > > > > > > diff --git a/sys/fs/nfsclient/nfs_clrpcops.c
> > > > > > > > b/sys/fs/nfsclient/nfs_clrpcops.c index be0476a..a89b907
> > > > > > > > 100644 --- a/sys/fs/nfsclient/nfs_clrpcops.c
> > > > > > > > +++ b/sys/fs/nfsclient/nfs_clrpcops.c
> > > > > > > > @@ -1444,7 +1444,7 @@ nfsrpc_readrpc(vnode_t vp, struct uio
> > > > > > > > *uiop, struct ucred *cred, NFSM_DISSECT(tl, u_int32_t *,
> > > > > > > > NFSX_UNSIGNED);
> > > > > > > > 
> > > > > > > >  			eof = fxdr_unsigned(int, *tl);
> > > > > > > >  		
> > > > > > > >  		}
> > > > > > > > 
> > > > > > > > -		NFSM_STRSIZ(retlen, rsize);
> > > > > > > > +		NFSM_STRSIZ(retlen, len);
> > > > > > > > 
> > > > > > > >  		error = nfsm_mbufuio(nd, uiop, retlen);
> > > > > > > >  		if (error)
> > > > > > > >  		
> > > > > > > >  			goto nfsmout;
> > > > > > > 
> > > > > > > I applied your patches and now I get a
> > > > > > > 
> > > > > > > panic: len -4
> > > > > > > cpuid = 1
> > > > > > > KDB: enter: panic
> > > > > > > Dumping 377 out of 6116
> > > > > > > MB:..5%..13%..22%..34%..43%..51%..64%..73%..81%..94%
> > > > > > 
> > > > > > This means that the age driver either produced corrupted mbuf
> > > > > > chain, or filled wrong negative value into the mbuf len field. I
> > > > > > am quite certain that the issue is in the driver.
> > > > > > 
> > > > > > I added the net@ to Cc:, hopefully you could get help there.
> > > > > 
> > > > > And I've cc'd Pyun who has written most of this driver and is
> > > > > likely the one most familiar with its handling of jumbo frames.
> > > > 
> > > > Try attached one and let me know how it goes.
> > > > Note, I don't have age(4) anymore so it wasn't tested at all.
> > > 
> > > Sorry, ignore previous patch and use this one(age.diff2) instead.
> > 
> > Thanks for the patch! I ignored the first and applied only the second
> > one, but unfortunately that did not change anything. I still get the
> > "panic: len -4"
> > 
> > :-(.
> 
> Ok, I contacted QAC and got a hint for its descriptor usage and I
> realized the controller does not work as I initially expected!
> When I wrote age(4) for the controller, the hardware was available
> only for a couple of weeks so I may have not enough time to test
> it.  Sorry about that.
> I'll let you know when experimental patch is available. Due to lack
> of hardware, it would take more time than it used to be.
> 
> Thanks for reporting!

Thanks for investing your time! I'm looking forward to test your next 
patch(es) :-)!

Ciao,
Christian.