From owner-freebsd-bugs  Mon Apr 21 12:50:49 1997
Return-Path: <owner-bugs>
Received: (from root@localhost)
          by freefall.freebsd.org (8.8.5/8.8.5) id MAA09963
          for bugs-outgoing; Mon, 21 Apr 1997 12:50:49 -0700 (PDT)
Received: from dg-rtp.dg.com (dg-rtp.rtp.dg.com [128.222.1.2])
          by freefall.freebsd.org (8.8.5/8.8.5) with SMTP id MAA09953
          for <freebsd-bugs@freefall.FreeBSD.org>; Mon, 21 Apr 1997 12:50:44 -0700 (PDT)
Received: by dg-rtp.dg.com (5.4R3.10/dg-rtp-v02)
	id AA20155; Mon, 21 Apr 1997 15:50:05 -0400
Received: from ponds by dg-rtp.dg.com.rtp.dg.com; Mon, 21 Apr 1997 15:50 EDT
Received: from lakes.water.net (lakes [10.0.0.3]) by ponds.water.net (8.8.3/8.7.3) with ESMTP id OAA27131; Mon, 21 Apr 1997 14:53:27 -0400 (EDT)
Received: (from rivers@localhost) by lakes.water.net (8.8.3/8.6.9) id OAA02323; Mon, 21 Apr 1997 14:59:53 -0400 (EDT)
Date: Mon, 21 Apr 1997 14:59:53 -0400 (EDT)
From: Thomas David Rivers <ponds!rivers@dg-rtp.dg.com>
Message-Id: <199704211859.OAA02323@lakes.water.net>
To: ponds!nlsystems.com!dfr, ponds!lakes.water.net!rivers
Subject: Re: kern/3304: NFS V2 readdir hangs
Cc: ponds!freefall.cdrom.com!freebsd-bugs
Content-Type: text
Sender: owner-bugs@FreeBSD.ORG
X-Loop: FreeBSD.org
Precedence: bulk

> 
> What appears to be happening is that numb is making a 4096byte sized
> readdir request for the first block of the large directory.  You can see
> this in the trace as request id b6cff051 (btw. you may find it useful to
> grep the log for nfs to separate the wood from the trees; next time we
> should add 'port nfs' to the tcpdump command).  The reply is sent but for
> some reason it never makes it into sorecieve.
> 
> You can see that numb retries the request with the same xid several times
> but never receives the reply.  My guess is that something between numb and
> sundog has corrupted the packet and it is failing the checksum in
> udp_input.  What we need to do is find out how far up the protocol stack
> the packet goes.  I suggest adding printfs to udp_input and ip_input where
> they drop packets with bad checksums (line 154 in udp_usrreq.c).  You
> should also be able to see it with 'netstat -p udp' and 'netstat -p ip'.

 Here's the output of those netstat commands:

Script started on Mon Apr 21 14:11:18 1997
# netstat -p udp
udp:
	129 datagrams received
	0 with incomplete header
	0 with bad data length field
	0 with bad checksum
	0 dropped due to no socket
	13 broadcast/multicast datagrams dropped due to no socket
	5 dropped due to full socket buffers
	0 not for hashed pcb
	111 delivered
	116 datagrams output
# netstat -p ip
ip:
	180 total packets received
	0 bad header checksums
	0 with size smaller than minimum
	0 with data size < data length
	0 with header length < data size
	0 with data length < header length
	0 with bad options
	0 with incorrect version number
	15 fragments received
	0 fragments dropped (dup or out of space)
	0 fragments dropped after timeout
	5 packets reassembled ok
	130 packets for this host
	0 packets for unknown/unsupported protocol
	0 packets forwarded
	40 packets not forwardable
	0 redirects sent
	116 packets sent from this host
	0 packets sent with fabricated ip header
	0 output packets dropped due to no bufs, etc.
	0 output packets discarded due to no route
	0 output datagrams fragmented
	0 fragments created
	0 datagrams that can't be fragmented
# exit

Script done on Mon Apr 21 14:11:25 1997

No checksum problems - but I do notice the "5 dropped due to socket full
buffers" line... could that be the reason?...

> 
> You might also try this (untested) hack which should limit readdirs to
> smaller bites:
> 
> Index: nfs_vfsops.c
> ===================================================================
> RCS file: /home/smp/sys/nfs/nfs_vfsops.c,v
> retrieving revision 1.1.1.5
> diff -u -r1.1.1.5 nfs_vfsops.c
> --- nfs_vfsops.c	1997/04/18 07:09:39	1.1.1.5
> +++ nfs_vfsops.c	1997/04/21 17:19:58
> @@ -748,6 +748,7 @@
>  	}
>  	if (nmp->nm_readdirsize > maxio)
>  		nmp->nm_readdirsize = maxio;
> +	nmp->nm_readdirsize = 1024; /* XXX */
>  
>  	if ((argp->flags & NFSMNT_MAXGRPS) && argp->maxgrouplist >= 0 &&
>  		argp->maxgrouplist <= NFS_MAXGRPS)
> 

 Yes! - this particular change does work-around the problem.  I'm
able to run my "ls -lR" and have it complete successfully [although,
there are some strange 'lags' every now and then...]  it does work.
I've been running it continuously for a few minutes now; no hangs...

 Now - a good question, which you asked,  is why are those packets 
getting blocked?

 Also, another question I have is why did this work with 2.1.5 - did
it always have a lower readdirsize; or is another problem in 2.2.1 simply
masked by lowering the readdirsize?

 I'm happy to investigate this further - and *overjoyed* that NFS
seems to be working for me...  let me know what I can do at this end.

	 - Thanks! -
	- Dave Rivers -