From owner-freebsd-atm  Wed Nov 12 17:23:58 1997
Return-Path: <owner-freebsd-atm>
Received: (from root@localhost)
          by hub.freebsd.org (8.8.7/8.8.7) id RAA28928
          for atm-outgoing; Wed, 12 Nov 1997 17:23:58 -0800 (PST)
          (envelope-from owner-freebsd-atm)
Received: from eden.dei.uc.pt (eden.dei.uc.pt [193.136.212.3])
          by hub.freebsd.org (8.8.7/8.8.7) with SMTP id RAA28921
          for <freebsd-atm@FreeBSD.ORG>; Wed, 12 Nov 1997 17:23:50 -0800 (PST)
          (envelope-from aalves@dei.uc.pt)
Received: from zorg.dei.uc.pt by eden.dei.uc.pt (5.65v3.2/1.1.10.5/28Jun97-0144PM)
	id AA01698; Thu, 13 Nov 1997 01:30:04 GMT
Received: from localhost (aalves@localhost)
	by zorg.dei.uc.pt (8.8.5/8.8.5) with SMTP id BAA01437;
	Thu, 13 Nov 1997 01:24:06 GMT
Date: Thu, 13 Nov 1997 01:24:06 +0000 (WET)
From: Antonio Luis Alves <aalves@dei.uc.pt>
To: Kenjiro Cho <kjc@csl.sony.co.jp>
Cc: freebsd-atm@FreeBSD.ORG
Subject: Re: ATM driver & udp stream 
In-Reply-To: <199711120445.NAA24120@hotaka.csl.sony.co.jp>
Message-Id: <Pine.BSF.3.96.971113002305.1378A-100000@zorg.dei.uc.pt>
Mime-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-freebsd-atm@FreeBSD.ORG
X-Loop: FreeBSD.org
Precedence: bulk


On Wed, 12 Nov 1997, Kenjiro Cho wrote:

> 
> >> All was working well until I made a UDP stream test. The sender was a
> >> Pentium 233mmx and on the receive side a Pentium 166. The test is a normal
> >> netperf UDP_STREAM with default values for the datagram, which is the max
> >> udp datagram 9216. The driver crashes at the pdu receive routine when
> >> accessing the segments which form the pdu. With the debugger I can see that
> >> the mbuf handles are correct but the data on the mbuf is not. Some times
> >> the m_type is MT_FREE and m->next as got a valid pointer on it, and other
> >> times all the members of the mbuf are zero. On a normal situation the
> >> mbufs would be of type MT_DATA, and correctly initialized with the
> >> pointers to the external free and reference function, m_len, ext_size and
> >> ext_buffer.
> >> From the firmware guide it says the card never touches the mbuf, it just
> >> pass the mbuf handle received from the supply routine to the segments on
> >> the received pdu descriptor.
> 
> It sounds familiar to me..., so you might want to take a look at the
> following possibility. 
> 
> I had a similar experience with Chuck Cranor's ATM driver long time
> ago.  It turns out that, under heavy load, a circular list holding
> pointers to mbufs gets completely full and wrapped around.  As a
> result, a wrong mbuf in use gets freed.  The system crashes after a
> while when it refers to the misplaced mbuf.  
> 
> IMO, it's a very bad idea to put a free list chain at the top of a
> mbuf cluster data area when a mbuf is on the free list.  It is very
> hard to debug if something goes wrong.
> 

After reading your mail, the first thing I thought was to put a periodic
test to check the mbufs in the circular list of the buffer supply
protocol. This way I probably could get more close to problem and try to
trace it. This is the only place I can think of at this moment that could
cause the problem if the mbufs were freed while still owned by the card.

However, because the supply protocol used on the driver was not changed
much from the original distribution which is running on other
architectures for quite some time I also suspected that the problem could
come from the kernel mbuf handling code which could under the heavy load
cause the problem. 

Anyway, today I found that there was new reassembly code on ip_input.c on
release 2.2.5, and because the problem only showed when there was
reassembly at the ip layer I decided to give it a try and upgraded the
Pentium166 machine to this release. It turns out that the driver does not
crash anymore under 2.2.5 . I just made this a few hours ago, but the
various tests I have made so far did not crash the machine. I rebooted the
machine several times and made around 10 udp-stream tests each time
without any problem.

So it seems the old reassembly code was the key of the problem. However it
is still not clear to me if there is a bug on the driver which shows only
under heavy load with the old reassembly code, or if the old code was the
responsible for the trash of the mbufs. I say this because I don't
remember to see on hackers anyone reporting problems with the reassembly
code. Right now I am going to upgrade all our machines to 2.2.5 and see if
we will not get any problems in the next weeks.


Antonio Alves