From owner-freebsd-current@FreeBSD.ORG  Sat Jul 10 22:25:57 2004
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP
	id 9D7D616A4CE; Sat, 10 Jul 2004 22:25:57 +0000 (GMT)
Received: from gw.catspoiler.org (217-ip-163.nccn.net [209.79.217.163])
	by mx1.FreeBSD.org (Postfix) with ESMTP
	id 2297A43D3F; Sat, 10 Jul 2004 22:25:57 +0000 (GMT)
	(envelope-from truckman@FreeBSD.org)
Received: from FreeBSD.org (mousie.catspoiler.org [192.168.101.2])
	by gw.catspoiler.org (8.12.11/8.12.11) with ESMTP id i6AMPmhw015583;
	Sat, 10 Jul 2004 15:25:53 -0700 (PDT)
	(envelope-from truckman@FreeBSD.org)
Message-Id: <200407102225.i6AMPmhw015583@gw.catspoiler.org>
Date: Sat, 10 Jul 2004 15:25:48 -0700 (PDT)
From: Don Lewis <truckman@FreeBSD.org>
To: dl@leo.org
In-Reply-To: <20040710105017.GA61243@atrbg11.informatik.tu-muenchen.de>
MIME-Version: 1.0
Content-Type: TEXT/plain; charset=us-ascii
cc: rwatson@FreeBSD.org
cc: current@FreeBSD.org
Subject: Re: panic: m_copym, length > size of mbuf chain
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 10 Jul 2004 22:25:57 -0000

On 10 Jul, Daniel Lang wrote:
> Hi Robert,
> 
> Robert Watson wrote on Wed, Jul 07, 2004 at 12:24:59PM -0400:
> [..]
>> Just to try ruling out possibilities -- have you run an extensive set of
>> hardware diagnostics?  Most server class hardware ships with a decent
>> diagnostics disk, and I'm sure we can find some for you in the event your
>> hardware didn't come with some.  While it's quite possibly a software
>> problem, tracking hardware problems using software symptoms constitutes
>> undesirable pain and so it wouldn't hurt to give that a spin.  I remember
>> seing your earlier e-mails about running with WITNESS increasing the
>> chances of pain -- this could be a bug in WITNESS as you suggest, or it
>> could be that WITNESS increases the opportunities for a variety of locking
>> related races by increasing the cost of lock/unlock operations.
> [..]
> 
> So I come back to the issue. As I already wrote, I guess I can
> rule out hardware problems now. I did a very thorough test with
> the Dell diagnosis utilities which showed no problems.
> 
> Also, after John's patch I did not see any WITNESS related
> problems (so far) again. But I had the m_copy panic again
> (see subject). This time I did file a PR and did some more detailed
> gdb analysis. It is all documented at:
> 
> http://www.freebsd.org/cgi/query-pr.cgi?pr=kern/68889
> 
> I am puzzled, because the stack frame on entering m_copym has
> 0x0 as first argument (m), however in the previous frame
> when m_copy() is called, the struct mbuf* argument is valid.

m_copym() overwrites its first and third arguments as it walks the mbuf
chain.

struct mbuf *
m_copym(struct mbuf *m, int off0, int len, int wait)
{
[snip]
	while (off > 0) {
		KASSERT(m != NULL, ("m_copym, offset > size of mbuf chain"));
		if (off < m->m_len)
			break;
		off -= m->m_len;
		m = m->m_next;
	}
[snip]
	while (len > 0) {
		if (m == NULL) {
			KASSERT(len == M_COPYALL, 
			    ("m_copym, length > size of mbuf chain"));
			break;
		}
[snip]
		if (len != M_COPYALL)
			len -= n->m_len;
		off = 0;
		m = m->m_next;
		np = &n->m_next;
	}


The interesting bits would seem to be in stack frame 11, tcp_output().
Check the arguments being passed to m_copym():

#10 0xc0551805 in m_copym (m=0x0, off0=737, len=1222, wait=1)
    at /usr/src/sys/kern/uipc_mbuf.c:380

We don't know the original value of len that was passed to m_copym(),
because it could have been decremented if m_copym() iterated a few times
before it paniced, but it was at least 1222.  If we add that to off0,
then the length of original mbuf chain passed to m_copym() should have
been at least 1959.

Now take look at the call to m_copy():

#11 0xc059ed5a in tcp_output (tp=0xc3f50000)
    at /usr/src/sys/netinet/tcp_output.c:748
748                             m->m_next = m_copy(so->so_snd.sb_mb, off, (int) len);

It would be interesting to see the value of len in stack frame 11, so
that we know the original value passed to m_copym().

Also the contents of *so is interesting.

(kgdb) p *so
[snip]
    sb_cc = 975, sb_hiwat = 33580, sb_mbcnt = 1536, sb_mbmax = 262144,

I'm not sure if sb_cc or sb_mbcnt is the important member, but I think
it is sb_cc.  I think this means that the mbuf chain contains 975 bytes
of data but tcp_output() is telling m_copy() to copy (at least) 1222
bytes of data starting at offset 737.

It looks to me like tcp_output() is passing a bogus len value to
m_copy().