From owner-freebsd-current@FreeBSD.ORG Sat Jul 10 22:25:57 2004 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 9D7D616A4CE; Sat, 10 Jul 2004 22:25:57 +0000 (GMT) Received: from gw.catspoiler.org (217-ip-163.nccn.net [209.79.217.163]) by mx1.FreeBSD.org (Postfix) with ESMTP id 2297A43D3F; Sat, 10 Jul 2004 22:25:57 +0000 (GMT) (envelope-from truckman@FreeBSD.org) Received: from FreeBSD.org (mousie.catspoiler.org [192.168.101.2]) by gw.catspoiler.org (8.12.11/8.12.11) with ESMTP id i6AMPmhw015583; Sat, 10 Jul 2004 15:25:53 -0700 (PDT) (envelope-from truckman@FreeBSD.org) Message-Id: <200407102225.i6AMPmhw015583@gw.catspoiler.org> Date: Sat, 10 Jul 2004 15:25:48 -0700 (PDT) From: Don Lewis To: dl@leo.org In-Reply-To: <20040710105017.GA61243@atrbg11.informatik.tu-muenchen.de> MIME-Version: 1.0 Content-Type: TEXT/plain; charset=us-ascii cc: rwatson@FreeBSD.org cc: current@FreeBSD.org Subject: Re: panic: m_copym, length > size of mbuf chain X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 10 Jul 2004 22:25:57 -0000 On 10 Jul, Daniel Lang wrote: > Hi Robert, > > Robert Watson wrote on Wed, Jul 07, 2004 at 12:24:59PM -0400: > [..] >> Just to try ruling out possibilities -- have you run an extensive set of >> hardware diagnostics? Most server class hardware ships with a decent >> diagnostics disk, and I'm sure we can find some for you in the event your >> hardware didn't come with some. While it's quite possibly a software >> problem, tracking hardware problems using software symptoms constitutes >> undesirable pain and so it wouldn't hurt to give that a spin. I remember >> seing your earlier e-mails about running with WITNESS increasing the >> chances of pain -- this could be a bug in WITNESS as you suggest, or it >> could be that WITNESS increases the opportunities for a variety of locking >> related races by increasing the cost of lock/unlock operations. > [..] > > So I come back to the issue. As I already wrote, I guess I can > rule out hardware problems now. I did a very thorough test with > the Dell diagnosis utilities which showed no problems. > > Also, after John's patch I did not see any WITNESS related > problems (so far) again. But I had the m_copy panic again > (see subject). This time I did file a PR and did some more detailed > gdb analysis. It is all documented at: > > http://www.freebsd.org/cgi/query-pr.cgi?pr=kern/68889 > > I am puzzled, because the stack frame on entering m_copym has > 0x0 as first argument (m), however in the previous frame > when m_copy() is called, the struct mbuf* argument is valid. m_copym() overwrites its first and third arguments as it walks the mbuf chain. struct mbuf * m_copym(struct mbuf *m, int off0, int len, int wait) { [snip] while (off > 0) { KASSERT(m != NULL, ("m_copym, offset > size of mbuf chain")); if (off < m->m_len) break; off -= m->m_len; m = m->m_next; } [snip] while (len > 0) { if (m == NULL) { KASSERT(len == M_COPYALL, ("m_copym, length > size of mbuf chain")); break; } [snip] if (len != M_COPYALL) len -= n->m_len; off = 0; m = m->m_next; np = &n->m_next; } The interesting bits would seem to be in stack frame 11, tcp_output(). Check the arguments being passed to m_copym(): #10 0xc0551805 in m_copym (m=0x0, off0=737, len=1222, wait=1) at /usr/src/sys/kern/uipc_mbuf.c:380 We don't know the original value of len that was passed to m_copym(), because it could have been decremented if m_copym() iterated a few times before it paniced, but it was at least 1222. If we add that to off0, then the length of original mbuf chain passed to m_copym() should have been at least 1959. Now take look at the call to m_copy(): #11 0xc059ed5a in tcp_output (tp=0xc3f50000) at /usr/src/sys/netinet/tcp_output.c:748 748 m->m_next = m_copy(so->so_snd.sb_mb, off, (int) len); It would be interesting to see the value of len in stack frame 11, so that we know the original value passed to m_copym(). Also the contents of *so is interesting. (kgdb) p *so [snip] sb_cc = 975, sb_hiwat = 33580, sb_mbcnt = 1536, sb_mbmax = 262144, I'm not sure if sb_cc or sb_mbcnt is the important member, but I think it is sb_cc. I think this means that the mbuf chain contains 975 bytes of data but tcp_output() is telling m_copy() to copy (at least) 1222 bytes of data starting at offset 737. It looks to me like tcp_output() is passing a bogus len value to m_copy().