From owner-freebsd-current@FreeBSD.ORG Sat Jul 10 23:24:57 2004 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 7DB0016A4CE; Sat, 10 Jul 2004 23:24:57 +0000 (GMT) Received: from gw.catspoiler.org (217-ip-163.nccn.net [209.79.217.163]) by mx1.FreeBSD.org (Postfix) with ESMTP id 1A18A43D45; Sat, 10 Jul 2004 23:24:57 +0000 (GMT) (envelope-from truckman@FreeBSD.org) Received: from FreeBSD.org (mousie.catspoiler.org [192.168.101.2]) by gw.catspoiler.org (8.12.11/8.12.11) with ESMTP id i6ANOlEs015698; Sat, 10 Jul 2004 16:24:51 -0700 (PDT) (envelope-from truckman@FreeBSD.org) Message-Id: <200407102324.i6ANOlEs015698@gw.catspoiler.org> Date: Sat, 10 Jul 2004 16:24:47 -0700 (PDT) From: Don Lewis To: rwatson@FreeBSD.org In-Reply-To: MIME-Version: 1.0 Content-Type: TEXT/plain; charset=us-ascii cc: ps@FreeBSD.org cc: current@FreeBSD.org cc: dl@leo.org Subject: Re: panic: m_copym, length > size of mbuf chain X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 10 Jul 2004 23:24:57 -0000 On 10 Jul, Robert Watson wrote: > > On Sat, 10 Jul 2004, Daniel Lang wrote: > >> So I come back to the issue. As I already wrote, I guess I can rule out >> hardware problems now. I did a very thorough test with the Dell >> diagnosis utilities which showed no problems. > > Thanks! > >> Also, after John's patch I did not see any WITNESS related problems (so >> far) again. But I had the m_copy panic again (see subject). This time I >> did file a PR and did some more detailed gdb analysis. It is all >> documented at: >> >> http://www.freebsd.org/cgi/query-pr.cgi?pr=kern/68889 >> >> I am puzzled, because the stack frame on entering m_copym has 0x0 as >> first argument (m), however in the previous frame when m_copy() is >> called, the struct mbuf* argument is valid. >> >> Ok, I just realized that there is a difference m_copy() and m_copym() >> are apparently different functions. Is this a makro/#define discrepancy >> it seems that that m_copym() is the function which is called in this >> line of code. >> >> Ah, I found it: >> >> sys/mbuf.h:#define m_copy(m, o, l) m_copym((m), (o), (l), M_DONTWAIT) >> >> so, the puzzle remains, since the arguments passed are kept, except that >> M_DONTWAIT flag is added. >> >> Is this a trashed stack? > > Possibly, but notice that the m_copym() function modifies its copy of 'm' > in the stack as part of its work -- it uses 'm' to iterate the mbuf chain > passed in in order to move to the necessary starting offset for the copy. > Note that the requested offset ('off0') is 737, and the requested 'len' is > at least 1222, so the loop starting at line 369 will walk until it either > gets far enough or the "offset > size" assertion triggers: > > while (off > 0) { > KASSERT(m != NULL, ("m_copym, offset > size of mbuf chain")); > if (off < m->m_len) > break; > off -= m->m_len; > m = m->m_next; > } > > Since that assertion didn't trigger, we can assume m_copym() successfully > walked at least 'off0' (737) bytes. The problem appears to be that it ran > out of mbufs in which to find data to copy, as it hit the end of the chain > (m == NULL): > > while (len > 0) { > if (m == NULL) { > KASSERT(len == M_COPYALL, > ("m_copym, length > size of mbuf chain")); > break; > } > > So the initial conclusion is that the caller requested that more data be > copied from the chain than is actually present in the chain. This > suggests a bug in socket buffer management or the TCP code. It's > interesting to note that the socket buffer believes it contains less than > the requested length -- 'so_snd.sb_mbcnt' is 1536, which is arguably less > than 737 + 1222 (although we don't know, I think, if it's iterated or not > and therefore decreased the value of 'len'). Could you print the value of > 'top' in the m_copym() stack? That will tell us if it's on the first mbuf > or not. > > It sounds like the socket buffer state may be inconsistent with the TCP > PCB state, or that the expectations in tcp_offset() are wrong. I've CC'd > Paul because he's had his hands in the new SACK code that was merged, and > it has its hands in that bit of the output code. Here are some things you > might want to try: > > (1) Try running with TCP SACK disabled. Set the > 'net.inet.tcp.sack.enable' sysctl to 0 to try this. > > (2) Try adding some assertions just before the copy to m_copy() in > tcp_output(). I'd suggest something like the following: I'm very suspicious of the SACK code. In the non-SACK case, len gets set here: if (!sack_rxmit) len = ((long)ulmin(so->so_snd.sb_cc, sendwin) - off); but when the system panics len+off > sb_cc. It would be interesting to look at *tp and *p in the tcp_output stack frame. If I had to guess, I'd say that either tp->snd_recover-tp->snd_una or p->end-tp->snd_una is greater than so->so_snd.sb_cc.