Date: Tue, 09 Jan 2007 09:37:05 -0500 From: Sven Willenberger <sven@dmv.com> To: Bruce Evans <bde@zeta.org.au> Cc: stable@FreeBSD.org, freebsd-amd64@FreeBSD.org Subject: Re: Panic in 6.2-PRERELEASE with bge on amd64 Message-ID: <1168353425.29047.8.camel@lanshark.dmv.com> In-Reply-To: <20070109124826.M79616@delplex.bde.org> References: <1168211205.22629.6.camel@lanshark.dmv.com> <20070108154433.C75042@delplex.bde.org> <1168271935.23549.10.camel@lanshark.dmv.com> <20070109124826.M79616@delplex.bde.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, 2007-01-09 at 12:50 +1100, Bruce Evans wrote: > On Mon, 8 Jan 2007, Sven Willenberger wrote: > > > On Mon, 2007-01-08 at 16:06 +1100, Bruce Evans wrote: > >> On Sun, 7 Jan 2007, Sven Willenberger wrote: > > >>> The short and dirty of the dump: > >>> ... > >>> --- trap 0xc, rip = 0xffffffff801d5f17, rsp = 0xffffffffb371ab50, rbp = 0xffffffffb371aba0 --- > >>> bge_rxeof() at bge_rxeof+0x3b7 > >> > >> What is the instruction here? > > > > I will do my best to ferret out the information you need. For the > > bge_rxeof() at bge_rxeof+0x3b7 line, the instruction is: > > > > 0xffffffff801d5f17 <bge_rxeof+951>: mov %r15,0x28(%r14) > > ... > >> Looks like a null pointer panic anyway. I guess the instruction is > >> movl to/from 0x28(%reg) where %reg is a null pointer. > >> > > > > from the above lines, apparently %r14 is null then. > > Yes. It's a bit suprising that the access is a write. > > >>> ... > >>> #8 0xffffffff801db818 in bge_intr (xsc=0x0) at /usr/src/sys/dev/bge/if_bge.c:2707 > >> > >> What is the statement here? It presumably follow a null pointer and only > >> the exprssion for the pointer is interesting. xsc is already null but > >> that is probably a bug in gdb, or the result of excessive optimization. > >> Compiling kernels with -O2 has little effect except to break debugging. > > > > the block of code from if_bge.c: > > > > 2705 if (ifp->if_drv_flags & IFF_DRV_RUNNING) { > > 2706 /* Check RX return ring producer/consumer. */ > > 2707 bge_rxeof(sc); > > 2708 > > 2709 /* Check TX ring producer/consumer. */ > > 2710 bge_txeof(sc); > > 2711 } > > Oops. I should have asked for the statment in bge_rxeof(). #7 0xffffffff801d5f17 in bge_rxeof (sc=0xffffffff8836b000) at /usr/src/sys/dev/bge/if_bge.c:2528 2528 m->m_pkthdr.len = m->m_len = cur_rx->bge_len - ETHER_CRC_LEN; (where m is defined as: 2449 struct mbuf *m = NULL; ) > > > By default -O2 is passed to CC (I don't use any custom make flags other > > than and only define CPUTYPE in my /etc/make.conf). > > -O2 is unfortunately the default for COPTFLAGS for most arches in > sys/conf/kern.pre.mk. All of my machines and most FreeBSD cluster > machines override this default in /etc/make.conf. > > With the override overridden for RELENG_6 amd64, gcc inlines bge_rxeof(), > so your environment must be a little different to get even the above > ifo. I think gdb can show the correct line numbers but not the call > frames (since there is no call). ddb and the kernel stack trace can > only show the call frames for actual calls. > > With -O1, I couldn't find any instruction similar to the mov to the > null pointer + 28. 28 is a popular offset in mbufs If you have a suggestion for an /etc/make.conf line, I can recompile the kernel accordingly assuming it still panics or locks up after the change of interface noted below. > > > The short of it is that this interface sees pretty much non-stop traffic > > as this is a mailserver (final destination) and is constantly being > > delivered to (direct disk access) and mail being retrieved (remote > > machine(s) with nfs mounted mail spools. If a momentary down of the > > interface is enough to completely panic the driver and then the kernel, > > this hardly seems "robust" if, in fact, this is what is happening. So > > the question arises as to what would be causing the down/up of the > > interface; I could start looking at the cable, the switch it's connected > > to and ... any other ideas? (I don't have watchdog enabled or anything > > like that, for example). > > I don't think down/up can occur in normal operation, since it takes ioctls > or a watchdog timeout to do it. Maybe some ioctls other than a full > down/up can cause problems... bge_init() is called for the following > ioctls: > - mtu changes > - some near down/up (possibly only these) > Suspend/resume and of course detach/attach do much the same things as > down/up. > > BTW, I added some sysctls and found it annoying to have to do down/up > to make the sysctls take effect. Sysctls in several other NIC drivers > require the same, since doing a full reinitialization is easiest. > Since I am tuning using sysctls, I got used to doing down/up too much. > > Similarly for the mtu ioctl. I think a full reinitialization is used > for mtu changes mainly in cases the change switches on/off support for > jumbo buffers. Then there is a lot of buffer reallocation to be > done, and interfaces have to be stopped to ensure that the bufferes > being deallocated are not in use, etc. > > Bruce As this was connected to a gigE switch with mtu left at 1500 I supposed it is possible that perhaps some mtu discovery/change may have been happening on the switch but that seems a bit out in left field. For now I am using the fxp interface connected to the same switch to see if the issue continues (the change of interface was driven by a hard lockup yesterday where I could not even type anything on the term). Sven
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1168353425.29047.8.camel>