From owner-freebsd-net@FreeBSD.ORG Mon Sep 7 19:10:06 2009 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id E7639106568B for ; Mon, 7 Sep 2009 19:10:06 +0000 (UTC) (envelope-from ccowart@rescomp.berkeley.edu) Received: from hal.rescomp.berkeley.edu (hal.Rescomp.Berkeley.EDU [169.229.70.150]) by mx1.freebsd.org (Postfix) with ESMTP id BD20F8FC2E for ; Mon, 7 Sep 2009 19:10:06 +0000 (UTC) Received: by hal.rescomp.berkeley.edu (Postfix, from userid 1225) id 0D3FA597C49; Mon, 7 Sep 2009 12:10:01 -0700 (PDT) Date: Mon, 7 Sep 2009 12:10:01 -0700 From: Chris Cowart To: George Neville-Neil Message-ID: <20090907191001.GA37291@hal.rescomp.berkeley.edu> Mail-Followup-To: George Neville-Neil , freebsd-net@freebsd.org References: <20090904223123.GD16213@hal.rescomp.berkeley.edu> <723505E9-96C6-401C-A844-3D9BA2033795@neville-neil.com> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-ripemd160; protocol="application/pgp-signature"; boundary="3MwIy2ne0vdjdPXF" Content-Disposition: inline In-Reply-To: <723505E9-96C6-401C-A844-3D9BA2033795@neville-neil.com> Organization: RSSP-IT, UC Berkeley User-Agent: Mutt/1.5.20 (2009-06-14) Cc: freebsd-net@freebsd.org Subject: Re: Crash in ether_input X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 07 Sep 2009 19:10:07 -0000 --3MwIy2ne0vdjdPXF Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable George Neville-Neil wrote: > On Sep 4, 2009, at 18:31 , Chris Cowart wrote: > > Starting about a week ago, our primary webserver (then running FreeBSD > > 7.0) began crashing several times a day, typically during our > > higher-load times of day. We have since upgraded to 7.1p7, but =20 > > continued > > to see the frequent crashes. > > > > We are running an apache22 webserver with a lot of perl, logging via > > syslog-ng, and using IPSec in transport mode between the webserver and > > both the fileserver and logserver. Everything is IPv4. > > > > From uname: > > > > | FreeBSD mug.rescomp.berkeley.edu 7.1-RELEASE-p7 FreeBSD 7.1-=20 > > RELEASE-p7 > > | #0: Wed Sep 2 17:56:59 PDT 2009 > > | root@mug.rescomp.berkeley.edu:/usr/obj/usr/src/sys/GENERIC amd64 > > > > Some information that appears typical across many crashes: > > > > | Unread portion of the kernel message buffer: > > | > > | Fatal trap 27: stack fault while in kernel mode > > | cpuid =3D 0; apic id =3D 00 > > | instruction pointer =3D 0x8:0xffffffff80559fb4 > > | stack pointer =3D 0x10:0xffffffffae39faf0 > > | frame pointer =3D 0x10:0xf85ecc37f9239402 > > | code segment =3D base 0x0, limit 0xfffff, type 0x1b > > | =3D DPL 0, pres 1, long 1, def32 0, gran 1 > > | processor eflags =3D interrupt enabled, resume, IOPL =3D 0 > > | current process =3D 27 (em0 taskq) > > | trap number =3D 27 > > | panic: stack fault > > | cpuid =3D 0 > > | Uptime: 43m44s > > | Physical memory: 4082 MB > > | Dumping 361 MB: 346 330 314 298 282 266 250 234 218 202em0: =20 > > watchdog timeout -- resetting > > > > | (kgdb) bt > > | #0 doadump () at pcpu.h:195 > > | #1 0x0000000000000004 in ?? () > > | #2 0xffffffff804bd9b9 in boot (howto=3D260) at /usr/src/sys/kern/=20 > > kern_shutdown.c:418 > > | #3 0xffffffff804bddc2 in panic (fmt=3D0x104
> bounds>) at /usr/src/sys/kern/kern_shutdown.c:574 > > | #4 0xffffffff807b9f23 in trap_fatal (frame=3D0xffffff00012d66e0, =20 > > eva=3DVariable "eva" is not available. > > | ) at /usr/src/sys/amd64/amd64/trap.c:764 > > | #5 0xffffffff807baa75 in trap (frame=3D0xffffffffae39fa40) at /usr/= =20 > > src/sys/amd64/amd64/trap.c:565 > > | #6 0xffffffff807a042e in calltrap () at /usr/src/sys/amd64/amd64/=20 > > exception.S:209 > > | #7 0xffffffff80559fb4 in ether_input (ifp=3D0xffffff00012bf000, =20 > > m=3D0xffffff0003576000) at /usr/src/sys/net/if_ethersubr.c:545 > > | #8 0xffffffff802bd645 in em_rxeof (adapter=3D0xffffffff80e4c000, =20 > > count=3D99) at /usr/src/sys/dev/e1000/if_em.c:4539 > > | #9 0xffffffff802be55e in em_handle_rxtx (context=3DVariable =20 > > "context" is not available. > > | ) at /usr/src/sys/dev/e1000/if_em.c:1702 > > | #10 0xffffffff804f2afd in taskqueue_run (queue=3D0xffffff00012c8c80) = =20 > > at /usr/src/sys/kern/subr_taskqueue.c:282 > > | #11 0xffffffff804f2da6 in taskqueue_thread_loop (arg=3DVariable =20 > > "arg" is not available. > > | ) at /usr/src/sys/kern/subr_taskqueue.c:401 > > | #12 0xffffffff8049b2f3 in fork_exit (callout=3D0xffffffff804f2d40 =20 > > , arg=3D0xffffffff80e50588, =20 > > frame=3D0xffffffffae39fc80) at /usr/src/sys/kern/kern_fork.c:804 > > | #13 0xffffffff807a07fe in fork_trampoline () at /usr/src/sys/amd64/= =20 > > amd64/exception.S:455 > > | #14 0x0000000000000000 in ?? () > > | #15 0x0000000000000000 in ?? () > > | #16 0x0000000000000001 in ?? () > > [...] > > > > | (kgdb) source debug/gdb6 > > | (kgdb) frame 7 > > | #7 0xffffffff80559fb4 in ether_input (ifp=3D0xffffff00012bf000, =20 > > m=3D0xffffff0003576000) at /usr/src/sys/net/if_ethersubr.c:545 > > | 545 eh =3D mtod(m, struct ether_header *); > > | (kgdb) info locals > > | eh =3D (struct ether_header *) 0xf85ecc37f9239402 > > | (kgdb) info args > > | ifp =3D (struct ifnet *) 0xffffff00012bf000 > > | m =3D (struct mbuf *) 0xffffff0003576000 > > | (kgdb) mbuf m > > | 0xffffff0003576000: 125 bytes ext 0xaf29dcb45d53e701 packet: 125 =20 > > bytes received via em0 > > | 0xbb763383e10eda22Cannot access memory at address 0xbb763383e10eda3a > > | (kgdb) > > > > If anyone can provide some points on other things I can try to get > > useful data out of these core dumps, I'm open to it. > > > > We did decide to stop mounting NFS, upgrade to syslog-ng3 (which > > supports TLS), and revert the webserver back to a GENERIC kernel. =20 > > Since > > booting the GENERIC kernel, the system has been up for nearly 2 days. > > > > Right now, we're logging via TLS to a temporary/testing logserver. =20 > > That > > logserver is one of our default builds with IPSec. It is configured to > > forward logs over udp/syslog (via IPSec in transport mode) to our > > primary logserver. > > > > Within hours of beginning to pass the production webserver's logs > > through this temporary logserver (and thus having its syslog-ng =20 > > forward > > to the primary logserver), the temporary logserver began exhibiting =20 > > the > > same behavior that the webserver was previously showing. > > > > We're totally grasping at straws here, but it's looking like some kind > > of bug related to IPSec. Maybe related to long messages? High volume = =20 > > of > > messages? > > > > We would love to get this hammered out, so please let me know if =20 > > there's > > any debugging we can perform or patches we can try. > > >=20 > Hi Chris, >=20 > Sorry to hear y'all are having problems. You mention that you > switched to GENERIC? Can you send out a copy of your modified kernel > config? I'm a bit confused because the kernel panic you do show > looks like it's from a GENERIC kernel. Sorry, the uname was from a stable boot after we had switched to GENERIC.=20 The config from the crashing kernel: | include GENERIC |=20 | ident RCBSD_REL7 |=20 | # Enabling IPSec (see SysAdmin.IPSec in TWiki) | options IPSEC =20 | options IPSEC_FILTERTUNNEL | device crypto |=20 | # Enabling quota support | options QUOTA > Are you setting any of the em device's kernel tunables? Nope. Some more details on the hardware: | [ccowart@mug /usr/local/rescomp/etc/online-helpdesk]$ dmesg | grep em0 | em0: port 0xece0-0xecff mem = 0xfc3e0000-0xfc3fffff,0xfc3c0000-0xfc3dffff irq 16 at device 0.0 on pci14 | em0: Using MSI interrupt | em0: [FILTER] | em0: Ethernet address: 00:15:17:7a:b5:e0 | em0: link state changed to UP > How is your IPSEC stuff built in to the kernel and are you setting any > kernel variables for it? See the kernel config above. We are not tuning via sysctls or the loader.conf. All of that should be stock for IPSEC. The host is configured with 2 peers for IPSec communication. We use racoon and x509 certificates. > My best guess from the stack trace is a buffer that is not being > returned properly somewhere between the device and the IPsec > subsystem. One of my coworkers has been able to reproduce the crash by syslogging really long lines. I'm planning to spend some time today to try to create a smaller example case on a couple of vms. --=20 Chris Cowart Network Technical Lead Network & Infrastructure Services, RSSP-IT UC Berkeley --3MwIy2ne0vdjdPXF Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.12 (FreeBSD) iQIcBAEBAwAGBQJKpVqIAAoJEC8b9sM8ejXtPVwP/0aL4JV7k1QSwAdrb3qz9kSo yqZk1jLAjG8ripUtcmwB/8VUXhhWVsb50FO+4ww312CGEweh4qA5ii5yMGIbBYa9 7PCo8FEIoHpRicWsxcJQ5sKJuNmjWdeJEYu89tivuvxlqjer7eMuOGIrDwTE+7kQ cYfaaLFLxkCBXj/9Fd5tEduetZcknmIQKNbPdKXeiCN3SPwctDE/OcYOUw30aBXx N+YZCJD1knKdFo+ck9RLb2bMTAn5Z+C+nwGRjMIwp++xK/QYI54YYZgZs7n+W45B hL6soVBHCAcBUfLPjrykF1kEsiWMfo/nFbIkD8/TAtpMdAev+V/LQSHpUmmPBxdZ PO9xqh9C82lYjlu7/6yDix5jRvu9QT41JEuPTodWDtGaGzTu2NGBqdXgdgEYjPGd 0ozciCbONPf2ZM7RNTlMl8u/NEOj7pKHirQmL8+gsrt9Vg39LaCbHlc7/mQNTISV u/Pz2ZHBUUz0NU9l91IQii1YN7WzOWzuuwTiXYACIYyhUudvrEYS3jTD0sZjm8Ap etqstxFfAYxXdga1PvsC7hx0/jxsVFEQJgfZt+RPKChHGi4lrh0LIANFkACNo0LV LX7TkPtUDQu7KgmMIukowSFEb2JxWZecc9zIUyjFn732iJ0K5sc0/KT6BYLMs08p dpOI+RnG5pxtYv9dRL1N =J/0b -----END PGP SIGNATURE----- --3MwIy2ne0vdjdPXF--