Date: Sun, 20 Jun 2010 15:15:54 +0200 From: Fabian Keil <freebsd-listen@fabiankeil.de> To: Lawrence Stewart <lstewart@freebsd.org> Cc: freebsd-current@freebsd.org Subject: Re: [CFT] SIFTR - Statistical Information For TCP Research: Uncle Lawrence needs YOU! Message-ID: <20100620151554.58e486db@r500.local> In-Reply-To: <4C1E0E14.3090506@freebsd.org> References: <4C1492D0.6020704@freebsd.org> <4C1C3922.2050102@freebsd.org> <20100619195823.53a7baaa@r500.local> <4C1DED16.8020209@freebsd.org> <20100620131544.495ddecd@r500.local> <4C1E019F.6060802@freebsd.org> <20100620142841.4803dac3@r500.local> <4C1E0E14.3090506@freebsd.org>
next in thread | previous in thread | raw e-mail | index | archive | help
--Sig_/KOfakMhdKJkZMtuiyHRGJ_N Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: quoted-printable Lawrence Stewart <lstewart@freebsd.org> wrote: > On 06/20/10 22:28, Fabian Keil wrote: > > Lawrence Stewart<lstewart@freebsd.org> wrote: > > > >> On 06/20/10 21:15, Fabian Keil wrote: > >>> Lawrence Stewart<lstewart@freebsd.org> wrote: > >>> > >>>> On 06/20/10 03:58, Fabian Keil wrote: > >>>>> Lawrence Stewart<lstewart@freebsd.org> wrote: > >>>>> > >>>>>> On 06/13/10 18:12, Lawrence Stewart wrote: > >>>>> > >>>>>>> The time has come to solicit some external testing for my SIFTR t= ool. > >>>>>>> I'm hoping to commit it within a week or so unless problems are d= iscovered. > >>>>> > >>>>>>> I'm interested in all feedback and reports of success/failure, al= ong > >>>>>>> with details of the architecture tested and number of CPUs if you= would > >>>>>>> be so kind. > >>>>> > >>>>> I got the following hand-transcribed panic maybe a second after > >>>>> sysctl net.inet.siftr.enabled=3D1 > >>>>> > >>>>> Fatal trap 12: page fault while in kernel mode > >>>>> cpuid =3D 1; apic id =3D 01 > >>>>> [...] > >>>>> current process =3D 12 (swi4: clock) > >>>>> [ thread pid 12 tid 100006 ] > >>>>> Stopped at siftr_chkpkt+0xd0: addq $0x1,0x8(%r14) > >>>>> db> where > >>>>> Tracing pid 12 tid 100006 td 0xffffff00034037e0 > >>>>> siftr_chkpt() at siftr_chkpkt+0xd0 > >>>>> pfil_run_hooks() at pfil_run_hooks+0xb4 > >>>>> ip_output() at ip_output+0x382 > >>>>> tcp_output() tcp_output+0xa41 > >>>>> tcp_timer_rexmt() at tcp_timer_rexmt+0x251 > >>>>> softclock() at softclock+0x291 > >>>>> intr_event_execute_handlers() at intr_event_execute_handlers+0x66 > >>>>> ithread_loop at ithread_loop+0x8e > >>>>> fork_exit() at fork_exit+0x112 > >>>>> fork_trampoline() at fork_trampoline+0xe > >>>>> --- trap 0, rip =3D 0, rsp =3D 0xffffff800003ad30, rbp =3D 0 --- > >>>> > >>>> So I've tracked down the line of code where the page fault is occurr= ing: > >>>> > >>>> if (dir =3D=3D PFIL_IN) > >>>> ss->n_in++; > >>>> else > >>>> ss->n_out++; > >>>> > >>>> ss is a DPCPU (dynamic per-cpu) variable used to keep a set of stats > >>>> per-cpu and is initialised at the start of the function like so: > >>>> > >>>> ss =3D DPCPU_PTR(ss); > >>>> > >>>> So for ss to be NULL, that implies DPCPU_PTR() is returning NULL on = your > >>>> machine. I know very little about the inner workings of the DPCPU_* > >>>> macros, but I'm pretty sure the way I use them in SIFTR is correct o= r at > >>>> least as intended. > >>> > >>> siftr_chkpkt() passes ss to siftr_chkreinject() before dereferencing > >>> it itself. I think if ss was NULL, the panic should already occur in > >>> siftr_chkreinject(). > >> > >> Yes but siftr_chkreinject() only dereferences ss in the exceptional ca= se > >> of a malloc failure or duplicate pkt. It's unlikely either case happens > >> for you and so wouldn't trigger the panic. > >> > >>> To be sure I added: > >>> > >>> diff --git a/sys/netinet/siftr.c b/sys/netinet/siftr.c > >>> index 8bc3498..b9fdfe4 100644 > >>> --- a/sys/netinet/siftr.c > >>> +++ b/sys/netinet/siftr.c > >>> @@ -788,6 +788,16 @@ siftr_chkpkt(void *arg, struct mbuf **m, struct = ifnet *ifp, int dir, > >>> if (siftr_chkreinject(*m, dir, ss)) > >>> goto ret; > >>> > >>> + if (ss =3D=3D NULL) { > >>> + printf("ss is NULL"); > >>> + ss =3D DPCPU_PTR(ss); > >>> + if (ss =3D=3D NULL) { > >>> + printf("ss is still NULL"); > >>> + goto ret; > >>> + } > >>> + } > >>> + > >>> + > >>> if (dir =3D=3D PFIL_IN) > >>> ss->n_in++; > >>> else > >>> > >>> which doesn't seem to affect the problem. > >> > >> As in it still panics and the "ss is NULL" message is not printed? I > >> would have expected to at least see "ss is NULL" printed if my > >> hypothesis was correct... hmm. > > > > Yes, it still panics, but no message is printed. >=20 > It was just pointed out to me that ss doesn't have to be NULL in order=20 > to cause the page fault (duh). It could also just be a garbage ptr which= =20 > is why your print statement isn't firing. >=20 > Can you trigger the panic again and look for some information along the=20 > lines of "fault virtual address =3D ..." as part of the panic info.=20 > Knowing the faulting address would be useful and may help further diagnos= is. Fatal trap 12: page fault while in kernel mode cpuid =3D 1; apic id =3D 01 fault virtual address =3D 0xffffff7f808f9de8 fault code =3D supervisor write data, page not present instruction pointer =3D 0x20:0xffffffff8241f800 stack pointer =3D 0x28:0xffffff800003a7d0 frame pointer =3D 0x28:0xffffff800003a840 code segment =3D base 0x0, limit 0xfffff, type 0x1b =3D DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags =3D interrupt enabled, resume, IOPL =3D 0 current process =3D 12 (swi4: clock) [ thread pid 12 tid 100006 ] Stopped at siftr_chkpkt+0xd0: addq $0x1,0x8(%r14) db> where =20 Tracing pid 12 tid 100006 td 0xffffff00034037e0 siftr_chkpt() at siftr_chkpkt+0xd0 pfil_run_hooks() at pfil_run_hooks+0xb4 ip_output() at ip_output+0x382 tcp_output() tcp_output+0xa41 tcp_timer_rexmt() at tcp_timer_rexmt+0x251 softclock() at softclock+0x291 intr_event_execute_handlers() at intr_event_execute_handlers+0x66 ithread_loop at ithread_loop+0x8e fork_exit() at fork_exit+0x112 fork_trampoline() at fork_trampoline+0xe --- trap 0, rip =3D 0, rsp =3D 0xffffff800003ad30, rbp =3D 0 --- > >>>> Could you please go ahead and retest using a GENERIC kernel and see = if > >>>> you can reproduce? There could be something in your custom kernel > >>>> causing the offsets or linker set magic used by the DPCPU bits to br= eak > >>>> which in turn is triggering this panic in SIFTR. > >>> > >>> I'll retry without pf first, and with GENERIC afterwards. > >> > >> Sounds good, thanks. > > > > Taking pf (and altq) out of the picture doesn't seem to make > > a difference. >=20 > Wouldn't have expected it to. Will be very curious to know if the panic=20 > is triggered in GENERIC. It's not. I, too, get pfil.c related LORs though: lock order reversal: 1st 0xffffffff80e5c568 PFil hook read/write mutex (PFil hook read/write mu= tex) @ /usr/src/sys/net/pfil.c:77 2nd 0xffffffff80e5dd68 udp (udp) @ /usr/src/sys/modules/pf/../../contrib/p= f/net/pf.c:3035 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2a _witness_debugger() at _witness_debugger+0x2e witness_checkorder() at witness_checkorder+0x81e _rw_rlock() at _rw_rlock+0x5f pf_socket_lookup() at pf_socket_lookup+0x1c5 pf_test_udp() at pf_test_udp+0x8b0 pf_test() at pf_test+0x1089 pf_check_in() at pf_check_in+0x39 pfil_run_hooks() at pfil_run_hooks+0xcf ip_input() at ip_input+0x2ae swi_net() at swi_net+0x151 intr_event_execute_handlers() at intr_event_execute_handlers+0x66 ithread_loop() at ithread_loop+0xb2 fork_exit() at fork_exit+0x12a fork_trampoline() at fork_trampoline+0xe --- trap 0, rip =3D 0, rsp =3D 0xffffff8000044d30, rbp =3D 0 --- lock order reversal: 1st 0xffffffff80e5c568 PFil hook read/write mutex (PFil hook read/write mu= tex) @ /usr/src/sys/net/pfil.c:77 2nd 0xffffffff80e5d788 tcp (tcp) @ /usr/src/sys/modules/siftr/../../netine= t/siftr.c:698 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2a _witness_debugger() at _witness_debugger+0x2e witness_checkorder() at witness_checkorder+0x81e _rw_rlock() at _rw_rlock+0x5f siftr_chkpkt() at siftr_chkpkt+0x3c4 pfil_run_hooks() at pfil_run_hooks+0xcf ip_input() at ip_input+0x2ae swi_net() at swi_net+0x151 intr_event_execute_handlers() at intr_event_execute_handlers+0x66 ithread_loop() at ithread_loop+0xb2 fork_exit() at fork_exit+0x12a fork_trampoline() at fork_trampoline+0xe --- trap 0, rip =3D 0, rsp =3D 0xffffff8000044d30, rbp =3D 0 --- My custom kernel normally doesn't have INVARIANTS and WITNESS enabled, so I'll try to enable them next. Fabian --Sig_/KOfakMhdKJkZMtuiyHRGJ_N Content-Type: application/pgp-signature; name=signature.asc Content-Disposition: attachment; filename=signature.asc -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.14 (FreeBSD) iEYEARECAAYFAkweFI8ACgkQBYqIVf93VJ2cagCfSbFY5gfh6GeDW3kKiCQBv3+n OesAn33UJRR0XanMCLSSddSzKN8Tg6v4 =DAmj -----END PGP SIGNATURE----- --Sig_/KOfakMhdKJkZMtuiyHRGJ_N--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20100620151554.58e486db>