Date: Sun, 20 Jun 2010 14:28:41 +0200 From: Fabian Keil <freebsd-listen@fabiankeil.de> To: Lawrence Stewart <lstewart@freebsd.org> Cc: freebsd-current@freebsd.org Subject: Re: [CFT] SIFTR - Statistical Information For TCP Research: Uncle Lawrence needs YOU! Message-ID: <20100620142841.4803dac3@r500.local> In-Reply-To: <4C1E019F.6060802@freebsd.org> References: <4C1492D0.6020704@freebsd.org> <4C1C3922.2050102@freebsd.org> <20100619195823.53a7baaa@r500.local> <4C1DED16.8020209@freebsd.org> <20100620131544.495ddecd@r500.local> <4C1E019F.6060802@freebsd.org>
index | next in thread | previous in thread | raw e-mail
[-- Attachment #1 --]
Lawrence Stewart <lstewart@freebsd.org> wrote:
> On 06/20/10 21:15, Fabian Keil wrote:
> > Lawrence Stewart<lstewart@freebsd.org> wrote:
> >
> >> On 06/20/10 03:58, Fabian Keil wrote:
> >>> Lawrence Stewart<lstewart@freebsd.org> wrote:
> >>>
> >>>> On 06/13/10 18:12, Lawrence Stewart wrote:
> >>>
> >>>>> The time has come to solicit some external testing for my SIFTR tool.
> >>>>> I'm hoping to commit it within a week or so unless problems are discovered.
> >>>
> >>>>> I'm interested in all feedback and reports of success/failure, along
> >>>>> with details of the architecture tested and number of CPUs if you would
> >>>>> be so kind.
> >>>
> >>> I got the following hand-transcribed panic maybe a second after
> >>> sysctl net.inet.siftr.enabled=1
> >>>
> >>> Fatal trap 12: page fault while in kernel mode
> >>> cpuid = 1; apic id = 01
> >>> [...]
> >>> current process = 12 (swi4: clock)
> >>> [ thread pid 12 tid 100006 ]
> >>> Stopped at siftr_chkpkt+0xd0: addq $0x1,0x8(%r14)
> >>> db> where
> >>> Tracing pid 12 tid 100006 td 0xffffff00034037e0
> >>> siftr_chkpt() at siftr_chkpkt+0xd0
> >>> pfil_run_hooks() at pfil_run_hooks+0xb4
> >>> ip_output() at ip_output+0x382
> >>> tcp_output() tcp_output+0xa41
> >>> tcp_timer_rexmt() at tcp_timer_rexmt+0x251
> >>> softclock() at softclock+0x291
> >>> intr_event_execute_handlers() at intr_event_execute_handlers+0x66
> >>> ithread_loop at ithread_loop+0x8e
> >>> fork_exit() at fork_exit+0x112
> >>> fork_trampoline() at fork_trampoline+0xe
> >>> --- trap 0, rip = 0, rsp = 0xffffff800003ad30, rbp = 0 ---
> >>
> >> So I've tracked down the line of code where the page fault is occurring:
> >>
> >> if (dir == PFIL_IN)
> >> ss->n_in++;
> >> else
> >> ss->n_out++;
> >>
> >> ss is a DPCPU (dynamic per-cpu) variable used to keep a set of stats
> >> per-cpu and is initialised at the start of the function like so:
> >>
> >> ss = DPCPU_PTR(ss);
> >>
> >> So for ss to be NULL, that implies DPCPU_PTR() is returning NULL on your
> >> machine. I know very little about the inner workings of the DPCPU_*
> >> macros, but I'm pretty sure the way I use them in SIFTR is correct or at
> >> least as intended.
> >
> > siftr_chkpkt() passes ss to siftr_chkreinject() before dereferencing
> > it itself. I think if ss was NULL, the panic should already occur in
> > siftr_chkreinject().
>
> Yes but siftr_chkreinject() only dereferences ss in the exceptional case
> of a malloc failure or duplicate pkt. It's unlikely either case happens
> for you and so wouldn't trigger the panic.
>
> > To be sure I added:
> >
> > diff --git a/sys/netinet/siftr.c b/sys/netinet/siftr.c
> > index 8bc3498..b9fdfe4 100644
> > --- a/sys/netinet/siftr.c
> > +++ b/sys/netinet/siftr.c
> > @@ -788,6 +788,16 @@ siftr_chkpkt(void *arg, struct mbuf **m, struct ifnet *ifp, int dir,
> > if (siftr_chkreinject(*m, dir, ss))
> > goto ret;
> >
> > + if (ss == NULL) {
> > + printf("ss is NULL");
> > + ss = DPCPU_PTR(ss);
> > + if (ss == NULL) {
> > + printf("ss is still NULL");
> > + goto ret;
> > + }
> > + }
> > +
> > +
> > if (dir == PFIL_IN)
> > ss->n_in++;
> > else
> >
> > which doesn't seem to affect the problem.
>
> As in it still panics and the "ss is NULL" message is not printed? I
> would have expected to at least see "ss is NULL" printed if my
> hypothesis was correct... hmm.
Yes, it still panics, but no message is printed.
> Perhaps the way I discovered the line number at which the panic occurred
> was wrong. I compiled SIFTR on my amd64 dev server with "CFLAGS+=-g" in
> the SIFTR Makefile to get debug symbols, ran "objdump -Sd siftr.ko | vim
> -", searched for the instruction reported in the panic message i.e.
> "addq $0x1,0x8(%r14)" and then with a bit of trial and error, recompiled
> SIFTR with the line of code "volatile int blah = 0; blah = 2;" at
> various points in the function and looking at the change in the objdump
> output to pinpoint which line of C code corresponded with the "addq"
> instruction.
>
> The "volatile int blah = 0; blah = 2;" compiles to "movl
> $0x0,0xffffffffffffffd4(%rbp)" followed immediately by "movl
> $0x2,0xffffffffffffffd4(%rbp)". When I put that code above the "if (dir
> == PFIL_IN)" statement I see the objdump output show the assembly code
> before the "addq" instruction and when I move it after the if statement
> the assembly code moves after the "addq" instruction.
That's a neat trick.
> Perhaps you could reproduce the above procedure and see if you identify
> the same point in the siftr_chkpkt function I did for the instruction
> referenced by the panic message?
I do. Using:
diff --git a/sys/netinet/siftr.c b/sys/netinet/siftr.c
index b9fdfe4..fc6bd9a 100644
--- a/sys/netinet/siftr.c
+++ b/sys/netinet/siftr.c
@@ -797,12 +797,15 @@ siftr_chkpkt(void *arg, struct mbuf **m, struct ifnet *ifp, int dir,
}
}
+ volatile int blah = 0; blah = 2;
if (dir == PFIL_IN)
ss->n_in++;
else
ss->n_out++;
+ volatile int foo = 0; foo = 3;
+
/*
* Create a tcphdr struct starting at the correct offset
* in the IP packet. ip->ip_hl gives the ip header length
I get:
803: c7 45 d4 00 00 00 00 movl $0x0,0xffffffffffffffd4(%rbp)
80a: c7 45 d4 02 00 00 00 movl $0x2,0xffffffffffffffd4(%rbp)
811: 0f 84 8a 02 00 00 je aa1 <siftr_chkpkt+0x371>
817: 49 83 46 08 01 addq $0x1,0x8(%r14)
81c: c7 45 d0 00 00 00 00 movl $0x0,0xffffffffffffffd0(%rbp)
823: c7 45 d0 03 00 00 00 movl $0x3,0xffffffffffffffd0(%rbp)
> >> Could you please go ahead and retest using a GENERIC kernel and see if
> >> you can reproduce? There could be something in your custom kernel
> >> causing the offsets or linker set magic used by the DPCPU bits to break
> >> which in turn is triggering this panic in SIFTR.
> >
> > I'll retry without pf first, and with GENERIC afterwards.
>
> Sounds good, thanks.
Taking pf (and altq) out of the picture doesn't seem to make
a difference.
Fabian
[-- Attachment #2 --]
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.14 (FreeBSD)
iEYEARECAAYFAkweCXsACgkQBYqIVf93VJ1qcwCeMBy8mvXtUBqp3BUkD9K8fK6x
9GYAn3enB+kNR4h0jO8BS8YEYJA2z/dD
=ifET
-----END PGP SIGNATURE-----
help
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20100620142841.4803dac3>
