Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 18 Nov 2021 18:50:05 -0700
From:      "Kristof Provost" <kp@FreeBSD.org>
To:        "John Baldwin" <jhb@FreeBSD.org>
Cc:        jenkins-admin@FreeBSD.org, dev-ci@FreeBSD.org
Subject:   Re: FreeBSD-main-amd64-test - Build #19839 - Failure
Message-ID:  <F132C565-ADFF-4234-BB57-1C892E2B6FB2@FreeBSD.org>
In-Reply-To: <af341e1b-042b-491d-9ee6-70d83d4fedd4@FreeBSD.org>
References:  <1727420938.1071.1637194678550@jenkins.ci.freebsd.org> <263150711.1077.1637200233260@jenkins.ci.freebsd.org> <af341e1b-042b-491d-9ee6-70d83d4fedd4@FreeBSD.org>

next in thread | previous in thread | raw e-mail | index | archive | help

--=_MailMate_532E3070-790D-4348-839A-FFE9CD0B8790_=
Content-Type: text/plain; charset="UTF-8"; format=flowed; markup=markdown
Content-Transfer-Encoding: 8bit

On 18 Nov 2021, at 17:21, John Baldwin wrote:
> On 11/17/21 5:50 PM, jenkins-admin@FreeBSD.org wrote:
>> FreeBSD-main-amd64-test - Build #19839 
>> (4082b189d2ce00674439226c9d5a8bdcafd23d01) - Failure
>>
>> Build information: 
>> https://ci.FreeBSD.org/job/FreeBSD-main-amd64-test/19839/
>> Full change log: 
>> https://ci.FreeBSD.org/job/FreeBSD-main-amd64-test/19839/changes
>> Full build log: 
>> https://ci.FreeBSD.org/job/FreeBSD-main-amd64-test/19839/console
>>
>> Status explanation:
>> "Failure" - the build is suspected being broken by the following 
>> changes
>> "Still Failing" - the build has not been fixed by the following 
>> changes and
>>                    this is a notification to note that these changes 
>> have
>>                    not been fully tested by the CI system
>>
>> Change summaries:
>> (Those commits are likely but not certainly responsible)
>>
>> b928e924f74b0b8f882a9b735611421a93113640 by jhb:
>> rtld-elf: Use _get_tp in __tls_get_addr for aarch64 and riscv64.
>>
>> a8d885296a9dc517e731723081c83d97d2aa598f by jhb:
>> linux_linkat: Don't invert AT_* flags.
>>
>> 8b2ce7a3bbd0a754d31ff3943d918b4c84c831a3 by jhb:
>> linux_name_to_handle_at: Support AT_EMPTY_PATH.
>>
>> 1962164584a91078418afcd7c979afef13df8c4d by jhb:
>> imgact_elf: Use bool instead of boolean_t.
>>
>> 4082b189d2ce00674439226c9d5a8bdcafd23d01 by jhb:
>> elf*_brand_inuse: Change return type to bool.
>
> I don't the panic is related to these commits.
>
>> The end of the build log:
>>
>> [...truncated 4.27 MB...]
>> --- trap 0xc, rip = 0x2100c7156d9a, rsp = 0x4423f48, rbp = 0x4423f60 
>> ---
>
> The useful parts of the panic were earlier in the log:
>
> 01:50:13 sys/netpfil/common/dummynet:pf_queue_v6  ->  epair5a: 
> Ethernet address: 02:b5:98:ea:1d:0a
> 01:50:14 epair5b: Ethernet address: 02:b5:98:ea:1d:0b
> 01:50:14 epair5a: link state changed to UP
> 01:50:14 epair5b: link state changed to UP
> 01:50:14 epair5a: link state changed to DOWN
> 01:50:27 epair5b: link state changed to DOWN
> 01:50:27 passed  [13.831s]
> 01:50:27
> 01:50:27
> 01:50:27 Fatal trap 12: page fault while in kernel mode
> 01:50:27 cpuid = 0; apic id = 00
> 01:50:27 fault virtual address	= 0x10
> 01:50:27 fault code		= supervisor read data, page not present
> 01:50:27 instruction pointer	= 0x20:0xffffffff80e3c60f
> 01:50:27 stack pointer	        = 0x28:0xfffffe00a5d9ec30
> 01:50:27 frame pointer	        = 0x28:0xfffffe00a5d9ed00
> 01:50:27 code segment		= base 0x0, limit 0xfffff, type 0x1b
> 01:50:27 			= DPL 0, pres 1, long 1, def32 0, gran 1
> 01:50:27 processor eflags	= interrupt enabled, resume, IOPL = 0
> 01:50:27 current process		= 0 (dummynet)
> 01:50:27 trap number		= 12
> 01:50:27 panic: page fault
> 01:50:27 cpuid = 0
> 01:50:27 time = 1637200227
> 01:50:27 KDB: stack backtrace:
> 01:50:27 db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 
> 0xfffffe00a5d9e9f0
> 01:50:27 vpanic() at vpanic+0x17f/frame 0xfffffe00a5d9ea40
> 01:50:27 panic() at panic+0x43/frame 0xfffffe00a5d9eaa0
> 01:50:27 trap_fatal() at trap_fatal+0x385/frame 0xfffffe00a5d9eb00
> 01:50:27 trap_pfault() at trap_pfault+0xab/frame 0xfffffe00a5d9eb60
> 01:50:27 calltrap() at calltrap+0x8/frame 0xfffffe00a5d9eb60
> 01:50:27 --- trap 0xc, rip = 0xffffffff80e3c60f, rsp = 
> 0xfffffe00a5d9ec30, rbp = 0xfffffe00a5d9ed00 ---
> 01:50:27 ip6_input() at ip6_input+0x4f/frame 0xfffffe00a5d9ed00
> 01:50:27 netisr_dispatch_src() at netisr_dispatch_src+0xb1/frame 
> 0xfffffe00a5d9ed60
> 01:50:27 dummynet_send() at dummynet_send+0x1dd/frame 
> 0xfffffe00a5d9eda0
> 01:50:27 dummynet_task() at dummynet_task+0x493/frame 
> 0xfffffe00a5d9ee40
> 01:50:27 taskqueue_run_locked() at taskqueue_run_locked+0xaa/frame 
> 0xfffffe00a5d9eec0
> 01:50:27 taskqueue_thread_loop() at taskqueue_thread_loop+0xc2/frame 
> 0xfffffe00a5d9eef0
> 01:50:27 fork_exit() at fork_exit+0x80/frame 0xfffffe00a5d9ef30
> 01:50:27 fork_trampoline() at fork_trampoline+0xe/frame 
> 0xfffffe00a5d9ef30
>
> So I suspect this is a race with teardown of the epair interfaces and 
> whatever
> traffic the sys/netpfil/common/dummynet:pf_queue_v6 test was sending 
> when it
> stopped.  I've cc'd kp@ in case he has any ideas?
>
I think I’ve seen that before, while I was doing the pf/dummynet work.
Dummynet will queue packets, delay them and eventually send them out. If 
between the enqueuing and dequeuing the packet the interface goes away 
we can end up with this panic.

I have a patch to teach dummynet to walk its queues when an interface is 
removed and to drop any packets destined for that interface. I wound up 
not committing it because I was unable to reproduce the problem at the 
time.

Can we reliably trigger this panic? I’d love to be able to add a test 
case for it, especially because the patch is non-trivial.

Here’s that patch. Its commit message has exactly the same backtrace.

Br,
Kristof 	
--=_MailMate_532E3070-790D-4348-839A-FFE9CD0B8790_=
Content-Disposition: attachment;
 filename=0001-dummynet-purge-queued-packets-on-interface-destructi.patch
Content-Type: text/plain
Content-Transfer-Encoding: quoted-printable

=46rom f9f51748a8b3f1eede241019c2fd62c7115174bb Mon Sep 17 00:00:00 2001
From: Kristof Provost <kp@FreeBSD.org>
Date: Wed, 7 Jul 2021 17:21:55 +0200
Subject: [PATCH] dummynet: purge queued packets on interface destruction

When a struct ifnet goes away it's possible that dummynet will still
have packets queued for it. We must purge those to ensure we do not
attempt to send packets out a removed interface.

This scenario is triggered by the dummynet:ipfw_queue_v6 test and looks
like:

	Fatal trap 9: general protection fault while in kernel mode
	cpuid =3D 0; apic id =3D 00
	instruction pointer	=3D 0x20:0xffffffff80e33df6
	stack pointer	        =3D 0x28:0xfffffe0148d1c170
	frame pointer	        =3D 0x28:0xfffffe0148d1c240
	code segment		=3D base 0x0, limit 0xfffff, type 0x1b
				=3D DPL 0, pres 1, long 1, def32 0, gran 1
	processor eflags	=3D interrupt enabled, resume, IOPL =3D 0
	current process		=3D 0 (dummynet)
	trap number		=3D 9
	panic: general protection fault
	cpuid =3D 0
	time =3D 1625664777
	KDB: stack backtrace:
	db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0148=
d1be70
	vpanic() at vpanic+0x187/frame 0xfffffe0148d1bed0
	panic() at panic+0x43/frame 0xfffffe0148d1bf30
	trap_fatal() at trap_fatal+0x387/frame 0xfffffe0148d1bf90
	trap() at trap+0xa4/frame 0xfffffe0148d1c0a0
	calltrap() at calltrap+0x8/frame 0xfffffe0148d1c0a0
	--- trap 0x9, rip =3D 0xffffffff80e33df6, rsp =3D 0xfffffe0148d1c170, rb=
p =3D 0xfffffe0148d1c240 ---
	ip6_input() at ip6_input+0x46/frame 0xfffffe0148d1c240
	netisr_dispatch_src() at netisr_dispatch_src+0xb1/frame 0xfffffe0148d1c2=
a0
	dummynet_send() at dummynet_send+0x1dd/frame 0xfffffe0148d1c2e0
	dummynet_task() at dummynet_task+0x49c/frame 0xfffffe0148d1c380
	taskqueue_run_locked() at taskqueue_run_locked+0xaa/frame 0xfffffe0148d1=
c400
	taskqueue_thread_loop() at taskqueue_thread_loop+0xc2/frame 0xfffffe0148=
d1c430
	fork_exit() at fork_exit+0x80/frame 0xfffffe0148d1c470
	fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe0148d1c470
	--- trap 0, rip =3D 0, rsp =3D 0, rbp =3D 0 ---
	KDB: enter: panic

On interface removal scan through the dummynet packet queues and remove
any packet destined for the removed interface.

MFC after:	2 weeks
Sponsored by:	Rubicon Communications, LLC ("Netgate")
---
 sys/netpfil/ipfw/dn_sched_qfq.c  |   3 +-
 sys/netpfil/ipfw/dn_sched_wf2q.c |   7 +-
 sys/netpfil/ipfw/ip_dummynet.c   | 120 +++++++++++++++++++++++++++++++
 3 files changed, 128 insertions(+), 2 deletions(-)

diff --git a/sys/netpfil/ipfw/dn_sched_qfq.c b/sys/netpfil/ipfw/dn_sched_=
qfq.c
index e02f27dfd85e..3a297ac322ae 100644
--- a/sys/netpfil/ipfw/dn_sched_qfq.c
+++ b/sys/netpfil/ipfw/dn_sched_qfq.c
@@ -568,7 +568,8 @@ qfq_dequeue(struct dn_sch_inst *si)
 	m =3D dn_dequeue(&cl->_q);
 =

 	if (!m) {
-		D("BUG/* non-workconserving leaf */");
+		/* This can happen if the packet has been purged, because it's
+		 * destined for a removed interface. */
 		return NULL;
 	}
 	NO(q->queued--;)
diff --git a/sys/netpfil/ipfw/dn_sched_wf2q.c b/sys/netpfil/ipfw/dn_sched=
_wf2q.c
index 7285df0d1d7d..766c6385fb32 100644
--- a/sys/netpfil/ipfw/dn_sched_wf2q.c
+++ b/sys/netpfil/ipfw/dn_sched_wf2q.c
@@ -242,8 +242,13 @@ wf2qp_dequeue(struct dn_sch_inst *_si)
 	/* ok we have at least one eligible pkt */
 	q =3D HEAP_TOP(sch)->object;
 	alg_fq =3D (struct wf2qp_queue *)q;
-	m =3D dn_dequeue(q);
+	m=3D dn_dequeue(q);
 	heap_extract(sch, NULL); /* Remove queue from heap. */
+	if (m =3D=3D NULL) {
+		/* Packet may have been purged. */
+		heap_insert(&si->idle_heap, alg_fq->F, q);
+		return (NULL);
+	}
 	si->V +=3D (uint64_t)(m->m_pkthdr.len) * si->inv_wsum;
 	alg_fq->S =3D alg_fq->F;  /* Update start time. */
 	if (q->mq.head =3D=3D 0) {	/* not backlogged any more. */
diff --git a/sys/netpfil/ipfw/ip_dummynet.c b/sys/netpfil/ipfw/ip_dummyne=
t.c
index d3242fd85817..5d51af960393 100644
--- a/sys/netpfil/ipfw/ip_dummynet.c
+++ b/sys/netpfil/ipfw/ip_dummynet.c
@@ -45,6 +45,7 @@ __FBSDID("$FreeBSD$");
 =

 #include <sys/param.h>
 #include <sys/ck.h>
+#include <sys/eventhandler.h>
 #include <sys/systm.h>
 #include <sys/malloc.h>
 #include <sys/mbuf.h>
@@ -64,6 +65,8 @@ __FBSDID("$FreeBSD$");
 #include <netinet/ip_var.h>	/* ip_output(), IP_FORWARDING */
 #include <netinet/ip_fw.h>
 #include <netinet/ip_dummynet.h>
+#include <net/if.h>
+#include <net/if_var.h>
 #include <net/vnet.h>
 =

 #include <netpfil/ipfw/ip_fw_private.h>
@@ -101,6 +104,8 @@ CK_LIST_HEAD(, dn_alg)	schedlist;
 CK_LIST_HEAD(, dn_aqm)	aqmlist;	/* list of AQMs */
 #endif
 =

+static eventhandler_tag	ip_dn_detach_cookie;
+
 static void
 dummynet(void *arg)
 {
@@ -2593,6 +2598,8 @@ static void
 ip_dn_vnet_destroy(void)
 {
 	DN_BH_WLOCK();
+	V_dn_cfg.init_done =3D 0;
+
 	dummynet_flush();
 	DN_BH_WUNLOCK();
 =

@@ -2603,6 +2610,114 @@ ip_dn_vnet_destroy(void)
 	DN_LOCK_DESTROY();
 }
 =

+static int
+ip_dn_mq_flush_ifnet(struct mq *mq, struct ifnet *ifp)
+{
+	struct mbuf *m, *tmp;
+	struct mbuf *mprev;
+	int removed =3D 0;
+
+	m =3D mq->head;
+	mprev =3D NULL;
+
+	while (m !=3D NULL) {
+		struct dn_pkt_tag *p =3D dn_tag_get(m);
+		if (p->ifp !=3D ifp && m->m_pkthdr.rcvif !=3D ifp) {
+			mprev =3D m;
+			m =3D m->m_nextpkt;
+			continue;
+		}
+
+		if (mprev =3D=3D NULL) {
+			/* We're removing the first element in the mq. */
+			mq->head =3D m->m_nextpkt;
+		} else {
+			MPASS(mprev->m_nextpkt =3D=3D m);
+			mprev->m_nextpkt =3D m->m_nextpkt;
+		}
+		tmp =3D m;
+		m =3D m->m_nextpkt;
+
+		tmp->m_nextpkt =3D NULL;
+		m_freem(tmp);
+
+		removed++;
+	}
+
+	mq->tail =3D mprev;
+
+	return (removed);
+}
+
+static int
+ip_dn_q_flush_ifnet(void *_q, void *_arg)
+{
+	struct dn_queue *q =3D _q;
+	struct ifnet *ifp =3D _arg;
+	int removed;
+
+	removed =3D ip_dn_mq_flush_ifnet(&q->mq, ifp);
+	q->count -=3D removed;
+
+	return (0);
+}
+
+static int
+ip_dn_fs_flush_ifnet(void *_fs, void *_arg)
+{
+	struct dn_fsk *fs =3D _fs;
+
+	if (!fs->qht)
+		return (0);
+
+	if (fs->fs.flags & DN_QHT_HASH)
+		dn_ht_scan(fs->qht, ip_dn_q_flush_ifnet, _arg);
+	else
+		ip_dn_q_flush_ifnet(fs->qht, _arg);
+
+	return (0);
+}
+
+static int
+ip_dn_schi_flush_ifnet(void *_si, void *_arg)
+{
+	struct dn_sch_inst *si =3D _si;
+	struct ifnet *ifp =3D _arg;
+
+	ip_dn_mq_flush_ifnet(&si->dline.mq, ifp);
+
+	return (0);
+}
+
+static int
+ip_dn_sched_flush_ifnet(void *_s, void *_arg)
+{
+	struct dn_schk *s =3D _s;
+
+	if (s->siht =3D=3D NULL)
+		return (0);
+
+	if (s->sch.flags & DN_HAVE_MASK)
+		dn_ht_scan(s->siht, ip_dn_schi_flush_ifnet, _arg);
+	else if (s->siht)
+		ip_dn_schi_flush_ifnet(s->siht, _arg);
+
+	return (0);
+}
+
+static void
+ip_dn_detach_ifnet_event(void *arg __unused, struct ifnet *ifp)
+{
+	if (! V_dn_cfg.init_done)
+		return;
+
+	DN_BH_WLOCK();
+	/* Purge all queued packets for this interface. */
+	dn_ht_scan(V_dn_cfg.fshash, ip_dn_fs_flush_ifnet, ifp);
+	dn_ht_scan(V_dn_cfg.schedhash, ip_dn_sched_flush_ifnet, ifp);
+	DN_BH_WUNLOCK();
+}
+
 static void
 ip_dn_init(void)
 {
@@ -2617,6 +2732,9 @@ ip_dn_init(void)
 	    taskqueue_thread_enqueue, &dn_tq);
 	taskqueue_start_threads(&dn_tq, 1, PI_NET, "dummynet");
 =

+	ip_dn_detach_cookie =3D EVENTHANDLER_REGISTER(ifnet_departure_event,
+	    ip_dn_detach_ifnet_event, NULL, EVENTHANDLER_PRI_ANY);
+
 	CK_LIST_INIT(&schedlist);
 	callout_init(&dn_timeout, 1);
 	dn_reschedule();
@@ -2635,6 +2753,8 @@ ip_dn_destroy(int last)
 		ip_dn_io_ptr =3D NULL;
 	}
 =

+	EVENTHANDLER_DEREGISTER(ifnet_departure_event, ip_dn_detach_cookie);
+
 	callout_drain(&dn_timeout);
 	taskqueue_drain(dn_tq, &dn_task);
 	taskqueue_free(dn_tq);
-- =

2.33.1


--=_MailMate_532E3070-790D-4348-839A-FFE9CD0B8790_=--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?F132C565-ADFF-4234-BB57-1C892E2B6FB2>