From nobody Sat Nov 20 00:36:23 2021
X-Original-To: dev-ci@mlmmj.nyi.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
	by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id C290D188D4B3
	for <dev-ci@mlmmj.nyi.freebsd.org>; Sat, 20 Nov 2021 00:36:28 +0000 (UTC)
	(envelope-from kp@FreeBSD.org)
Received: from smtp.freebsd.org (smtp.freebsd.org [IPv6:2610:1c1:1:606c::24b:4])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256
	 client-signature RSA-PSS (4096 bits) client-digest SHA256)
	(Client CN "smtp.freebsd.org", Issuer "R3" (verified OK))
	by mx1.freebsd.org (Postfix) with ESMTPS id 4HwvjS58lQz3kX6;
	Sat, 20 Nov 2021 00:36:28 +0000 (UTC)
	(envelope-from kp@FreeBSD.org)
Received: from venus.codepro.be (venus.codepro.be [5.9.86.228])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(Client CN "mx1.codepro.be", Issuer "R3" (verified OK))
	(Authenticated sender: kp)
	by smtp.freebsd.org (Postfix) with ESMTPSA id 62F1997A5;
	Sat, 20 Nov 2021 00:36:28 +0000 (UTC)
	(envelope-from kp@FreeBSD.org)
Received: by venus.codepro.be (Postfix, authenticated sender kp)
 id 3E8B62E0A5;
	Sat, 20 Nov 2021 01:36:26 +0100 (CET)
From: "Kristof Provost" <kp@FreeBSD.org>
To: "John Baldwin" <jhb@FreeBSD.org>
Cc: jenkins-admin@FreeBSD.org, dev-ci@FreeBSD.org
Subject: Re: FreeBSD-main-amd64-test - Build #19839 - Failure
Date: Fri, 19 Nov 2021 17:36:23 -0700
X-Mailer: MailMate (1.13.2r5673)
Message-ID: <CB9A8932-8EF8-4022-A5AD-C6FF0826CB00@FreeBSD.org>
In-Reply-To: <F132C565-ADFF-4234-BB57-1C892E2B6FB2@FreeBSD.org>
References: <1727420938.1071.1637194678550@jenkins.ci.freebsd.org>
 <263150711.1077.1637200233260@jenkins.ci.freebsd.org>
 <af341e1b-042b-491d-9ee6-70d83d4fedd4@FreeBSD.org>
 <F132C565-ADFF-4234-BB57-1C892E2B6FB2@FreeBSD.org>
List-Id: Continuous Integration Build and Test Results <dev-ci.freebsd.org>
List-Archive: https://lists.freebsd.org/archives/dev-ci
List-Help: <mailto:dev-ci+help@freebsd.org>
List-Post: <mailto:dev-ci@freebsd.org>
List-Subscribe: <mailto:dev-ci+subscribe@freebsd.org>
List-Unsubscribe: <mailto:dev-ci+unsubscribe@freebsd.org>
Sender: owner-dev-ci@freebsd.org
X-BeenThere: dev-ci@freebsd.org
MIME-Version: 1.0
Content-Type: text/plain; charset="UTF-8"; format=flowed; markup=markdown
Content-Transfer-Encoding: 8bit
X-ThisMailContainsUnwantedMimeParts: N

On 18 Nov 2021, at 18:50, Kristof Provost wrote:
> On 18 Nov 2021, at 17:21, John Baldwin wrote:
>> On 11/17/21 5:50 PM, jenkins-admin@FreeBSD.org wrote:
>>> FreeBSD-main-amd64-test - Build #19839 
>>> (4082b189d2ce00674439226c9d5a8bdcafd23d01) - Failure
>>>
>>> Build information: 
>>> https://ci.FreeBSD.org/job/FreeBSD-main-amd64-test/19839/
>>> Full change log: 
>>> https://ci.FreeBSD.org/job/FreeBSD-main-amd64-test/19839/changes
>>> Full build log: 
>>> https://ci.FreeBSD.org/job/FreeBSD-main-amd64-test/19839/console
>>>
>>> Status explanation:
>>> "Failure" - the build is suspected being broken by the following 
>>> changes
>>> "Still Failing" - the build has not been fixed by the following 
>>> changes and
>>>                    this is a notification to note that these changes 
>>> have
>>>                    not been fully tested by the CI system
>>>
>>> Change summaries:
>>> (Those commits are likely but not certainly responsible)
>>>
>>> b928e924f74b0b8f882a9b735611421a93113640 by jhb:
>>> rtld-elf: Use _get_tp in __tls_get_addr for aarch64 and riscv64.
>>>
>>> a8d885296a9dc517e731723081c83d97d2aa598f by jhb:
>>> linux_linkat: Don't invert AT_* flags.
>>>
>>> 8b2ce7a3bbd0a754d31ff3943d918b4c84c831a3 by jhb:
>>> linux_name_to_handle_at: Support AT_EMPTY_PATH.
>>>
>>> 1962164584a91078418afcd7c979afef13df8c4d by jhb:
>>> imgact_elf: Use bool instead of boolean_t.
>>>
>>> 4082b189d2ce00674439226c9d5a8bdcafd23d01 by jhb:
>>> elf*_brand_inuse: Change return type to bool.
>>
>> I don't the panic is related to these commits.
>>
>>> The end of the build log:
>>>
>>> [...truncated 4.27 MB...]
>>> --- trap 0xc, rip = 0x2100c7156d9a, rsp = 0x4423f48, rbp = 0x4423f60 
>>> ---
>>
>> The useful parts of the panic were earlier in the log:
>>
>> 01:50:13 sys/netpfil/common/dummynet:pf_queue_v6  ->  epair5a: 
>> Ethernet address: 02:b5:98:ea:1d:0a
>> 01:50:14 epair5b: Ethernet address: 02:b5:98:ea:1d:0b
>> 01:50:14 epair5a: link state changed to UP
>> 01:50:14 epair5b: link state changed to UP
>> 01:50:14 epair5a: link state changed to DOWN
>> 01:50:27 epair5b: link state changed to DOWN
>> 01:50:27 passed  [13.831s]
>> 01:50:27
>> 01:50:27
>> 01:50:27 Fatal trap 12: page fault while in kernel mode
>> 01:50:27 cpuid = 0; apic id = 00
>> 01:50:27 fault virtual address	= 0x10
>> 01:50:27 fault code		= supervisor read data, page not present
>> 01:50:27 instruction pointer	= 0x20:0xffffffff80e3c60f
>> 01:50:27 stack pointer	        = 0x28:0xfffffe00a5d9ec30
>> 01:50:27 frame pointer	        = 0x28:0xfffffe00a5d9ed00
>> 01:50:27 code segment		= base 0x0, limit 0xfffff, type 0x1b
>> 01:50:27 			= DPL 0, pres 1, long 1, def32 0, gran 1
>> 01:50:27 processor eflags	= interrupt enabled, resume, IOPL = 0
>> 01:50:27 current process		= 0 (dummynet)
>> 01:50:27 trap number		= 12
>> 01:50:27 panic: page fault
>> 01:50:27 cpuid = 0
>> 01:50:27 time = 1637200227
>> 01:50:27 KDB: stack backtrace:
>> 01:50:27 db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 
>> 0xfffffe00a5d9e9f0
>> 01:50:27 vpanic() at vpanic+0x17f/frame 0xfffffe00a5d9ea40
>> 01:50:27 panic() at panic+0x43/frame 0xfffffe00a5d9eaa0
>> 01:50:27 trap_fatal() at trap_fatal+0x385/frame 0xfffffe00a5d9eb00
>> 01:50:27 trap_pfault() at trap_pfault+0xab/frame 0xfffffe00a5d9eb60
>> 01:50:27 calltrap() at calltrap+0x8/frame 0xfffffe00a5d9eb60
>> 01:50:27 --- trap 0xc, rip = 0xffffffff80e3c60f, rsp = 
>> 0xfffffe00a5d9ec30, rbp = 0xfffffe00a5d9ed00 ---
>> 01:50:27 ip6_input() at ip6_input+0x4f/frame 0xfffffe00a5d9ed00
>> 01:50:27 netisr_dispatch_src() at netisr_dispatch_src+0xb1/frame 
>> 0xfffffe00a5d9ed60
>> 01:50:27 dummynet_send() at dummynet_send+0x1dd/frame 
>> 0xfffffe00a5d9eda0
>> 01:50:27 dummynet_task() at dummynet_task+0x493/frame 
>> 0xfffffe00a5d9ee40
>> 01:50:27 taskqueue_run_locked() at taskqueue_run_locked+0xaa/frame 
>> 0xfffffe00a5d9eec0
>> 01:50:27 taskqueue_thread_loop() at taskqueue_thread_loop+0xc2/frame 
>> 0xfffffe00a5d9eef0
>> 01:50:27 fork_exit() at fork_exit+0x80/frame 0xfffffe00a5d9ef30
>> 01:50:27 fork_trampoline() at fork_trampoline+0xe/frame 
>> 0xfffffe00a5d9ef30
>>
>> So I suspect this is a race with teardown of the epair interfaces and 
>> whatever
>> traffic the sys/netpfil/common/dummynet:pf_queue_v6 test was sending 
>> when it
>> stopped.  I've cc'd kp@ in case he has any ideas?
>>
> I think I’ve seen that before, while I was doing the pf/dummynet 
> work.
> Dummynet will queue packets, delay them and eventually send them out. 
> If between the enqueuing and dequeuing the packet the interface goes 
> away we can end up with this panic.
>
> I have a patch to teach dummynet to walk its queues when an interface 
> is removed and to drop any packets destined for that interface. I 
> wound up not committing it because I was unable to reproduce the 
> problem at the time.
>
> Can we reliably trigger this panic? I’d love to be able to add a 
> test case for it, especially because the patch is non-trivial.
>
I managed to build a test case to reliably trigger the problem (with v4, 
but same principle):

The fix: https://reviews.freebsd.org/D33064
The test: https://reviews.freebsd.org/D33065

Kristof