From owner-freebsd-stable@freebsd.org Fri Nov 20 11:53:36 2020 Return-Path: Delivered-To: freebsd-stable@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id 7BFB12C72FA for ; Fri, 20 Nov 2020 11:53:36 +0000 (UTC) (envelope-from kp@FreeBSD.org) Received: from smtp.freebsd.org (smtp.freebsd.org [96.47.72.83]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (4096 bits) client-digest SHA256) (Client CN "smtp.freebsd.org", Issuer "Let's Encrypt Authority X3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4Ccw1D20HPz3Gk8; Fri, 20 Nov 2020 11:53:36 +0000 (UTC) (envelope-from kp@FreeBSD.org) Received: from venus.codepro.be (venus.codepro.be [5.9.86.228]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "mx1.codepro.be", Issuer "Let's Encrypt Authority X3" (verified OK)) (Authenticated sender: kp) by smtp.freebsd.org (Postfix) with ESMTPSA id A4A9EE821; Fri, 20 Nov 2020 11:53:35 +0000 (UTC) (envelope-from kp@FreeBSD.org) Received: by venus.codepro.be (Postfix, authenticated sender kp) id C50B74B44B; Fri, 20 Nov 2020 12:53:32 +0100 (CET) From: "Kristof Provost" To: "Peter Blok" Cc: "FreeBSD Stable" Subject: Re: Commit 367705+367706 causes a pabic Date: Fri, 20 Nov 2020 12:53:29 +0100 X-Mailer: MailMate (1.13.2r5673) Message-ID: In-Reply-To: <665757BF-DA06-4503-9ACD-8A4630E23FF4@bsd4all.org> References: <1753B4A3-2FFC-47A5-9D0C-DC0B71BA22E8@FreeBSD.org> <665757BF-DA06-4503-9ACD-8A4630E23FF4@bsd4all.org> MIME-Version: 1.0 Embedded-HTML: [{"HTML":[775, 17404], "plain":[399, 4597], "uuid":"74904DD7-503B-4D4B-A800-4DC3AB57CB61"}] Content-Type: text/plain; charset="UTF-8"; format=flowed Content-Transfer-Encoding: 8bit X-Content-Filtered-By: Mailman/MimeDel 2.1.34 X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 20 Nov 2020 11:53:36 -0000 Can you share your kernel config file (and src.conf / make.conf if they exist)? This second panic is in the IPSec code. My current thinking is that your kernel config is triggering a bug that’s manifesting in multiple places, but not actually caused by those places. I’d like to be able to reproduce it so we can debug it. Best regards, Kristof On 20 Nov 2020, at 12:02, Peter Blok wrote: > Hi Kristof, > > This is 12-stable. With the previous bridge epochification that was > backed out my config had a panic too. > > I don’t have any local modifications. I did a clean rebuild after > removing /usr/obj/usr > > My kernel is custom - I only have zfs.ko, opensolaris.ko, vmm.ko and > nmdm.ko as modules. Everything else is statically linked. I have > removed all drivers not needed for the hardware at hand. > > My bridge is between two vlans from the same trunk and the jail epair > devices as well as the bhyve tap devices. > > The panic happens when the jails are starting. > > I can try to narrow it down over the weekend and make the crash dump > available for analysis. > > Previously I had the following crash with 363492 > > kernel trap 12 with interrupts disabled > > > Fatal trap 12: page fault while in kernel mode > cpuid = 2; apic id = 02 > fault virtual address = 0xffffffff00000410 > fault code = supervisor read data, page not present > instruction pointer = 0x20:0xffffffff80692326 > stack pointer = 0x28:0xfffffe00c06097b0 > frame pointer = 0x28:0xfffffe00c06097f0 > code segment = base 0x0, limit 0xfffff, type 0x1b > = DPL 0, pres 1, long 1, def32 0, gran 1 > processor eflags = resume, IOPL = 0 > current process = 2030 (ifconfig) > trap number = 12 > panic: page fault > cpuid = 2 > time = 1595683412 > KDB: stack backtrace: > #0 0xffffffff80698165 at kdb_backtrace+0x65 > #1 0xffffffff8064d67b at vpanic+0x17b > #2 0xffffffff8064d4f3 at panic+0x43 > #3 0xffffffff809cc311 at trap_fatal+0x391 > #4 0xffffffff809cc36f at trap_pfault+0x4f > #5 0xffffffff809cb9b6 at trap+0x286 > #6 0xffffffff809a5b28 at calltrap+0x8 > #7 0xffffffff803677fd at ck_epoch_synchronize_wait+0x8d > #8 0xffffffff8069213a at epoch_wait_preempt+0xaa > #9 0xffffffff807615b7 at ipsec_ioctl+0x3a7 > #10 0xffffffff8075274f at ifioctl+0x47f > #11 0xffffffff806b5ea7 at kern_ioctl+0x2b7 > #12 0xffffffff806b5b4a at sys_ioctl+0xfa > #13 0xffffffff809ccec7 at amd64_syscall+0x387 > #14 0xffffffff809a6450 at fast_syscall_common+0x101 > > > > >> On 20 Nov 2020, at 11:30, Kristof Provost wrote: >> >> On 20 Nov 2020, at 11:18, peter.blok@bsd4all.org >> wrote: >>> I’m afraid the last Epoch fix for bridge is not solving the >>> problem ( or perhaps creates a new ). >>> >> We’re talking about the stable/12 branch, right? >> >>> This seems to happen when the jail epair is added to the bridge. >>> >> There must be something more to it than that. I’ve run the bridge >> tests on stable/12 without issue, and this is a problem we didn’t >> see when the bridge epochification initially went into stable/12. >> >> Do you have a custom kernel config? Other patches? What exact >> commands do you run to trigger the panic? >> >>> kernel trap 12 with interrupts disabled >>> >>> >>> Fatal trap 12: page fault while in kernel mode >>> cpuid = 6; apic id = 06 >>> fault virtual address = 0xc10 >>> fault code = supervisor read data, page not present >>> instruction pointer = 0x20:0xffffffff80695e76 >>> stack pointer = 0x28:0xfffffe00bf14e6e0 >>> frame pointer = 0x28:0xfffffe00bf14e720 >>> code segment = base 0x0, limit 0xfffff, type 0x1b >>> = DPL 0, pres 1, long 1, def32 0, gran 1 >>> processor eflags = resume, IOPL = 0 >>> current process = 1686 (jail) >>> trap number = 12 >>> panic: page fault >>> cpuid = 6 >>> time = 1605811310 >>> KDB: stack backtrace: >>> #0 0xffffffff8069bb85 at kdb_backtrace+0x65 >>> #1 0xffffffff80650a4b at vpanic+0x17b >>> #2 0xffffffff806508c3 at panic+0x43 >>> #3 0xffffffff809d0351 at trap_fatal+0x391 >>> #4 0xffffffff809d03af at trap_pfault+0x4f >>> #5 0xffffffff809cf9f6 at trap+0x286 >>> #6 0xffffffff809a98c8 at calltrap+0x8 >>> #7 0xffffffff80368a8d at ck_epoch_synchronize_wait+0x8d >>> #8 0xffffffff80695c8a at epoch_wait_preempt+0xaa >>> #9 0xffffffff80757d40 at vnet_if_init+0x120 >>> #10 0xffffffff8078c994 at vnet_alloc+0x114 >>> #11 0xffffffff8061e3f7 at kern_jail_set+0x1bb7 >>> #12 0xffffffff80620190 at sys_jail_set+0x40 >>> #13 0xffffffff809d0f07 at amd64_syscall+0x387 >>> #14 0xffffffff809aa1ee at fast_syscall_common+0xf8 >> >> This panic is rather odd. This isn’t even the bridge code. This is >> during initial creation of the vnet. I don’t really see how this >> could even trigger panics. >> That panic looks as if something corrupted the net_epoch_preempt, by >> overwriting the epoch->e_epoch. The bridge patches only access this >> variable through the well-established functions and macros. I see no >> obvious way that they could corrupt it. >> >> Best regards, >> Kristof