FreeBSD Mail Archives

Date:      Fri, 20 Nov 2020 12:53:29 +0100
From:      "Kristof Provost" <kp@FreeBSD.org>
To:        "Peter Blok" <pblok@bsd4all.org>
Cc:        "FreeBSD Stable" <freebsd-stable@freebsd.org>
Subject:   Re: Commit 367705+367706 causes a pabic
Message-ID:  <BD8D114E-9F12-4580-A0D5-5A5BAC75DF27@FreeBSD.org>
In-Reply-To: <665757BF-DA06-4503-9ACD-8A4630E23FF4@bsd4all.org>
References:  <CD3B0F62-3790-4C63-A92C-9694256823CD@bsd4all.org> <1753B4A3-2FFC-47A5-9D0C-DC0B71BA22E8@FreeBSD.org> <665757BF-DA06-4503-9ACD-8A4630E23FF4@bsd4all.org>

Can you share your kernel config file (and src.conf / make.conf if they 
exist)?

This second panic is in the IPSec code. My current thinking is that your 
kernel config is triggering a bug that’s manifesting in multiple 
places, but not actually caused by those places.

I’d like to be able to reproduce it so we can debug it.

Best regards,
Kristof

On 20 Nov 2020, at 12:02, Peter Blok wrote:
> Hi Kristof,
>
> This is 12-stable. With the previous bridge epochification that was 
> backed out my config had a panic too.
>
> I don’t have any local modifications. I did a clean rebuild after 
> removing /usr/obj/usr
>
> My kernel is custom - I only have zfs.ko, opensolaris.ko, vmm.ko and 
> nmdm.ko as modules. Everything else is statically linked. I have 
> removed all drivers not needed for the hardware at hand.
>
> My bridge is between two vlans from the same trunk and the jail epair 
> devices as well as the bhyve tap devices.
>
> The panic happens when the jails are starting.
>
> I can try to narrow it down over the weekend and make the crash dump 
> available for analysis.
>
> Previously I had the following crash with 363492
>
> kernel trap 12 with interrupts disabled
>
>
> Fatal trap 12: page fault while in kernel mode
> cpuid = 2; apic id = 02
> fault virtual address	= 0xffffffff00000410
> fault code		= supervisor read data, page not present
> instruction pointer	= 0x20:0xffffffff80692326
> stack pointer	        = 0x28:0xfffffe00c06097b0
> frame pointer	        = 0x28:0xfffffe00c06097f0
> code segment		= base 0x0, limit 0xfffff, type 0x1b
> 			= DPL 0, pres 1, long 1, def32 0, gran 1
> processor eflags	= resume, IOPL = 0
> current process		= 2030 (ifconfig)
> trap number		= 12
> panic: page fault
> cpuid = 2
> time = 1595683412
> KDB: stack backtrace:
> #0 0xffffffff80698165 at kdb_backtrace+0x65
> #1 0xffffffff8064d67b at vpanic+0x17b
> #2 0xffffffff8064d4f3 at panic+0x43
> #3 0xffffffff809cc311 at trap_fatal+0x391
> #4 0xffffffff809cc36f at trap_pfault+0x4f
> #5 0xffffffff809cb9b6 at trap+0x286
> #6 0xffffffff809a5b28 at calltrap+0x8
> #7 0xffffffff803677fd at ck_epoch_synchronize_wait+0x8d
> #8 0xffffffff8069213a at epoch_wait_preempt+0xaa
> #9 0xffffffff807615b7 at ipsec_ioctl+0x3a7
> #10 0xffffffff8075274f at ifioctl+0x47f
> #11 0xffffffff806b5ea7 at kern_ioctl+0x2b7
> #12 0xffffffff806b5b4a at sys_ioctl+0xfa
> #13 0xffffffff809ccec7 at amd64_syscall+0x387
> #14 0xffffffff809a6450 at fast_syscall_common+0x101
>
>
>
>
>> On 20 Nov 2020, at 11:30, Kristof Provost <kp@FreeBSD.org> wrote:
>>
>> On 20 Nov 2020, at 11:18, peter.blok@bsd4all.org 
>> <mailto:peter.blok@bsd4all.org> wrote:
>>> I’m afraid the last Epoch fix for bridge is not solving the 
>>> problem ( or perhaps creates a new ).
>>>
>> We’re talking about the stable/12 branch, right?
>>
>>> This seems to happen when the jail epair is added to the bridge.
>>>
>> There must be something more to it than that. I’ve run the bridge 
>> tests on stable/12 without issue, and this is a problem we didn’t 
>> see when the bridge epochification initially went into stable/12.
>>
>> Do you have a custom kernel config? Other patches? What exact 
>> commands do you run to trigger the panic?
>>
>>> kernel trap 12 with interrupts disabled
>>>
>>>
>>> Fatal trap 12: page fault while in kernel mode
>>> cpuid = 6; apic id = 06
>>> fault virtual address	= 0xc10
>>> fault code		= supervisor read data, page not present
>>> instruction pointer	= 0x20:0xffffffff80695e76
>>> stack pointer	        = 0x28:0xfffffe00bf14e6e0
>>> frame pointer	        = 0x28:0xfffffe00bf14e720
>>> code segment		= base 0x0, limit 0xfffff, type 0x1b
>>> 			= DPL 0, pres 1, long 1, def32 0, gran 1
>>> processor eflags	= resume, IOPL = 0
>>> current process		= 1686 (jail)
>>> trap number		= 12
>>> panic: page fault
>>> cpuid = 6
>>> time = 1605811310
>>> KDB: stack backtrace:
>>> #0 0xffffffff8069bb85 at kdb_backtrace+0x65
>>> #1 0xffffffff80650a4b at vpanic+0x17b
>>> #2 0xffffffff806508c3 at panic+0x43
>>> #3 0xffffffff809d0351 at trap_fatal+0x391
>>> #4 0xffffffff809d03af at trap_pfault+0x4f
>>> #5 0xffffffff809cf9f6 at trap+0x286
>>> #6 0xffffffff809a98c8 at calltrap+0x8
>>> #7 0xffffffff80368a8d at ck_epoch_synchronize_wait+0x8d
>>> #8 0xffffffff80695c8a at epoch_wait_preempt+0xaa
>>> #9 0xffffffff80757d40 at vnet_if_init+0x120
>>> #10 0xffffffff8078c994 at vnet_alloc+0x114
>>> #11 0xffffffff8061e3f7 at kern_jail_set+0x1bb7
>>> #12 0xffffffff80620190 at sys_jail_set+0x40
>>> #13 0xffffffff809d0f07 at amd64_syscall+0x387
>>> #14 0xffffffff809aa1ee at fast_syscall_common+0xf8
>>
>> This panic is rather odd. This isn’t even the bridge code. This is 
>> during initial creation of the vnet. I don’t really see how this 
>> could even trigger panics.
>> That panic looks as if something corrupted the net_epoch_preempt, by 
>> overwriting the epoch->e_epoch. The bridge patches only access this 
>> variable through the well-established functions and macros. I see no 
>> obvious way that they could corrupt it.
>>
>> Best regards,
>> Kristof

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?BD8D114E-9F12-4580-A0D5-5A5BAC75DF27>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation