From nobody Mon Feb 13 17:15:23 2023 X-Original-To: freebsd-questions@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4PFrZd5NF5z3rV9w for ; Mon, 13 Feb 2023 17:15:37 +0000 (UTC) (envelope-from nicolasgoldman07@gmail.com) Received: from mail-ej1-x629.google.com (mail-ej1-x629.google.com [IPv6:2a00:1450:4864:20::629]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "GTS CA 1D4" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4PFrZc6T3Gz3mnw for ; Mon, 13 Feb 2023 17:15:36 +0000 (UTC) (envelope-from nicolasgoldman07@gmail.com) Authentication-Results: mx1.freebsd.org; dkim=pass header.d=gmail.com header.s=20210112 header.b=VfZ0Qgn7; spf=pass (mx1.freebsd.org: domain of nicolasgoldman07@gmail.com designates 2a00:1450:4864:20::629 as permitted sender) smtp.mailfrom=nicolasgoldman07@gmail.com; dmarc=pass (policy=none) header.from=gmail.com Received: by mail-ej1-x629.google.com with SMTP id ml19so33761211ejb.0 for ; Mon, 13 Feb 2023 09:15:36 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=to:subject:message-id:date:from:mime-version:from:to:cc:subject :date:message-id:reply-to; bh=4kuee3odoUYiGq7DR41NY2iB3zeBLcq1MeHMcWIxSmo=; b=VfZ0Qgn7K05mLylX9Etw2zgxpFeuTnlstGf0boUxs0TpeLAFVPjyeatv2+DILuBMat rR4FXVNi6JzmobycAydgPCN0Hd/z/QdF/r0LVFIhsxT9U0r1Q6JLa4ps8jhBdjoh7TWn Ni8tGyRfJfyo7Ye+obJwrr12E7utqAE6ziGmzOUPaUUZ0acLUmB0OABmE3KT5KRNza+n Vce+6gSO0A2yToAJ3xTwLNB7YhHk/X6KDcCnNaZ6HBRDLz9xBpnXd43nUiOvxflx52GV WhBFuQZ6vde5AuZXWPD93mKXQYfWOmvv5aBS+QeBBAZnH5T4RxK26WSmF70mAaO2ZnJc Ab8A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=to:subject:message-id:date:from:mime-version:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=4kuee3odoUYiGq7DR41NY2iB3zeBLcq1MeHMcWIxSmo=; b=3nrgFS4+FSD2aRMduHQChSHbdfii+PQePDxfxSeEyN4DSugvwfR+Mwn8+mzVBwjdgE hdGsk4P3Qd+1RfzeKknhy0m5ULF9O4rko2Wz5muAbPKfkkCkswSEWRgpoeDJiwKgkgSN Td9+mKKnVou9rbWE1XPN2CmWkJMk42tCy/yEC/6PaYi39HRXJeKhlHuoDb9x98dRCQ5M 240B2hJ5hmkwO0zNhVH8PqPS3DKqfsJltKtGht3CULvm3MQNJ8a2d6ftGmcR4yCwiU9U riUMmkkWnTX3aeruBaey8SXnIjODOoy/6zkVNOhfndDBh+sTVAEfSwdQ7V+RFzWhGjxq z7Ag== X-Gm-Message-State: AO0yUKX1eccOpxQHu+Q3WYqasSPmXcw/QbVE0rHyA+X30c5f/q/2H6CB t7ZHZ+aZmOMbcsnUixZiYXuF4vmi6JWJYGuGrD90r+bjQQ6/cA== X-Google-Smtp-Source: AK7set+V/n6LDCK5tYKMdhmjYVqYQ5nfPWRMYAO7rTkWeXjJXNHpruXxuzz3Eb4oTmnHo9zwfuQPmOl2vAXuQA0AaLc= X-Received: by 2002:a17:906:68d7:b0:8af:373f:4735 with SMTP id y23-20020a17090668d700b008af373f4735mr4853405ejr.6.1676308535069; Mon, 13 Feb 2023 09:15:35 -0800 (PST) List-Id: User questions List-Archive: https://lists.freebsd.org/archives/freebsd-questions List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-questions@freebsd.org X-BeenThere: freebsd-questions@freebsd.org MIME-Version: 1.0 From: Nicolas Goldman Date: Mon, 13 Feb 2023 14:15:23 -0300 Message-ID: Subject: Kernel panics when given a high workload To: freebsd-questions@freebsd.org Content-Type: multipart/alternative; boundary="000000000000c44f8405f497fefb" X-Spamd-Result: default: False [-3.60 / 15.00]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; NEURAL_HAM_LONG(-1.00)[-1.000]; NEURAL_HAM_SHORT(-0.60)[-0.598]; DMARC_POLICY_ALLOW(-0.50)[gmail.com,none]; R_DKIM_ALLOW(-0.20)[gmail.com:s=20210112]; R_SPF_ALLOW(-0.20)[+ip6:2a00:1450:4000::/36:c]; MIME_GOOD(-0.10)[multipart/alternative,text/plain]; ARC_NA(0.00)[]; MLMMJ_DEST(0.00)[freebsd-questions@freebsd.org]; MIME_TRACE(0.00)[0:+,1:+,2:~]; RCVD_IN_DNSWL_NONE(0.00)[2a00:1450:4864:20::629:from]; FROM_EQ_ENVFROM(0.00)[]; ASN(0.00)[asn:15169, ipnet:2a00:1450::/32, country:US]; FREEMAIL_ENVFROM(0.00)[gmail.com]; MID_RHS_MATCH_FROMTLD(0.00)[]; TO_MATCH_ENVRCPT_ALL(0.00)[]; RCVD_COUNT_TWO(0.00)[2]; FREEMAIL_FROM(0.00)[gmail.com]; FROM_HAS_DN(0.00)[]; DKIM_TRACE(0.00)[gmail.com:+]; PREVIOUSLY_DELIVERED(0.00)[freebsd-questions@freebsd.org]; RCPT_COUNT_ONE(0.00)[1]; TO_DN_NONE(0.00)[]; RCVD_TLS_LAST(0.00)[]; DWL_DNSWL_NONE(0.00)[gmail.com:dkim] X-Rspamd-Queue-Id: 4PFrZc6T3Gz3mnw X-Spamd-Bar: --- X-ThisMailContainsUnwantedMimeParts: N --000000000000c44f8405f497fefb Content-Type: text/plain; charset="UTF-8" Hello! Good Monday to all the FreeBSD community. I am working on the FreeBSD kernel for my university thesis. The idea is to make changes to the FreeBSD short-time scheduler so that all its operations are based on the concept of Petri Nets. We already have the modeling of said scheduler and the first tests running. I am currently running into a problem that has left me out of ideas. Very randomly, the kernel throws page faults and reboots the OS. With my thesis partner, we tried to see when this problem happened but didn't find any pattern to reproduce it. We could see it mostly when the processor is heavily loaded, but as I said previously, only in some simulations. I leave some information about the logs found; any help is appreciated. Code: uname -a FreeBSD pielihueso 13.1-RELEASE FreeBSD 13.1-RELEASE DrudiGoldmanPI/update_petriNetScheduler-13.1.0-n250157-cb2e622cf22d PI_KERNELCONF amd64 Differences between PI_KERNELCONF and GENERIC are: 1. We are working on the 4BSD scheduler instead of the ULE: Code: # options SCHED_ULE # ULE scheduler options SCHED_4BSD # 4BSD scheduler 2. We added some debugger options: Code: options DDB options GDB options KDB_UNATTENDED ------- Code: *cd /usr/obj/usr/src/amd64.amd64/sys/PI_KERNELCONF/ kgdb kernel.debug /var/crash/vmcore.last* GNU gdb (GDB) 12.1 [GDB v12.1 for FreeBSD] Copyright (C) 2022 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-portbld-freebsd13.1". Type "show configuration" for configuration details. For bug reporting instructions, please see: . Find the GDB manual and other documentation resources online at: . For help, type "help". Type "apropos word" to search for commands related to "word"... Reading symbols from /boot/kernel/kernel... Reading symbols from /usr/lib/debug//boot/kernel/kernel.debug... Unread portion of the kernel message buffer: Fatal trap 12: page fault while in kernel mode cpuid = 0; apic id = 00 fault virtual address = 0xffffffffffffffa8 fault code = supervisor write data, page not present instruction pointer = 0x20:0xffffffff80ca0822 stack pointer = 0x28:0xfffffe00cd879b60 frame pointer = 0x28:0x0 code segment = base 0x0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 922 (sshd) trap number = 12 panic: page fault cpuid = 1 time = 1676286448 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00cd879920 vpanic() at vpanic+0x17f/frame 0xfffffe00cd879970 panic() at panic+0x43/frame 0xfffffe00cd8799d0 trap_fatal() at trap_fatal+0x385/frame 0xfffffe00cd879a30 trap_pfault() at trap_pfault+0x4f/frame 0xfffffe00cd879a90 calltrap() at calltrap+0x8/frame 0xfffffe00cd879a90 --- trap 0xc, rip = 0xffffffff80ca0822, rsp = 0xfffffe00cd879b60, rbp = 0 --- kern_select() at kern_select+0x942 Uptime: 34s Dumping 371 out of 8085 MB:..5%..13%..22%..31%..44%..52%..61%..74%..82%..91% __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55 55 __asm("movq %%gs:%P1,%0" : "=r" (td) : "n" (offsetof(struct pcpu, *(kgdb) where* #0 __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55 #1 doadump (textdump=textdump@entry=1) at /usr/src/sys/kern/kern_shutdown.c:399 #2 0xffffffff80c2f521 in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:487 #3 0xffffffff80c2f99e in vpanic (fmt=0xffffffff811dfeea "%s", ap=) at /usr/src/sys/kern/kern_shutdown.c:920 #4 0xffffffff80c2f7a3 in panic (fmt=) at /usr/src/sys/kern/kern_shutdown.c:844 #5 0xffffffff810d7855 in trap_fatal (frame=0xfffffe00cd879aa0, eva=18446744073709551528) at /usr/src/sys/amd64/amd64/trap.c:944 #6 0xffffffff810d78af in trap_pfault (frame=0xfffffe00cd879aa0, usermode=false, signo=, ucode=) at /usr/src/sys/amd64/amd64/trap.c:763 #7 #8 0xffffffff80ca0822 in selrescan (td=, ibits=, obits=) at /usr/src/sys/kern/sys_generic.c:1325 #9 kern_select (td=, nd=, fd_in=, fd_ou=, fd_ex=, tvp=, abi_nfdbits=) at /usr/src/sys/kern/sys_generic.c:1206 Backtrace stopped: Cannot access memory at address 0x8 Code: # nm -n /boot/kernel/kernel | grep 0xffffffff80ca0822 # nm -n /boot/kernel/kernel | grep 0xffffffff80ca0822 # nm -n /boot/kernel/kernel | grep 0xffffffff80c # nm -n /boot/kernel/kernel | grep 0xffffffff # nm -n /boot/kernel/kernel | grep 0xfffff # nm -n /boot/kernel/kernel | grep 0xff # nm -n /boot/kernel/kernel | grep 0x ffffffff80388c30 t cam_compat_handle_0x17 ffffffff803891e0 t cam_compat_handle_0x18 ffffffff803895f0 t cam_compat_handle_0x19 ffffffff80389730 t cam_compat_translate_dev_match_0x18 ffffffff80aba6a0 t xl_check_maddr_90xB ffffffff80aba6f0 t xl_check_maddr_90x ffffffff80abad90 t xl_txeof_90xB ffffffff80abb090 t xl_start_90xB_locked ffffffff80ebac80 t mlx5e_fec_mask_10x_25x_handler ffffffff80ebb050 t mlx5e_fec_avail_10x_25x_handler ffffffff80ebb0f0 t mlx5e_fec_mask_50x_handler ffffffff80ebb4e0 t mlx5e_fec_avail_50x_handler ffffffff810af6c0 T Xint0x80_syscall_pti ffffffff810af740 T Xint0x80_syscall ffffffff810af743 t int0x80_syscall_common ffffffff817fe180 r db_inst_0f0x ffffffff8180cc50 r mouse10x14_120 ffffffff8180cd40 r mouse10x16_50 ffffffff8180cd90 r mouse10x16_75 ffffffff8180cde0 r mouse10x16_90 ffffffff8180ce30 r mouse10x16_100 ffffffff8180ce80 r mouse10x16_120 ffffffff8180ced0 r mouse10x16_133 We also tried with dtrace but with no luck. Do you have other recommendations on how we can keep debugging this issue? We know it's something we broke on the scheduler because the generic kernel is working decently. P.S.: If someone is interested in how we implemented the Petri Net for the scheduler, contact me through the mail, and I can give you the paper we are working on. --000000000000c44f8405f497fefb Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
= Hello! Good Monday to all the FreeBSD community.
I am working on the= FreeBSD kernel for my university thesis. The idea is to make changes to th= e FreeBSD short-time scheduler so that all its operations are based on the = concept of Petri Nets. We already have the modeling of said scheduler and t= he first tests running.

I am currently running into a problem that has left me o= ut of ideas. Very randomly, the kernel throws=C2=A0page faults=C2=A0and reboots the OS. With my thesis partner, we tried to see when= this problem happened but didn't find any pattern to reproduce it. We = could see it mostly when the processor is heavily loaded, but as I said pre= viously, only in some simulations.

I leave some information about the logs found= ; any help is appreciated.

Code:
uname -a
FreeBSD pielihueso 13.1-RELEASE FreeBSD 13.1-RELEASE DrudiGoldmanPI/update_=
petriNetScheduler-13.1.0-n250157-cb2e622cf22d PI_KERNELCONF amd64

Differences between PI_KERNELCONF an= d GENERIC are:

1. We are working on the 4BSD scheduler instead of the ULE:
Code:
# options     SCHED_ULE       =
 # ULE scheduler
options     SCHED_4BSD        # 4BSD scheduler

<= span style=3D"color:rgb(20,20,20);font-family:"Segoe UI","He= lvetica Neue",Helvetica,Roboto,Oxygen,Ubuntu,Cantarell,"Fira Sans= ","Droid Sans",sans-serif;font-size:16px">2. We added some d= ebugger options:

Code:
options =
       DDB
options        GDB
options        KDB_UNATTENDED
-------=C2=A0
Code:
cd /usr/obj/usr/src/amd64.amd64/s=
ys/PI_KERNELCONF/
kgdb kernel.debug /var/crash/vmcore.last

GNU gdb (GDB) 12.1 [GDB v12.1 for FreeBSD]
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-portbld-freebsd13.1".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<ht=
tps://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word&=
quot;...
Reading symbols from /boot/kernel/kernel...
Reading symbols from /usr/lib/debug//boot/kernel/kernel.debug...

Unread portion of the kernel message buffer:


Fatal trap 12: page fault while in kernel mode
cpuid =3D 0; apic id =3D 00
fault virtual address    =3D 0xffffffffffffffa8
fault code        =3D supervisor write data, page not present
instruction pointer    =3D 0x20:0xffffffff80ca0822
stack pointer            =3D 0x28:0xfffffe00cd879b60
frame pointer            =3D 0x28:0x0
code segment        =3D base 0x0, limit 0xfffff, type 0x1b
            =3D DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags    =3D interrupt enabled, resume, IOPL =3D 0
current process        =3D 922 (sshd)
trap number        =3D 12
panic: page fault
cpuid =3D 1
time =3D 1676286448
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00cd879=
920
vpanic() at vpanic+0x17f/frame 0xfffffe00cd879970
panic() at panic+0x43/frame 0xfffffe00cd8799d0
trap_fatal() at trap_fatal+0x385/frame 0xfffffe00cd879a30
trap_pfault() at trap_pfault+0x4f/frame 0xfffffe00cd879a90
calltrap() at calltrap+0x8/frame 0xfffffe00cd879a90
--- trap 0xc, rip =3D 0xffffffff80ca0822, rsp =3D 0xfffffe00cd879b60, rbp =
=3D 0 ---
kern_select() at kern_select+0x942
Uptime: 34s
Dumping 371 out of 8085 MB:..5%..13%..22%..31%..44%..52%..61%..74%..82%..91=
%

__curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55
55        __asm("movq %%gs:%P1,%0" : "=3Dr" (td) : &quo=
t;n" (offsetof(struct pcpu,

(kgdb) where

#0  __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55
#1  doadump (textdump=3Dtextdump@entry=3D1) at /usr/src/sys/kern/kern_shutd=
own.c:399
#2  0xffffffff80c2f521 in kern_reboot (howto=3D260) at /usr/src/sys/kern/ke=
rn_shutdown.c:487
#3  0xffffffff80c2f99e in vpanic (fmt=3D0xffffffff811dfeea "%s", =
ap=3D<optimized out>) at /usr/src/sys/kern/kern_shutdown.c:920
#4  0xffffffff80c2f7a3 in panic (fmt=3D<unavailable>) at /usr/src/sys=
/kern/kern_shutdown.c:844
#5  0xffffffff810d7855 in trap_fatal (frame=3D0xfffffe00cd879aa0, eva=3D184=
46744073709551528) at /usr/src/sys/amd64/amd64/trap.c:944
#6  0xffffffff810d78af in trap_pfault (frame=3D0xfffffe00cd879aa0, usermode=
=3Dfalse, signo=3D<optimized out>, ucode=3D<optimized out>) at =
/usr/src/sys/amd64/amd64/trap.c:763
#7  <signal handler called>
#8  0xffffffff80ca0822 in selrescan (td=3D<error reading variable: Canno=
t access memory at address 0xffffffffffffffd0>, ibits=3D<optimized ou=
t>, obits=3D<optimized out>) at /usr/src/sys/kern/sys_generic.c:13=
25
#9  kern_select (td=3D<optimized out>, nd=3D<error reading variabl=
e: Cannot access memory at address 0xffffffffffffff90>, fd_in=3D<opti=
mized out>, fd_ou=3D<optimized out>, fd_ex=3D<optimized out>=
, tvp=3D<optimized out>,
    abi_nfdbits=3D<error reading variable: Cannot access memory at addre=
ss 0x10>) at /usr/src/sys/kern/sys_generic.c:1206
Backtrace stopped: Cannot access memory at address 0x8

Code:
# nm -n /boot/kernel/kern=
el | grep  0xffffffff80ca0822
# nm -n /boot/kernel/kernel | grep  0xffffffff80ca0822
# nm -n /boot/kernel/kernel | grep  0xffffffff80c
# nm -n /boot/kernel/kernel | grep  0xffffffff
# nm -n /boot/kernel/kernel | grep  0xfffff
# nm -n /boot/kernel/kernel | grep  0xff
# nm -n /boot/kernel/kernel | grep  0x

ffffffff80388c30 t cam_compat_handle_0x17
ffffffff803891e0 t cam_compat_handle_0x18
ffffffff803895f0 t cam_compat_handle_0x19
ffffffff80389730 t cam_compat_translate_dev_match_0x18
ffffffff80aba6a0 t xl_check_maddr_90xB
ffffffff80aba6f0 t xl_check_maddr_90x
ffffffff80abad90 t xl_txeof_90xB
ffffffff80abb090 t xl_start_90xB_locked
ffffffff80ebac80 t mlx5e_fec_mask_10x_25x_handler
ffffffff80ebb050 t mlx5e_fec_avail_10x_25x_handler
ffffffff80ebb0f0 t mlx5e_fec_mask_50x_handler
ffffffff80ebb4e0 t mlx5e_fec_avail_50x_handler
ffffffff810af6c0 T Xint0x80_syscall_pti
ffffffff810af740 T Xint0x80_syscall
ffffffff810af743 t int0x80_syscall_common
ffffffff817fe180 r db_inst_0f0x
ffffffff8180cc50 r mouse10x14_120
ffffffff8180cd40 r mouse10x16_50
ffffffff8180cd90 r mouse10x16_75
ffffffff8180cde0 r mouse10x16_90
ffffffff8180ce30 r mouse10x16_100
ffffffff8180ce80 r mouse10x16_120
ffffffff8180ced0 r mouse10x16_133

We also tried with dtrace but w= ith no luck. Do you have other recommendations on how we can keep debugging= this issue? We know it's something we broke on the scheduler because t= he generic kernel is working decently.

P.S.: If someone is interested in how we = implemented the Petri Net for the scheduler, contact me through the mail, a= nd I can give you the paper we are working on.
--000000000000c44f8405f497fefb--