Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 08 Oct 2023 02:27:19 +0000
From:      bugzilla-noreply@freebsd.org
To:        bugs@FreeBSD.org
Subject:   [Bug 274346] kernel panic/page fault in nfs_commonkrpc.c::newnfs_request(), due to duplicate hostid's
Message-ID:  <bug-274346-227@https.bugs.freebsd.org/bugzilla/>

next in thread | raw e-mail | index | archive | help
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D274346

            Bug ID: 274346
           Summary: kernel panic/page fault in
                    nfs_commonkrpc.c::newnfs_request(), due to duplicate
                    hostid's
           Product: Base System
           Version: 14.0-STABLE
          Hardware: Any
                OS: Any
            Status: New
          Severity: Affects Only Me
          Priority: ---
         Component: kern
          Assignee: bugs@FreeBSD.org
          Reporter: freebsd@kumba.dev

So I have managed to trigger a kernel panic in 14.0-BETA5 in the NFS subsys=
tem,
but this is partly due to a mistake I myself made by not changing the hosti=
d of
a system that's a clone of another active system.  The first system is runn=
ing
13.2-RELEASE-p4 and the cloned system is running 14.0-BETA5, newly upgraded.

There are several elements that lead to this panic:
  - Both systems have the same hostid
  - Both systems mount the same remote NFS share from a third system
  - Have the 13.2-RELEASE-p4 system start doing a job on the remote share, =
like
compiling code (e.g., /usr/ports is on this share)
  - Have the cloned system running 14.0-BETA5 attempt to unmount the remote
share
  - The 14.0-BETA5 system will crash

I know it's due to duplicate hostid's, because the below message is printed=
 on
the console immediately before the kernel crashes:
>=20
> Initiate recovery. If server has not rebooted, check NFS clients for uniq=
ue /etc/hostid's
>=20

And the printf() for that exact string is in the crashing function right wh=
ere
GDB says the crash happens, in nfs_commonkrpc.c, function newnfs_request(),
line 1212.  I'm just not sure if it's the if statement immediately preceedi=
ng
the printf() call or the if statement that happens after.  The next call is
memcmp() in machine code, so I am assuming a NULL deref of some kind.

My kernel is a custom build, but this can be triggered on a GENERIC kernel =
as
well, as my first crash happened on GENERIC right before I was set to reboot
into my rebuilt custom kernel after doing the second `freebsd-update instal=
l`
phase to upgrade to 14.0-BETA5.  At that time, I had crashdumps disabled.  =
So
the below crash info is from that custom kernel, after I enabled crashdumps=
 and
re-triggered the crash (it's at least reproducible...):

> Unread portion of the kernel message buffer:
> [179]
> [179]
> [179] Fatal trap 12: page fault while in kernel mode
> [179] cpuid =3D 0; apic id =3D 00
> [179] fault virtual address     =3D 0x4
> [179] fault code                =3D supervisor read data, page not present
> [179] instruction pointer       =3D 0x20:0xffffffff809e9893
> [179] stack pointer             =3D 0x28:0xfffffe00a233e800
> [179] frame pointer             =3D 0x28:0xfffffe00a233e800
> [179] code segment              =3D base 0x0, limit 0xfffff, type 0x1b
> [179]                   =3D DPL 0, pres 1, long 1, def32 0, gran 1
> [179] processor eflags  =3D interrupt enabled, resume, IOPL =3D 0
> [179] current process           =3D 87256 (umount)
> [179] rdi: fffff800077761e4 rsi: 0000000000000004 rdx: 0000000000000010
> [179] rcx: 0000000000000000  r8: 0000000000000024  r9: fffffe00a233f000
> [179] rax: 0000000000000000 rbx: fffffe00a251b020 rbp: fffffe00a233e800
> [179] r10: 0000000000000585 r11: 000000007ff9687f r12: fffff80007776010
> [180] r13: fffff80003abb800 r14: fffffe00a233ea18 r15: fffff80007776000
> [180] trap number               =3D 12
> [180] panic: page fault
> [180] cpuid =3D 0
> [180] time =3D 1696723338
> [180] KDB: stack backtrace:
> [180] #0 0xffffffff806b5edd at kdb_backtrace+0x5d
> [180] #1 0xffffffff8066aa20 at vpanic+0x130
> [180] #2 0xffffffff8066a8e3 at panic+0x43
> [180] #3 0xffffffff809ee34c at trap_fatal+0x40c
> [180] #4 0xffffffff809ee39e at trap_pfault+0x4e
> [180] #5 0xffffffff809c6288 at calltrap+0x8
> [180] #6 0xffffffff8053f804 at newnfs_request+0x10a4
> [180] #7 0xffffffff8054dbad at nfsrpc_destroysession+0x11d
> [180] #8 0xffffffff80557252 at nfscl_umount+0x312
> [180] #9 0xffffffff80589470 at nfs_unmount+0x70
> [180] #10 0xffffffff8073c4ad at vfs_unmount_sigdefer+0x2d
> [180] #11 0xffffffff80741e37 at dounmount+0x787
> [180] #12 0xffffffff80741645 at kern_unmount+0x2f5
> [180] #13 0xffffffff809eeaf9 at amd64_syscall+0x109
> [180] #14 0xffffffff809c6b9b at fast_syscall_common+0xf8
> [180] Timeout initializing vt_vga
> [180] Uptime: 3m0s
> [180] Dumping 447 out of 8077 MB:..4%..11%..22%..33%..43%..51%..61%..72%.=
.83%..93%
>=20
> __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:57
> 57      /usr/src/sys/amd64/include/pcpu_aux.h: No such file or directory.
> (kgdb) #0  __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:57
> #1  doadump (textdump=3D<optimized out>) at ../../../kern/kern_shutdown.c=
:405
> #2  0xffffffff8066a5b7 in kern_reboot (howto=3D260)
>     at ../../../kern/kern_shutdown.c:526
> #3  0xffffffff8066aa8d in vpanic (fmt=3D0xffffffff80a3bcd1 "%s",
>     ap=3Dap@entry=3D0xfffffe00a233e680) at ../../../kern/kern_shutdown.c:=
970
> #4  0xffffffff8066a8e3 in panic (fmt=3D<unavailable>)
>     at ../../../kern/kern_shutdown.c:894
> #5  0xffffffff809ee34c in trap_fatal (frame=3D0xfffffe00a233e740, eva=3D4)
>     at ../../../amd64/amd64/trap.c:952
> #6  0xffffffff809ee39e in trap_pfault (frame=3D0xfffffe00a233e740,
>     usermode=3Dfalse, signo=3D<optimized out>, ucode=3D<optimized out>)
>     at ../../../amd64/amd64/trap.c:760
> #7  <signal handler called>
> #8  memcmp () at ../../../amd64/amd64/support.S:115
> #9  0xffffffff8053f804 in newnfs_request (nd=3Dnd@entry=3D0xfffffe00a233e=
a18,
>     nmp=3Dnmp@entry=3D0xfffff80003abb800, clp=3Dclp@entry=3D0x0,
>     nrp=3Dnrp@entry=3D0xfffff80003abbcd8, vp=3Dvp@entry=3D0x0,
>     td=3Dtd@entry=3D0xfffffe00a251b020, cred=3D0xfffff8000765aa00, prog=
=3D100003,
>     vers=3D4, retsum=3D0x0, toplevel=3D1, xidp=3D0x0, dssep=3D0x0)
>     at ../../../fs/nfs/nfs_commonkrpc.c:1212
> #10 0xffffffff8054dbad in nfsrpc_destroysession (
>     nmp=3Dnmp@entry=3D0xfffff80003abb800, tsep=3D0xfffff80007776010,
>     tsep@entry=3D0x0, cred=3Dcred@entry=3D0xfffff8000765aa00,
>     p=3Dp@entry=3D0xfffffe00a251b020) at ../../../fs/nfs/nfs_commonsubs.c=
:5151
> #11 0xffffffff80557252 in nfscl_umount (nmp=3Dnmp@entry=3D0xfffff80003abb=
800,
>     p=3Dp@entry=3D0xfffffe00a251b020, dhp=3Ddhp@entry=3D0x0)
>     at ../../../fs/nfsclient/nfs_clstate.c:2094
> #12 0xffffffff80589470 in nfs_unmount (mp=3D0xfffffe00a4058000,
>     mntflags=3D<optimized out>) at ../../../fs/nfsclient/nfs_clvfsops.c:1=
903
> #13 0xffffffff8073c4ad in vfs_unmount_sigdefer (mp=3D0xfffffe00a4058000,
>     mntflags=3D134217728) at ../../../kern/vfs_init.c:185
> #14 0xffffffff80741e37 in dounmount (mp=3D0xfffff800077761e4,
>     mp@entry=3D0xfffffe00a4058000, flags=3Dflags@entry=3D134217728,
>     td=3Dtd@entry=3D0xfffffe00a251b020) at ../../../kern/vfs_mount.c:2327
> #15 0xffffffff80741645 in kern_unmount (td=3D0xfffffe00a251b020,
>     path=3D<optimized out>, flags=3D134217728) at ../../../kern/vfs_mount=
.c:1785
> #16 0xffffffff809eeaf9 in syscallenter (td=3D0xfffffe00a251b020)
>     at ../../../amd64/amd64/../../kern/subr_syscall.c:187
> #17 amd64_syscall (td=3D0xfffffe00a251b020, traced=3D0)
>     at ../../../amd64/amd64/trap.c:1197
> #18 <signal handler called>
> #19 0x0000244bc41489ba in ?? ()
> Backtrace stopped: Cannot access memory at address 0x244bc20f4c18
> (kgdb)

--=20
You are receiving this mail because:
You are the assignee for the bug.=



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?bug-274346-227>