Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 22 Apr 2019 13:46:01 +0000
From:      bugzilla-noreply@freebsd.org
To:        bugs@FreeBSD.org
Subject:   [Bug 237470] [ppc][pseries] panic: Error evicting page: -7
Message-ID:  <bug-237470-227@https.bugs.freebsd.org/bugzilla/>

next in thread | raw e-mail | index | archive | help
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D237470

            Bug ID: 237470
           Summary: [ppc][pseries] panic: Error evicting page: -7
           Product: Base System
           Version: CURRENT
          Hardware: powerpc
                OS: Any
            Status: New
          Severity: Affects Only Me
          Priority: ---
         Component: kern
          Assignee: bugs@FreeBSD.org
          Reporter: luporl@FreeBSD.org

I have seen this issue happening a couple of times. It is difficult to
reproduce. In my case, it started happening more often when using a clang b=
uilt
with debug info, to perform large parallel builds.

This is the panic message observed:
panic: Error evicting page: -7
cpuid =3D 10
time =3D 1555531979
KDB: stack backtrace:
0xe000000033634910: at .kdb_backtrace+0x5c
0xe000000033634a40: at .vpanic+0x1b4
0xe000000033634b00: at .panic+0x38
0xe000000033634b90: at .mphyp_pte_insert+0x304
0xe000000033634cb0: at .moea64_pvo_enter+0x164
0xe000000033634d40: at .moea64_enter+0x520
0xe000000033634e40: at .moea64_enter_object+0xa8
0xe000000033634ef0: at .pmap_enter_object+0xa8
0xe000000033634fa0: at .vm_map_pmap_enter+0x2d0
0xe000000033635070: at .vm_map_insert+0x550
0xe000000033635170: at .vm_map_fixed+0x134
0xe000000033635240: at .vm_mmap_object+0x484
0xe000000033635350: at .vn_mmap+0x190
0xe000000033635430: at .kern_mmap+0x474
0xe000000033635550: at .sys_mmap+0x30
0xe0000000336355d0: at .trap+0x654
0xe000000033635770: at .powerpc_interrupt+0x290
0xe000000033635810: user SC trap by 0x81004e768: srr1=3D0x800000000000d032
            r1=3D0x3fffffffffffb790 cr=3D0x22024024 xer=3D0 ctr=3D0x81004e9=
40
r2=3D0x810075d80 frame=3D0xe000000033635840
KDB: enter: panic


This seems to indicate that the PTE to be evicted was not found.

After some debugging, it seems to be that there is a race condition, between
mphyp_pte_unset() and mphyp_pte_insert(), that may cause the page chosen for
eviction to be removed by mphyp_pte_unset() before mphyp_pte_insert().

This can be explained as following:
- mphyp_pte_insert() locks the pvo to be inserted
- mphyp_pte_insert() obtains read access to mphyp_eviction_lock
- mphyp_pte_insert() tries to insert the corresponding pte but fails
- mphyp_pte_insert() releases mphyp_eviction_lock
- mphyp_pte_insert() acquires mphyp_eviction_lock for write
- mphyp_pte_insert() chooses a pte to evict - let's call it p
- mphyp_pte_unset(), on another thread, locks the pvo that corresponds to p=
te p
- mphyp_pte_unset() removes p (without holding mphyp_eviction_lock)
- mphyp_pte_insert() tries to remove p, but fails, because it was already
removed by another thread
- the system panics

KDB's acttrace on this panic supports the hypothesis above:

Tracing command clang-8 pid 44763 tid 100504 td 0xc00000000f220000 (CPU 10)
0xe000000033634a40: at .vpanic+0x1d4=20=20=20=20=20=20=20=20=20=20=20=20=20=
=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20
0xe000000033634b00: at .panic+0x38=20=20=20=20=20=20=20=20=20=20=20=20=20=
=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=
=20=20=20=20=20=20=20=20
0xe000000033634b90: at .mphyp_pte_insert+0x304
0xe000000033634cb0: at .moea64_pvo_enter+0x164
0xe000000033634d40: at .moea64_enter+0x520=20=20=20=20=20=20=20=20=20=20=20=
=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20
0xe000000033634e40: at .moea64_enter_object+0xa8
0xe000000033634ef0: at .pmap_enter_object+0xa8=20=20=20=20
0xe000000033634fa0: at .vm_map_pmap_enter+0x2d0
0xe000000033635070: at .vm_map_insert+0x550=20=20=20
0xe000000033635170: at .vm_map_fixed+0x134=20=20=20=20=20=20=20=20=20=20=20=
=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=
=20=20
0xe000000033635240: at .vm_mmap_object+0x484=20=20=20=20=20=20=20=20=20=20=
=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=
=20
0xe000000033635350: at .vn_mmap+0x190
0xe000000033635430: at .kern_mmap+0x474
0xe000000033635550: at .sys_mmap+0x30=20=20=20=20=20=20=20=20=20
0xe0000000336355d0: at .trap+0x654=20=20=20=20=20=20=20=20=20=20=20=20
0xe000000033635770: at .powerpc_interrupt+0x290
0xe000000033635810: user SC trap by 0x81004e768: srr1=3D0x800000000000d032
            r1=3D0x3fffffffffffb790 cr=3D0x22024024 xer=3D0 ctr=3D0x81004e9=
40
r2=3D0x810075d80 frame=3D0xe0000000336358
40

Tracing command clang-8 pid 44640 tid 100690 td 0xc00000004913b000 (CPU 15)
0xe00000003404fad0: at .intr_event_handle+0xf0
0xe00000003404fb90: at .powerpc_dispatch_intr+0xf0
0xe00000003404fc40: at .xicp_dispatch+0x274
0xe00000003404fd00: at .powerpc_interrupt+0xc8
0xe00000003404fda0: kernel EXI trap by .lock_delay+0x5c:
srr1=3D0x8000000000009032
            r1=3D0xe000000034050050 cr=3D0x2000f084 xer=3D0 ctr=3D0x1b78
r2=3D0xc00000000155e890 frame=3D0xe00000003404fdd0
0xe000000034050050: at 0xc0000000fa530c1c
0xe0000000340500e0: at .__mtx_lock_sleep+0x238
0xe0000000340501d0: at .__mtx_lock_flags+0x160
0xe000000034050280: at .moea64_remove_pages+0x134
0xe000000034050340: at .pmap_remove_pages+0x78
0xe0000000340503d0: at .vmspace_exit+0xf8
0xe000000034050480: at .exit1+0x6d8
0xe000000034050550: at .sys_sys_exit+0x1c
0xe0000000340505d0: at .trap+0x654
0xe000000034050770: at .powerpc_interrupt+0x290
0xe000000034050810: user SC trap by 0x8133eafd8: srr1=3D0x800000000000f032
            r1=3D0x3fffffffffffd890 cr=3D0x22000242 xer=3D0x20000000 ctr=3D=
0x8133eafd0
r2=3D0x8134bdbf0 frame=3D0xe000000034050840


Process 44640 is exiting and removing its pages while process 44763 is mapp=
ing
pages into memory.

It seems to me this could be fix in 2 ways:
1- make mphyp_pte_unset() acquire the eviction lock before removing ptes
2- change mphyp_pte_insert() KASSERT, to tolerate H_NOT_FOUND, as
mphyp_pte_unset()

If correct, 2 is preferable, as avoids extra locking points.

--=20
You are receiving this mail because:
You are the assignee for the bug.=



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?bug-237470-227>