Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 04 Mar 2024 14:12:20 +0000
From:      bugzilla-noreply@freebsd.org
To:        ports-bugs@FreeBSD.org
Subject:   [Bug 277476] amdgpu/drm-kmod periodic hangs due to phys contig allocations
Message-ID:  <bug-277476-7788@https.bugs.freebsd.org/bugzilla/>

next in thread | raw e-mail | index | archive | help
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D277476

            Bug ID: 277476
           Summary: amdgpu/drm-kmod periodic hangs due to phys contig
                    allocations
           Product: Ports & Packages
           Version: Latest
          Hardware: Any
                OS: Any
            Status: New
          Severity: Affects Only Me
          Priority: ---
         Component: Individual Port(s)
          Assignee: ports-bugs@FreeBSD.org
          Reporter: jeffpc@josefsipek.net

Two weeks ago I replaced an ancient nvidia graphics card with an AMD RX580 =
card
to run open source drivers. Everything works fine most of the time, but
occasionally the system hangs for a few seconds (5-10, usually).  The longer
the system has been up, the worse it gets.

Digging into it a bit, it is because userspace (it always looks like X) doe=
s an
ioctl into drm which then tries to allocate a large-ish piece of physically
contiguous memory.  This explains why it gets worse as uptime increases (fr=
ee
physical memory fragmentation) and when running firefox (the most memory hu=
ngry
application I use).  I know nothing about graphics cards, the software stack
supporting them, or the linux kernel API compatibility layer, but clearly i=
t'd
be beneficial if amdgpu/drm/whatever could make use of *virtually* contiguo=
us
pages or some kind of allocation caching/reuse to avoid repeatedly asking t=
he
vm code for physically contiguous ranges.

To conclude the above, I did a handful of dtrace-based experiments.

While one of the "temporary hangs" was happening, the following was the most
common (non-idle) profiler stack:

# dtrace -n 'profile-97{@[stack()]=3Dcount()}'
...
              kernel`vm_phys_alloc_contig+0x11d
              kernel`linux_alloc_pages+0x8f
              ttm.ko`ttm_pool_alloc+0x2cb
              ttm.ko`ttm_tt_populate+0xc5
              ttm.ko`ttm_bo_handle_move_mem+0xc3
              ttm.ko`ttm_bo_validate+0xb4
              ttm.ko`ttm_bo_init_reserved+0x199
              amdgpu.ko`amdgpu_bo_create+0x1eb
              amdgpu.ko`amdgpu_bo_create_user+0x21
              amdgpu.ko`amdgpu_gem_create_ioctl+0x1e2
              drm.ko`drm_ioctl_kernel+0xc6
              drm.ko`drm_ioctl+0x2b5
              kernel`linux_file_ioctl+0x312
              kernel`kern_ioctl+0x255
              kernel`sys_ioctl+0x123
              kernel`amd64_syscall+0x109
              kernel`0xffffffff80fe43eb

The latency of vm_phys_alloc_contig (entry to return) is bimodal - with
latencies in the single digit *milli*seconds during the "temporary hangs":

# dtrace -n 'fbt::vm_phys_alloc_contig:entry{self->ts=3Dtimestamp}' -n
'fbt::vm_phys_alloc_contig:return/self->ts/{this->delta=3Dtimestamp-self->t=
s;
@=3Dquantize(this->delta);}' -n 'tick-1sec{printa(@)}'
...
           value  ------------- Distribution ------------- count=20=20=20=20
             256 |                                         0=20=20=20=20=20=
=20=20=20
             512 |@                                        2606=20=20=20=20=
=20
            1024 |@@@@@@@@@                                18207=20=20=20=20
            2048 |@                                        2534=20=20=20=20=
=20
            4096 |                                         894=20=20=20=20=
=20=20
            8192 |                                         34=20=20=20=20=
=20=20=20
           16384 |                                         78=20=20=20=20=
=20=20=20
           32768 |                                         58=20=20=20=20=
=20=20=20
           65536 |                                         219=20=20=20=20=
=20=20
          131072 |                                         306=20=20=20=20=
=20=20
          262144 |                                         310=20=20=20=20=
=20=20
          524288 |                                         735=20=20=20=20=
=20=20
         1048576 |                                         174=20=20=20=20=
=20=20
         2097152 |@@                                       4364=20=20=20=20=
=20
         4194304 |@@@@@@@@@@@@@@@@@@@@@@@@                 47475=20=20=20=20
         8388608 |@                                        1546=20=20=20=20=
=20
        16777216 |                                         2=20=20=20=20=20=
=20=20=20
        33554432 |                                         0=20=20=20=20=20

The number of pages being allocated:

# dtrace -n 'fbt::vm_phys_alloc_contig:entry/arg1>1/{@=3Dquantize(arg1)}' -n
'tick-1sec{printa(@)}'
...
           value  ------------- Distribution ------------- count=20=20=20=20
               1 |                                         0=20=20=20=20=20=
=20=20=20
               2 |@@@                                      15=20=20=20=20=
=20=20=20
               4 |@                                        7=20=20=20=20=20=
=20=20=20
               8 |@@@                                      16=20=20=20=20=
=20=20=20
              16 |@@                                       10=20=20=20=20=
=20=20=20
              32 |@@                                       10=20=20=20=20=
=20=20=20
              64 |@                                        7=20=20=20=20=20=
=20=20=20
             128 |@@@                                      12=20=20=20=20=
=20=20=20
             256 |@@@                                      12=20=20=20=20=
=20=20=20
             512 |@@@@@@@@@@@@@@                           68=20=20=20=20=
=20=20=20
            1024 |@@@@@@@                                  32=20=20=20=20=
=20=20=20
            2048 |                                         0=20=20=20=20=20=
=20

I did a few more dtrace experiments, but they all point to the same thing -=
 a
drm/amdgpu related ioctl wants 4MB of physically contiguous memory often en=
ough
to become a headache.  4MB isn't too much given than the system has 32GB of
RAM, but physically contiguous takes a while to fulfill sometimes.


The card:

vgapci0@pci0:1:0:0:     class=3D0x030000 rev=3D0xe7 hdr=3D0x00 vendor=3D0x1=
002
device=3D0x67df subvendor=3D0x1da2 subdevice=3D0xe353
    vendor     =3D 'Advanced Micro Devices, Inc. [AMD/ATI]'
    device     =3D 'Ellesmere [Radeon RX 470/480/570/570X/580/580X/590]'
    class      =3D display
    subclass   =3D VGA

$ pkg info|grep -i amd=20=20=20=20=20=20=20=20=20=20=20=20=20
gpu-firmware-amd-kmod-aldebaran-20230625 Firmware modules for aldebaran AMD
GPUs
gpu-firmware-amd-kmod-arcturus-20230625 Firmware modules for arcturus AMD G=
PUs
gpu-firmware-amd-kmod-banks-20230625 Firmware modules for banks AMD GPUs
gpu-firmware-amd-kmod-beige-goby-20230625 Firmware modules for beige_goby A=
MD
GPUs
gpu-firmware-amd-kmod-bonaire-20230625 Firmware modules for bonaire AMD GPUs
gpu-firmware-amd-kmod-carrizo-20230625 Firmware modules for carrizo AMD GPUs
gpu-firmware-amd-kmod-cyan-skillfish2-20230625 Firmware modules for
cyan_skillfish2 AMD GPUs
gpu-firmware-amd-kmod-dimgrey-cavefish-20230625 Firmware modules for
dimgrey_cavefish AMD GPUs
gpu-firmware-amd-kmod-fiji-20230625 Firmware modules for fiji AMD GPUs
gpu-firmware-amd-kmod-green-sardine-20230625 Firmware modules for green_sar=
dine
AMD GPUs
gpu-firmware-amd-kmod-hainan-20230625 Firmware modules for hainan AMD GPUs
gpu-firmware-amd-kmod-hawaii-20230625 Firmware modules for hawaii AMD GPUs
gpu-firmware-amd-kmod-kabini-20230625 Firmware modules for kabini AMD GPUs
gpu-firmware-amd-kmod-kaveri-20230625 Firmware modules for kaveri AMD GPUs
gpu-firmware-amd-kmod-mullins-20230625 Firmware modules for mullins AMD GPUs
gpu-firmware-amd-kmod-navi10-20230625 Firmware modules for navi10 AMD GPUs
gpu-firmware-amd-kmod-navi12-20230625 Firmware modules for navi12 AMD GPUs
gpu-firmware-amd-kmod-navi14-20230625 Firmware modules for navi14 AMD GPUs
gpu-firmware-amd-kmod-navy-flounder-20230625 Firmware modules for navy_flou=
nder
AMD GPUs
gpu-firmware-amd-kmod-oland-20230625 Firmware modules for oland AMD GPUs
gpu-firmware-amd-kmod-picasso-20230625 Firmware modules for picasso AMD GPUs
gpu-firmware-amd-kmod-pitcairn-20230625 Firmware modules for pitcairn AMD G=
PUs
gpu-firmware-amd-kmod-polaris10-20230625 Firmware modules for polaris10 AMD
GPUs
gpu-firmware-amd-kmod-polaris11-20230625 Firmware modules for polaris11 AMD
GPUs
gpu-firmware-amd-kmod-polaris12-20230625 Firmware modules for polaris12 AMD
GPUs
gpu-firmware-amd-kmod-raven-20230625 Firmware modules for raven AMD GPUs
gpu-firmware-amd-kmod-raven2-20230625 Firmware modules for raven2 AMD GPUs
gpu-firmware-amd-kmod-renoir-20230625 Firmware modules for renoir AMD GPUs
gpu-firmware-amd-kmod-si58-20230625 Firmware modules for si58 AMD GPUs
gpu-firmware-amd-kmod-sienna-cichlid-20230625 Firmware modules for
sienna_cichlid AMD GPUs
gpu-firmware-amd-kmod-stoney-20230625 Firmware modules for stoney AMD GPUs
gpu-firmware-amd-kmod-tahiti-20230625 Firmware modules for tahiti AMD GPUs
gpu-firmware-amd-kmod-tonga-20230625 Firmware modules for tonga AMD GPUs
gpu-firmware-amd-kmod-topaz-20230625 Firmware modules for topaz AMD GPUs
gpu-firmware-amd-kmod-vangogh-20230625 Firmware modules for vangogh AMD GPUs
gpu-firmware-amd-kmod-vega10-20230625 Firmware modules for vega10 AMD GPUs
gpu-firmware-amd-kmod-vega12-20230625 Firmware modules for vega12 AMD GPUs
gpu-firmware-amd-kmod-vega20-20230625 Firmware modules for vega20 AMD GPUs
gpu-firmware-amd-kmod-vegam-20230625 Firmware modules for vegam AMD GPUs
gpu-firmware-amd-kmod-verde-20230625 Firmware modules for verde AMD GPUs
gpu-firmware-amd-kmod-yellow-carp-20230625 Firmware modules for yellow_carp=
 AMD
GPUs
suitesparse-amd-3.3.0          Symmetric approximate minimum degree
suitesparse-camd-3.3.0         Symmetric approximate minimum degree
suitesparse-ccolamd-3.3.0      Constrained column approximate minimum degree
ordering
suitesparse-colamd-3.3.0       Column approximate minimum degree ordering
algorithm
webcamd-5.17.1.2_1             Port of Linux USB webcam and DVB drivers into
userspace
xf86-video-amdgpu-22.0.0_1     X.Org amdgpu display driver
$ pkg info|grep -i drm
drm-515-kmod-5.15.118_3        DRM drivers modules
drm-kmod-20220907_1            Metaport of DRM modules for the linuxkpi-bas=
ed
KMS components
gpu-firmware-kmod-20230210_1,1 Firmware modules for the drm-kmod drivers
libdrm-2.4.120_1,1             Direct Rendering Manager library and headers

--=20
You are receiving this mail because:
You are the assignee for the bug.=



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?bug-277476-7788>