Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 16 May 2024 06:59:03 +0000
From:      bugzilla-noreply@freebsd.org
To:        bugs@FreeBSD.org
Subject:   [Bug 279021] Random phantom files by g_new_bio() failure
Message-ID:  <bug-279021-227@https.bugs.freebsd.org/bugzilla/>

next in thread | raw e-mail | index | archive | help
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D279021

            Bug ID: 279021
           Summary: Random phantom files by g_new_bio() failure
           Product: Base System
           Version: 14.0-STABLE
          Hardware: Any
                OS: Any
            Status: New
          Severity: Affects Some People
          Priority: ---
         Component: kern
          Assignee: bugs@FreeBSD.org
          Reporter: seigo.tanimura@gmail.com

A bug in g_new_bio() is suspected to cause the random phantom files often
silently; expoited during the poudriere-bulk(8) test on bug #275594, comment
#147.

* Test Environment: Hypervisor
- CPU: Intel Core i7-13700KF 3.4GHz (24 threads)
- RAM: 128 GB
- OS: Windows 10
- Storage: NVMe and SATA HDDs
- Hypervisor: VMWare Workstation 17.5

* Test Environment: VM & OS
- vCPUs: 16
- RAM: 16 GB
- Swap: 128 GB on NVMe
- OS: FreeBSD 14.1-BETA2
  - All of the releng/14.1 fixes in bug #275594, comment #147 applied.
- Storage & Filesystems: ZFS mainly
  - Main pool: 1.5G on SATA HDD
  - ZIL: 16 GB on NVMe
  - L2ARC: 64 GB on NVMe

* Application
- poudriere
  - Number of ports to build: 2325 (including dependencies)
  - Major configurations for port building
    - poudriere.conf
      - #NO_ZFS=3Dyes (ZFS enabled)
      - USE_PORTLINT=3Dno
      - USE_TMPFS=3D"wrkdir data localbase"
      - TMPFS_LIMIT=3D32
      - DISTFILES_CACHE=3D(configured in ZFS)
      - CCACHE_DIR=3D(configured in ZFS)
        - The cache is cleared in advance.
      - CCACHE_STATIC_PREFIX=3D/usr/local
      - PARALLEL_JOBS=3D16 (actually givin via "poudriere bulk -J")
    - make.conf
      - MAKE_JOBS_NUMBER=3D4

* Steps
1. Remove the package output directory, so that all packages are built.
2. Clear the ccache contents by "ccache -C".
3. Run 'poudriere bulk' to start the parallel build.
4. Observe the system and build progress by top(1), poudriere web UI,
cmdwatch(1) + sysctl(8), etc.

* Expected results
- All of the ports are built successfully.

* Observed behaviors during building
- In about 2 hours, the RAM went out and the kernel started swapping out the
pages.
- The bulk port build failed at random.
  + A header file or a library provided via the dependency was often missin=
g.
- The kernel occasionally logged "swap_pager: cannot allocate bio".
- vm.uma.g_bio.stats.fails increased up to ~5000.

* Analysis
g_new_bio(), the kernel function that allocates a new bio in the non-blocki=
ng
manner, returns NULL if the g_bio uma(9) zone has no free items.  While such
the case is regarded as a rare error with an ordinary HDD, an nvme(4) stora=
ge
is likely to trigger that issue because of its high capacity for the parall=
el
I/O operations.

Although not confirmed precisely, the effect of this issue seems to include=
 the
phantom files, ie the files created newly do not become visible immediately=
.=20
Under poudriere-bulk(8), it is suspected that the files installed during
build-depends and lib-depends are not detected as expected.  The problem
happens at random; it is up to the state of the g_bio zone.

No logs are emitted by g_new_bio() in case of an allocation failure.  An
exception is the swap pager, which logs "swap_pager: cannot allocate bio". =
 The
increase of vm.uma.g_bio.stats.fails is the sole record of the errors.

* Proposed Fix and Test Results
Reserve some bios for the non-blocking allocation.  Uma(9) supports the item
reservation, which can be used to implement the fix.  NB the item reservati=
on
of uma(9) can be configured at the boot time only, in practice.

The proposed fix has been committed to the submitter's GitHub repository and
made public.

New Loader Tunable:
- kern.geom.reserved_new_bios
  The number of the bios reserved for the non-blocking allocation.  (Defaul=
t:
65536)
  Zero means no bios are reserved.  Due to the limitation on the uma(9) zon=
e,
this configuration cannot be altered upon a running host.

All of the sources are under
https://github.com/altimeter-130ft/freebsd-freebsd-src.

            |                                   | Git Commit Hash
Base Branch | Fix Branch                        | Base            | Fix
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D
main        | topic-bio-reservation             | c1ebd76c3f      | c784b64=
b8a
------------+-----------------------------------+-----------------+--------=
----
stable/14   | stable/14-topic-bio-reservation   | 3c414a8c2f      | aeaac96=
a7a
------------+-----------------------------------+-----------------+--------=
----
releng/14.1 | releng/14.1-topic-bio-reservation | e3e57ae30c      | 8f0281d=
20d
------------+-----------------------------------+-----------------+--------=
----
releng/14.0 | releng/14.0-topic-bio-reservation | d338712beb      | 6f8fed5=
2ee
------------+-----------------------------------+-----------------+--------=
----
stable/13   | stable/13-topic-bio-reservation   | 85e63d952d      | 64b9962=
cec
------------+-----------------------------------+-----------------+--------=
----
releng/13.3 | releng/13.3-topic-bio-reservation | be4f1894ef      | 4d233d7=
419
------------+-----------------------------------+-----------------+--------=
----
releng/13.2 | releng/13.2-topic-bio-reservation | f5ac4e174f      | 7b156cb=
ac8

Poudriere-bulk(8) has been tested with the releng/14.1-topic-bio-reservation
branch (and the ZFS fix on bug #275594, comment #147), with the following
results proving the fix:
- vm.uma.g_bio.stats.fails did not increase at all.
- "swap_pager: cannot allocate bio" did not appear in the log at all.
- The build error disappeared completely.
  + Only one port (graphics/gimp-app) failed, but due to a separate problem.
(An internal error of clang.)

--=20
You are receiving this mail because:
You are the assignee for the bug.=



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?bug-279021-227>