Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 04 Dec 2019 14:29:16 +0000
From:      bugzilla-noreply@freebsd.org
To:        bugs@FreeBSD.org
Subject:   [Bug 242427] pmap_remove() sometimes is very slow causing 10+ minutes long reboots
Message-ID:  <bug-242427-227@https.bugs.freebsd.org/bugzilla/>

next in thread | raw e-mail | index | archive | help
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D242427

            Bug ID: 242427
           Summary: pmap_remove() sometimes is very slow causing 10+
                    minutes long reboots
           Product: Base System
           Version: 11.3-RELEASE
          Hardware: amd64
                OS: Any
            Status: New
          Severity: Affects Some People
          Priority: ---
         Component: kern
          Assignee: bugs@FreeBSD.org
          Reporter: pen@lysator.liu.se

I've noticed that on our file servers it can take a _very_ long time to reb=
oot
- where most of the time spent is during the "shutdown" phase - time spent
after the "Uptime: " line is printed.

After a long debugging session I've pinpointed it to ZFS freeing the "zio"
cache, which, after many levels of function calls, ends up it in pmap_remov=
e()
where some calls to it takes approximately 1 second. On a basically idle te=
st
server it can take up to 10-20 minutes for the server to "shut down" (or ev=
en
more - the time seems to be relative to the server uptime). We've seen
production machines that seem to be "hung" (atleast an hour or more) so we =
gave
up and sent them a "hard reset" via IPMI.

Hardware:
Dell PowerEdge R730xd with dual Intel Xeon E5-2620v4 CPUs (32 "cpus") and 2=
56GB
of RAM. No swap.

Software:
FreeBSD 11.3-RELEASE-p5. ZFS on boot & data. ZFS ARC limited to 128GB. Appr=
ox
24000 ZFS filesystems (empty on this test server). Snapshots taken every ho=
ur.


An example of how long it can take (I'm only printing timing info for calls
that take >=3D1s (the top 4) or >=3D2s (the rest):

kmem_unback: pmap_remove(kernel_pmap, 18446741877714755584,
18446741877714767872) took 1 seconds
kmem_free: kmem_unback(kmem_object, 18446741877714755584, 12288) took 1 sec=
onds
page_free: kmem_free(kmem_arena, 18446741877714755584, 12288) took 1 seconds
keg_free_slab: keg->uk_freef(mem) {page_free} took 1 seconds

keg_drain: while-keg_free_slab-loop took 14 seconds [20021 loops, 14 slow
calls]
zone_drain_wait: zone_foreach_keg(zone, &keg_drain) took 14 seconds
zone_dtor: zone_drain_wait(zone, M_WAITOK) took 14 seconds
zone_free_item(zone=3DUMA Zones): zone->uz_dtor() took 14 seconds
uma_zdestroy(zio_buf_12288) took 14 seconds
kmem_cache_destroy: uma_zdestroy(0xfffff803467c8ac0) [zio_buf_12288] took 14
seconds
kmem_cache_destroy(zio_buf_cache[20]) took 14 seconds

Called from kern_shutdown() -> EVENTHANDLER_INVOKE(shutdown_post_sync) ->
zfsshutdown()


Ie, ~14 out of 20021 calls to keg_free_slab() takes 1 second instead of
executing really quickly (in this case).
But some kmem_cache:s are much bigger causing delays for 300-600 seconds (or
more).

(I use "time_second" for time measurements, should probably use something w=
ith
better granularity for the top 4 calls :-)

I added a sysctl() kern.shutdown.verbose that I can set to a number to make=
 it
be more verbose (and added a lot of printf()s to get this info)... With it I
now can see the number of filesystems being unmounted (since that too can t=
ake
a little while - nothing close to the times above though).

- Peter

--=20
You are receiving this mail because:
You are the assignee for the bug.=



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?bug-242427-227>