Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 29 Dec 2019 11:50:32 -0500
From:      Mark Johnston <markj@freebsd.org>
To:        Oliver Pinter <oliver.pntr@gmail.com>
Cc:        "src-committers@freebsd.org" <src-committers@freebsd.org>, "svn-src-all@freebsd.org" <svn-src-all@freebsd.org>, "svn-src-head@freebsd.org" <svn-src-head@freebsd.org>
Subject:   Re: svn commit: r356159 - head/sys/vm
Message-ID:  <20191229165032.GC30375@raichu>
In-Reply-To: <CAPjTQNFNrM1iWm8JygbWnsnNNVN24PMaitsQv%2BEDgG8dbZm9Fg@mail.gmail.com>
References:  <201912281904.xBSJ4T19064948@repo.freebsd.org> <CAPjTQNFNrM1iWm8JygbWnsnNNVN24PMaitsQv%2BEDgG8dbZm9Fg@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Sun, Dec 29, 2019 at 03:39:55AM +0100, Oliver Pinter wrote:
> Is there any performance measurement from before and after. It would be
> nice to see them.

I did not do extensive benchmarking.  The aim of the patch set was
simply to remove the use of the hashed page lock, since it shows up
prominently in lock profiles of some workloads.  The problem is that we
acquire these locks any time a page's LRU state is updated, and the use
of the hash lock means that we get false sharing.  The solution is to
implement these state updates using atomic operations on the page
structure itself, making data contention much less likely.  Another
option was to embed a mutex into the vm_page structure, but this would
bloat a structure which is already too large.

A secondary goal was to reduce the number of locks held during page
queue scans.  Such scans frequently call pmap_ts_referenced() to collect
info about recent references to the page.  This operation can be
expensive since it may require a TLB shootdown, and it can block for a
long time on the pmap lock, for example if the lock holder is copying
the page tables as part of a fork().  Now, the active queue scan body is
executed without any locks held, so a page daemon thread blocked on a
pmap lock no longer has the potential to block other threads by holding
on to a shared page lock.  Before, the page daemon could block faulting
threads for a long time, hurting latency.  I don't have any benchmarks
that capture this, but it's something that I've observed in production
workloads.

I used some microbenchmarks to verify that the change did not penalize
the single-threaded case.  Here are some results on a 64-core arm64
system I have been playing with:
https://people.freebsd.org/~markj/arm64_page_lock/

The benchmark from will-it-scale simply maps 128MB of anonymous memory,
faults on each page, and unmaps it, in a loop.  In the fault handler we
allocate a page and insert it into the active queue, and the unmap
operation removes all of those pages from the queue.  I collected the
throughput for 1, 2, 4, 8, 16 and 32 concurrent processes.

With my patches we see some modest gains at low concurrency.  At higher
levels of concurrency we actually get lower throughput than before as
contention moves from the page locks and the page queue lock to just the
page queue lock.  I don't believe this is a real regression: first, the
benchmark is quite extreme relative to any useful workload, and second,
arm64 suffers from using a much smaller batch size than amd64 for
batched page queue operations.  Changing that pushes the results out
somewhat.  Some earlier testing on a 2-socket Xeon system showed a
similar pattern with smaller differences.  



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20191229165032.GC30375>