Date: Sun, 29 Dec 2019 23:18:42 +0100 From: Oliver Pinter <oliver.pntr@gmail.com> To: Mark Johnston <markj@freebsd.org> Cc: "src-committers@freebsd.org" <src-committers@freebsd.org>, "svn-src-all@freebsd.org" <svn-src-all@freebsd.org>, "svn-src-head@freebsd.org" <svn-src-head@freebsd.org> Subject: Re: svn commit: r356159 - head/sys/vm Message-ID: <CAPjTQNGOton_wJRj7b2_oqo1H=DUG1_VNTm-Zkerb6gBdQ6dMg@mail.gmail.com> In-Reply-To: <20191229165032.GC30375@raichu> References: <201912281904.xBSJ4T19064948@repo.freebsd.org> <CAPjTQNFNrM1iWm8JygbWnsnNNVN24PMaitsQv%2BEDgG8dbZm9Fg@mail.gmail.com> <20191229165032.GC30375@raichu>
next in thread | previous in thread | raw e-mail | index | archive | help
Thanks for the detailed answer Mark! On Sunday, December 29, 2019, Mark Johnston <markj@freebsd.org> wrote: > On Sun, Dec 29, 2019 at 03:39:55AM +0100, Oliver Pinter wrote: > > Is there any performance measurement from before and after. It would be > > nice to see them. > > I did not do extensive benchmarking. The aim of the patch set was > simply to remove the use of the hashed page lock, since it shows up > prominently in lock profiles of some workloads. The problem is that we > acquire these locks any time a page's LRU state is updated, and the use > of the hash lock means that we get false sharing. The solution is to > implement these state updates using atomic operations on the page > structure itself, making data contention much less likely. Another > option was to embed a mutex into the vm_page structure, but this would > bloat a structure which is already too large. > > A secondary goal was to reduce the number of locks held during page > queue scans. Such scans frequently call pmap_ts_referenced() to collect > info about recent references to the page. This operation can be > expensive since it may require a TLB shootdown, and it can block for a > long time on the pmap lock, for example if the lock holder is copying > the page tables as part of a fork(). Now, the active queue scan body is > executed without any locks held, so a page daemon thread blocked on a > pmap lock no longer has the potential to block other threads by holding > on to a shared page lock. Before, the page daemon could block faulting > threads for a long time, hurting latency. I don't have any benchmarks > that capture this, but it's something that I've observed in production > workloads. > > I used some microbenchmarks to verify that the change did not penalize > the single-threaded case. Here are some results on a 64-core arm64 > system I have been playing with: > https://people.freebsd.org/~markj/arm64_page_lock/ > > The benchmark from will-it-scale simply maps 128MB of anonymous memory, > faults on each page, and unmaps it, in a loop. In the fault handler we > allocate a page and insert it into the active queue, and the unmap > operation removes all of those pages from the queue. I collected the > throughput for 1, 2, 4, 8, 16 and 32 concurrent processes. > > With my patches we see some modest gains at low concurrency. At higher > levels of concurrency we actually get lower throughput than before as > contention moves from the page locks and the page queue lock to just the > page queue lock. I don't believe this is a real regression: first, the > benchmark is quite extreme relative to any useful workload, and second, > arm64 suffers from using a much smaller batch size than amd64 for > batched page queue operations. Changing that pushes the results out > somewhat. Some earlier testing on a 2-socket Xeon system showed a > similar pattern with smaller differences. >
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAPjTQNGOton_wJRj7b2_oqo1H=DUG1_VNTm-Zkerb6gBdQ6dMg>