Date: Thu, 29 Aug 2024 23:22:51 +0300 From: Konstantin Belousov <kostikbel@gmail.com> To: Mark Johnston <markj@freebsd.org> Cc: src-committers@freebsd.org, dev-commits-src-all@freebsd.org, dev-commits-src-main@freebsd.org Subject: Re: git: ff1ae3b3639d - main - rangelocks: restore caching of the single rl entry in the struct thread Message-ID: <ZtDYm7BVSdYuFkpA@kib.kiev.ua> In-Reply-To: <ZtDUOiXuJ5VDbO2i@nuc> References: <202408060406.47646dvJ004709@gitrepo.freebsd.org> <ZtDUOiXuJ5VDbO2i@nuc>
next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, Aug 29, 2024 at 04:04:10PM -0400, Mark Johnston wrote: > On Tue, Aug 06, 2024 at 04:06:39AM +0000, Konstantin Belousov wrote: > > The branch main has been updated by kib: > > > > URL: https://cgit.FreeBSD.org/src/commit/?id=ff1ae3b3639d39a6485cfc655bf565cd04b9caa6 > > > > commit ff1ae3b3639d39a6485cfc655bf565cd04b9caa6 > > Author: Konstantin Belousov <kib@FreeBSD.org> > > AuthorDate: 2023-08-23 23:29:50 +0000 > > Commit: Konstantin Belousov <kib@FreeBSD.org> > > CommitDate: 2024-08-06 04:05:58 +0000 > > > > rangelocks: restore caching of the single rl entry in the struct thread > > > > Reviewed by: markj, Olivier Certner <olce.freebsd@certner.fr> > > Tested by: pho > > Sponsored by: The FreeBSD Foundation > > Differential revision: https://reviews.freebsd.org/D41787 > > --- > > sys/kern/kern_rangelock.c | 33 +++++++++++++++++++++++++++------ > > sys/kern/kern_thread.c | 1 + > > sys/sys/rangelock.h | 1 + > > 3 files changed, 29 insertions(+), 6 deletions(-) > > > > diff --git a/sys/kern/kern_rangelock.c b/sys/kern/kern_rangelock.c > > index 355b2125dd9b..2f16569b19aa 100644 > > --- a/sys/kern/kern_rangelock.c > > +++ b/sys/kern/kern_rangelock.c > > @@ -82,8 +82,15 @@ static struct rl_q_entry * > > rlqentry_alloc(vm_ooffset_t start, vm_ooffset_t end, int flags) > > { > > struct rl_q_entry *e; > > - > > - e = uma_zalloc_smr(rl_entry_zone, M_WAITOK); > > + struct thread *td; > > + > > + td = curthread; > > + if (td->td_rlqe != NULL) { > > + e = td->td_rlqe; > > + td->td_rlqe = NULL; > > + } else { > > + e = uma_zalloc_smr(rl_entry_zone, M_WAITOK); > > + } > > e->rl_q_next = NULL; > > e->rl_q_free = NULL; > > e->rl_q_start = start; > > @@ -95,6 +102,12 @@ rlqentry_alloc(vm_ooffset_t start, vm_ooffset_t end, int flags) > > return (e); > > } > > > > +void > > +rangelock_entry_free(struct rl_q_entry *e) > > +{ > > + uma_zfree_smr(rl_entry_zone, e); > > +} > > + > > void > > rangelock_init(struct rangelock *lock) > > { > > @@ -106,6 +119,7 @@ void > > rangelock_destroy(struct rangelock *lock) > > { > > struct rl_q_entry *e, *ep; > > + struct thread *td; > > > > MPASS(!lock->sleepers); > > for (e = (struct rl_q_entry *)atomic_load_ptr(&lock->head); > > @@ -386,8 +400,10 @@ rangelock_lock_int(struct rangelock *lock, bool trylock, vm_ooffset_t start, > > vm_ooffset_t end, int locktype) > > { > > struct rl_q_entry *e, *free, *x, *xp; > > + struct thread *td; > > enum RL_INSERT_RES res; > > > > + td = curthread; > > for (res = RL_LOCK_RETRY; res == RL_LOCK_RETRY;) { > > free = NULL; > > e = rlqentry_alloc(start, end, locktype); > > @@ -401,10 +417,15 @@ rangelock_lock_int(struct rangelock *lock, bool trylock, vm_ooffset_t start, > > e = NULL; > > } > > for (x = free; x != NULL; x = xp) { > > - MPASS(!rl_e_is_marked(x)); > > - xp = x->rl_q_free; > > - MPASS(!rl_e_is_marked(xp)); > > - uma_zfree_smr(rl_entry_zone, x); > > + MPASS(!rl_e_is_marked(x)); > > + xp = x->rl_q_free; > > + MPASS(!rl_e_is_marked(xp)); > > + if (td->td_rlqe == NULL) { > > + smr_synchronize(rl_smr); > > I think I had the impression that this was a rare case, but empirically > it is not. As far as I can see, this smr_synchronize() call happens > every time an entry is freed, which could be very frequent. > smr_synchronize() bumps the global sequence counter and blocks until all > CPUs have had a chance to observe the new value, so is quite expensive. > > I didn't try benchmarking yet, but pwrite3 from will-it-scale should be > a good candidate. > > Why do we maintain this per-thread cache at all? IMO it would make more > sense to unconditionally free the entry here. Or perhaps I'm missing > something here. Normally threads perform i/o syscalls, and for these syscalls, they need to perform just one rangelock op. In absence of the races with conflicting locks, the rangelock acquisition does not need another rl_q_entry item. In previous rangelock implementation, this was a significant improvement esp. on microbenchmarks where rangelock overhead could be reached, like repeated write or read of same single byte. Remembering that, I decided to keep the caching.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?ZtDYm7BVSdYuFkpA>