Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 10 Jul 2020 17:13:52 -0600
From:      Alan Somers <asomers@freebsd.org>
To:        Mark Johnston <markj@freebsd.org>
Cc:        Mateusz Guzik <mjguzik@gmail.com>, FreeBSD Hackers <freebsd-hackers@freebsd.org>
Subject:   Re: Right-sizing the geli thread pool
Message-ID:  <CAOtMX2jSGZk2kgG=EZXc0Yu7zA1pNgCDtNnxWLOPg79jqzp3zw@mail.gmail.com>
In-Reply-To: <20200710175452.GA9380@raichu>
References:  <CAOtMX2g0UTT1wG%2B_rUNssVvaJH1LfG-UoEGvYhYGQZVn26dNFA@mail.gmail.com> <CAGudoHE8zvDqXYr5%2BOEYJbM8uKZ_SvaQVg1Ys5TXqzYEZUgzig@mail.gmail.com> <20200710175452.GA9380@raichu>

next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, Jul 10, 2020 at 11:55 AM Mark Johnston <markj@freebsd.org> wrote:

> On Fri, Jul 10, 2020 at 05:55:50AM +0200, Mateusz Guzik wrote:
> > On 7/9/20, Alan Somers <asomers@freebsd.org> wrote:
> > > Currently, geli creates a separate thread pool for each provider, and
> by
> > > default each thread pool contains one thread per cpu.  On a large
> server
> > > with many encrypted disks, that can balloon into a very large number of
> > > threads!  I have a patch in progress that switches from per-provider
> thread
> > > pools to a single thread pool for the entire module.  Happily, I see
> read
> > > IOPs increase by up to 60%.  But to my surprise, write IOPs
> _decreases_ by
> > > up to 25%.  dtrace suggests that the CPU usage is dominated by the
> > > vmem_free call in biodone, as in the below stack.
> > >
> > >               kernel`lock_delay+0x32
> > >               kernel`biodone+0x88
> > >               kernel`g_io_deliver+0x214
> > >               geom_eli.ko`g_eli_write_done+0xf6
> > >               kernel`g_io_deliver+0x214
> > >               kernel`md_kthread+0x275
> > >               kernel`fork_exit+0x7e
> > >               kernel`0xffffffff8104784e
> > >
> > > I only have one idea for how to improve things from here.  The geli
> thread
> > > pool is still fed by a single global bio queue.  That could cause cache
> > > thrashing, if bios get moved between cores too often.  I think a
> superior
> > > design would be to use a separate bio queue for each geli thread, and
> use
> > > work-stealing to balance them.  However,
> > >
> > > 1) That doesn't explain why this change benefits reads more than
> writes,
> > > and
> > > 2) work-stealing is hard to get right, and I can't find any examples
> in the
> > > kernel.
> > >
> > > Can anybody offer tips or code for implementing work stealing?  Or any
> > > other suggestions about why my write performance is suffering?  I would
> > > like to get this change committed, but not without resolving that
> issue.
> > >
> >
> > I can't comment on revamping the design, but:
> >
> > void
> > vmem_free(vmem_t *vm, vmem_addr_t addr, vmem_size_t size)
> > {
> >         qcache_t *qc;
> >         MPASS(size > 0);
> >
> >         if (size <= vm->vm_qcache_max &&
> >             __predict_true(addr >= VMEM_ADDR_QCACHE_MIN)) {
> >                 qc = &vm->vm_qcache[(size - 1) >> vm->vm_quantum_shift];
> >                 uma_zfree(qc->qc_cache, (void *)addr);
> >         } else
> >                 vmem_xfree(vm, addr, size);
> > }
> >
> > What sizes are being passed here? Or more to the point, is it feasible
> > to bump qcache to stick to uma in this call? If lock contention is
> > indeed coming from vmem_xfree this change would get rid of the problem
> > without having to rework anything.
>
> We would have to enable the quantum cache in the transient KVA arena.
> This itself should not have many downsides on platforms with plenty of
> KVA, but it only solves the immediate problem: before freeing the KVA
> biodone() has to perform a global TLB shootdown, and the quantum cache
> doesn't help at all with that.
>
> kib's suggestion of using sf_buf(9) to transiently map crypto(9)
> payloads in software crypto drivers that require a mapping, and using
> unmapped cryptop requests for hardware drivers that do not, sounds like
> the right solution.  cryptop structures can already handle multiple data
> container types, like uios and mbufs, so it should be possible to also
> support vm_page arrays in OCF like we do for unmapped BIOs, and let
> crypto(9) drivers create transient mappings when necessary.
>
> > For read performance, while it is nice there is a win, it may still be
> > less than it should. I think it is prudent to get a flamegraph from
> > both cases.
>

And, it works!  Using sf_buf was a good suggestion, because the write path
only needs to access the user data for a short amount of time, in a
CPU-pinned thread.  So I can create my sf_buf and immediately free it.  I
didn't even need to change crypto(9).  Writes are much faster now.  I
haven't incorporated sf_buf into the read path yet, though.

Preliminary benchmarks:

                                              Write IOPs READ IOPs
baseline                                       66908      60351
kern.geom.eli.threads=1                       156719     144117
kern.geom.eli.threads=1, direct dispatch      205326     201867
direct dispatch, shared thread pool           156631     226874
direct dispatch, shared thread pool, sf_buf   432573     247275

-Alan



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAOtMX2jSGZk2kgG=EZXc0Yu7zA1pNgCDtNnxWLOPg79jqzp3zw>