Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 10 Jul 2020 13:54:52 -0400
From:      Mark Johnston <markj@freebsd.org>
To:        Mateusz Guzik <mjguzik@gmail.com>
Cc:        Alan Somers <asomers@freebsd.org>, FreeBSD Hackers <freebsd-hackers@freebsd.org>
Subject:   Re: Right-sizing the geli thread pool
Message-ID:  <20200710175452.GA9380@raichu>
In-Reply-To: <CAGudoHE8zvDqXYr5%2BOEYJbM8uKZ_SvaQVg1Ys5TXqzYEZUgzig@mail.gmail.com>
References:  <CAOtMX2g0UTT1wG%2B_rUNssVvaJH1LfG-UoEGvYhYGQZVn26dNFA@mail.gmail.com> <CAGudoHE8zvDqXYr5%2BOEYJbM8uKZ_SvaQVg1Ys5TXqzYEZUgzig@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, Jul 10, 2020 at 05:55:50AM +0200, Mateusz Guzik wrote:
> On 7/9/20, Alan Somers <asomers@freebsd.org> wrote:
> > Currently, geli creates a separate thread pool for each provider, and by
> > default each thread pool contains one thread per cpu.  On a large server
> > with many encrypted disks, that can balloon into a very large number of
> > threads!  I have a patch in progress that switches from per-provider thread
> > pools to a single thread pool for the entire module.  Happily, I see read
> > IOPs increase by up to 60%.  But to my surprise, write IOPs _decreases_ by
> > up to 25%.  dtrace suggests that the CPU usage is dominated by the
> > vmem_free call in biodone, as in the below stack.
> >
> >               kernel`lock_delay+0x32
> >               kernel`biodone+0x88
> >               kernel`g_io_deliver+0x214
> >               geom_eli.ko`g_eli_write_done+0xf6
> >               kernel`g_io_deliver+0x214
> >               kernel`md_kthread+0x275
> >               kernel`fork_exit+0x7e
> >               kernel`0xffffffff8104784e
> >
> > I only have one idea for how to improve things from here.  The geli thread
> > pool is still fed by a single global bio queue.  That could cause cache
> > thrashing, if bios get moved between cores too often.  I think a superior
> > design would be to use a separate bio queue for each geli thread, and use
> > work-stealing to balance them.  However,
> >
> > 1) That doesn't explain why this change benefits reads more than writes,
> > and
> > 2) work-stealing is hard to get right, and I can't find any examples in the
> > kernel.
> >
> > Can anybody offer tips or code for implementing work stealing?  Or any
> > other suggestions about why my write performance is suffering?  I would
> > like to get this change committed, but not without resolving that issue.
> >
> 
> I can't comment on revamping the design, but:
> 
> void
> vmem_free(vmem_t *vm, vmem_addr_t addr, vmem_size_t size)
> {
>         qcache_t *qc;
>         MPASS(size > 0);
> 
>         if (size <= vm->vm_qcache_max &&
>             __predict_true(addr >= VMEM_ADDR_QCACHE_MIN)) {
>                 qc = &vm->vm_qcache[(size - 1) >> vm->vm_quantum_shift];
>                 uma_zfree(qc->qc_cache, (void *)addr);
>         } else
>                 vmem_xfree(vm, addr, size);
> }
> 
> What sizes are being passed here? Or more to the point, is it feasible
> to bump qcache to stick to uma in this call? If lock contention is
> indeed coming from vmem_xfree this change would get rid of the problem
> without having to rework anything.

We would have to enable the quantum cache in the transient KVA arena.
This itself should not have many downsides on platforms with plenty of
KVA, but it only solves the immediate problem: before freeing the KVA
biodone() has to perform a global TLB shootdown, and the quantum cache
doesn't help at all with that.

kib's suggestion of using sf_buf(9) to transiently map crypto(9)
payloads in software crypto drivers that require a mapping, and using
unmapped cryptop requests for hardware drivers that do not, sounds like
the right solution.  cryptop structures can already handle multiple data
container types, like uios and mbufs, so it should be possible to also
support vm_page arrays in OCF like we do for unmapped BIOs, and let
crypto(9) drivers create transient mappings when necessary.

> For read performance, while it is nice there is a win, it may still be
> less than it should. I think it is prudent to get a flamegraph from
> both cases.
> 
> -- 
> Mateusz Guzik <mjguzik gmail.com>
> _______________________________________________
> freebsd-hackers@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-hackers
> To unsubscribe, send any mail to "freebsd-hackers-unsubscribe@freebsd.org"



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20200710175452.GA9380>