Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 10 Jul 2020 05:55:50 +0200
From:      Mateusz Guzik <mjguzik@gmail.com>
To:        Alan Somers <asomers@freebsd.org>
Cc:        FreeBSD Hackers <freebsd-hackers@freebsd.org>
Subject:   Re: Right-sizing the geli thread pool
Message-ID:  <CAGudoHE8zvDqXYr5%2BOEYJbM8uKZ_SvaQVg1Ys5TXqzYEZUgzig@mail.gmail.com>
In-Reply-To: <CAOtMX2g0UTT1wG%2B_rUNssVvaJH1LfG-UoEGvYhYGQZVn26dNFA@mail.gmail.com>
References:  <CAOtMX2g0UTT1wG%2B_rUNssVvaJH1LfG-UoEGvYhYGQZVn26dNFA@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On 7/9/20, Alan Somers <asomers@freebsd.org> wrote:
> Currently, geli creates a separate thread pool for each provider, and by
> default each thread pool contains one thread per cpu.  On a large server
> with many encrypted disks, that can balloon into a very large number of
> threads!  I have a patch in progress that switches from per-provider thread
> pools to a single thread pool for the entire module.  Happily, I see read
> IOPs increase by up to 60%.  But to my surprise, write IOPs _decreases_ by
> up to 25%.  dtrace suggests that the CPU usage is dominated by the
> vmem_free call in biodone, as in the below stack.
>
>               kernel`lock_delay+0x32
>               kernel`biodone+0x88
>               kernel`g_io_deliver+0x214
>               geom_eli.ko`g_eli_write_done+0xf6
>               kernel`g_io_deliver+0x214
>               kernel`md_kthread+0x275
>               kernel`fork_exit+0x7e
>               kernel`0xffffffff8104784e
>
> I only have one idea for how to improve things from here.  The geli thread
> pool is still fed by a single global bio queue.  That could cause cache
> thrashing, if bios get moved between cores too often.  I think a superior
> design would be to use a separate bio queue for each geli thread, and use
> work-stealing to balance them.  However,
>
> 1) That doesn't explain why this change benefits reads more than writes,
> and
> 2) work-stealing is hard to get right, and I can't find any examples in the
> kernel.
>
> Can anybody offer tips or code for implementing work stealing?  Or any
> other suggestions about why my write performance is suffering?  I would
> like to get this change committed, but not without resolving that issue.
>

I can't comment on revamping the design, but:

void
vmem_free(vmem_t *vm, vmem_addr_t addr, vmem_size_t size)
{
        qcache_t *qc;
        MPASS(size > 0);

        if (size <= vm->vm_qcache_max &&
            __predict_true(addr >= VMEM_ADDR_QCACHE_MIN)) {
                qc = &vm->vm_qcache[(size - 1) >> vm->vm_quantum_shift];
                uma_zfree(qc->qc_cache, (void *)addr);
        } else
                vmem_xfree(vm, addr, size);
}

What sizes are being passed here? Or more to the point, is it feasible
to bump qcache to stick to uma in this call? If lock contention is
indeed coming from vmem_xfree this change would get rid of the problem
without having to rework anything.

For read performance, while it is nice there is a win, it may still be
less than it should. I think it is prudent to get a flamegraph from
both cases.

-- 
Mateusz Guzik <mjguzik gmail.com>



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAGudoHE8zvDqXYr5%2BOEYJbM8uKZ_SvaQVg1Ys5TXqzYEZUgzig>