Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 10 Jul 2020 11:52:26 +0300
From:      Konstantin Belousov <kostikbel@gmail.com>
To:        Alan Somers <asomers@freebsd.org>
Cc:        FreeBSD Hackers <freebsd-hackers@freebsd.org>
Subject:   Re: Right-sizing the geli thread pool
Message-ID:  <20200710085226.GC2866@kib.kiev.ua>
In-Reply-To: <CAOtMX2g0UTT1wG%2B_rUNssVvaJH1LfG-UoEGvYhYGQZVn26dNFA@mail.gmail.com>
References:  <CAOtMX2g0UTT1wG%2B_rUNssVvaJH1LfG-UoEGvYhYGQZVn26dNFA@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, Jul 09, 2020 at 03:26:41PM -0600, Alan Somers wrote:
> Currently, geli creates a separate thread pool for each provider, and by
> default each thread pool contains one thread per cpu.  On a large server
> with many encrypted disks, that can balloon into a very large number of
> threads!  I have a patch in progress that switches from per-provider thread
> pools to a single thread pool for the entire module.  Happily, I see read
> IOPs increase by up to 60%.  But to my surprise, write IOPs _decreases_ by
> up to 25%.  dtrace suggests that the CPU usage is dominated by the
> vmem_free call in biodone, as in the below stack.
> 
>               kernel`lock_delay+0x32
>               kernel`biodone+0x88
>               kernel`g_io_deliver+0x214
>               geom_eli.ko`g_eli_write_done+0xf6
>               kernel`g_io_deliver+0x214
>               kernel`md_kthread+0x275
>               kernel`fork_exit+0x7e
>               kernel`0xffffffff8104784e
> 
> I only have one idea for how to improve things from here.  The geli thread
> pool is still fed by a single global bio queue.  That could cause cache
> thrashing, if bios get moved between cores too often.  I think a superior
> design would be to use a separate bio queue for each geli thread, and use
> work-stealing to balance them.  However,
Geli uses mapped io, and the fact that vmem_free() is called from biodone()
means that geom has to enable transient remapping to handle unmapped requests
coming to geli provider.

This path was never supposed to be fast.  Geli might need an access
to the bio's data e.g. for AES-NI processing, or rather, crypto(9) aesni
driver needs it.  But it might be very beneficial to declare geli
as supporting unmapped io and only do transient remapping on pinned
thread to avoid global allocations of KVA and shootdowns.

Another possible huge optimization could be in aesni crypto(9) driver.
I am not sure what is the state of crypto(9) WRT unmapped requests, there
were a lot of work improving the framework so it might support unmapped.
On amd64 aesni can work with unmapped requests through DMAP, which means
that no remapping is needed.

> 
> 1) That doesn't explain why this change benefits reads more than writes, and
> 2) work-stealing is hard to get right, and I can't find any examples in the
> kernel.
> 
> Can anybody offer tips or code for implementing work stealing?  Or any
> other suggestions about why my write performance is suffering?  I would
> like to get this change committed, but not without resolving that issue.
> 
> -Alan
> _______________________________________________
> freebsd-hackers@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-hackers
> To unsubscribe, send any mail to "freebsd-hackers-unsubscribe@freebsd.org"



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20200710085226.GC2866>