Date: Tue, 7 Jul 2020 16:02:56 -0600 From: Alan Somers <asomers@freebsd.org> To: "Pawe?? Jakub Dawidek" <pawel@dawidek.net>, freebsd-geom@freebsd.org Subject: Re: Single-threaded bottleneck in geli Message-ID: <CAOtMX2hhPawY39HHydELD8wQubFsE3VfE=PsSBMayJrgNKR1BA@mail.gmail.com> In-Reply-To: <20200704231642.GU4213@funkthat.com> References: <CAOtMX2hHaEzOT0jmc_QcukVZjRKUtCm55bTT9Q5=BNCcL9rf%2Bg@mail.gmail.com> <80B62FE6-FCFB-42B8-A34C-B28E7DDBF45D@dawidek.net> <CAOtMX2idj0pfpo4k%2BxOjaZ9Tk9tLNavNqAo9nGyYOs6OkG7r8w@mail.gmail.com> <20200704231642.GU4213@funkthat.com>
next in thread | previous in thread | raw e-mail | index | archive | help
Ok, this turned out to be embarrassingly easy. I just turned it on, and everything worked. On a dual-socket Intel CPU E5-2680 (16 cores, hyperthreading disabled), this change roughly doubles geli's IOPs when reading or writing to 32 malloc-backed md devices. But I don't have any hardware crypto accelerator to test it on. Could anybody help me out with that? https://reviews.freebsd.org/D25587 On Sat, Jul 4, 2020 at 5:16 PM John-Mark Gurney <jmg@funkthat.com> wrote: > Alan Somers wrote this message on Sat, Jul 04, 2020 at 15:59 -0600: > > I might give this a shot. What is the best way tell if geli ought to use > > direct dispatch? Is there a generic "are crypto instructions available" > > macro that would cover aesni as well as other platform-specific > > instructions? > > Direct dispatch has the advantage of saving scheduling context > switches... > > Geli has two modes, one is hardware acceleration mode, where it does > a bit of work to put together the request, and then hands it off to > the crypto framework (say an accelerator card), and then there is the > mode where it has to be done in software, where it dispatches to a > set of worker threads... > > In both modes, it would make sense for GELI to be able to do the work > to construct those requests via direct dispatch... This would eliminate > a context switch, which is always a good thing... I haven't looked at > the OpenCrypto code in a while, so I don't know what the locking > requirements are... > > The key cache is already locked by an mtx, but I believe it's a leaf > lock, and so shouldn't be an issue... > > I'll add this to my list of things to look at... > > Also, if you have that many geli devices, you might also want to set: > kern.geom.eli.threads=1 > > As it stands, geli fires up ncpu threads for EACH geli device, so likely > have thousands of geli threads... > > > On Sat, Jul 4, 2020, 2:55 PM Pawe?? Jakub Dawidek <pawel@dawidek.net> > wrote: > > > > > Direct dispatch would be great for geli, especially that geli can use > own > > > (multiple) threads when necessary (eg. using crypto cards). With > AES-NI you > > > could go straight to the disk. > > > > > > > On Jul 3, 2020, at 13:22, Alan Somers <asomers@freebsd.org> wrote: > > > > > > > > ???I don't. What I meant was that a single thread (geom) is > limiting the > > > > performance of the system overall. I'm certain, based on top, > gstat, and > > > > zpool iostat, that geom is the limiting factor on this system. > > > > -Alan > > > > > > > >> On Fri, Jul 3, 2020 at 2:18 PM Pawe?? Jakub Dawidek < > pawel@dawidek.net> > > > >> wrote: > > > >> > > > >> Hi Alan, > > > >> > > > >> why do you think it will hurt single-threaded performance? > > > >> > > > >>>> On Jul 3, 2020, at 12:30, Alan Somers <asomers@freebsd.org> > wrote: > > > >>> > > > >>> ???I'm using geli, gmultipath, and ZFS on a large system, with > hundreds > > > of > > > >>> drives. What I'm seeing is that under at least some workloads, the > > > >> overall > > > >>> performance is limited by the single geom kernel process. > procstat and > > > >>> kgdb aren't much help in telling exactly why this process is using > so > > > >> much > > > >>> CPU, but it certainly must be related to the fact that over 15,000 > IOPs > > > >> are > > > >>> going through that thread. What can I do to improve this > situation? > > > >> Would > > > >>> it make sense to enable direct dispatch for geli? That would hurt > > > >>> single-threaded performance, but probably improve performance for > > > highly > > > >>> multithreaded workloads like mine. > > > >>> > > > >>> Example top output: > > > >>> PID USERNAME PRI NICE SIZE RES STATE C TIME WCPU > > > COMMAND > > > >>> 13 root -8 - 0B 96K CPU46 46 82.7H 70.54% > > > >>> geom{g_down} > > > >>> 13 root -8 - 0B 96K - 9 35.5H 25.32% > > > >>> geom{g_up} > > -- > John-Mark Gurney Voice: +1 415 225 5579 > > "All that I will do, has been done, All that I have, has not." >
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAOtMX2hhPawY39HHydELD8wQubFsE3VfE=PsSBMayJrgNKR1BA>