From owner-freebsd-hackers@freebsd.org  Tue Jul 14 16:37:12 2020
Return-Path: <owner-freebsd-hackers@freebsd.org>
Delivered-To: freebsd-hackers@mailman.nyi.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
 by mailman.nyi.freebsd.org (Postfix) with ESMTP id 471E6368387
 for <freebsd-hackers@mailman.nyi.freebsd.org>;
 Tue, 14 Jul 2020 16:37:12 +0000 (UTC)
 (envelope-from asomers@gmail.com)
Received: from mail-oi1-f172.google.com (mail-oi1-f172.google.com
 [209.85.167.172])
 (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits)
 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256
 client-signature RSA-PSS (2048 bits) client-digest SHA256)
 (Client CN "smtp.gmail.com", Issuer "GTS CA 1O1" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id 4B5mPy38Ncz49wn
 for <freebsd-hackers@freebsd.org>; Tue, 14 Jul 2020 16:37:10 +0000 (UTC)
 (envelope-from asomers@gmail.com)
Received: by mail-oi1-f172.google.com with SMTP id k22so14455634oib.0
 for <freebsd-hackers@freebsd.org>; Tue, 14 Jul 2020 09:37:10 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:mime-version:references:in-reply-to:from:date
 :message-id:subject:cc;
 bh=8lHQGKmLQ9ler0bDOlfv/jO4faB9/zdtg0TU+75udRQ=;
 b=N42OI40D2PvIkg6uz5MUz/QSTx5lXH4u9XUU00RkTvClqCYTW3mA9dKbe5yQXJDAxe
 j9KKBvTyXpKEUI8pqw1AjM57Fc8TIcfmGKaIUX7/KAatYJArzMVYQkd537JtlBMyHJO/
 bFxkUbkkN6ivER5deqyeEBf8YiEa7IOkiLqDk504a4IO5GkxeBiDdhAoWvhLoS0iJuYb
 IULdusxt1Zl+JPXifl4LnCZxRrJSLvKFt4VbNKcVCoib5fa9MmT3kjK1ph/9U1CBuBZw
 IHskfbg3/NKrEKGu02vD+O0O0hk4IRmwxSv4qSvz6r0YHNLXdJuTEGhrhhmdW7wT/zxw
 wvag==
X-Gm-Message-State: AOAM532YhVOh57lfcQxWxnZhBZoRs4tZBt0Oydp62dUEP2JgZAWcc0FN
 90OgFY46bugHdtllTvjSemqBglINrWNInXKqHic=
X-Received: by 2002:aca:b6c3:: with SMTP id g186mt4887102oif.55.1594744629083; 
 Tue, 14 Jul 2020 09:37:09 -0700 (PDT)
MIME-Version: 1.0
References: <CAOtMX2g0UTT1wG+_rUNssVvaJH1LfG-UoEGvYhYGQZVn26dNFA@mail.gmail.com>
 <CAGudoHE8zvDqXYr5+OEYJbM8uKZ_SvaQVg1Ys5TXqzYEZUgzig@mail.gmail.com>
 <20200710175452.GA9380@raichu>
 <CAOtMX2jSGZk2kgG=EZXc0Yu7zA1pNgCDtNnxWLOPg79jqzp3zw@mail.gmail.com>
In-Reply-To: <CAOtMX2jSGZk2kgG=EZXc0Yu7zA1pNgCDtNnxWLOPg79jqzp3zw@mail.gmail.com>
From: Alan Somers <asomers@freebsd.org>
Date: Tue, 14 Jul 2020 10:36:56 -0600
Message-ID: <CAOtMX2iR-8evpMDiy16H=qL7-FkW=RyHYq5hjUvZj_ebg1URww@mail.gmail.com>
Subject: Re: Right-sizing the geli thread pool
Cc: Mark Johnston <markj@freebsd.org>, Mateusz Guzik <mjguzik@gmail.com>, 
 FreeBSD Hackers <freebsd-hackers@freebsd.org>
X-Rspamd-Queue-Id: 4B5mPy38Ncz49wn
X-Spamd-Bar: /
Authentication-Results: mx1.freebsd.org; dkim=none; dmarc=none;
 spf=pass (mx1.freebsd.org: domain of asomers@gmail.com designates
 209.85.167.172 as permitted sender) smtp.mailfrom=asomers@gmail.com
X-Spamd-Result: default: False [-0.01 / 15.00]; ARC_NA(0.00)[];
 NEURAL_HAM_MEDIUM(-0.70)[-0.698]; RCVD_COUNT_TWO(0.00)[2];
 RCVD_TLS_ALL(0.00)[]; FROM_HAS_DN(0.00)[];
 RCPT_COUNT_THREE(0.00)[3];
 R_SPF_ALLOW(-0.20)[+ip4:209.85.128.0/17:c];
 NEURAL_HAM_LONG(-0.78)[-0.777];
 MIME_GOOD(-0.10)[multipart/alternative,text/plain];
 PREVIOUSLY_DELIVERED(0.00)[freebsd-hackers@freebsd.org];
 DMARC_NA(0.00)[freebsd.org]; TO_MATCH_ENVRCPT_SOME(0.00)[];
 TO_DN_ALL(0.00)[]; NEURAL_HAM_SHORT(-0.54)[-0.536];
 RCVD_IN_DNSWL_NONE(0.00)[209.85.167.172:from];
 MISSING_TO(2.00)[];
 FORGED_SENDER(0.30)[asomers@freebsd.org,asomers@gmail.com];
 RWL_MAILSPIKE_POSSIBLE(0.00)[209.85.167.172:from];
 R_DKIM_NA(0.00)[]; MIME_TRACE(0.00)[0:+,1:+,2:~];
 ASN(0.00)[asn:15169, ipnet:209.85.128.0/17, country:US];
 FROM_NEQ_ENVFROM(0.00)[asomers@freebsd.org,asomers@gmail.com];
 FREEMAIL_ENVFROM(0.00)[gmail.com];
 FREEMAIL_CC(0.00)[freebsd.org,gmail.com]
Content-Type: text/plain; charset="UTF-8"
X-Content-Filtered-By: Mailman/MimeDel 2.1.33
X-BeenThere: freebsd-hackers@freebsd.org
X-Mailman-Version: 2.1.33
Precedence: list
List-Id: Technical Discussions relating to FreeBSD
 <freebsd-hackers.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-hackers>, 
 <mailto:freebsd-hackers-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hackers/>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Help: <mailto:freebsd-hackers-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-hackers>, 
 <mailto:freebsd-hackers-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 14 Jul 2020 16:37:12 -0000

On Fri, Jul 10, 2020 at 5:13 PM Alan Somers <asomers@freebsd.org> wrote:

> On Fri, Jul 10, 2020 at 11:55 AM Mark Johnston <markj@freebsd.org> wrote:
>
>> On Fri, Jul 10, 2020 at 05:55:50AM +0200, Mateusz Guzik wrote:
>> > On 7/9/20, Alan Somers <asomers@freebsd.org> wrote:
>> > > Currently, geli creates a separate thread pool for each provider, and
>> by
>> > > default each thread pool contains one thread per cpu.  On a large
>> server
>> > > with many encrypted disks, that can balloon into a very large number
>> of
>> > > threads!  I have a patch in progress that switches from per-provider
>> thread
>> > > pools to a single thread pool for the entire module.  Happily, I see
>> read
>> > > IOPs increase by up to 60%.  But to my surprise, write IOPs
>> _decreases_ by
>> > > up to 25%.  dtrace suggests that the CPU usage is dominated by the
>> > > vmem_free call in biodone, as in the below stack.
>> > >
>> > >               kernel`lock_delay+0x32
>> > >               kernel`biodone+0x88
>> > >               kernel`g_io_deliver+0x214
>> > >               geom_eli.ko`g_eli_write_done+0xf6
>> > >               kernel`g_io_deliver+0x214
>> > >               kernel`md_kthread+0x275
>> > >               kernel`fork_exit+0x7e
>> > >               kernel`0xffffffff8104784e
>> > >
>> > > I only have one idea for how to improve things from here.  The geli
>> thread
>> > > pool is still fed by a single global bio queue.  That could cause
>> cache
>> > > thrashing, if bios get moved between cores too often.  I think a
>> superior
>> > > design would be to use a separate bio queue for each geli thread, and
>> use
>> > > work-stealing to balance them.  However,
>> > >
>> > > 1) That doesn't explain why this change benefits reads more than
>> writes,
>> > > and
>> > > 2) work-stealing is hard to get right, and I can't find any examples
>> in the
>> > > kernel.
>> > >
>> > > Can anybody offer tips or code for implementing work stealing?  Or any
>> > > other suggestions about why my write performance is suffering?  I
>> would
>> > > like to get this change committed, but not without resolving that
>> issue.
>> > >
>> >
>> > I can't comment on revamping the design, but:
>> >
>> > void
>> > vmem_free(vmem_t *vm, vmem_addr_t addr, vmem_size_t size)
>> > {
>> >         qcache_t *qc;
>> >         MPASS(size > 0);
>> >
>> >         if (size <= vm->vm_qcache_max &&
>> >             __predict_true(addr >= VMEM_ADDR_QCACHE_MIN)) {
>> >                 qc = &vm->vm_qcache[(size - 1) >> vm->vm_quantum_shift];
>> >                 uma_zfree(qc->qc_cache, (void *)addr);
>> >         } else
>> >                 vmem_xfree(vm, addr, size);
>> > }
>> >
>> > What sizes are being passed here? Or more to the point, is it feasible
>> > to bump qcache to stick to uma in this call? If lock contention is
>> > indeed coming from vmem_xfree this change would get rid of the problem
>> > without having to rework anything.
>>
>> We would have to enable the quantum cache in the transient KVA arena.
>> This itself should not have many downsides on platforms with plenty of
>> KVA, but it only solves the immediate problem: before freeing the KVA
>> biodone() has to perform a global TLB shootdown, and the quantum cache
>> doesn't help at all with that.
>>
>> kib's suggestion of using sf_buf(9) to transiently map crypto(9)
>> payloads in software crypto drivers that require a mapping, and using
>> unmapped cryptop requests for hardware drivers that do not, sounds like
>> the right solution.  cryptop structures can already handle multiple data
>> container types, like uios and mbufs, so it should be possible to also
>> support vm_page arrays in OCF like we do for unmapped BIOs, and let
>> crypto(9) drivers create transient mappings when necessary.
>>
>> > For read performance, while it is nice there is a win, it may still be
>> > less than it should. I think it is prudent to get a flamegraph from
>> > both cases.
>>
>
> And, it works!  Using sf_buf was a good suggestion, because the write path
> only needs to access the user data for a short amount of time, in a
> CPU-pinned thread.  So I can create my sf_buf and immediately free it.  I
> didn't even need to change crypto(9).  Writes are much faster now.  I
> haven't incorporated sf_buf into the read path yet, though.
>
> Preliminary benchmarks:
>
>                                               Write IOPs READ IOPs
> baseline                                       66908      60351
> kern.geom.eli.threads=1                       156719     144117
> kern.geom.eli.threads=1, direct dispatch      205326     201867
> direct dispatch, shared thread pool           156631     226874
> direct dispatch, shared thread pool, sf_buf   432573     247275
>
> -Alan
>

Ok, I got ahead of myself.  Actually, it _doesn't_ work.  I was initially
elated by how much extra speed I got by using sf_bufs.  But even so, writes
are still slower (and reads faster) when I use a shared thread pool.  I'll
work on committing the sf_buf changes, then try to debug the shared thread
pool performance degradation again.
-Alan